DEV Community

Cover image for First Impressions with Nova Act SDK 🤔
Laura Salinas for AWS

Posted on

10 4 3 4 2

First Impressions with Nova Act SDK 🤔

It's been a little over a week since the announcement of the Nova Act SDK which is currently in research preview (US Only). I carved out my lunch break to see how quickly I could get up and running. Join me below to see how far I get in about an hour 🙈.

Setup

You'll need to first request access to Nova via your Amazon login (Amazon the one you shop from 📦, not your AWS account). You may be placed on a waitlist if you don't see access granted immediately, keep an eye on your email if that's the case.

With access granted head to the Nova Act SDK. You'll see there are three steps outlined. Installing the SDK, setting up an API key and a first workflow sample.

Screenshot of the Nova Act SDK Setup Screen

I head to the onboarding guide for more info on how to interact with the SDK and how to authenticate.

⚠️ I had some minor trouble here using the API key as an environment variable. Thinking I'd get ahead of the curve I created a .env file with the API key so that my vscode project could pick it up, but that kept throwing errors. TL;DR- just use export NOVA_ACT_API_KEY="your_api_key" before you start a session in terminal and it'll make the SDK happy. Below the error I kept getting.

An error message on vscode indicating authentication is not working

You can use the SDK in a Python script or directly from the terminal with the standard Python shell.

Example Usage

I'm currently learning Japanese as a hobby and to feel a little more comfortable on my upcoming trip next month, so I figured let's test Nova Act by asking it to help me with some flashcards for practicing hiragana with diacritical marks (Duolingo just isn't enough 😭)

My first attempt looks like this:

from nova_act import NovaAct

with NovaAct(starting_page="https://www.google.com") as nova:
    nova.act("search google for Japanese hiragana with diacritical " \
    "marks flashcards, find ones that can print well.")
Enter fullscreen mode Exit fullscreen mode

Which leads the agent to Google, it searches for my query and then opens the first web result which is Quizlet. So far so good. I see the agent describing it's thinking process in my terminal and it starts checking for a print button on the site.

As it does so a popup captcha appears prompting for a click to confirm humanity which... feels hilarious for this first use of Nova Act. Now, things go a little off the rails here- Nova Act understands there is a popup but gets stuck in a loop where it can't close the popup and can't continue to navigate the page either. I give it a few seconds before exiting the script.

💡 So, first lesson learned here- some sites are definitely detecting the use of agents and this behavior currently breaks Nova Act (unless you get crafty which...during this quick hack I didn't 🙂)

I'm curious if I try prompting it to avoid sites and instead go directly to google images if I can get better results.

Attempt two looks like this:

from nova_act import NovaAct

with NovaAct(starting_page="https://www.google.com") as nova:
    nova.act("search google images for flashcards for japanese " \
    "hiragana with diacritical marks. Find ones that can print well.")
Enter fullscreen mode Exit fullscreen mode

This worked well, except I forgot to give the agent something specific to do so once it finds the images tab so it scrolls endlessly until I stop the script.

Attempt three looks like this:

from nova_act import NovaAct

with NovaAct(starting_page="https://www.google.com") as nova:
    nova.act(

        "Search google images for flashcards for japanese hiragana " \
        "with diacritical marks."
        "Right click or secondary click an image."
        "In the context menu click 'Open image in new tab'."
        )
Enter fullscreen mode Exit fullscreen mode

I spent a over 30min trying to figure out the correct way to prompt Nova Act so it would perform a right-click otherwise known as a secondary-click for the context menu and couldn't find a way...

Learnings

While the functionality of a web capable agent is promising, like most difficulties with agents/AI automation, finding the right way to tell the model to do what you want is still challenging. I wanted to get more done with this example, but I couldn't justify spending more time to figure out how to get the model to understand a concept as simple to a human as a right click 🖱️

Stay tuned for more as I continue to tinker with Nova Act (and perhaps come back to finish up this example...)

(PS- I have already given this missing context menu action feedback to the Nova Act team! PFR in the works 🔨)

Additional Resources 📚

Top comments (2)

Collapse
 
jasondunn profile image
Jason Dunn [AWS]

Looks like it has some interesting potential!

Collapse
 
lausalin profile image
Laura Salinas • Edited

I missed this during write up but the Github linked has tips for getting around Captchas which I'll try in another post!