DEV Community

Cover image for AI Agent Building Block: Native App Automation
Srav Nayani
Srav Nayani

Posted on

AI Agent Building Block: Native App Automation

Code for this article is available at https://github.com/shravyanayani/automation


What is Artificial Intelligence

Artificial Intelligence (AI) refers to a computer system's ability to perform tasks that usually need human intelligence. These tasks include learning, reasoning, problem-solving, and understanding language.

The key components of AI are data, algorithms, and models.

  • Data provides the examples or information that AI learns from. Algorithms are the step-by-step methods that process this data to find patterns or make decisions.

  • The model is the outcome of the trained algorithms. It uses what it has learned from the data to predict or act on new inputs.

  • These components work together so AI systems can keep improving their performance and make smart decisions in real-world situations.

What is AI Agent

Generic AI Agent Design

An AI agent is a system that can sense its environment, make decisions, and act to reach specific goals, often without ongoing human help.

The key components of an AI agent include the perception module, decision-making module, and action module.

  • The perception module collects information from the environment using sensors or data inputs and makes sense of it.
  • The decision-making module uses algorithms or models to select the best action based on goals, rules, or past experiences.
  • Finally, the action module executes those decisions, and the agent continuously repeats this cycle to learn and improve over time.

What is Native Desktop App Automation

Native desktop app automation or native automation uses scripts to automatically perform tasks on native web apps, like Windows native application. This includes filling out forms, clicking buttons, scraping data, recognizing data, or testing applications without needing manual effort.

A key components of native automation is the native object detection technologies like windows object detection or OCR (Optical Character Recognition) based text detection or AI based pattern and object detection.

  • Native Object Detection : Apps made with native technology or what ever high eve code that compiles into native code, such apps will have native operating system controls. For example the buttons on such windows apps use the native windows button object, and it can be detected and controlled using Windows SDK. For example Java AWT code when run on Windows OS uses native Windows controls.
  • OCR (Optical Character Recognition) : OCR uses the patterns to detect the objects and characters on the screen. Such characters can be used to scrape or extract the text using code. Also the object detected on the screen can be used to control them like clicking, text entering etc.
  • AI based pattern and object detection : AI can be used for pattern detection of objects like specific button based on the text on the button, or associating the labes and controls by proximity etc.

Once the objects are identified then native mouse and keyboard api can be used to simulate human actions.

Apart from the object detection, the scripts are needed to perform actions on the controls simulating huma actions like mouse and keyboard actions. Once such building block scripts are available, they can be used for performing a high-level functionality towards specific goals.

These components enable developers to test, monitor, or interact with apps effectively and reliably.

Native Automation vs API

API (Application Programming Interface) integration lets an AI agent interact directly with an external system’s backend through structured requests. API integration is quick, efficient, and dependable. However, an API must exist for all external system integrations. Also, API access must be granted for the AI Agent.

Native Automation depends on UI based user functionality, so setting up the API interface is not necessary. However, there are some drawbacks to Native Automation, such as occasional unreliability, slowness, and complexity of the automation etc.

Web Automation vs Native App Automation

Web automation and native app automation involve using software to automatically test or interact with applications, but they target different platforms and use different tools.

Web automation focuses on automating actions in web browsers, such as Chrome or Edge. It uses tools like Selenium or Playwright. This method interacts with web elements, including HTML, CSS, and JavaScript, through a web driver. It works across various browsers and operating systems.

Native app automation, in contrast, targets mobile or desktop applications specifically designed for platforms like Android, iOS, or Windows. It employs tools like Appium, Espresso, or XCUITest, which communicate directly with the app’s native UI components instead of going through a browser.

Web automation relies on the Web Browser's DOM (Document Object Model) while native app automation relies on UI elements defined by the operating system.

In short, web automation tests websites, while native app automation tests standalone apps. Both ensure that software functions correctly in their respective environments.

How Native Automation can be a building block for AI Agents

Native Automation used with AI Agent

Native automation can be one of the most powerful features of AI systems, as it basically allows them to perform actions on the native apps and interact with it much in the same way as a human user would do.

By the means of native automation, an AI agent can simply navigate through the apps, collect data, fill in the forms, or start a certain process — thus giving it the ability to access the up-to-date information and perform the tasks without any human intervention.

The automation layer is the one that physically does the clicking, typing, or scraping while the AI layer gives the intelligence – by deciding what to do, why, when, based on objectives or learned patterns.

As an illustration, an AI agent may employ natural language understanding to get an idea of a request (“purchase a stock at a specific price”) and then, through native automation, it goes to the respective app, compares the stock price, waits till the price condition is met and purchases the stock.

Ultimately, the combination of AI-driven decision-making and native automation execution gives agents the power to move seamlessly between thought and deed, thus implementing the smart insights to the world.

Technology Choices for Native Automation

Technology options for the purpose of native automation are primarily dependent on the to-be-executed tasks, the platforms to be targeted, and the degree of the intelligence or scalability desired.

Programming languages, support tools, and automation frameworks are the main classes of technologies that feature.

Automation Frameworks – Tools of this kind such as Native SDK based, OCR based, AI based are the most cited ones.

  • Native SDK is the most fundamental and comparatively most reliable technology of all listed here.
  • OCR based object and text detection works when the app doesn't use the native objects. For example, if the UI is built using Java Swing API, the controls are not the native ones, but the Swing API draws the controls with low-level drawing operations. In such cases Native SDK cannot help to detect the UI controls, so OCR helps to identify the shapes and the coordinates of the controls.
  • AI based object detection goes 1 step further than simple OCR. It could perform in fault tolerant way like if the object got renamed or moved around. AI can still figure out the objects with its intelligence like humans.
  • A combination of these techniques can be used to build reliable automation systems.

Programming Languages – The usual picks are Python, Java, JavaScript, and C# and the choice depends on the proficiency of the development team and integration requirements.

Supporting Tools – The use of some scheduler or some trigger to run the automation conditionally for smart maintenance is common in the frameworks which most often integrate.

These are not competing technologies but complementary ones — the language instructs the logic, the framework interacts with the apps, and the tools allow for integration with the environment.

Challenges with Native Automation

Here are a few key challenges with native automation:

  • Non-Native Elements – Apps built using non-native SDK are difficult to automate, in that case the pattern-based object detection must be used like OCR, AI etc.
  • Platform Compatibility – Apps built for specific platform are generally not compatible on other platforms, so are the automation scripts can be used across multiple platforms. If apps are built using platform neutral technologies like Java Swing, special purpose tools are needed to automate such tools.
  • Synchronization Issues – The "element not found" errors may appear if the elements are not detected even a fraction of a second earlier than the script so proper waiting or timing control should be used.
  • Maintenance Overhead – In the situation when a app layout or functionality has changed then there is a necessity for the test scripts to be updated first before the tests can run.
  • Authentication and Security Barriers – A few examples of the problematic issues that can arise automation due to the introduction of new security features like MFA (Multi Factor Authentication) etc.
  • Scalability and Performance – The large-scale automation process (e.g., running parallel tests) can require a lot of resources and a well-thought-out infrastructure.
  • Handling Non-Standard Elements – Just like regular UI components, complex ones can also be hard to automate. These components are canvas, pop-up, drag-and-drop, etc.

These challenges make app automation challenging sometimes but thoughtful design, robust frameworks, and continuous maintenance makes it powerful.

Coding sample Native Automation using AutoIT

Following is the windows app automation code written in AutoIT Basic like script. The app automated is a brokerage app that provides the stock quote info including price and volume info. The script polls for the price and volume info, to decide if all the conditions to purchase stock is met. Practically the decision making can be handed off to AI engine. In that case the automation script will provide the automation services to the AI agent so that the tools share the overall responsibilities and they do the best they can do. This is written just for educational purposes. This script cannot be used for real usage, because the real-world usage of the stocks application need more sophisticated logic and a lot of reinforcement for proper handling of financial info.

#include <AutoItConstants.au3>
#include <MsgBoxConstants.au3>
#include <WindowsConstants.au3>
#include <GUIConstantsEx.au3>

; Configuration Variables
Global $SYMBOL = "AAPL"              ; Stock symbol to monitor
Global $TARGET_PRICE = 150.00        ; Target price to trigger buy
Global $TARGET_VOLUME = 100000       ; Minimum volume required
Global $SHARES_TO_BUY = 100          ; Number of shares to buy
Global $CHECK_INTERVAL = 30          ; Seconds between each check
Global $FIDELITY_WINDOW = "Fidelity Active Trader Pro"  ; Main window title

; Function to check if Fidelity application is running
Func IsFidelityRunning()
    If WinExists($FIDELITY_WINDOW) Then
        Return True
    Else
        MsgBox($MB_ICONERROR, "Error", "Fidelity Active Trader Pro is not running!")
        Return False
    EndIf
EndFunc

; Function to activate Fidelity window
Func ActivateFidelity()
    If Not WinActivate($FIDELITY_WINDOW) Then
        Return False
    EndIf
    Sleep(1000)  ; Wait for window to activate
    Return True
EndFunc

; Function to enter symbol
Func EnterSymbol()
    ; Activate symbol input field (you'll need to adjust coordinates)
    MouseClick("left", 100, 100)
    Sleep(500)
    Send($SYMBOL)
    Sleep(500)
    Send("{ENTER}")
    Sleep(1000)
EndFunc

; Function to get current price
Func GetCurrentPrice()
    ; You'll need to adjust coordinates based on where price appears
    Local $price = PixelGetColor(200, 200)  ; Replace with actual coordinates
    ; Add OCR logic here to read price
    Return $price
EndFunc

; Function to get current volume
Func GetCurrentVolume()
    ; You'll need to adjust coordinates based on where volume appears
    Local $volume = PixelGetColor(300, 200)  ; Replace with actual coordinates
    ; Add OCR logic here to read volume
    Return $volume
EndFunc

; Function to place buy order
Func PlaceBuyOrder()
    ; Click Trade button
    MouseClick("left", 400, 100)  ; Adjust coordinates
    Sleep(1000)

    ; Enter number of shares
    Send($SHARES_TO_BUY)
    Sleep(500)

    ; Click Buy button
    MouseClick("left", 500, 150)  ; Adjust coordinates
    Sleep(1000)

    ; Confirm order
    MouseClick("left", 550, 200)  ; Adjust coordinates
    Sleep(1000)
EndFunc

; Main monitoring loop
While 1
    If Not IsFidelityRunning() Then
        Exit
    EndIf

    If Not ActivateFidelity() Then
        ContinueLoop
    EndIf

    EnterSymbol()

    Local $currentPrice = GetCurrentPrice()
    Local $currentVolume = GetCurrentVolume()

    ConsoleWrite("Current Price: " & $currentPrice & " Volume: " & $currentVolume & @CRLF)

    ; Check if conditions are met
    If $currentPrice <= $TARGET_PRICE And $currentVolume >= $TARGET_VOLUME Then
        PlaceBuyOrder()
        MsgBox($MB_ICONINFORMATION, "Order Placed", "Buy order placed for " & $SHARES_TO_BUY & " shares of " & $SYMBOL)
        Exit
    EndIf

    Sleep($CHECK_INTERVAL * 1000)
WEnd
Enter fullscreen mode Exit fullscreen mode

Coding sample Native Automation using PyWinAuto

Following is the windows app automation code written in Python using PyWinAuto library. The app automated is a brokerage app that provides the stock quote info including price and volume info. The script polls for the price and volume info, to decide if all the conditions to purchase stock is met. Practically the decision making can be handed off to AI engine. In that case the automation script will provide the automation services to the AI agent so that the tools share the overall responsibilities and they do the best they can do. This is written just for educational purposes. This script cannot be used for real usage, because the real-world usage of the stocks application need more sophisticated logic and a lot of reinforcement for proper handling of financial info.

import os
import time
import logging
from datetime import datetime
from dotenv import load_dotenv
from pywinauto.application import Application
from pywinauto.keyboard import send_keys
import cv2
import numpy as np
import pytesseract
from PIL import ImageGrab

# Load environment variables
load_dotenv()

# Configure logging
logging.basicConfig(
    filename='fidelity_trader.log',
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

class FidelityTrader:
    def __init__(self):
        # Trading Configuration
        self.symbol = os.getenv('STOCK_SYMBOL', 'AAPL')
        self.target_price = float(os.getenv('TARGET_PRICE', '150.0'))
        self.target_volume = int(os.getenv('TARGET_VOLUME', '100000'))
        self.shares_to_buy = int(os.getenv('SHARES_TO_BUY', '100'))
        self.check_interval = int(os.getenv('CHECK_INTERVAL', '30'))

        # Application Configuration
        self.app_path = os.getenv('FIDELITY_APP_PATH', '')
        self.window_title = "Fidelity Active Trader Pro"
        self.app = None
        self.main_window = None

    def connect_to_application(self):
        """Connect to Fidelity application"""
        try:
            # Try to connect to running instance
            self.app = Application(backend="uia").connect(title=self.window_title)
            logging.info("Connected to existing Fidelity application")
        except Exception as e:
            try:
                # Launch new instance if not running
                self.app = Application(backend="uia").start(self.app_path)
                logging.info("Launched new Fidelity application")
            except Exception as launch_error:
                logging.error(f"Failed to launch Fidelity: {launch_error}")
                raise

        self.main_window = self.app.window(title=self.window_title)
        self.main_window.set_focus()

    def capture_region(self, region):
        """Capture a specific region of the screen"""
        screenshot = ImageGrab.grab(bbox=region)
        return np.array(screenshot)

    def get_text_from_image(self, image):
        """Extract text from image using OCR"""
        gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
        text = pytesseract.image_to_string(gray, config='--psm 6')
        return text.strip()

    def get_price(self):
        """Get current stock price from the application"""
        try:
            # Adjust coordinates based on where price appears in the application
            price_region = (100, 100, 200, 130)  # Example coordinates
            price_image = self.capture_region(price_region)
            price_text = self.get_text_from_image(price_image)
            return float(price_text.replace('$', '').strip())
        except Exception as e:
            logging.error(f"Error getting price: {e}")
            return None

    def get_volume(self):
        """Get current trading volume from the application"""
        try:
            # Adjust coordinates based on where volume appears in the application
            volume_region = (300, 100, 400, 130)  # Example coordinates
            volume_image = self.capture_region(volume_region)
            volume_text = self.get_text_from_image(volume_image)
            return int(volume_text.replace(',', '').strip())
        except Exception as e:
            logging.error(f"Error getting volume: {e}")
            return None

    def enter_symbol(self):
        """Enter stock symbol in the application"""
        try:
            # Find and click symbol input field
            symbol_input = self.main_window.child_window(auto_id="symbolInput")
            symbol_input.click_input()
            send_keys(self.symbol)
            send_keys('{ENTER}')
            logging.info(f"Entered symbol: {self.symbol}")
        except Exception as e:
            logging.error(f"Error entering symbol: {e}")
            raise

    def place_buy_order(self):
        """Place a buy order"""
        try:
            # Click trade button
            trade_button = self.main_window.child_window(title="Trade")
            trade_button.click_input()

            # Enter number of shares
            shares_input = self.main_window.child_window(auto_id="sharesInput")
            shares_input.type_keys(str(self.shares_to_buy))

            # Click buy button
            buy_button = self.main_window.child_window(title="Buy")
            buy_button.click_input()

            # Confirm order
            confirm_button = self.main_window.child_window(title="Confirm")
            confirm_button.click_input()

            logging.info(f"Buy order placed for {self.shares_to_buy} shares of {self.symbol}")
            return True
        except Exception as e:
            logging.error(f"Error placing buy order: {e}")
            return False

    def monitor_stock(self):
        """Main monitoring loop"""
        logging.info(f"Starting monitoring for {self.symbol}")
        logging.info(f"Target price: ${self.target_price}")
        logging.info(f"Target volume: {self.target_volume}")

        while True:
            try:
                self.enter_symbol()
                current_price = self.get_price()
                current_volume = self.get_volume()

                if current_price and current_volume:
                    logging.info(f"Current price: ${current_price}, Volume: {current_volume}")

                    if current_price <= self.target_price and current_volume >= self.target_volume:
                        if self.place_buy_order():
                            logging.info("Order executed successfully")
                            break
                        else:
                            logging.error("Failed to execute order")

                time.sleep(self.check_interval)

            except Exception as e:
                logging.error(f"Error in monitoring loop: {e}")
                time.sleep(self.check_interval)

if __name__ == "__main__":
    try:
        trader = FidelityTrader()
        trader.connect_to_application()
        trader.monitor_stock()
    except Exception as e:
        logging.critical(f"Critical error: {e}")
Enter fullscreen mode Exit fullscreen mode

Testing Native Automation

Testing native automation code is vital if you want to make sure that it functions consistently and is capable of dealing with the situations that might appear in the world.

Part of web automation testing can consist of the use of assertions for verification that expected elements have appeared, data is accurate, and navigational steps have been completed successfully.

Moreover, a wait conditions can be used for handling dynamic data loads instead of using fixed delays, which is also a very effective method.

Try your automated actions on various browsers and devices to check whether the performance is the same.

Integrating Native Automation with AI Agent

A script for native automation is excellent for handling repetitive and predictable tasks, such as clicking on buttons, filling out forms, moving through pages, or extracting data.

Conversely, an AI agent is capable of logical thinking, planning, learning, and making judgments based on available data.

When you combine these two, you get a smart system where the AI is in charge of the operations and the automation carries out the tasks.

Typical AI Agent System Components

  • AI Decision Engine - Processes goals, rules, or user commands and decides the sequence of actions. Can use ML models, NLP, or rule-based systems
  • Native Automation Layer - Executes low-level actions on apps via tools like AutoIT, PyWinAuto or WinDriver. Handles clicks, inputs, scrolling, navigation.
  • Perception / Data Extraction Module - Observes the web environment and extracts relevant information (prices, flight options, stock data). Feeds this back to the AI agent.
  • Feedback / Learning Module - Evaluates outcomes of automated actions (e.g., did the AI decision the purchase of stocks based on the conditions?) and updates decision-making models for future improvements.
  • Scheduler / Controller - Coordinates the flow: triggers web automation when needed, handles retries, logs progress, and ensures proper sequencing of tasks.

Top comments (0)