DEV Community: Arin Zingade

AI Web Agents: The Future of Intelligent Automation

Arin Zingade — Sat, 04 Jan 2025 15:27:18 +0000

Introduction: What Are AI Web Agents?

AI web agents represent a powerful, emerging force within the digital landscape, fundamentally reshaping how organizations approach automation. As software tools capable of simulating human-like interactions, AI agents can understand, execute, and adapt to user requests. They are not merely passive systems responding to pre-defined commands but actively work to understand broader goals, learn from interactions, and dynamically refine responses.

A recent Capgemini survey of large enterprises reveals that one in ten organizations is already deploying AI agents, with over half planning to explore these technologies within the coming year. Forrester Research also highlights AI web agents as one of the top 10 emerging technologies for 2024, with VP Brian Hopkins calling them “perhaps the most exciting development” on this year’s list.

The concept of Large Action Models (LAMs) has become a focal point in discussions about AI agents. Rabbit AI, a pioneering player in this space, has introduced a product—a custom OS-equipped device supporting a trainable AI assistant capable of handling a wide range of actions. This assistant leverages LAMs to manage tasks such as making reservations, giving directions, ordering services, and adapting to user-specific prompts.

Imagine a team of robotic coworkers, each able to support various business operations, be it customer service, data analysis, or scheduling tasks. These agents act as powerful extensions of human teams, handling operational tasks so human team members can focus on higher-level strategic work.

Large Action Models: A Step Toward the Future

As AI technology advances, an exciting new category emerges: Large Action Models (LAMs). Large action models have a more expansive role than traditional language models, which primarily generate text. They’re built to "generate" or perform actions, executing complex tasks based on clear instructions. This progression brings us closer to artificial general intelligence (AGI), the idea of AI capable of performing various tasks. Although AGI remains a distant vision, the development of LAMs brings us a step nearer to it in practical, impactful ways.

What are Large Action Models?

Large action models combine multiple components, creating a system capable of interpreting instructions, understanding context, and performing diverse tasks. Think of them as supercharged LLMs that operate with multimodal capabilities, meaning they can handle not just text but also images, videos, and more. Additionally, they’re designed to interact with external tools and environments, empowering them to execute actions within complex workflows seamlessly.

Capabilities and Real-World Applications

Large action models are reshaping how we think about automation. They go beyond handling complex queries; they adapt to a range of situations and user requirements. For example, MultiOn agents use websites and online services to perform a variety of tasks, all based on a simple prompt. With applications in areas like personalized marketing, these agents are positioned to change how people interact with digital services by simplifying workflows, automating repetitive tasks, and handling entire workflows.

Zero-shot Learning: LAMs are designed to perform new tasks without explicit training, relying on the vast data they’re trained on. This enables them to take on unfamiliar tasks with minimal guidance, broadening their application scope.

Few-shot Learning: LAMs can also handle custom tasks by learning from a few examples provided in the input. This lets us adapt them to specific needs or contexts, adding a level of flexibility that traditional automation tools often lack.

Potential Limitations

Despite their promise, LAMs face some hurdles:

Latency Issues: Efficiency is a core design goal for LAMs, yet complex, multi-step tasks can introduce delays. This can impact user experience, particularly in environments where real-time responses are crucial.
Experimental Phase: Many LAMs are still in development, and while their capabilities are impressive, they may not be fully reliable in all real-world applications. Continued refinement and testing will be key to achieving consistency.
Data Dependency: Like any advanced AI, LAMs require extensive datasets to make accurate, informed decisions. In domains where data is scarce, their performance may be limited.
Complexity of Integration: Integrating LAMs into existing systems requires sophisticated infrastructure and support for multimodal processing, which can be challenging and costly.

Understanding AI Web Agents

AI web agents are revolutionizing how we approach automation. These agents gather information from their surroundings, process that data, and take actions to transform the environment—whether physical, digital, or a blend of both. As technology continues to advance, many AI agents are becoming increasingly capable of learning and adapting their behavior over time. They explore new solutions to challenges, continuously refining their approach until they achieve the desired outcome.

Pipeline of how AI Web Agent functions

AI Web Agents vs. Traditional Automation Tools

For many years, businesses have relied on traditional automation tools to handle repetitive, rule-based tasks—things like data entry, email marketing, and scheduling. These tools were highly effective for straightforward, repetitive processes but often lacked flexibility and intelligence when faced with more complex scenarios. That’s where AI-powered web agents come in, offering a much more sophisticated approach to automation.

Unlike traditional automation tools, which rely on fixed rules and processes, AI agents leverage advanced technologies such as machine learning, natural language processing (NLP), and cognitive computing. This allows them to perform tasks in a much more flexible manner, adapting to new information and evolving conditions in real time. With these capabilities, AI agents can learn from past experiences and make smarter decisions without being explicitly programmed for every possible scenario.

In the past, web automation often required businesses to write custom scripts for each website, using techniques like DOM parsing and XPath-based interactions. However, these scripts could easily break if a website's layout or structure changed. AI agents, on the other hand, have evolved beyond such limitations, offering a more resilient and dynamic approach.

Key Technologies Behind AI Web Agents

AI web agents harness a suite of advanced technologies to bring a new level of automation and intelligence to digital workflows. At the core of these agents are systems that not only understand tasks but also adapt to dynamic online environments, recognize visual elements, interpret language, and extract meaningful data.

Large Language Models (LLM)

LLMs play a central role in AI web agents. They understand the task at hand, process the language, and generate the necessary steps to complete the objective. Whether it’s interacting with a website or gathering information, the LLM drives the decision-making process.

Natural Language Processing (NLP)

NLP allows AI agents to interpret and understand human language. It helps the agent communicate with websites, forms, and other digital environments, enabling tasks like reading text, answering questions, or extracting key information.

Computer Vision

Computer vision enables AI agents to "see" and interact with visual elements on a webpage. By scanning for images, buttons, or other interactive items, AI agents can make informed decisions about how to engage with the environment.

Understanding Context

Context is crucial for accurate decision-making. AI agents use context to adapt their behavior based on real-time data, past experiences, or user input. This ensures tasks are completed intelligently, even when conditions change.

Entity Recognition and Extraction

Entity recognition helps AI agents identify important pieces of information, like product names, dates, or locations, within text or data. This capability allows agents to make smarter decisions based on extracted entities.

Standard Pipeline for Web AI Agents

Understanding the Task: The LLM interprets the task and identifies the objective.
Website Interaction: The agent accesses the target website and uses computer vision to scan the content.
Action Generation: The agent generates action plans, such as Selenium or Playwright code, to interact with the website.
Execution: The generated code is executed to perform the required actions.
Repeat: The agent repeats the process until the task is completed.

Demonstrates the recursive approach AI Web Agents take

Leading AI Web Agents

As we dive into AI-driven web agents - many of them open-source, are paving the way for developers to build powerful, customized solutions. These frameworks offer us robust foundations, allowing us to focus on tailoring and scaling AI agents for specific needs rather than building everything from scratch.

LaVague

LaVague is an open-source framework designed for developers seeking to build AI web agents that automate processes for their users. It provides a comprehensive solution for creating adaptable and effective AI agents.

Key Features

World Model: LaVague's World Model processes the current web page and the given objective to generate a set of instructions for the agent.

Action Engine: The Action Engine compiles these instructions into executable code, such as Selenium or Playwright, and then performs the required action.

Supported Drivers: LaVague supports three main driver options:

Selenium WebDriver
Playwright WebDriver
Chrome Extension Driver

Stagehand

Stagehand, from Browserbase, is a high-performance, serverless platform that simplifies the management of headless browsers. It allows developers to run, manage, and monitor web automation tasks at scale, offering a robust solution for integrating AI web agents.

Key Features

Native compatibility with popular automation tools like Playwright, Puppeteer, and Selenium

Integration: Seamless integration with AI technologies like crewAI and Langchain.
Observability: Full observability with the Session Inspector, which provides deep insights into agent interactions.
Stealth Mode: Automatically solves captchas and uses residential proxies for improved anonymity and reliability.
Features: Advanced features like custom extensions, file downloads, long-running sessions, and an API for live view and session logs.

Stagehand offers a scalable, secure, and reliable infrastructure that supports the creation and deployment of powerful AI agents in the web automation space.

Skyvern

Skyvern is a cutting-edge solution that automates browser-based workflows using large language models (LLMs) and computer vision. It’s designed to replace traditional automation solutions with a more robust, adaptable system.

Skyvern offers an intuitive user interface, allowing you to automate workflows with ease. Here’s how to get started with setting up Skyvern on your machine.

Steps to Set Up Skyvern

Prerequisites

Before you begin, make sure Docker is installed on your system. Docker will allow Skyvern to run seamlessly across different environments.

Clone the Repository Start by cloning Skyvern's repository from GitHub:

   git clone https://github.com/Skyvern-AI/skyvern
   cd Skyvern

Configure Your API Key Open the docker-compose.yml file in a text editor:

   nano docker-compose.yml

Replace the placeholder with your OpenAI or Anthropic API key. This key enables Skyvern to access the AI functionalities needed for workflow automation.

Build and Run Skyvern Once you’ve added the API key, start Skyvern with Docker:

   docker-compose up --build

Skyvern will start running on http://localhost:8080

You can set up tasks and workflows using this user interface: visit Skyvern Docs for more.

Key Features:

Adaptability: Skyvern can operate on websites it has never encountered before, thanks to its ability to map visual elements to actions necessary for completing workflows, without relying on custom code.
Resilience: Unlike traditional automation tools that depend on fixed XPath selectors, Skyvern can adapt to website layout changes, ensuring it remains functional even as websites evolve.
Scalability: Skyvern is capable of applying a single workflow across a large number of websites, reasoning through interactions and automating complex tasks reliably.

Automating Web Tasks with AI

Automation is no longer limited to simple, rule-based processes; with AI-driven agents, we can automate nuanced, multi-step operations that require adaptability and intelligence.

Common Use Cases

We’re seeing AI web agents transform various workflows. Here are some common use cases where they’re making a real difference:

Data Extraction and Web Scraping: Collecting and structuring information from online sources, saving us the time and effort of manual data gathering.
Automating Repetitive Tasks: From logging data entries to filling forms, AI agents handle repetitive actions with precision, freeing us to focus on higher-level tasks.
Workflow Automation: Our agents can coordinate multiple steps across platforms, streamlining workflows and reducing the need for human intervention.

Benefits of Automation

By adopting AI automation, we’re not just saving time; we’re enhancing our work in meaningful ways:

Time Savings and Efficiency: AI agents allow us to focus on critical, creative aspects of our work, increasing our productivity and freeing up time for innovation.
Reduction in Human Error: With AI managing repetitive tasks, accuracy improves, errors decrease, and we benefit from more consistent, reliable results.

How AI Agents can increase efficiency

Demonstration

In this demo, Lavague is used in combination with a WebAI agent to automate the task of navigating from the Yahoo Finance homepage to the World Indices page. The process is broken down into a few simple steps:

Install the Required Libraries: To get started, you'll first need to install Lavague and its dependencies:

  pip install lavague llama_index

Set Up the WebAI Agent: The agent uses the Selenium web driver to interact with the Yahoo Finance website. Here's the Python code that sets up the agent and directs it to the Yahoo Finance homepage:

from lavague.drivers.selenium import SeleniumDriver
from lavague.core import ActionEngine, WorldModel
from lavague.core.agents import WebAgent
from lavague.core.navigation import NavigationEngine
from lavague.core.retrievers import OpsmSplitRetriever
from lavague.contexts.openai import OpenaiContext
from llama_index.llms.groq import Groq
import os

os.environ['OPENAI_API_KEY'] = "<your_api_key>"

selenium_driver = SeleniumDriver()
action_engine = ActionEngine(selenium_driver)
world_model = WorldModel()
agent = WebAgent(world_model, action_engine)
agent.get("https://finance.yahoo.com/")

instruction = """
Objective: Go to the World Indices Page
1. Click on "Markets"
2. Click on the "World Indices" link in the "Markets" dropdown menu
"""

agent.run(objective=instruction, display = True)

When the agent successfully completes the task, it automatically takes a screenshot of the current web page. Lavague stores these screenshots, which can later be used in a Visual Language Model (VLM) to extract important information and create a workflow for further interactions.

The output of the agent’s task would look something like this:

Integrating AI Web Agents into Workflows

Integrating AI web agents into workflows has become a transformative approach for businesses looking to enhance efficiency, automate repetitive tasks, and improve overall productivity. Here’s how organizations can practically implement AI agents into their operational frameworks

Define Use Cases: Identify specific tasks or processes that can benefit from automation. Common use cases include customer service, order management, HR processes, and project management. For example, AI agents can automate customer inquiries, manage recruitment workflows, or optimize project task allocations.
Select the Right Platform: Choose an AI platform that supports the creation and management of AI agents. Platforms like Automation Anywhere's AI Agent Studio allow businesses to build custom agents tailored to their unique needs. These platforms often provide tools for integrating generative AI into existing workflows seamlessly.
Design Agentic Workflows: Implement agentic workflows that enable AI agents to operate independently while pursuing specific goals. Unlike traditional systems that react to commands, these workflows allow agents to analyze their environment and make proactive decisions based on real-time data.

Conclusion

These tools are not just simplifying automation—they are evolving it, bringing unprecedented adaptability, intelligence, and efficiency to workflows across industries. The transformative capabilities of LAMs, we’re seeing a clear shift towards AI agents that understand and actively respond to the world around them.

In this article, we’ve explored the technologies, frameworks, and key features that make AI web agents a game-changer. From enhanced data extraction to seamless workflow automation, these agents provide us with new possibilities for maximizing efficiency and minimizing errors. LAMs, especially, represent a leap forward, empowering agents to perform a broader range of tasks with little or no additional training. As LAMs continue to evolve, they’re opening doors to more complex actions, bringing us closer to the vision of artificial general intelligence.

As we move forward, we’re excited to continue integrating these innovations into our processes, harnessing the full potential of AI agents to create smarter, more autonomous workflows.

Let’s step confidently into this future together, knowing that we’re building a more productive, efficient, and innovative digital landscape.

ClickHouse Vs DuckDB

Arin Zingade — Tue, 03 Dec 2024 06:13:24 +0000

Introduction

The rise of OLAP databases has been hard to ignore, with the relentless growth of data and demand for powerful, fast analytics tools. OLTP databases have been invaluable for transaction-heavy applications but often fall short when faced with the sheer complexity of modern analytical workloads. Enter OLAP databases, designed to make slicing through massive datasets feel nearly effortless.

In exploring OLAP solutions, we found two that stood out: ClickHouse and DuckDB. While both are OLAP-focused, they’re fundamentally different tools, each with unique strengths. ClickHouse is a powerhouse designed for multi-node, distributed systems that scale up to petabytes of data. DuckDB, on the other hand, is more like the SQLite of OLAP—a nimble, desktop-friendly database that brings OLAP capabilities to local environments without the need for elaborate setup. Despite their differences, these databases share a versatility that makes them adaptable to a range of tasks: querying data in object storage, handling cross-database queries, and even parsing compressed files or semi-structured data.

This article will mainly focus on the capabilities of DuckDB -- while touching on ClickHouse and its key differences which makes both the projects great in their own niche. I have covered ClickHouse in much detail here.

DuckDB: Speedy Analytics, Zero Setup

DuckDB is a welcome solution for data analysts and scientists seeking efficient, local OLAP processing without the usual infrastructure demands. DuckDB has carved out a unique niche as a lightweight relational database, built to perform analytical tasks with impressive speed while remaining incredibly easy to use.

For those of us who are accustomed to the high costs of platforms like Redshift, Databricks, Snowflake, or BigQuery, DuckDB offers a refreshing alternative. DuckDB allows you to simply upload files to cloud storage, enabling teams to use their existing laptops compute to perform analytics, bypassing the need for expensive and complex infrastructure for smaller tasks and analysis.

Personally, we use DuckDB as a go-to solution for handling tasks that exceed the capacity of tools like Pandas or Polars. Its ability to load large CSVs directly into dataframes with speed and efficiency has streamlined my workflows. It performs well for ETL tasks in Kubernetes environments, showcasing its adaptability and reliability across a wide range of data processing needs.

Versatility of DuckDB

ClickHouse: Analytics for Scale

Then there's ClickHouse, a database built for scale, known for its incredible speed and efficiency when working with vast amounts of data. ClickHouse’s column-oriented architecture, coupled with unique table engines, enables it to process millions of rows per second. Companies like Cloudflare use it to cut down memory usage by over four times, highlighting its role in real-world, large-scale applications. In environments where data volumes are measured in terabytes or petabytes, ClickHouse really shines.

One of the standout aspects of ClickHouse is its ability to leverage the full power of the underlying hardware, optimizing memory and CPU usage to handle massive, complex queries with ease. The distributed nature of ClickHouse allows it to scale horizontally across nodes, making it resilient and highly available for mission-critical applications. Its real-time query capabilities allow companies to power dashboards and interactive reports with minimal latency, providing instant insights for fast-paced decision-making. For anyone managing large-scale analytics, it delivers a robust set of features for everything from web analytics to detailed log analysis.

Similarities and Differences between ClickHouse and DuckDB

While both ClickHouse and DuckDB excel in fast, efficient querying and share similar columnar architectures, their design philosophies and deployment models cater to unique needs. Together, these tools showcase the spectrum of options available for handling analytical workloads, from enterprise-scale, distributed systems to flexible, embedded analytics.

ClickHouse: Enterprise-Grade, Big Data Workloads

ClickHouse is designed for large-scale analytics, widely adopted by enterprises handling vast datasets for real-time analytics, monitoring, and business intelligence. Built to handle multi-node deployments, ClickHouse scales effectively with its Massively Parallel Processing (MPP) architecture, making it a strong choice for multi-terabyte, distributed, cloud-first deployments.

DuckDB: Small-to-Medium Data and Data Science Workflows

DuckDB is ideal for small-to-medium datasets and data science tasks. It’s lightweight, embedded, and designed to run directly on local machines with minimal configuration. This makes it perfect for data exploration and prototyping on tens-of-gigabyte datasets without needing a complex database setup.

Installation and Embedding: ClickHouse’s chDB and DuckDB’s In-Process Model

DuckDB is completely embedded—no server setup is needed, making it easily deployable within the same process as the host application. ClickHouse offers similar ease with chDB, a library that allows ClickHouse SQL queries to be run directly in Python environments, providing a streamlined setup for local analytics.

In-Memory and Serialization Capabilities

Both databases support in-memory processing, though they differ in approach. DuckDB can operate in-memory by default for fast, temporary analyses, while ClickHouse also provides in-memory storage options through specific storage engines. For data serialization to flat files, ClickHouse is generally faster, thanks to its optimized storage architecture.

Performance on Complex Computations and DataFrame Integration

DuckDB excels at handling complex relational data operations, often providing faster local analysis for structured datasets. Its seamless querying of Pandas, Polars, and Arrow DataFrames within Python is a key feature that makes it highly useful for data scientists, serving as a powerful in-process SQL engine.

Distributed Scaling vs. Local, Serverless Execution

ClickHouse’s MPP architecture allows for horizontal scaling across nodes, ideal for enterprise-grade, cloud-based analytics workloads. DuckDB, meanwhile, thrives in serverless, single-machine setups and is perfect for tasks like ETL, semi-structured data queries, and quick analyses on local storage.

In Summary,

Feature	ClickHouse	DuckDB
Database Type	Column-oriented, OLAP	Column-oriented, OLAP
Primary Use Case	Enterprise-grade, big data workloads	Small-to-medium data volumes; data science workflows
Adoption	Widely used in large enterprises for high-speed analytics	Popular among data analysts and scientists
Deployment	Multi-node, distributed (supports MPP architecture)	Single-machine, embedded (runs in-process)
Installation Requirement	Server software typically required (chDB for local SQL)	No server installation needed; embedded within host
Data Volume Scale	Suited for multi-terabyte, large datasets	Ideal for tens of GB-level datasets
In-Memory Processing	Supported through specific storage engines	Supported with “:memory:” mode (default)
Serialization Performance	Faster for serializing flat file data	Slower than ClickHouse for serialization
Complex Query Performance	Strong for large-scale aggregation and distributed tasks	Excels with complex computations on relational schema
DataFrame Integration	Available via chDB in Python	Directly supports Pandas, Polars, and Arrow
Scalability	Highly scalable across multiple nodes	Limited to single-machine; serverless, embedded use
Best For	Big data analytics, enterprise BI, web analytics	Local data exploration, data prototyping, ETL tasks

Performance Comparison: ClickHouse vs. DuckDB

When it comes to high-performance analytical databases, both ClickHouse and DuckDB have unique strengths and limitations.

General Performance Overview

ClickHouse generally outperforms DuckDB for larger data volumes and relatively straightforward queries. This strength can be attributed to ClickHouse's columnar storage, distributed nature, and optimizations for large-scale data processing, which allow it to efficiently manage and retrieve massive datasets.
DuckDB, however, is highly optimized for in-memory data processing and excels at handling complex analytical queries on single-node setups. DuckDB's ability to work seamlessly in-memory allows it to execute queries quickly for moderate data sizes without needing the distributed setup that ClickHouse typically requires for peak performance.
Factors Affecting Performance: The structure and complexity of the data (like normalized vs. denormalized tables) and query complexity also impact performance for both databases. For instance, ClickHouse performs best with denormalized data, while DuckDB handles normalized data more effectively, especially in analytical tasks.

DuckDB's Strengths and Limitations

DuckDB’s limitations resemble those of other single-node query engines like Polars, DataFusion, and Pandas. Although it’s often compared to Spark, it’s a bit of an “apples to oranges” comparison due to Spark’s multi-node, distributed setup. DuckDB, Polars, and similar engines are more suitable for fast, single-node analytics and don’t scale out to multiple nodes like Spark or ClickHouse.

For example, in a recent benchmark involving a huge dataset, DuckDB excelled at in-memory querying. This left us impressed and somewhat surprised, given the common praise for ClickHouse’s speed and efficiency in handling large datasets.

Observations on ClickHouse’s Speed with Memory-Table Engine

ClickHouse avoids disk I/O, decompression, and deserialization. This kind of performance is advantageous when dealing with high-speed requirements and straightforward query patterns.

However, the complexity of queries—especially analytical ones like TPC-DS benchmarks—can challenge ClickHouse’s performance, as it relies heavily on denormalized data for speed. Our test, seemed to amplify the impact of query complexity on ClickHouse’s performance. If anything, this reinforces the need to tailor the setup and data structure based on use case requirements, particularly when handling ClickHouse in scenarios it’s not fully optimized for.

Key Advantages of DuckDB

As discussed earlier, DuckDB stands out for several key advantages, making it a valuable tool for data analysis.

Dependency-Free, Single Binary Deployment

One of the biggest selling points of DuckDB is its minimalist approach to installation. Unlike other databases that require complex setups, DuckDB can be deployed as a single binary with no external dependencies. This makes it incredibly easy to get started with, as there’s no need to configure or maintain a separate database server. You can run it directly within your local environment or even integrate it into your existing workflows with minimal effort.

Querying Data Directly

What sets it apart is its ability to query these DataFrames directly using SQL. This means you can interact with your data in a familiar Python-based environment and use the full power of SQL for your analysis, without having to worry about converting between formats or loading the entire dataset into memory. It's like running SQL queries directly on your existing data structures, streamlining your workflows.

Filling the Gap Between Traditional Databases and Data Science Workflows

DuckDB bridges the gap between traditional database management systems and the fast-paced, iterative work often done in data science. For many data scientists, working with large datasets typically means turning to complex and heavyweight systems like PostgreSQL or even Spark. DuckDB, however, offers a simpler, more lightweight alternative while still providing the power of SQL-based analytics. It enables analysts to perform complex queries directly on datasets, whether they're stored locally or in the cloud, without the overhead of setting up a full-fledged database system.

Cost-Effective Analytics: Using DuckDB with Parquet Files on GCS

DuckDB’s cost-effectiveness is becoming a major selling point for companies looking to reduce their cloud analytics costs. For instance, many teams are turning to DuckDB to perform analytics on Parquet files stored in Google Cloud Storage (GCS), rather than using more expensive solutions like BigQuery. BigQuery’s costs can add up quickly with frequent analytical queries, whereas DuckDB enables a "bring-your-own-compute" model, allowing users to leverage their local machines to process cloud data without incurring heavy charges. This makes DuckDB an attractive alternative for data teams looking to cut down on operational costs while still performing powerful analytics.

Efficient Data Handling Without Full Data Loading

One of the key benefits of DuckDB over tools like SQLite or Pandas is its ability to process data without loading the entire file into memory. While Pandas requires the full dataset to be loaded before any analysis can be done, DuckDB allows you to copy compressed data directly into memory, bypassing the need to load everything at once. This not only saves memory but also makes DuckDB more efficient when dealing with large files or datasets.

Enhancing Data Science Workflows: DuckDB and Polars

While Polars is known for its performance in data manipulation, DuckDB offers a unique advantage by being a full-fledged database. DuckDB can read data from a Polars DataFrame without any manual conversion, allowing you to work with data in both systems seamlessly. You can process data in Polars, then pass it to DuckDB for further SQL-based operations, and even save the results directly to the DuckDB database—all without the need for manual copying or reformatting. This smooth integration significantly enhances productivity and streamlines workflows for data scientists.

SQL Support with Advanced Features

Another key advantage of DuckDB is its SQL dialect, which we find to be incredibly powerful. It supports advanced features like macros, which allow for more flexible and reusable queries. This is especially useful for data scientists who need to run complex queries and streamline their analysis. DuckDB also has a functional interface, which means you can work with data in a way similar to Spark or Pandas, but with the power of SQL under the hood. This hybrid approach allows you to transform and manipulate data efficiently, combining the best aspects of both worlds.

The appeal of DuckDB becomes even clearer when considering the limitations that existed before it. Previously, working with smaller datasets locally was manageable with formats like CSV or Parquet, but as data size increased, the process grew challenging. Setting up traditional databases like MySQL or PostgreSQL for these mid-sized tasks was cumbersome, and distributed systems like Spark felt excessive for datasets that didn’t require that scale. DuckDB fills this gap, allowing small-to-medium datasets to be processed locally, without the need for complex database setups.

In modern data analysis, data must often be combined from a wide variety of different sources. Data might sit in CSV files on your machine, in Parquet files in a data lake, or in an operational database. DuckDB has strong support for moving data between many different data sources.

*Source: DuckDB beyond the hype

Use Cases and Applications

Both ClickHouse and DuckDB serve unique purposes in data processing, offering complementary strengths for different tasks.

ClickHouse for Large-Scale, Distributed OLAP Workloads

ClickHouse excels in handling large-scale, distributed OLAP workloads. Its MPP architecture scales horizontally, making it perfect for real-time analytics over multi-terabyte datasets. It's used in industries like telecom, finance, and e-commerce where fast query performance on large datasets is crucial. Companies like Yandex and Uber leverage ClickHouse for real-time analytics, making it a top choice for enterprise-scale applications.

DuckDB for Serverless Pipelines and Local Data Processing

DuckDB is ideal for serverless pipelines and local data processing, excelling with small-to-medium datasets. It's great for temporary staging in ELT jobs and data transformations, especially when dealing with Parquet and other semi-structured formats.

In embedded systems or sensor data applications, DuckDB’s columnar storage and compression make it highly efficient, processing data in tight memory constraints.

Complementary Roles in the Data Ecosystem

ClickHouse is for large, distributed workloads, while DuckDB handles smaller, local processing tasks. They complement each other, with ClickHouse powering big data and cloud-based analytics, and DuckDB simplifying local, serverless data tasks. Together, they provide a flexible, efficient data pipeline for different analytics needs.

Conclusion

In the evolving landscape of data analytics, both ClickHouse and DuckDB have carved out distinct yet complementary niches. ClickHouse has established itself as a powerhouse for large-scale, distributed OLAP workloads, making it the go-to choice for enterprise-grade deployments handling petabytes of data. DuckDB, meanwhile, has revolutionized local data analysis by offering a lightweight, embedded solution that seamlessly integrates with modern data science tools like Pandas, Polars, and Apache Arrow.

While ClickHouse excels at handling massive distributed datasets with impressive performance, DuckDB shines in scenarios requiring quick, in-process analytics and complex queries on smaller datasets. The choice between these tools ultimately depends on specific use cases: ClickHouse for organizations requiring robust, distributed analytics at scale, and DuckDB for data scientists and analysts who need efficient, local data processing without the overhead of traditional database systems.

As organizations continue to grapple with diverse data processing needs, having both tools in the modern data stack enables teams to choose the right tool for their specific analytical requirements, whether it's processing petabytes in the cloud or analyzing gigabytes on a local machine.

At the end of the day, it’s not about which one is better—it’s about choosing the right tool for the job. ClickHouse powers through big data at scale, while DuckDB gives data scientists the flexibility to run powerful queries on their own machines. Together, they form the perfect duo—the heavyweight and the lightweight—both designed to make data processing faster and more efficient.

ClickHouse: The Key to Faster Insights

Arin Zingade — Tue, 03 Dec 2024 06:08:32 +0000

ClickHouse is rapidly gaining traction for its unmatched speed and efficiency in processing big data. Cloudflare, for example, uses ClickHouse to process millions of rows per second and reduce memory usage by over four times, making it a key player in large-scale analytics. With its advanced features and real-time query performance, ClickHouse is becoming a go-to choice for companies handling massive datasets.
In this article, we'll explore why ClickHouse is increasingly favored for analytics, its key features, and how to deploy it on Kubernetes. We'll also cover some best practices for scaling ClickHouse to handle growing workloads and maximize performance.

Introduction

ClickHouse is a high-performance, column-oriented SQL database management system (DBMS) designed for online analytical processing (OLAP), excelling in handling large datasets with remarkable speed, particularly for filtering and aggregating data. By utilizing columnar storage, it enables rapid data access and efficient compression, making it ideal for industries that demand fast data retrieval and analysis. Its common use cases include web analytics, where it processes vast amounts of tracking data, business intelligence to power high-speed decision-making, and log analysis for large-scale monitoring and troubleshooting.

Key Features of Clickhouse:

Columnar Storage: Enables fast data access and efficient compression, enhancing the speed of analytical queries and efficient compressions.
High Performance and Scalability: Optimized for handling massive datasets and complex queries with unique table engines that determine how data is stored.
Real-Time Analytics: Supports real-time data processing and analytics.
Maximizing Hardware Usage: ClickHouse is designed to utilize all available resources of the system effectively.
Rich Functionality: Offers a wide array of built-in functions that enhance data manipulation and analysis. ### How Does ClickHouse Work?

ClickHouse is designed for speed and scalability, making it ideal for handling vast amounts of data. Its distributed nature allows for data replication across multiple nodes, ensuring both fault tolerance and high availability.

Architecture

ClickHouse operates on a distributed architecture where data is partitioned and replicated across nodes. It employs a Shared Nothing Architecture, moving towards a decoupled compute and storage model, facilitating parallel and vectorized execution.

An Example of Shared Nothing ClickHouse Cluster with 3 replica servers

Storage Mechanism

ClickHouse uses columnar storage which allows it to read and compress large amounts of data quickly. Organizations migrating from row-based systems like Postgres can benefit significantly in terms of performance.
Tables utilize unique Table Engines—notably the MergeTree engine family—to store data effectively, leveraging ClickHouse’s strengths in analytical processing.

Query Execution

ClickHouse utilizes a unique query engine optimized for high-speed data retrieval, leveraging Single Instruction, Multiple Data (SIMD) instructions to process multiple data points simultaneously. This parallel processing significantly enhances performance, especially for complex queries. As demonstrated in the video A Day in the Life of a Query, ClickHouse efficiently breaks down and executes queries, focusing on answering specific questions rather than merely retrieving raw data.
To further understand query execution, we can use the EXPLAIN clause. The EXPLAIN clause in SQL is used to display the execution plan of a query. When you run a query with EXPLAIN, the database doesn't actually execute the query. Instead, it shows a detailed breakdown of how the query would be executed, including the steps the query optimizer will take.

For ClickHouse query execution steps look like:

Source: Performance introspection EXPLAIN clause

EXPLAIN PLAN: The query plan shows the in a generic way the stages that need to be executed for the query, but the query plan does not show how the ClickHouse executes the query using the available resources on the machine, its handy to check in what order the clauses are getting executed, read the plan from bottom to top.

For demonstration purposes, we will be using the UK Property Prices dataset.

EXPLAIN PLAN indexes = 1 
SELECT
    postcode1,
    type,
    COUNT(*) AS property_count,
    AVG(price) AS avg_price
FROM
    uk_price_paid
WHERE
    is_new = 1  
    AND date >= '2023-01-01'
GROUP BY
    postcode1, type
ORDER BY
    avg_price DESC;

for the above query, we get output as:

Expression (Project names)
Limit (preliminary LIMIT (without OFFSET))
Sorting (Sorting for ORDER BY)
Expression ((Before ORDER BY + Projection))
Aggregating
Expression (Before GROUP BY)
Expression
ReadFromMergeTree (default.uk_price_paid)

Indexes:
    PrimaryKey
    Condition: true
    Parts: 1/1
    Granules: 3598/3598

In analyzing the query execution plan, it's essential to interpret the steps from the bottom up (in this case from ReadMergeTree to Limit) , as each layer represents a sequential operation performed on the data.

EXPLAIN AST: With this clause, we can explore the Abstract Syntax Tree, we can also visualize this via Graphviz

For the query:

EXPLAIN AST graph = 1
SELECT
    postcode1,
    type,
    COUNT(*) AS property_count,
    AVG(price) AS avg_price
FROM
    uk_price_paid
WHERE
    is_new = 1  
    AND date >= '2023-01-01'
GROUP BY
    postcode1, type
ORDER BY
    avg_price DESC;

we get Abstract Syntax Tree as:

EXPLAIN PIPELINE: Introspecting the query pipeline can help you identify where the bottle necks of the query.

For the query:

EXPLAIN PIPELINE graph = 1
SELECT
    postcode1,
    type,
    COUNT(*) AS property_count,
    AVG(price) AS avg_price
FROM
    uk_price_paid
WHERE
    is_new = 1  
    AND date >= '2023-01-01'
GROUP BY
    postcode1, type
ORDER BY
    avg_price DESC;

we get output as:

ClickHouse naturally parallelizes queries, with each step utilizing multiple threads by default. In this example, the stages are handled by 4 threads, meaning each thread processes roughly one-fourth of the data in parallel before combining the results. This approach speeds up execution significantly.
For instance, identifying stages that run in a single thread is key to optimizing slow queries. By isolating these bottlenecks, we can target specific parts of the query for performance improvements, ensuring faster and more efficient execution overall.

Integration Capabilities

ClickHouse is highly compatible with a wide range of data tools, including ETL/ELT processes and BI tools like Apache Superset. It supports virtually all common data formats, making integration seamless across diverse ecosystems.

Why Choose ClickHouse and Migrate?

Choosing ClickHouse offers significant advantages, particularly for organizations dealing with large-scale data analytics. Its unique combination of performance, cost-effectiveness, and community support makes it a compelling choice for migrating from traditional databases.

Performance Advantages

ClickHouse is optimized for OLAP workloads, delivering exceptional speed in both data ingestion and query execution, offering sub-second query performance even when processing billions of rows. This makes it ideal for real-time analytics and decision-making in data-intensive industries.
The primary key in ClickHouse plays a crucial role in determining how data is stored and searched. It's important to select columns that are frequently queried, as the primary key should optimize query execution, especially for the WHERE clause. In ClickHouse, primary key is not unique to each row.

Real-World Success Stories

Many organizations have successfully migrated to ClickHouse, achieving substantial improvements in performance and cost savings. From e-commerce giants to financial companies, success stories highlight ClickHouse’s ability to transform data analytics capabilities at scale. For more details, refer to ClickHouse Adopters.

Running ClickHouse on Kubernetes

In this guide, we’ll walk through the process of running ClickHouse on a Kubernetes cluster in 7 steps:

Step 1: Install Kubectl

First, we need to install kubectl, the command-line tool for interacting with Kubernetes clusters. Run the following commands in your terminal:

sudo apt-get update
sudo apt-get install -y kubectl
# Download Minikube
# Please check your OS configuration and download from:
# https://minikube.sigs.k8s.io/docs/start/?arch=%2Flinux%2Fx86-64%2Fstable%2Fbinary+download
sudo install minikube-linux-amd64 /usr/local/bin/minikube
minikube version
minikube start

At this point, you have set up Kubernetes locally.

Step 2: Install Altinity ClickHouse Operator

Next, we will download and install the Altinity ClickHouse operator to manage our ClickHouse deployment:


kubectl apply -f https://raw.githubusercontent.com/Altinity/clickhouse-operator/master/deploy/operator/clickhouse-operator-install-bundle.yaml

kubectl get pods -n kube-system

You should see the ClickHouse operator pod running, which indicates that the operator is successfully deployed.

Step 3: Install the ClickHouse Database

Now we need to install the ClickHouse database itself. Follow these steps:

A basic configuration example for our demo:


apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseInstallation"
metadata:
  name: my-clickhouse
  namespace: test-clickhouse-operator
spec:
  configuration:
    clusters:
      - name: cluster
        layout:
          shardsCount: 1
          replicasCount: 1
  templates:
    podTemplates:
      - name: clickhouse-pod-template
        spec:
          containers:
            - name: clickhouse
              image: clickhouse/clickhouse-server:latest
              resources:
                requests:
                  cpu: "100m"
                  memory: "1Gi"
                limits:
                  cpu: "1"
                  memory: "2Gi"
  defaults:
    templates:
      podTemplate: clickhouse-pod-template

Now apply the configuration and check the status of the pods and services:

cat clickhouse-install.yaml | kubectl apply -f -
kubectl get pods -n test-clickhouse-operator
kubectl get services -n test-clickhouse-operator

You should see services running as defined in your installation configuration.

Step 4: Connect to ClickHouse Database

To interact with the ClickHouse database, we need to install the ClickHouse client on our local machine. If you are using a different operating system, refer to the official ClickHouse installation guide.

Run the following commands to install ClickHouse:

sudo apt-get install -y apt-transport-https ca-certificates curl gnupg

curl -fsSL 'https://packages.clickhouse.com/rpm/lts/repodata/repomd.xml.key' | sudo gpg --dearmor -o /usr/share/keyrings/clickhouse-keyring.gpg

echo "deb [signed-by=/usr/share/keyrings/clickhouse-keyring.gpg] https://packages.clickhouse.com/deb stable main" | sudo tee /etc/apt/sources.list.d/clickhouse.list

sudo apt-get update
sudo apt-get install -y clickhouse-server clickhouse-client
sudo clickhouse start

kubectl -n test-clickhouse-operator port-forward <pod_name> 9000:9000 &

clickhouse-client

Step 5: Test Your Services

To verify that everything is running correctly, execute the following commands:

kubectl get pods -n test-clickhouse-operator
kubectl get services -n test-clickhouse-operator

Step 6: Execute Queries

Now, let’s create a table and execute some queries in ClickHouse:

clickhouse-client
CREATE TABLE test_table (
    id UInt32,
    name String
) ENGINE = MergeTree()
ORDER BY id;

INSERT INTO test_table VALUES (1, 'CloudRaft'), (2, 'ClickHouse');
SELECT * FROM test_table;

You should see the results in the CLI with the changed path, indicating that you are interacting directly with the ClickHouse cluster.

Step 7: Load Testing

To further evaluate the performance of your ClickHouse installation, consider using load testing tools like Apache JMeter or k6 to simulate increased query loads. Measure how query response times change as you add more nodes to the cluster.

Key Differences between PostgreSQL and ClickHouse

While both Postgres and ClickHouse serve different purposes, the key distinction lies in how they handle replication and sharding. Postgres is primarily designed for transactional workloads (OLTP), where data consistency and durability are prioritized. On the other hand, ClickHouse is tailored for analytical workloads (OLAP), and optimized for high-speed querying and large-scale data analysis.

Materialized Views

In ClickHouse, Materialized Views are a powerful feature designed to improve query performance by pre-aggregating and storing data. Unlike regular views, which are calculated on-the-fly during query execution, materialized views physically store the results of a query, allowing for faster reads. These views can also leverage the efficient compression and fast access capabilities of the columnar storage model, further enhancing performance.

Materialized views are particularly useful in environments where query performance is critical, as they provide pre-computed results that save time during execution.
Postgres’s Materialized Views need to be manually re-updated, whereas ClickHouse automatically updates them with insert-and-optimize-later philosophy.

Scaling ClickHouse

In ClickHouse, scaling can be achieved through replication and sharding mechanisms. These help distribute data and queries across multiple nodes for performance and fault tolerance.

ClickHouse traditionally relies on ZooKeeper, a centralized service for coordinating distributed systems. ZooKeeper ensures that data replicas are in sync across nodes by maintaining metadata, managing locks, and handling failovers. It acts as a key component to keep the cluster’s state consistent, ensuring that replicas do not diverge and that read and write operations are properly distributed.

Replication

Replication ensures that copies of the same data are stored across multiple nodes to provide redundancy and improve fault tolerance. Replication in ClickHouse is at the Table Level.

ReplicatedMergeTree is the engine used for replicated tables.
Each table has a replica on multiple servers, and these replicas are kept in sync.
Clickhouse-Keeper manages the coordination between these replicas, ensuring consistency by managing locks, transactions, and metadata related to replication.
In case one replica goes down, the system can still read from and write to the available replicas.

Replication Process Example:

Let’s assume there are two replicas, A and B. A write to Replica A will be logged and replicated to Replica B, ensuring that both have the same data. This happens asynchronously to avoid latency issues. ### Sharding Sharding in Clickhouse is the process of dividing the data horizontally into smaller parts and distributing it across different servers (shards). This allows Clickhouse to handle very large datasets by spreading the load.
Distributed Table: Clickhouse uses a distributed table to achieve sharding. A distributed table is a logical table that sits on top of local tables (sharded across different nodes) and acts as a query router.
When a query is executed on a distributed table, it is automatically routed to the relevant shard(s) b ased on the sharding key.

Sharding Process Example:

Suppose you have 3 nodes (Node 1, Node 2, Node 3), and data is sharded by a key such as user ID. A distributed table will split the data based on the user ID and store different users’ data on different nodes. Queries on user-specific data will be routed directly to the shard holding that user’s data, improving performance.

Conclusion

In conclusion, ClickHouse offers a powerful solution for businesses seeking high-speed, large-scale analytics. With its columnar storage, real-time query performance, and scalability through replication and sharding, it serves as an excellent alternative for organizations transitioning from traditional row-based databases like Postgres. Particularly effective in industries such as web analytics, business intelligence, and log analysis, ClickHouse meets the demands for rapid data retrieval and analysis.

However, while ClickHouse excels in query performance and scalability, it may introduce complexities in data insertion compared to traditional databases, and it’s not well-suited for OLTP use cases. Organizations considering migration to ClickHouse should weigh these trade-offs, especially if they require frequent real-time inserts or updates. Ultimately, its scalability, cost-effectiveness, and growing community support make ClickHouse a compelling choice for modern data-driven applications, transforming how businesses manage and analyze data.

Decoding OCR: A Comprehensive Guide

Arin Zingade — Wed, 07 Aug 2024 19:49:09 +0000

Introduction

Optical Character Recognition (OCR) stands as a fundamental technology that transforms visual text representations into machine-readable formats. This capability is essential for digitizing printed documents and optimizing data entry processes. Advancements in artificial intelligence (AI) and machine learning (ML) have brought significant improvements to traditional OCR systems. These technologies enhance OCR's ability to accurately interpret text from complex or low-quality images by learning from data variations.

Looking to the future, OCR is on track for exciting developments. The technology is expected to integrate more seamlessly with other AI domains, such as natural language processing (NLP) and image recognition. This evolution will not only refine its core functionalities but also extend its utility across more sophisticated and holistic data processing solutions. Moreover, the scope of OCR applications is set to expand dramatically. Beyond mere text digitization, future applications may include real-time translation services, accessibility tools for the visually impaired, and interactive educational platforms. This broadening of scope will undoubtedly make OCR an even more vital component of our increasingly digital world.

Key Metrics for Evaluating OCR Systems

Evaluating the performance of OCR systems is crucial to ensure they meet the required accuracy and efficiency standards. Key metrics such as with application of Levenshtein Distance Character Error Rate (CER), Word Error Rate (WER) are measured.
An advanced metric known as ZoneMapAltCnt also provides comprehensive insights into the performance of OCR systems.

Levenshtein Distance

Levenshtein Distance is a measure of the difference between two sequences. In the context of OCR, it quantifies how many single-character edits (insertions, deletions, or substitutions) are necessary to change the recognized text into the ground truth text.

Character Error Rate (CER)

Character Error Rate (CER) is a fundamental metric in OCR evaluation, representing the percentage of characters that were incorrectly recognized in a text document. It is calculated by comparing the recognized text to a ground-truth text and counting the number of insertions, deletions, and substitutions needed to make the recognized text identical to the ground truth.

Word Error Rate (WER)

Word Error Rate (WER) measures the performance of OCR systems at the word level. It is similar to CER but evaluates errors in terms of whole words instead of individual characters. WER is calculated by the number of word insertions, deletions, and substitutions required to match the recognized text with the ground truth.

ZoneMapAltCnt

The ZoneMapAltCnt metric represents a more advanced approach in evaluating OCR systems. It assesses both the accuracy of text segmentation and the correctness of the recognized text within those segments. This metric evaluates the precision of detected text zones and measures character and word accuracy within these zones. By handling segmentation errors effectively. For more details, refer to this document

Factors Affecting OCR Accuracy

Several factors influence the accuracy of OCR systems:

Document Condition: Poor quality or damaged documents can significantly reduce OCR accuracy due to obscured or unreadable text.
Image Resolution: Higher resolution images provide more detail, allowing for better character recognition.
Document Language: OCR systems must be optimized for specific languages, as character sets and linguistic rules vary widely.
Preprocessing: Techniques such as noise reduction, binarization, and normalization improve text readability and OCR accuracy.

Framework For an OCR Model

Preprocessing Images for OCR

Opening an Image

import cv2
image_file = "PATH"
img = cv2.imread(image_file)

Inverting Image

Inverting an image in the context of OCR refers to reversing the color scheme of the image to enhance the text's readability and contrast for better recognition accuracy.

inverted_image = cv2.bitwise_not(img)
cv2.imwrite("temp/inverted.jpg", inverted_image)

Here, OpenCV handles this for us and we write the image in "temp" folder for further analysis.

Binarization

Binarization in the context of OCR is a crucial preprocessing step that involves converting a color or grayscale image into a binary image. This binary image consists of only two colors—typically black and white. This step becomes important because most OCR models are designed to handle
this kind of format.

def grayScale(image):
    return cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

gray_image = grayScale(img)
thresh, im_bw = cv2.threshold(gray_image, 200, 230, cv2.THRESH_BINARY)
cv2.imwrite("temp/bw.jpg", im_bw)

Noise Removal

Noise removal is a critical preprocessing step in OCR because it enhances the quality of the input images, leading to more accurate text recognition

import numpy as np
def noiseRemoval(image):
    kernal = np.ones((1,1), np.uint8)
    image = cv2.dilate(image, kernal, iterations=1)
    image = cv2.erode(image, kernal, iterations=1)
    image = cv2.morphologyEx(image, cv2.MORPH_CLOSE, kernal)
    image = cv2.medianBlur(image, 3)
    return image
    
no_noise = noiseRemoval(im_bw)
cv2.imwrite("temp/no_noise.jpg", no_noise)

Dilation and Erosion

In OCR, the processes of dilation and erosion play pivotal roles in improving the readability and recognition accuracy of text. Dilation helps to enhance the visibility of characters by thickening them, thus aiding in better character recognition in low-quality or faint prints. Conversely, erosion is used to thin out characters, which prevents misinterpretations and enhances the separation of text from the background.

import numpy as np

#Erosion
def thin_font(image):
    image = cv2.bitwise_not(image)
    kernel = np.ones((1,1), np.uint8)
    image = cv2.erode(image, kernel, iterations=1)
    image = cv2.bitwise_not(image)
    return image

#Dilation
def thick_font(image):
    image = cv2.bitwise_not(image)
    kernel = np.ones((2,2), np.uint8)
    image = cv2.dilate(image, kernel, iterations=1)
    image = cv2.bitwise_not(image)
    return image

eroded_image = thin_font(no_noise)
dilate_image = thick_font(no_noise)
cv2.imwrite("temp/eroded_image.jpg", eroded_image)
cv2.imwrite("temp/dilate.jpg", dilate_image)

Dilation and Erosion processes work only when the image is inverted - background is black and text is white.

Best OCR Models for Different Use Cases

Amazon Textract:
- Use Case: Industry Level
- Strength: Amazon Textract is highly effective for industrial-scale document processing, capable of extracting text and data from virtually any type of document, including forms and tables. It integrates seamlessly with other AWS services, making it ideal for businesses looking to automate document workflows in cloud environments.
- https://aws.amazon.com/textract/
SuryaOCR:
- Use Case: Large Language Range Support
- Strength: SuryaOCR stands out for its extensive language support, making it suitable for global applications where documents in multiple languages need to be processed. This makes it a valuable tool for international organizations and government agencies dealing with multilingual data.
- https://github.com/VikParuchuri/surya
Tesseract:
- Use Case: Customizable and Versatile
- Strength: Tesseract is an open-source OCR engine that offers flexibility and customization, which is perfect for developers looking to integrate OCR into their applications without significant investment. Its versatility makes it a popular choice for academic research, prototype development, and small business applications.
- https://github.com/tesseract-ocr
EasyOCR:
- Use Case: Good for Small and Simple Projects
- Strength: EasyOCR is an accessible and straightforward tool for developers who need a quick and efficient solution for small-scale projects. It supports multiple languages and is easy to set up, making it ideal for startups and individual developers working on applications with less complex OCR requirements.
- https://github.com/JaidedAI/EasyOCR

Different Use Cases Where OCR Can Be Used

Automated Form Processing
Digital Archiving
License Plate Recognition
Legal Document Analysis
Educational Resources

Case Study - A Demo on Surya-OCR

Overview of Surya OCR

Surya OCR is a comprehensive document OCR toolkit. This toolkit is designed to handle a wide range of document types and supports OCR in over 90 languages, benchmarking favorably against other leading cloud services.

They also have a hosted API https://www.datalab.to/

Key Features:

Multilingual Support: Capable of performing OCR in more than 90 languages, making it highly versatile for global applications.
Advanced Text Detection: Offers line-level text detection capabilities, which work effectively across any language.
Sophisticated Layout Analysis: Detects various layout elements such as tables, images, headers, etc., and determines their arrangement within the document.
Reading Order Detection: Identifies and follows the reading order in documents, which is crucial for understanding structured data like forms and articles.
Surya also offers performance tips for optimizing GPU and CPU usage during OCR processing, ensuring efficient handling of resources, unlike Tesseract.

Usage:

Surya is particularly adept at handling complex OCR tasks such as processing scientific papers, textbooks, scanned documents, and even mixed-language content efficiently.
The toolkit is available through a hosted API that supports PDFs, images, Word documents, and PowerPoint presentations, ensuring high reliability and consistent performance without latency spikes.
Surya can be installed via pip and requires Python 3.9+ and PyTorch. The model weights download automatically upon the first run.
It includes a user-friendly Streamlit app that allows for interactive testing of the OCR capabilities on images or PDF files.

In this demonstration, we will explore how to perform Text Detection, OCR, Reading Layout, and Reading Order using Surya. We will cover three methods: using the Streamlit GUI, through the Command Line Interface, and directly from Python code.

Surya OCR Through GUI

To run Surya OCR GUI locally on your machine, you will need to open your Command Line Interface (CLI) and follow the given instructions:

pip install streamlit

After successfully installing streamlit, execute below snippet in CLI

suryu_gui

The app will start running on
http://localhost:8501

Above dashboard will be displayed if all the steps above are executed successfully.
Now just follow the steps:

Click on Browse File Button
Select your desired file and language.
Choose one between:
1. Run Text Detection
2. Run OCR
3. Run Layout Analysis
4. Run Reading Order

All the images from this point are processed by Surya OCR

Text Detection:
- Identifies areas within an image or document where text is present. This step involves locating text blocks and distinguishing them from non-text elements like images and backgrounds.

OCR (Optical Character Recognition):
- Transforms the detected text areas into machine-readable characters. This process involves analyzing the shapes of characters and converting them into corresponding text data.
- Note that here, we have used an inverted image as uploaded file.

Reading Order:
- Analyzes the physical structure of the document to understand how different elements are organized. This includes the detection of headers, footers, columns, tables, and images, helping to interpret the document as a whole rather than just isolated text blocks.

Using Surya OCR via Command Line

Text Recognition

Open a Command Prompt: Navigate to the folder containing the images you wish to process.
Execute Surya OCR: Type the following command to process your images using Surya OCR. Replace DATA_PATH with the path to your images relative to your current directory. This command will output the recognized text in the "results" folder.
```
surya_ocr DATA_PATH --images --langs hi,en
```
In the command above, --langs hi,en specifies the languages for OCR. "en" represents English. Surya OCR supports up to 90 different ISO language codes. For a complete list of supported languages, refer here.

Similarly,

Text Line Detection

surya_detect DATA_PATH --images

Layout Analysis

surya_layout DATA_PATH --images

Reading Order

surya_order DATA_PATH --images

It should be noted that the results obtained will be the same as that presented above in this article.

Using Surya OCR from Python

Sometimes the images on which we want to apply OCR may not be upto the mark, need some preprocessing techniques as discussed earlier in this article, improvements etx.
So we need to create a pipeline through which our image processes.
This can be easily done in python by combining various functions and then applying Surya OCR on the processed document

pip install surya-ocr

Text Recognition

This Python script utilizes the surya.ocr library to perform optical character recognition (OCR) on images. The script:

Loads an image for OCR.
Initializes necessary models and processors for text detection and recognition.
Executes the OCR process on the image, returning text predictions.
To use the segformer, use version 0.4.14 of Surya as in the latest update, the file is missing.

from PIL import Image
from surya.ocr import run_ocr
from surya.model.detection import model
from surya.model.recognition.model import load_model
from surya.model.recognition.processor import load_processor

image = Image.open(IMAGE_PATH) 
langs = ["en"]
det_processor, det_model = model.load_processor(), model.load_model()
rec_model, rec_processor = load_model(), load_processor()  

# Perform OCR and get predictions
predictions = run_ocr([image], [langs], det_model, det_processor, rec_model, rec_processor)

Line Detection

This segment of the code focuses on detecting textual lines within an image:

It loads an image and uses the surya.detection module.
Applies a text detection model to find textual lines.
Outputs a list of dictionaries containing detected text lines for further processing.

from PIL import Image
from surya.detection import batch_text_detection
from surya.model.detection.model import load_model, load_processor

image = Image.open(IMAGE_PATH)  
model, processor = load_model(), load_processor()  

# Get predictions of text lines
predictions = batch_text_detection([image], model, processor)

Layout Analysis

This script analyzes the layout of the page within an image:

Loads an image and initializes models for both line detection and layout analysis.
First, detects text lines, then performs layout analysis based on these lines.
Returns structured data indicating the layout of content in the image.

from PIL import Image
from surya.detection import batch_text_detection
from surya.layout import batch_layout_detection
from surya.model.detection.model import load_model, load_processor
from surya.settings import settings

image = Image.open(IMAGE_PATH)  
det_model, det_processor = load_model(), load_processor()  
model, processor = load_model(checkpoint=settings.LAYOUT_MODEL_CHECKPOINT), load_processor(checkpoint=settings.LAYOUT_MODEL_CHECKPOINT) 

# First detect lines, then analyze layout
line_predictions = batch_text_detection([image], det_model, det_processor)
layout_predictions = batch_layout_detection([image], model, processor, line_predictions)

Reading Order

This code snippet establishes the reading order within a document:

Loads an image and extracts bounding boxes (bboxes) of detected text elements.
Utilizes the surya.ordering module to determine the sequential order of text blocks.
Outputs ordered text predictions to guide further content analysis or extraction.

from PIL from Image
from surya.ordering import batch_ordering
from surya.model.ordering.processor import load_processor
from surya.model.ordering.model from load_model

image = Image.open(IMAGE_PATH)  
bboxes = [bbox1, bbox2, ...] 
model, processor = load_model(), load_processor() 

# Get ordered text predictions
order_predictions = batch_ordering([image], [bboxes], model, processor)

Like text detection, this function returns structured data with coordinates and descriptions of various layout components, organized in a way that reflects the physical structure of the page.

For a deeper dive into Surya-OCR, an advanced OCR system, enthusiasts and developers can explore its extensive components on GitHub. This open-source project is readily accessible for those eager to understand its mechanics or contribute to its evolution. Visit Surya-OCR on GitHub to explore the documentation, source code, and more.

Limitations of Surya-OCR & Scope of Improvement

Surya-OCR stands out for its impressive multilingual support and specialization in digitizing printed documents. Despite its strengths, there are a few limitations users should be aware of. Primarily, Surya-OCR is optimized for printed text and can struggle with text on complex backgrounds or in handwritten formats, potentially leading to inaccuracies.

Additionally, the toolkit requires substantial GPU resources for optimal performance, with recommendations like 16GB of VRAM for batch processing. This high demand may exclude users with limited hardware capabilities. Also, issues with the confidence levels in the model's text detection could affect its reliability, especially in critical applications where accuracy is paramount.

Optical Character Recognition (OCR) technology has made significant strides, evolving from simple text digitization to becoming an integral part of complex AI-driven applications. This evolution can be further enhanced by the integration with multimodal Large Language Models (LLMs), which are capable of processing and understanding information from multiple data types, including text, images, and audio.

Multimodal LLMs can complement traditional OCR systems in several ways. While OCR excels at extracting raw text from images, multimodal LLMs can interpret the context within which the text appears, understanding nuances and subtleties that OCR alone might miss. This synergy allows for a more nuanced understanding of documents in contexts where text is intertwined with visual elements, such as infographics, annotated diagrams, and mixed media documents.

For example, in educational materials where diagrams are annotated with textual explanations, OCR can extract the text, and the multimodal LLM can provide insights into how the text relates to the graphical content. This could be invaluable for creating accessible educational tools, where both text and visuals need to be made comprehensible to users with different needs.

Conclusion

In my opinion, OCR has transcended its traditional role, enhanced by advancements in AI and ML, to become a cornerstone technology in our digital era. As it integrates further with fields like NLP and image recognition, OCR is expanding into dynamic applications such as real-time translation and accessibility tools, transforming how we interact with information. But there is still much to be done in the field.

Multimodal Large Language Models (LLMs) represent a promising evolution in OCR technology. By combining OCR with these models, we can extract not just text but understand the context of images, making digital content more accessible and interpretable.

As we continue to refine these technologies, the potential for creating seamless and intuitive user interfaces that can interpret and respond to a complex blend of textual, visual, and auditory inputs is immense. This could revolutionize the way we interact with our devices, making technology an even more integral part of everyday life.