DEV Community

Cover image for AI-driven OCR Revolutionizes Intelligent Layout Analysis with 24+ Labels
Derek
Derek

Posted on

AI-driven OCR Revolutionizes Intelligent Layout Analysis with 24+ Labels

With the rapid development of technology and the ever-changing business needs, automating repetitive tasks has become a key factor for efficiency enhancement in modern enterprises and a cornerstone for achieving digital transformation. RPA (Robotic Process Automation) is an effective technology to address this challenge. Increasingly, companies are adopting RPA technology to modernize their internal workflows. 

 

Customer Background and Challenges 

 

A technology company specializing in office software development plans to create an RPA product and an intelligent Q&A product to help enterprises automate workflows and business processes, thereby meeting the needs of efficient, cost-effective, and compliant operations while enhancing customer experience. 

 

However, during the development of RPA and AI Q&A products, this company encountered challenges in processing unstructured documents: manual labeling massive documents was inefficient and error-prone, leading to increased costs and slow development progress. They learned that ComIDP's intelligent document processing solution once helped a data provider process over 3 million unstructured documents in 5 days, prompting them to request automated data labeling for intelligent layout recognition and data parsing. 

 

ComIDP customized layout recognition parameters for them and upgraded OCR technology using AI models, employing over 24 labels to restore the layout and logic of documents, ensuring the integrity and consistency of the document layout. This company deployed ComIDP's intelligent document solution in a clustered environment for developing RPA and intelligent Q&A products, significantly shortening their development cycle, reducing costs, and enabling rapid market entry for the products. 

 

Customer Pain Points 

Due to the complex content and inconsistent format of unstructured documents, data parsing and extraction become extremely challenging. Layout recognition is a major difficulty in parsing unstructured documents, as each layout has numerous page elements, varying layouts and styles, and different logical relationships between contents. Additionally, issues such as noise, skew, and perspective further increase the difficulty of recognition. This requires parsing technology with high adaptability and intelligence. However, lacking advanced technology support, this enterprise had to rely on manual processing, which was inefficient and inaccurate, directly impacting the effectiveness of the RPA and Q&A systems.

Manual Data Labeling

This technology enterprise previously used manual labeling of unstructured data for document layout recognition, which was time-consuming and prone to errors. When different people handled the same dataset, labeling results varied, leading to inconsistent data quality. This not only increased the cost and time for subsequent data verification but also complicated the development work and extended project timelines.

Massive Document Input

This company processes over hundreds of thousands of files daily, necessitating servers with high efficiency and high-load processing capacity. However, traditional server architectures could not handle such large-scale data inputs, resulting in slow system performance.

Self-development Challenges

In a competitive market, self-development can bring personalized solutions but is costly and time-consuming. Long development cycles make it difficult for companies to quickly respond to market changes, risking the loss of market opportunities. 

 

Customer Requirements 

This company detailed their product's application scenarios to our ComIDP team and proposed specific requirements for intelligent data labeling of layout analysis, aiming to optimize data parsing effects while achieving AI data automation.

Types of AI Data Labeling

They needed to annotate titles, paragraphs, code blocks, tables, formulas, lists, and non-text content within documents to ensure unstructured document completeness. Separating natural paragraphs and layout segmentation were particularly crucial.

 

Type of Label

Sub-type

Note

title

title

All levels of titles on the page need to be labeled.

paragraph

paragraph

Text fragments consisting of plain text are categorized as paragraphs. To facilitate data search and location, large sections with multiple independent semantic paragraphs should be split, typically segmented by natural paragraphs and punctuation, and each text fragment after segmentation is called a paragraph.

block

block: unknown category, non-text block

A block is the output of layout segmentation. Data of the same type of information that is visually in a connected domain is a block.

1. An image on the same row is a block, a table is a block, and a large section of text under the same column is a block.

2. Blocks must consist of same-type information, mixed areas of different types cannot form one block and must be split. For example, a mixed area with an image and a table cannot be a block.

code-block: code block

img-block: mixed text and image block

table-block: table block

sci-block: scientific formulas block

list-block: list block, text, such as directories or text lists

Must be at least two lines (3 and more), with average line text not exceeding two rows, else it's a paragraph



Beyond these fundamental needs, each data labeling type had specific restrictions, such as standalone title labeling, non-overlapping paragraph and block, no multi-column blocks, and no blocks containing mixed data types.

Output Labeling

Post parsing the unstructured documents, this company required the output files in JSON format with limited output labels including title, paragraph, block, code-block, img-block, table-block, sci-block, and list-block. This supports subsequent key information extraction and semantic analysis, enhancing the accuracy of RPA and Q&A systems. 

 

ComIDP's R&D team customized the layout recognition parameters based on the customer’s needs. Constant updates and iterations led to an accuracy exceeding 95%, successfully delivered to them for acceptance. 

 

ComIDP Solution 

ComIDP team engaged in-depth conversations with this enterprise’s R&D team to comprehend specific needs and business goals, ensuring custom and practical solutions. From data collection, AI model training, model optimization to testing reports, we provide professional, flexible, and efficient services for customers.

Layout Analysis Model Training

By collecting different types of samples for manual data labeling, such as financial reports, papers, newspapers, and books, our R&D team trained a layout analysis AI model applicable to various industries. This model accurately identifies and classifies various elements on the page, such as titles, paragraphs, tables, and images, using 24 predefined labels, with recognition accuracy surpassing 95%. 

 

Based on the specific data labeling needs of this enterprise, we further optimized our AI model. Through refined labeling types and rules, we achieved precise automated data labeling of complex document content. For instance, special recognition algorithms were designed and adjusted for code blocks and formula blocks in technical documents, accurately extracting and distinguishing these unique contents. AI-based ComIDP analyses both geometric and logical document layouts, ensuring 99% restoration of document layout and reading logic structure, thereby maintaining layout completeness and consistency. As requested by the client, labeled results are outputted in standardized JSON format, facilitating secondary processing and data analysis. 

 

Test Reports Verify Effectiveness

Functional Testing

Upon AI model training completion, we conducted multiple rounds of rigorous testing to validate its performance, simultaneously using client-provided examples as validation sets to detect model accuracy, eventually producing a functional testing report. The report elaborated on our AI OCR model's behavior in automatically processing various document types, including different formats, sizes, and languages, plus elements like stamps, charts, formulas, and flowcharts. These results served as critical acceptance criteria for the model.

Format

PNG, JPG, JPEG, BMP

Size

100KB ~ 30MB

Languages

Simplified Chinese, English, Mixed Chinese and English

Types

Tables, Complex Layouts, Stamps, Handwritten text, Exams, Formulas, Flowcharts, Skewed text, Scanned, and Photographed books and PPTs

 

From the test report, we selected the ultimate effect of ComIDP processing documents with formulas. Results showed accurate recognition of both text and formulas, and our customer was very satisfied with the results.

 

Stress Testing

Facing this enterprise with over a hundred thousand daily document inputs, we performed comprehensive stress testing to ensure the system could handle massive document input pressures. We tested PDF to Word (Grid Layout) with and without OCR in both synchronous and asynchronous environment. Our stress test report indicated ComIDP maintained stability, accuracy, and quick responses under high load, proving its excellent performance and reliability in high-load tasks.

 

 

Synchronous Testing

Asynchronous Testing

Test Scenario

200 users converting files simultaneously.

200 users converting files simultaneously, lasting over 10 minutes.

Test Results

All 200 users succeeded in conversion.

All 200 users succeeded in conversion.

Success Rate and Accuracy reached 100%, with no error responses.

Success Rate and Accuracy reached 100%, with no error responses.

99% response time under 1 second.

99% response time under 1 second.

 

GPU&CPU Speed Testing

Additionally, we deployed a GPU to accelerate document processing speeds. Comparing GPU and CPU efficiency for the same tasks resulted in a detailed OCR GPU&CPU speed comparison report.

 

Below illustrates ComIDP's time expenditure for processing 100 image samples using GPU vs CPU. Testing indicated that in a dual-GPU system's dual-container environment, ComIDP processes up to 20,000 images per minute on average. GPU processing time is 100 times faster than CPU, demonstrating significant speed advantages for large-scale document processing, substantially reducing time and boosting efficiency. For customer’s actual applications and document processing demands, we provided a customized cluster deployment solution to ensure high efficiency in ComIDP's real-world application. 

Top comments (0)