DEV Community: Sudhanshu

Open-Source Table Extraction Tool: Extract Structured Data from Documents with OCR and Computer Vision

Sudhanshu — Fri, 24 Jan 2025 19:19:18 +0000

Extracting tabular data from documents remains one of the biggest challenges in industries like healthcare, insurance, and finance. When processing claims, invoices, or contracts, maintaining the structure of complex tables is crucial for accurate insights.

Traditional methods — such as OCR paired with Language Models — often lose the structural integrity of tables, leading to mismatched columns and rows. Vision-based LLMs promise better accuracy but come with significant computational costs and occasional hallucinations.

I’m excited to share a cost-effective and scalable open-source solution that addresses these challenges!

🛠️ What Does the Tool Do?

My solution is designed to extract structured tabular data from document images, combining the best of OCR and computer vision technologies with custom processing logic.

Here’s how it works:

Table Detection: Identifies and extracts tables from images using HuggingFace’s Table Detection.
OCR Integration: Uses PaddleOCR to read text within table cells.
Linked List Algorithm: Builds a structured linked list to preserve the table layout and outputs it in multiple formats like Pandas DataFrames, HTML tables, or CSVs.

🔍 Why Is This Important?

Maintains Structural Integrity: The tool ensures tables retain their format, significantly improving downstream processing accuracy.
Adaptable to Complex Cases: It can handle basic to moderately complex tables and provides a foundation for applying custom post-processing logic.
Cost-Effective: Unlike Vision LLMs, this solution uses lightweight open-source tools, making it highly affordable and efficient.

💡 How Can You Use It?

Directly use the structured output for simple workflows.
Feed the output into an LLM to improve the accuracy of information extraction, as the structural context is retained.
Replace the open-source components (e.g., PaddleOCR) with advanced tools for higher precision.

🔗 Get Started Today

This project is completely open-source and available on GitHub! It’s easy to set up and comes with detailed instructions for implementation.

👉 Explore the Repository on GitHub

If you’re looking for a scalable, reliable, and accurate solution to extract tabular data from documents, this tool is for you. Let me know your thoughts, and feel free to contribute to the project!

Improve your APIs Response Time, use threading!

Sudhanshu — Wed, 06 Mar 2024 16:31:16 +0000

When to use?

In general, threading is a valuable technique to employ in scenarios where your program is not primarily engaged in computational tasks but rather is waiting for some external event, such as

I/O operations,
network requests,
user input,
waiting for resources to become available

How to apply?

Their are various methods in python for applying threading.
Example:
a. threading library
b. concurrent.futures
c. asyncio and async functions

General Approach

import threading

def download_from_cloud():
    pass

threads = []
for url, filename in zip(urls, filenames):
  # Create and start thread for each download
  thread = threading.Thread(target=download_from_cloud, args=(url, filename))
  threads.append(thread)
  thread.start()

# In the above code the downloading has started concurrently

# Wait for all threads to finish (optional)
for thread in threads:
  thread.join()

thread.join() will make the program to wait for all the threads to first finish and than continue with the further code. You can also choose to move further without waiting for them to complete depending on your programs dependency on the operations performed by the above thread.

Below is a Helper Module

Past the below code in a .py file

from concurrent.futures import ThreadPoolExecutor
import time

class ConcurrentThreadExecutor:

    def __init__(self):
        pass

    def execute_parallel(self, tasks, max_threads):
        with ThreadPoolExecutor(max_workers=max_threads) as executor:
            results = list(executor.map(lambda task: self.execute_task(*task), tasks))
        return results

    def execute_task(self, task_name, task_function, task_args):
        t1 = time.time()
        print(f"Started ThreadTask : {task_name}")
        result = self.run_function(task_function, *task_args)
        print(f'ThreadTask : "{task_name}" took : {round(time.time() - t1,2)}sec')
        return (task_name, result)

    def run_function(self, func, *args, **kwargs):
        return func(*args, **kwargs)

class MultiThreadExecutor(ConcurrentThreadExecutor):

    def __init__(self, debug: bool = True) -> None:
        self.max_threads = ThreadPoolExecutor()._max_workers 
        if debug:
            print("Initialized Multi-Thread Executor, Number of threads: " + str(self.max_threads))
        self.debug = debug


    def execute_tasks(self, tasks: list, no_of_cores : int = None):
        """
         ** Note Input must Follow the following format **
            tasks : [ ( TASK_NAME, FUNCTION, ARGS ), ]
        """
        num_threads = min(self.max_threads, len(tasks)) if no_of_cores is None else min(no_of_cores, self.max_threads)
        if self.debug:
            print("Number of threads being used: " + str(num_threads))
        results = self.execute_parallel(tasks, num_threads)

        results = dict(results)
        return results

Now just call the MultiThreadExecutor to run multiple functions concurrently as below


def func1(name ):
    time.sleep(5)
    print(f"Completed : {name}")


def func2(name):
    [i*i for i in range(100000)]
    print(f"Completed : {name}")

# Create the MultiThreadExecutor Object
execute_concurrent = MultiThreadExecutor(debug = False)

# Pass All the function to be executed concurrently

execute_concurrent.execute_tasks([
    ('func1', func1, ('Function 1',)),
    ('func2', func2, ('Function 2',))
])

print('Ended!')

Unlocking Efficiency with Threading:

Threading is a powerful tool to enhance responsiveness in Python applications. By allowing your program to handle waiting tasks concurrently, it can deliver a smoother user experience.

Hope you found it informative!

Search Word BOT (Tesseract | Tkinter | Selenium)

Sudhanshu — Sun, 03 Mar 2024 08:21:11 +0000

This bot takes input from a website (URL provided) in the form of an image and converts it into text using Pytesseract, a Python library (which was the most challenging part).

Once the text preprocessing is complete, it searches for words in all eight possible directions and displays the final output using Tkinter.

Stack Used : Tkinter, Pytesseract, Selenium

Find more such projects at : Sudhanshu1304 GitHub

Multicollinearity

Sudhanshu — Thu, 29 Feb 2024 17:52:08 +0000

What, Why, and How to solve the multicollinearity.

Do you know that Multicollinearity has almost no effect on the final accuracy of the machine learning model? So why is multicollinearity a problem, and what is it in the first place?

In this article, we will learn answers to such questions.

Description —

The assumption in the Regression-based model.
What is Coliniarity?
About Multicoliniarity.
Why is Multicoliniarity a problem?
How to remove Multicollinearity?
1. Using VIF and its Code Implementation
2. Using Correlation and its Code Implementation
3. A python library that automates the above methods.

Assumption —

It is very important to know that one of the assumption’s for the regression-based model is —

No or little multicollinearity i.e Input features have no Correlation among themselves or in other words they are independent of each other.

But generally, it is not the case i.e the dataset contains features that are significantly correlated with each other. This leads to Multicollinearity.

Let’s quickly see what is Coliniarity in brief.

What is Collinearity?

Collinearity or Correlation is a statistical measure that indicates the extent to which two or more variables move together. In other words, it simply is a number(+ve or -ve) to indicate how are two features interacting with each other like if one increases the other feature is increasing, decreasing, or showing a random growth.

The correlation could be positive or negative. A positive correlation indicates that the variables increase or decrease together. A negative correlation indicates that if one variable increases, the other decreases, and vice versa.

About Multicoliniarity —

As we all know that it is very important to understand what factors affect the growth of a business firm, this is done in many ways and one of them is by understanding the equation given by our machine learning model. Let’s take an example -

No of Covied19 = 4*Area + intercept (Just an Assumption)

Let’s say our model gave us the above equation which basically states that if the intercept is 0 then we can directly say that if the Area doubles the no of covid cases will increase by a factor of 4. As you can understand that knowing this factor as accurately as possible is very important because based on that we are supposed to decide the no of vaccines and hospital beds to produce daily.

Why is Multicoliniarity a problem?

So what does multicollinearity have to do with the above example?

Due to multicollinearity of we may get different coefficients for the same factors and hence leading to wrong interpretations, which could have serious effects. For example, we may get a factor as 2*Area but this will lead to a shortage of no of beds and vaccines which are produced daily and hence will lead to increase in no of deaths per day.

In general, we can say that —

Multicollinearity has a great negative impact on these coefficients and could lead to a wrong inference.
Technically it also affects the p-values which again affects the feature selection process.

How to Remove Multicollinearity?

In general, there are two different methods to remove Multicollinearity —

Using Correlations

Using VIF (variation inflation factor)

1. Using Correlation

General, A correlation between two features is more than 0.7, which indicates the those features

A correlation greater than 0.7 between two features indicates the presence of Multicoliniarity and we should drop one of the two features to solve it.

Code —

2. Variation Inflation Factor(VIF) —

There is a simple test to identify multicollinearity called VIF(variance inflation factor).VIF **starts with 1 and has no upper boundary.VIF **between 1 to 5 is considered moderate, but if VIF is above 5 then those are to be removed.

**VIF = 1/(1-R2)**

R2 is the coefficient of Determination which indicates the extent to which a predictor can explain the change in the response variable.

A VIF of 10 means that the variance of the coefficient of the predictor is 10 times more than what it should be if there’s no collinearity.

Code —

Library Support —

You can use the Python library **ModelAuto to **solve Muticoliniarity easily.

It has an inbuilt package to remove Multicoliniarity via both methods.

pip install ModelAuto

from ModelAuto.Multicollinearity import handel_Multico_Corr

OR

from ModelAuto.Multicollinearity import handel_Multico_VIF

Conclusion —

Multicollinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the predictions, and the goodness of fit. So if our primary goal is just to make predictions we don’t need to reduce multicollinearity.
Majorly multicollinearity affects Linear models like-Linear Regression, Logistics Regression. Not much impact on the Algorithms like KNN, Decision Tree, etc which are non-linear.

If you found the information helpful consider leaving a 👏😏.

Dockerizing Django, A Step-by-Step Guide

Sudhanshu — Mon, 19 Feb 2024 10:14:35 +0000

Docker is a popular containerization tool that allows you to package your applications and their dependencies into lightweight, portable containers. Docker makes it easy to run your applications in a consistent environment, regardless of the host operating system.

Let's see, what we will learn.

Creating a quick Django project
1. Dockerizing the Django app using two methods:
2. Using only Docker File
3. Using Docker Compose (.yml file)
4. Understanding the syntax of a docker file
5. Understanding the syntax for Docker-compose file
6. Make real-time changes in the docker image as you make changes in your local files (Volume mount)

Step 1: Create the Django project

First, let’s create a Django project named dockertestusing the following command:

This will create the basic file structure for the Django project, including the manage.py file, which is used to manage the project. The file structure should look like this:

dockertest
|
|-- dockertest
|    |-- __init__.py
|    |-- asgi.py
|    |-- settings.py
|    |-- urls.py
|    |-- wsgi.py
|
|-- manage.py
|-- db.sqlite3

Now you should navigate to the parent dockertest folder in your command prompt (CMD/Bash) such that the path should look like this

To check if the application is running properly run the following command in the terminal

This should give you this page.

If you can see the above page you have successfully created the basic Django application.

Step 2: Create the requirements.txt file

Next, we need to create a requirements.txt the file that specifies the Python libraries and dependencies required to run the Django project. Run the following command to create the requirements.txt file:

This will create a requirements.txt file in the same directory as your Django project. The file structure should now look like this:

dockertest
|
|-- dockertest
|    |-- __init__.py
|    |-- asgi.py
|    |-- settings.py
|    |-- urls.py
|    |-- wsgi.py
|
|-- manage.py
|-- db.sqlite3
|-- requirements.txt

Step 3: Dockizing

There are two methods for Dockerizing the Django project: using only a Dockerfile, or using Docker Compose.

Method A: Using only a Dockerfile

This method is suitable for smaller applications. Now, we need to create a Dockerfile that specifies the steps needed to build a Docker image for the Django project. Such that the file structure should look like

dockertest
|
|-- dockertest
|    |-- __init__.py
|    |-- asgi.py
|    |-- settings.py
|    |-- urls.py
|    |-- wsgi.py
|
|-- manage.py
|-- db.sqlite3
|-- requirements.txt
|-- Dockerfile

Add the following content in the docker file:

This Dockerfile specifies the following:

The base image to use is python:3.8-slim-buster, which is a minimal Python image based on Debian Buster. We can use different base images which you can find on the docker hub.
The working directory is set to /app.This basically creates a folder in which we can place all our project files, and all the commands below will work on this folder by default.
The COPY directive copies the requirements.txt file from the host to the image. This will copy the requirements.txt file into the WORKDIR specified above.
The RUN directive installs the required libraries using pip.
The COPY . . the command copies the rest of the Django project files from the host to the image. The first . represents the source directory (i.e. the current directory on the host), and the second . represents the destination directory (i.e. the working directory specified by the WORKDIR directive in the Dockerfile). This command will copy all the Django project files into the /app directory in the image.
CMD ["python3", "manage.py", "runserver", "0.0.0.0:8000"]. This command will tell the docker how to run the python application created within the container.

Method B: Using Docker Compose

This method is suitable for larger or more complex applications that involve multiple services. When using the docker-compose file we can make small changes in the docker file created above.

First, create a docker-compose.yml file

dockertest
|
|-- dockertest
|    |-- __init__.py
|    |-- asgi.py
|    |-- settings.py
|    |-- urls.py
|    |-- wsgi.py
|
|-- manage.py
|-- db.sqlite3
|-- requirements.txt
|-- Dockerfile
|-- docker-compose.yml

And add the below content

This docker-compose.yml the file specifies the following:

version: In a docker-compose.yml file, the version the field specifies the version of the Docker Compose file format. Docker Compose uses this field to determine how to interpret the rest of the file.
A service name myapp is defined, with the following configuration:
— The service is built using Dockerfile the current directory.
— The current directory is mounted as a volume in the container.
— Port 8000 on the host is mapped to port 8000 in the container.
— The service is given the name docker_image_1 and the container is given the name docker_container_1.
— The command to run the Django server is specified as python manage.py runserver 0.0.0.0:8000.

Step 4: Running the Dockerized Django Project

Now that you have Dockerized your Django project, you can use the following commands to run and manage your containers.

Before moving ahead let’s quickly learn about Volume in docker.

Solving the Issue of Persistent Changes in Volumes

One issue with Docker is that changes made to the files on the host system are not reflected in the container unless the container is rebuilt. This means that if you make changes to the files on your local machine and want to see those changes in the container, you will have to rebuild the container every time. This can be time-consuming and inconvenient, especially if you are making frequent code changes. To solve this issue, you can use a volume to mount the host directory in the container. This allows you to make real-time changes to the files on your local machine and have those changes immediately reflected in the container. This way, you don’t have to rebuild the container every time you make a change, which saves time and makes it easier to work with your Django project.

Method A: Using a Dockerfile

Build the Docker image using the following command:

Replace MyDockerImage with the name, you want to give to your Docker image. The . at the end specifies the location of the Dockerfile.

Run the Docker image to create a container using the following command:

This command will create an interactive terminal (-it), delete the container after it is stopped ( — rm), map port 8080 on your host to port 8000 in the container (-p 8080:8000), and mount the volume (-v $(pxd):/app) to enable real-time updates between your local files and the container. Replace MyDockerImage with the name of your Docker image.

If you are on windows use *“%cd%” **instead *$(pxd). **They return the path to the current directory.

Method B: Using Docker Compose

If you are using a docker-compose file you can build the image easily by using

This will build the images for all the services defined in the file.

Note that the docker-compose build the command only builds the images for the services, it does not start the containers. To start the containers, you can use the docker-compose up command.

Some useful commands

docker images: This command lists all the Docker images on your system. It displays the image ID, the repository and tag, the image size, and the creation date.

docker run -it Image_Name bash: This command runs the specified Docker image and starts a bash shell inside the container. The -it flag creates an interactive terminal and allows you to input commands directly into the container.

docker rm: This command removes one or more Docker containers. You can specify the container ID or name as an argument to remove a specific container.

docker rmi: This command removes one or more Docker images. You can specify the image ID or repository and tag it as an argument to remove a specific image.

docker ps -a: This command lists all the Docker containers on your system, both running and stopped. The -a flag shows all containers, not just the running ones. The output includes the container ID, the image used to create the container, the command is run, the created and status, and the container name.

To look at all the code visit my GitHub

https://github.com/Sudhanshu1304/Dockerizing-Django-Application

Conclusion

Congratulations on completing this tutorial on Dockerizing a Django project! By following the steps outlined in this tutorial, you have successfully set up Docker for your Django project.

We hope you found this tutorial on Dockerizing a Django project helpful and informative, please consider giving it a clap👏 and following me on Medium.

Thank you for reading this tutorial!!!.

Autoencoders

Sudhanshu — Thu, 05 Aug 2021 06:09:41 +0000

If you are new to Deep learning, and would love to understand Neural Network Architecture or would like to tinker with CNN’s / ANN’s then autoencoder is the best point to start with. In this post, we will go through a quick Introduction to Autoencoders.

Visit — machine.learns — Here you can Visualize the working of Autoencoders in different configurations and more.

So before directly jumping into the technical details, let’s first see some of its applications.

Noise Reduction: Autoencoders can be used for Reducing Noise. Noise could be in an Image or in Sound.
Image Compression: So using autoencoders if you have an image of size 784 pixels, you can convert this image into 64 pixels and then again get the original image from that low dimensional image.
Converting a Black and White Image to a Colored Image.
Removing Watermarks from an image.
Fraud Detection eg. credit card fraud detection.

What is an Autoencoders?

Autoencoder is an Artificial Neural Network learning technique. It is an Unsupervised Learning Approach. It mainly helps in achieving representation learning i.e. we come up with an architecture that forces the model to learn a compressed representation of the input data.

This compressed representation is also known as a bottleneck or latent representation. The bottleneck basically learns the features of the data, i.e. it learns to represent a particular data point based on a certain number of features. For example — if the input is a digit dataset then it may learn features like no of horizontal and vertical edges and based on these features it identifies each data point.

Latent Representation (Bottleneck)

The above figure is a plot of Latent space or Bottleneck

We can see clusters of similar colors. Each cluster above is representing similar types of objects example the blue dots are representing asphalt etc.

It is important to know that each dot is representing a unique type of object.

So in total, we can divide an Autoencoder into three parts

Encoder: Part of the Architecture which compresses or forces the model to capture the important information from the data.
Bottleneck(latent representation): This is the compressed representation of the original data.
Decoder: This Part tries to reconstruct the original image from latent representation.

This concept of Autoencoders can be applied using architectures like CNNs and LSTMs.

Conclusion

Despite being a very basic architecture it helps a lot in understanding the various concepts of Neural networks and their architectures. So I will highly encourage you two to learn the implementation of Autoencoders too.

Hope you would have learned something valuable.

Visit: machine.learns 😄