DEV Community

Cover image for Extract Structured Data from PDFs Using Documind
Milo for Documind

Posted on

Extract Structured Data from PDFs Using Documind

Extracting structured data from PDFs and other document types is a common challenge. Instead of relying on unreliable OCR solutions or manually copying and pasting data, Documind provides a powerful API that allows you to automate data extraction seamlessly.

In this guide, you’ll learn how to:

  • Set up authentication to securely access the API
  • Create an extraction job for processing
  • Poll for job completion and retrieve the extracted data

To make this practical, we’ll extract key details from this invoice, including:

  • Invoice number
  • Invoice date
  • Due date
  • Supplier name
  • Supplier address
  • Supplier email
  • Supplier phone
  • Supplier VAT number
  • Payment method
  • Bank name
  • IBAN
  • SWIFT/BIC code
  • Payment reference
  • Subtotal
  • Tax amount
  • Total amount
  • Item name
  • Item SKU
  • Item quantity
  • Item unit price
  • Item discount
  • Item total price

Once the data is extracted, it can be stored in a database for tracking invoices and payments or sent directly to accounting platforms like Xero, QuickBooks, and other financial tools. This guide will take you through the complete process so you can integrate Documind’s API into your own applications.

Step 1: Setting Up Authentication

Before making requests to the Documind API, you'll need to install the necessary dependencies and obtain your API key for authentication.

1.1 Install Required Packages

Ensure you have Node.js installed, then install axios (for making HTTP requests) and dotenv (for securely storing API keys):

npm install axios dotenv

Enter fullscreen mode Exit fullscreen mode

1.2 Get Your API Key

  1. Sign in to the Documind Dashboard
  2. Navigate to the Settings page
  3. Copy your Secret API Key

To keep your key secure, create a .env file in your project directory and store the key there:

DOCUMIND_API_KEY=your-secret-api-key

Enter fullscreen mode Exit fullscreen mode

Documind is still in private beta sign up hereto get access.

1.3 Create an Axios Instance

Now, set up an axios instance with the base URL of the API and authentication headers:

import 'dotenv/config'
import axios from 'axios';

// Load API key from environment variables
const API_KEY = process.env.DOCUMIND_API_KEY;

const documindAPI = axios.create({
  baseURL: 'https://api.documind.xyz',
  headers: {
    'Authorization': `Bearer ${API_KEY}`,
    'Content-Type': 'application/json'
  }
});

Enter fullscreen mode Exit fullscreen mode

Step 2: Creating an Extraction Job

To extract data from the invoice, you need to create an extraction job. This involves sending the document's URL to Documind along with a schema that defines the structure of the data you want to extract.

2.1 Defining the Schema

The schema acts as a blueprint, specifying what information should be extracted from the document. It outlines the key fields and their expected data types.

For a detailed breakdown of schema definitions and best practices, check out this guide. You can also find additional schema examples in the documentation.

Here’s how you can define a schema for an invoice:

const schema =[
  {
    "name": "invoiceNumber",
    "type": "string",
    "description": "Unique identifier for the invoice"
  },
  {
    "name": "invoiceDate",
    "type": "string",
    "description": "Date when the invoice was issued"
  },
  {
    "name": "dueDate",
    "type": "string",
    "description": "Payment due date for the invoice"
  },
  {
    "name": "supplier",
    "type": "object",
    "description": "Details of the supplier issuing the invoice",
    "children": [
      {
        "name": "name",
        "type": "string",
        "description": "Supplier's name"
      },
      {
        "name": "email",
        "type": "string",
        "description": "Supplier's email address"
      },
      {
        "name": "vatNumber",
        "type": "string",
        "description": "Supplier's VAT or Tax ID"
      }
    ]
  },
  {
    "name": "payment",
    "type": "object",
    "description": "Payment details for the invoice",
    "children": [
      {
        "name": "paymentMethod",
        "type": "enum",
        "description": "Payment method",
        "values": ["Bank Transfer", "Credit Card", "Cheque"]
      },
      {
        "name": "bankDetails",
        "type": "object",
        "description": "Bank details for wire transfers",
        "children": [
          {
            "name": "bankName",
            "type": "string",
            "description": "Name of the bank"
          },
          {
            "name": "iban",
            "type": "string",
            "description": "International Bank Account Number (IBAN)"
          },
          {
            "name": "swift",
            "type": "string",
            "description": "SWIFT/BIC code for international transfers"
          },
          {
            "name": "reference",
            "type": "string",
            "description": "Reference text for the payment"
          }
        ]
      }
    ]
  },
  {
    "name": "financialSummary",
    "type": "object",
    "description": "Breakdown of financial details",
    "children": [
      {
        "name": "subtotal",
        "type": "number",
        "description": "Total amount before taxes"
      },
      {
        "name": "tax",
        "type": "number",
        "description": "Total tax amount applied"
      },
      {
        "name": "totalAmount",
        "type": "number",
        "description": "Final total amount due after taxes"
      }
    ]
  },
  {
    "name": "items",
    "type": "array",
    "description": "List of purchased items in the invoice",
    "children": [
      {
        "name": "name",
        "type": "string",
        "description": "Name of the item"
      },
      {
        "name": "sku",
        "type": "string",
        "description": "Stock Keeping Unit (SKU) identifier"
      },
      {
        "name": "quantity",
        "type": "number",
        "description": "Number of units purchased"
      },
      {
        "name": "unitPrice",
        "type": "number",
        "description": "Price per unit"
      },
      {
        "name": "discount",
        "type": "number",
        "description": "Discount applied per unit"
      },
      {
        "name": "totalPrice",
        "type": "number",
        "description": "Total price for the item"
      }
    ]
  }
]

Enter fullscreen mode Exit fullscreen mode

2.2 Send the PDF to Documind

Now, let's send the extraction job with the document url and the schema above:

async function createJob(file) {
    try {
      const response = await documindAPI.post('/run-job', {
        file,
        schema
      });

      return response.data.id; // Store Job ID for polling
    } catch (error) {
      console.error('Error creating extraction job:', error.response ? error.response.data : error.message);
    }
  }

Enter fullscreen mode Exit fullscreen mode

What Happens Here?

  • We send a POST request to /run-job
  • The file URL points to the document to be processed
  • The schema defines the expected structure of extracted data
  • The API returns a Job ID, which we use to check the status and get the results

Step 3: Polling for Job Completion

Next, poll the API until the job is complete.

async function pollJob(jobId, maxRetries = 5, delay = 5000) {
    for (let attempt = 1; attempt <= maxRetries; attempt++) {
      try {
        const { data } = await documindAPI.get(`/job/${jobId}`);

        if (data.status === "COMPLETED") {
          return data.result;
        }

        if (data.status === "FAILED") {
          throw new Error(`Extraction failed for Job ID: ${jobId}`);
        }

        await new Promise(resolve => setTimeout(resolve, delay));

      } catch (error) {
        console.error("Error retrieving job status:", error.response?.data || error.message);
        if (attempt === maxRetries) throw new Error(`Max retries reached. Job ID: ${jobId}`);
      }
    }
  }
Enter fullscreen mode Exit fullscreen mode

Step 4: Putting Everything Together

async function extractData(file) {
  const jobId = await createJob(file);
  if (!jobId) throw new Error("Failed to create extraction job.");

  const result = await pollJob(jobId);

  // You can save the extracted data to a JSON file to see the results
  fs.writeFileSync("invoice.json", JSON.stringify(result, null, 2));
}

// Usage
const file = "<Add your file URL here>"
extractData(file)
  .then(() => console.log("Extraction process completed."))
  .catch(error => console.error("Error:", error));

Enter fullscreen mode Exit fullscreen mode

The Result

Once completed, you should receive structured JSON data like this:

{
  "items": [
    {
      "sku": "SRV-1001",
      "name": "Cloud Server Hosting",
      "discount": 0,
      "quantity": 1,
      "unitPrice": 3000,
      "totalPrice": 3000
    },
    {
      "sku": "LIC-4587",
      "name": "Software Licensing",
      "discount": 50,
      "quantity": 5,
      "unitPrice": 400,
      "totalPrice": 1750
    },
    {
      "sku": "CNS-2003",
      "name": "Consulting Services",
      "discount": 20,
      "quantity": 10,
      "unitPrice": 100,
      "totalPrice": 980
    }
  ],
  "dueDate": "March 1, 2024",
  "payment": {
    "bankDetails": {
      "iban": "GB29HBUK40127612345678",
      "swift": "HBUKGB4B",
      "bankName": "HSBC Bank",
      "reference": "Invoice #INV-2024-019"
    },
    "paymentMethod": "Bank Transfer"
  },
  "supplier": {
    "name": "Tech Solutions",
    "email": "accounts@techsolutions.com",
    "vatNumber": "GB123456789"
  },
  "invoiceDate": "February 1, 2024",
  "invoiceNumber": "INV-2024-019",
  "financialSummary": {
    "tax": 286.5,
    "subtotal": 5730,
    "totalAmount": 6016.5
  }
}
Enter fullscreen mode Exit fullscreen mode

What Next?

  • You can import the extracted data into databases like PostgreSQL, Firebase, MongoDB
  • Sync payments with accounting software like Xero or Quickbooks
  • Generate financial reports
  • Trigger payment processing workflows

Now that you've seen how to extract structured data from PDFs, you can test it out yourself! Try uploading an invoice in the playground to see how the extraction works in real-time. When you're ready to integrate this into your own applications, sign up on Documind to start extracting structured data from PDFs and other documents.

Sentry image

Hands-on debugging session: instrument, monitor, and fix

Join Lazar for a hands-on session where you’ll build it, break it, debug it, and fix it. You’ll set up Sentry, track errors, use Session Replay and Tracing, and leverage some good ol’ AI to find and fix issues fast.

RSVP here →

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay