DEV Community

Ahmed Moussa
Ahmed Moussa

Posted on • Originally published at api.aaido.dev

DocProcessor API Tutorial: Automated Document Intelligence and Parsing

The DocProcessor API provides intelligent document processing capabilities for extracting structured data from PDFs, invoices, contracts, and other document types. This API leverages machine learning to automatically identify and parse key information, eliminating manual data entry and reducing processing time.

What DocProcessor Does

DocProcessor analyzes uploaded documents and returns structured JSON data containing extracted text, metadata, and identified entities. The API supports various document formats including PDFs, invoices, contracts, forms, and receipts. It automatically detects document types and applies appropriate parsing logic to extract relevant fields such as dates, amounts, addresses, and custom data points.

Key capabilities include:

  • Text extraction with positional data
  • Entity recognition (dates, amounts, addresses, names)
  • Document classification
  • Table and form field detection
  • Multi-page document processing

Getting Started

First, create an account to obtain your API key:

curl -X POST https://api.aaido.dev/signup \
  -H "Content-Type: application/json" \
  -d '{
    "email": "your-email@example.com",
    "password": "your-secure-password"
  }'
Enter fullscreen mode Exit fullscreen mode

The signup response includes your API key:

{
  "status": "success",
  "api_key": "dp_12345abcdef...",
  "user_id": "user_67890"
}
Enter fullscreen mode Exit fullscreen mode

Store this API key securely as you'll need it for all subsequent requests.

Basic Document Processing

Uploading and Processing a Document

The primary endpoint accepts document uploads via multipart form data:

curl -X POST https://api.aaido.dev/v1/products/docprocessor \
  -H "Authorization: Bearer dp_12345abcdef..." \
  -F "file=@invoice.pdf" \
  -F "document_type=invoice" \
  -F "extract_tables=true"
Enter fullscreen mode Exit fullscreen mode

Parameters:

  • file: The document file (required)
  • document_type: Hint for processing type (optional: auto, invoice, contract, receipt)
  • extract_tables: Boolean flag for table extraction (default: false)
  • language: Document language code (default: auto)

Response Structure

The API returns a comprehensive JSON response:

{
  "document_id": "doc_abc123",
  "status": "completed",
  "document_type": "invoice",
  "pages": 2,
  "processing_time": 1247,
  "extracted_data": {
    "text": "Invoice #INV-2024-001\nDate: 2024-01-15\nAmount: $1,250.00...",
    "entities": {
      "invoice_number": "INV-2024-001",
      "date": "2024-01-15",
      "total_amount": 1250.00,
      "currency": "USD",
      "vendor_name": "Acme Corp",
      "vendor_address": "123 Main St, City, ST 12345"
    },
    "tables": [
      {
        "page": 1,
        "rows": [
          ["Item", "Quantity", "Price", "Total"],
          ["Web Development", "1", "$1000.00", "$1000.00"],
          ["Consulting", "5 hours", "$50.00", "$250.00"]
        ]
      }
    ],
    "confidence_scores": {
      "overall": 0.94,
      "entities": {
        "invoice_number": 0.98,
        "total_amount": 0.96,
        "date": 0.92
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Practical Use Cases

1. Invoice Processing Automation

Automate accounts payable by extracting key invoice data:

curl -X POST https://api.aaido.dev/v1/products/docprocessor \
  -H "Authorization: Bearer dp_12345abcdef..." \
  -F "file=@vendor_invoice.pdf" \
  -F "document_type=invoice" \
  -F "extract_tables=true" \
  -F "extract_line_items=true"
Enter fullscreen mode Exit fullscreen mode

This extracts vendor information, invoice numbers, dates, amounts, and line items for direct integration into accounting systems. The structured response allows automatic validation against purchase orders and streamlines approval workflows.

2. Contract Analysis and Data Extraction

Extract critical terms and dates from legal contracts:

curl -X POST https://api.aaido.dev/v1/products/docprocessor \
  -H "Authorization: Bearer dp_12345abcdef..." \
  -F "file=@service_agreement.pdf" \
  -F "document_type=contract" \
  -F "extract_clauses=true" \
  -F "identify_parties=true"
Enter fullscreen mode Exit fullscreen mode

The API identifies contract parties, effective dates, termination clauses, payment terms, and key obligations. This enables automated contract review processes and deadline tracking systems.

3. Receipt and Expense Management

Process expense receipts for automated reporting:

curl -X POST https://api.aaido.dev/v1/products/docprocessor \
  -H "Authorization: Bearer dp_12345abcdef..." \
  -F "file=@receipt.jpg" \
  -F "document_type=receipt" \
  -F "categorize_expenses=true"
Enter fullscreen mode Exit fullscreen mode

Extracts merchant names, transaction amounts, dates, and expense categories for direct integration into expense management systems.

CI/CD Pipeline Integration

GitHub Actions Integration

Here's a practical example of integrating DocProcessor into a CI/CD pipeline for automated document processing:

name: Process Documents
on:
  push:
    paths: ['documents/**']

jobs:
  process-docs:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Process new documents
        env:
          DOCPROCESSOR_API_KEY: ${{ secrets.DOCPROCESSOR_API_KEY }}
        run: |
          for file in documents/*.pdf; do
            if [[ -f "$file" ]]; then
              echo "Processing $file"

              response=$(curl -s -X POST https://api.aaido.dev/v1/products/docprocessor \
                -H "Authorization: Bearer $DOCPROCESSOR_API_KEY" \
                -F "file=@$file" \
                -F "document_type=auto")

              # Extract document ID for tracking
              doc_id=$(echo $response | jq -r '.document_id')

              # Save extracted data
              echo $response | jq '.extracted_data' > "processed/${doc_id}.json"

              # Validate processing success
              status=$(echo $response | jq -r '.status')
              if [ "$status" != "completed" ]; then
                echo "Processing failed for $file"
                exit 1
              fi
            fi
          done

      - name: Commit processed data
        run: |
          git config --local user.email "action@github.com"
          git config --local user.name "GitHub Action"
          git add processed/
          git commit -m "Add processed document data" || exit 0
          git push
Enter fullscreen mode Exit fullscreen mode

Environment Variables Setup

Store your API key securely in your CI/CD environment:

# For GitHub Actions
# Add DOCPROCESSOR_API_KEY to repository secrets

# For Jenkins
export DOCPROCESSOR_API_KEY="dp_12345abcdef..."

# For Docker containers
docker run -e DOCPROCESSOR_API_KEY="dp_12345abcdef..." your-app
Enter fullscreen mode Exit fullscreen mode

Error Handling

The API returns standard HTTP status codes with detailed error messages:

{
  "error": "invalid_document_type",
  "message": "Document type 'unknown' is not supported",
  "supported_types": ["auto", "invoice", "contract", "receipt", "form"]
}
Enter fullscreen mode Exit fullscreen mode

Common error codes:

  • 400: Invalid request parameters
  • 401: Invalid or missing API key
  • 413: File size too large (max 10MB)
  • 415: Unsupported file format
  • 429: Rate limit exceeded

Best Practices

  1. File Size Optimization: Compress PDFs before upload to reduce processing time
  2. Document Type Hints: Specify document types when known to improve accuracy
  3. Batch Processing: Process multiple documents in parallel for better throughput
  4. Confidence Validation: Check confidence scores before using extracted data
  5. Rate Limiting: Implement proper rate limiting in your applications

The DocProcessor API streamlines document-heavy workflows by automating data extraction and enabling seamless integration into existing systems. For complete API documentation and advanced features, visit https://api.aaido.dev/products/docprocessor.

Top comments (0)