DEV Community

Cover image for Building a PDF Parser for HDFC Bank Statements: From 165 Pages to CSV in Minutes
Vishwaraja Pathi (Vishwa)
Vishwaraja Pathi (Vishwa)

Posted on

Building a PDF Parser for HDFC Bank Statements: From 165 Pages to CSV in Minutes

Building a PDF Parser for HDFC Bank Statements: From 165 Pages to CSV in Minutes

πŸš€ GitHub Repository | ⭐ Star it if you find it useful!

The Problem That Started It All

Picture this: You're an auditor, accountant, or financial analyst staring at a 165-page HDFC Bank statement with 3,602 transactions that need to be converted to CSV format. The manual process would take days, and the risk of errors is enormous.

That's exactly the challenge I faced recently, and it led me to build an open-source solution that I'm excited to share with the community.

The Solution: HDFC PDF to CSV Converter

I created a Python tool that automatically extracts all transactions from HDFC Bank PDF statements and converts them to CSV format with intelligent categorization. Here's what it accomplishes:

  • βœ… 100% extraction rate from 165-page PDFs
  • βœ… 3,602 transactions processed automatically
  • βœ… 22 automatic categories (UPI, Foreign Exchange, Salary, etc.)
  • βœ… Multi-line narration support for complex transactions
  • βœ… Multiple output formats (CSV, Excel, Markdown)
  • βœ… Command-line interface for easy automation

Quick Start

# Clone the repository
git clone https://github.com/vishwaraja/hdfc-pdf-converter.git
cd hdfc-pdf-converter

# Install dependencies
pip install -r requirements.txt

# Convert your first PDF (creates ./results/ directory automatically)
python src/hdfc_converter.py your_statement.pdf
Enter fullscreen mode Exit fullscreen mode

Technical Deep Dive

The Tech Stack

# Core dependencies
camelot-py[cv]  # PDF table extraction
pandas          # Data manipulation
PyPDF2          # PDF processing
pdfplumber      # Text extraction
Enter fullscreen mode Exit fullscreen mode

The Challenge: Multi-line Narrations

One of the biggest challenges was handling transactions where the narration spans multiple lines. Here's how I solved it:

def _parse_transaction_row(self, row, page_num):
    """Parse a single transaction row with multi-line support."""
    # Handle multi-line narrations
    narration_parts = []

    # Everything between date and amounts is narration
    narration_start = 1
    narration_end = len(row) - 5

    for i in range(narration_start, narration_end):
        part = str(row.iloc[i]).strip()
        if part and part != 'nan':
            narration_parts.append(part)

    narration = ' '.join(narration_parts)
    return narration
Enter fullscreen mode Exit fullscreen mode

Intelligent Categorization

The tool automatically categorizes transactions into 22 meaningful categories:

def categorize_transaction(narration):
    narration_lower = str(narration).lower()

    if any(word in narration_lower for word in ['salary', 'payroll', 'betterplace']):
        return 'Salary & Employment'
    elif any(word in narration_lower for word in ['foreign', 'usd', 'eur', 'gbp']):
        return 'Foreign Exchange'
    elif any(word in narration_lower for word in ['upi']):
        return 'UPI Payments'
    # ... and 19 more categories
Enter fullscreen mode Exit fullscreen mode

Real Results

Here's what the tool achieved with my 165-page statement:

Metric Result
Total Transactions 3,602
Pages Processed 165/165 (100%)
Extraction Time ~2 minutes
Categories Identified 22
Data Quality 100% valid dates

Sample Output

Date,Narration,Category,Withdrawal_Amount,Deposit_Amount
15/07/2020,UPI payment to merchant,UPI Payments,150.00,0.00
16/07/2020,Salary credit from company,Salary & Employment,0.00,25000.00
17/07/2020,Foreign remittance from USA,Foreign Exchange,0.00,50000.00
Enter fullscreen mode Exit fullscreen mode

Usage Examples

Command Line Interface

# Basic usage (creates ./results/ directory automatically)
python src/hdfc_converter.py statement.pdf

# Custom output directory
python src/hdfc_converter.py statement.pdf --output-dir ./my_results

# Verbose logging for debugging
python src/hdfc_converter.py statement.pdf --verbose

# Convert PDF from different directory
python src/hdfc_converter.py /path/to/statements/hdfc_2024.pdf
Enter fullscreen mode Exit fullscreen mode

Programmatic API

from src.hdfc_converter import HDFCConverter

# Initialize converter
converter = HDFCConverter('statement.pdf', output_dir='./results')

# Convert PDF to CSV
success = converter.convert()

if success:
    print("βœ… Conversion completed successfully!")
Enter fullscreen mode Exit fullscreen mode

The Impact

This tool has already saved me hours of manual work and eliminated the risk of transcription errors. But more importantly, it's now available as an open-source solution for the entire community.

Key Benefits for Users:

  • Auditors: Quick conversion of bank statements for analysis
  • Accountants: Automated data entry from PDF statements
  • Fintech Developers: Foundation for building banking tools
  • Data Analysts: Clean CSV data for financial analysis

Open Source and Community

I've made this tool completely open source with:

  • πŸ“š Comprehensive documentation
  • πŸ§ͺ Unit tests and examples
  • 🀝 Contribution guidelines
  • πŸ“‹ Issue templates and PR templates
  • πŸ”„ CI/CD pipeline

πŸ”— Repository: https://github.com/vishwaraja/hdfc-pdf-converter

What's Next?

I'm excited to see how the community will use and improve this tool. Some potential enhancements:

  • Support for other bank PDF formats
  • GUI interface for non-technical users
  • Cloud processing capabilities
  • Advanced filtering and search features

Lessons Learned

Building this tool taught me several valuable lessons:

  1. PDF parsing is complex - Different banks use different formats
  2. Multi-line data is tricky - Requires careful parsing logic
  3. Categorization needs intelligence - Simple regex isn't enough
  4. Documentation is crucial - Makes tools accessible to others
  5. Open source is powerful - Community feedback improves everything

Get Started

Ready to try it out? Here's how to get started:

# Clone the repository
git clone https://github.com/vishwaraja/hdfc-pdf-converter.git
cd hdfc-pdf-converter

# Install dependencies
pip install -r requirements.txt

# Convert your first PDF
python src/hdfc_converter.py your_statement.pdf
Enter fullscreen mode Exit fullscreen mode

Conclusion

What started as a personal problem-solving exercise became a tool that could benefit the entire developer and financial community. This is the power of open source - turning individual solutions into community resources.

I'd love to hear your thoughts, suggestions, and use cases. Have you faced similar challenges with PDF processing? What other banking tools would be useful to the community?

Connect with me:


Have questions about PDF parsing or want to contribute to the project? Leave a comment below - I'd love to discuss!

Top comments (0)