Vishwaraja Pathi (Vishwa)

Posted on Sep 25

Building a PDF Parser for HDFC Bank Statements: From 165 Pages to CSV in Minutes

#python #opensource #pdf #hdfc

Building a PDF Parser for HDFC Bank Statements: From 165 Pages to CSV in Minutes

🚀 GitHub Repository | ⭐ Star it if you find it useful!

The Problem That Started It All

Picture this: You're an auditor, accountant, or financial analyst staring at a 165-page HDFC Bank statement with 3,602 transactions that need to be converted to CSV format. The manual process would take days, and the risk of errors is enormous.

That's exactly the challenge I faced recently, and it led me to build an open-source solution that I'm excited to share with the community.

The Solution: HDFC PDF to CSV Converter

I created a Python tool that automatically extracts all transactions from HDFC Bank PDF statements and converts them to CSV format with intelligent categorization. Here's what it accomplishes:

✅ 100% extraction rate from 165-page PDFs
✅ 3,602 transactions processed automatically
✅ 22 automatic categories (UPI, Foreign Exchange, Salary, etc.)
✅ Multi-line narration support for complex transactions
✅ Multiple output formats (CSV, Excel, Markdown)
✅ Command-line interface for easy automation

Quick Start

# Clone the repository
git clone https://github.com/vishwaraja/hdfc-pdf-converter.git
cd hdfc-pdf-converter

# Install dependencies
pip install -r requirements.txt

# Convert your first PDF (creates ./results/ directory automatically)
python src/hdfc_converter.py your_statement.pdf

Technical Deep Dive

The Tech Stack

# Core dependencies
camelot-py[cv]  # PDF table extraction
pandas          # Data manipulation
PyPDF2          # PDF processing
pdfplumber      # Text extraction

The Challenge: Multi-line Narrations

One of the biggest challenges was handling transactions where the narration spans multiple lines. Here's how I solved it:

def _parse_transaction_row(self, row, page_num):
    """Parse a single transaction row with multi-line support."""
    # Handle multi-line narrations
    narration_parts = []

    # Everything between date and amounts is narration
    narration_start = 1
    narration_end = len(row) - 5

    for i in range(narration_start, narration_end):
        part = str(row.iloc[i]).strip()
        if part and part != 'nan':
            narration_parts.append(part)

    narration = ' '.join(narration_parts)
    return narration

Intelligent Categorization

The tool automatically categorizes transactions into 22 meaningful categories:

def categorize_transaction(narration):
    narration_lower = str(narration).lower()

    if any(word in narration_lower for word in ['salary', 'payroll', 'betterplace']):
        return 'Salary & Employment'
    elif any(word in narration_lower for word in ['foreign', 'usd', 'eur', 'gbp']):
        return 'Foreign Exchange'
    elif any(word in narration_lower for word in ['upi']):
        return 'UPI Payments'
    # ... and 19 more categories

Real Results

Here's what the tool achieved with my 165-page statement:

Metric	Result
Total Transactions	3,602
Pages Processed	165/165 (100%)
Extraction Time	~2 minutes
Categories Identified	22
Data Quality	100% valid dates

Sample Output

Date,Narration,Category,Withdrawal_Amount,Deposit_Amount
15/07/2020,UPI payment to merchant,UPI Payments,150.00,0.00
16/07/2020,Salary credit from company,Salary & Employment,0.00,25000.00
17/07/2020,Foreign remittance from USA,Foreign Exchange,0.00,50000.00

Usage Examples

Command Line Interface

# Basic usage (creates ./results/ directory automatically)
python src/hdfc_converter.py statement.pdf

# Custom output directory
python src/hdfc_converter.py statement.pdf --output-dir ./my_results

# Verbose logging for debugging
python src/hdfc_converter.py statement.pdf --verbose

# Convert PDF from different directory
python src/hdfc_converter.py /path/to/statements/hdfc_2024.pdf

Programmatic API

from src.hdfc_converter import HDFCConverter

# Initialize converter
converter = HDFCConverter('statement.pdf', output_dir='./results')

# Convert PDF to CSV
success = converter.convert()

if success:
    print("✅ Conversion completed successfully!")

The Impact

This tool has already saved me hours of manual work and eliminated the risk of transcription errors. But more importantly, it's now available as an open-source solution for the entire community.

Key Benefits for Users:

Auditors: Quick conversion of bank statements for analysis
Accountants: Automated data entry from PDF statements
Fintech Developers: Foundation for building banking tools
Data Analysts: Clean CSV data for financial analysis

Open Source and Community

I've made this tool completely open source with:

📚 Comprehensive documentation
🧪 Unit tests and examples
🤝 Contribution guidelines
📋 Issue templates and PR templates
🔄 CI/CD pipeline

🔗 Repository: https://github.com/vishwaraja/hdfc-pdf-converter

What's Next?

I'm excited to see how the community will use and improve this tool. Some potential enhancements:

Support for other bank PDF formats
GUI interface for non-technical users
Cloud processing capabilities
Advanced filtering and search features

Lessons Learned

Building this tool taught me several valuable lessons:

PDF parsing is complex - Different banks use different formats
Multi-line data is tricky - Requires careful parsing logic
Categorization needs intelligence - Simple regex isn't enough
Documentation is crucial - Makes tools accessible to others
Open source is powerful - Community feedback improves everything

Get Started

Ready to try it out? Here's how to get started:

# Clone the repository
git clone https://github.com/vishwaraja/hdfc-pdf-converter.git
cd hdfc-pdf-converter

# Install dependencies
pip install -r requirements.txt

# Convert your first PDF
python src/hdfc_converter.py your_statement.pdf

Conclusion

What started as a personal problem-solving exercise became a tool that could benefit the entire developer and financial community. This is the power of open source - turning individual solutions into community resources.

I'd love to hear your thoughts, suggestions, and use cases. Have you faced similar challenges with PDF processing? What other banking tools would be useful to the community?

Connect with me:

Have questions about PDF parsing or want to contribute to the project? Leave a comment below - I'd love to discuss!

DEV Community

Building a PDF Parser for HDFC Bank Statements: From 165 Pages to CSV in Minutes

Building a PDF Parser for HDFC Bank Statements: From 165 Pages to CSV in Minutes

The Problem That Started It All

The Solution: HDFC PDF to CSV Converter

Quick Start

Technical Deep Dive

The Tech Stack

The Challenge: Multi-line Narrations

Intelligent Categorization

Real Results

Sample Output

Usage Examples

Command Line Interface

Programmatic API

The Impact

Key Benefits for Users:

Open Source and Community

What's Next?

Lessons Learned

Get Started

Conclusion

Top comments (0)