Building a PDF Parser for HDFC Bank Statements: From 165 Pages to CSV in Minutes
π GitHub Repository | β Star it if you find it useful!
The Problem That Started It All
Picture this: You're an auditor, accountant, or financial analyst staring at a 165-page HDFC Bank statement with 3,602 transactions that need to be converted to CSV format. The manual process would take days, and the risk of errors is enormous.
That's exactly the challenge I faced recently, and it led me to build an open-source solution that I'm excited to share with the community.
The Solution: HDFC PDF to CSV Converter
I created a Python tool that automatically extracts all transactions from HDFC Bank PDF statements and converts them to CSV format with intelligent categorization. Here's what it accomplishes:
- β 100% extraction rate from 165-page PDFs
- β 3,602 transactions processed automatically
- β 22 automatic categories (UPI, Foreign Exchange, Salary, etc.)
- β Multi-line narration support for complex transactions
- β Multiple output formats (CSV, Excel, Markdown)
- β Command-line interface for easy automation
Quick Start
# Clone the repository
git clone https://github.com/vishwaraja/hdfc-pdf-converter.git
cd hdfc-pdf-converter
# Install dependencies
pip install -r requirements.txt
# Convert your first PDF (creates ./results/ directory automatically)
python src/hdfc_converter.py your_statement.pdf
Technical Deep Dive
The Tech Stack
# Core dependencies
camelot-py[cv] # PDF table extraction
pandas # Data manipulation
PyPDF2 # PDF processing
pdfplumber # Text extraction
The Challenge: Multi-line Narrations
One of the biggest challenges was handling transactions where the narration spans multiple lines. Here's how I solved it:
def _parse_transaction_row(self, row, page_num):
"""Parse a single transaction row with multi-line support."""
# Handle multi-line narrations
narration_parts = []
# Everything between date and amounts is narration
narration_start = 1
narration_end = len(row) - 5
for i in range(narration_start, narration_end):
part = str(row.iloc[i]).strip()
if part and part != 'nan':
narration_parts.append(part)
narration = ' '.join(narration_parts)
return narration
Intelligent Categorization
The tool automatically categorizes transactions into 22 meaningful categories:
def categorize_transaction(narration):
narration_lower = str(narration).lower()
if any(word in narration_lower for word in ['salary', 'payroll', 'betterplace']):
return 'Salary & Employment'
elif any(word in narration_lower for word in ['foreign', 'usd', 'eur', 'gbp']):
return 'Foreign Exchange'
elif any(word in narration_lower for word in ['upi']):
return 'UPI Payments'
# ... and 19 more categories
Real Results
Here's what the tool achieved with my 165-page statement:
Metric | Result |
---|---|
Total Transactions | 3,602 |
Pages Processed | 165/165 (100%) |
Extraction Time | ~2 minutes |
Categories Identified | 22 |
Data Quality | 100% valid dates |
Sample Output
Date,Narration,Category,Withdrawal_Amount,Deposit_Amount
15/07/2020,UPI payment to merchant,UPI Payments,150.00,0.00
16/07/2020,Salary credit from company,Salary & Employment,0.00,25000.00
17/07/2020,Foreign remittance from USA,Foreign Exchange,0.00,50000.00
Usage Examples
Command Line Interface
# Basic usage (creates ./results/ directory automatically)
python src/hdfc_converter.py statement.pdf
# Custom output directory
python src/hdfc_converter.py statement.pdf --output-dir ./my_results
# Verbose logging for debugging
python src/hdfc_converter.py statement.pdf --verbose
# Convert PDF from different directory
python src/hdfc_converter.py /path/to/statements/hdfc_2024.pdf
Programmatic API
from src.hdfc_converter import HDFCConverter
# Initialize converter
converter = HDFCConverter('statement.pdf', output_dir='./results')
# Convert PDF to CSV
success = converter.convert()
if success:
print("β
Conversion completed successfully!")
The Impact
This tool has already saved me hours of manual work and eliminated the risk of transcription errors. But more importantly, it's now available as an open-source solution for the entire community.
Key Benefits for Users:
- Auditors: Quick conversion of bank statements for analysis
- Accountants: Automated data entry from PDF statements
- Fintech Developers: Foundation for building banking tools
- Data Analysts: Clean CSV data for financial analysis
Open Source and Community
I've made this tool completely open source with:
- π Comprehensive documentation
- π§ͺ Unit tests and examples
- π€ Contribution guidelines
- π Issue templates and PR templates
- π CI/CD pipeline
π Repository: https://github.com/vishwaraja/hdfc-pdf-converter
What's Next?
I'm excited to see how the community will use and improve this tool. Some potential enhancements:
- Support for other bank PDF formats
- GUI interface for non-technical users
- Cloud processing capabilities
- Advanced filtering and search features
Lessons Learned
Building this tool taught me several valuable lessons:
- PDF parsing is complex - Different banks use different formats
- Multi-line data is tricky - Requires careful parsing logic
- Categorization needs intelligence - Simple regex isn't enough
- Documentation is crucial - Makes tools accessible to others
- Open source is powerful - Community feedback improves everything
Get Started
Ready to try it out? Here's how to get started:
# Clone the repository
git clone https://github.com/vishwaraja/hdfc-pdf-converter.git
cd hdfc-pdf-converter
# Install dependencies
pip install -r requirements.txt
# Convert your first PDF
python src/hdfc_converter.py your_statement.pdf
Conclusion
What started as a personal problem-solving exercise became a tool that could benefit the entire developer and financial community. This is the power of open source - turning individual solutions into community resources.
I'd love to hear your thoughts, suggestions, and use cases. Have you faced similar challenges with PDF processing? What other banking tools would be useful to the community?
Connect with me:
- GitHub: @vishwaraja
- Email: vishwaraja.pathi@adiyogitech.com
- Repository: hdfc-pdf-converter
Have questions about PDF parsing or want to contribute to the project? Leave a comment below - I'd love to discuss!
Top comments (0)