Scraping Nonprofit Financial Data: Form 990 and IRS Records

#python #programming #tutorial #webdev

Scraping Nonprofit Financial Data: Form 990 and IRS Records

Every nonprofit in the United States is required to file Form 990 with the IRS, disclosing executive compensation, revenue, expenses, and program activities. This data is public but notoriously difficult to access in bulk. Let's build a Python scraper to extract and analyze nonprofit financial data systematically.

Why Nonprofit Financial Data Matters

Donors, journalists, and researchers need to evaluate whether nonprofits use funds effectively. Form 990 reveals executive pay ratios, fundraising efficiency, program spending percentages, and financial health indicators.

Data Sources

IRS Exempt Organizations BMF — master list of all tax-exempt organizations
ProPublica Nonprofit Explorer API — structured 990 data (best free source)
IRS 990 XML files — raw filings on AWS (bulk download)
State charity registrations — additional compliance data

ProPublica Nonprofit Explorer API

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

IRS Bulk XML Data

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Financial Health Analysis

import pandas as pd

class NonprofitAnalyzer:
    def calculate_metrics(self, filing_data):
        revenue = filing_data.get("total_revenue", 0)
        expenses = filing_data.get("total_expenses", 0)
        program_expenses = filing_data.get("program_service_expenses", 0)
        net_assets = filing_data.get("net_assets", 0)

        metrics = {
            "ein": filing_data.get("ein"),
            "name": filing_data.get("name"),
            "revenue": revenue,
            "expenses": expenses,
        }

        if expenses > 0:
            metrics["program_ratio"] = round(program_expenses / expenses * 100, 1)
        else:
            metrics["program_ratio"] = 0

        monthly_expenses = expenses / 12
        if monthly_expenses > 0:
            metrics["operating_reserve_months"] = round(net_assets / monthly_expenses, 1)
        else:
            metrics["operating_reserve_months"] = 0

        officers = filing_data.get("officers", [])
        if officers:
            top_comp = max(o["compensation"] for o in officers)
            metrics["top_executive_comp"] = top_comp
            if revenue > 0:
                metrics["comp_to_revenue_pct"] = round(top_comp / revenue * 100, 2)

        metrics["program_rating"] = (
            "EXCELLENT" if metrics["program_ratio"] > 80
            else "GOOD" if metrics["program_ratio"] > 65
            else "FAIR" if metrics["program_ratio"] > 50
            else "POOR"
        )
        return metrics

    def compare_organizations(self, filings_list):
        results = [self.calculate_metrics(f) for f in filings_list]
        df = pd.DataFrame(results)
        return df.sort_values("program_ratio", ascending=False)

Batch Processing

def batch_analyze_sector(sector_keyword, state=None, limit=500):
    explorer = NonprofitExplorer()
    analyzer = NonprofitAnalyzer()
    orgs = []
    page = 0
    while len(orgs) < limit:
        results = explorer.search_organizations(sector_keyword, state=state, page=page)
        if not results:
            break
        orgs.extend(results)
        page += 1

    all_metrics = []
    for org in orgs[:limit]:
        ein = org.get("ein")
        details = explorer.get_organization(ein)
        filings = details.get("filings_with_data", [])
        if filings:
            latest = filings[0]
            metrics = analyzer.calculate_metrics({
                "ein": ein,
                "name": org.get("name"),
                "total_revenue": latest.get("totrevenue", 0),
                "total_expenses": latest.get("totfuncexpns", 0),
                "program_service_expenses": latest.get("totprgmrevnue", 0),
                "net_assets": latest.get("totassetsend", 0),
                "officers": []
            })
            all_metrics.append(metrics)

    df = pd.DataFrame(all_metrics)
    df.to_csv(f"nonprofit_{sector_keyword}_{state or 'all'}.csv", index=False)
    return df

Scaling with Proxies

For large-scale scraping of state charity registrar sites, use ScraperAPI for rendering-heavy portals. ThorData residential proxies avoid rate limiting on government sites. ScrapeOps monitors scraper health.

Use Cases

Donor due diligence — vet charities before giving
Investigative journalism — find compensation outliers or financial red flags
Academic research — study nonprofit sector trends at scale
Grant makers — evaluate applicant financial health

Nonprofit financial transparency shouldn't require a forensic accountant. With these tools, anyone can analyze how organizations spend their funding.

DEV Community

Scraping Nonprofit Financial Data: Form 990 and IRS Records

Scraping Nonprofit Financial Data: Form 990 and IRS Records

Why Nonprofit Financial Data Matters

Data Sources

ProPublica Nonprofit Explorer API

IRS Bulk XML Data

Financial Health Analysis

Batch Processing

Scaling with Proxies

Use Cases

Top comments (0)