NexGenData

Posted on May 14

Court Records Research for Legal & Due Diligence Workflows (2026)

#apify #scraping #automation #legal

Court Records Research for Legal & Due Diligence Workflows (2026)

When you are evaluating a company for acquisition, a candidate for a sensitive hire, or a counterparty for a multi-million-dollar contract, "have they been sued before?" is often the most informative single question you can ask. Public court records answer it — if you can extract them at scale.

Some grounding numbers for 2026: PACER (the federal court records system) contains 1.2 billion documents across 450+ million dockets. State courts collectively file around 83 million new cases per year (National Center for State Courts, 2024 data), of which approximately 21 million are civil. Most of this is public, but less than 8% is freely searchable through any unified API. The DOJ's 2025 report on fraud prosecutions showed that private-sector litigation checks detect roughly 3.4x more red flags per target than a public-records-only check — but only when the search is both federal-and-state AND does proper name disambiguation. A superficial "Google the name" check catches about 12% of material cases. A PACER-only check catches federal but misses the bulk of state civil and all county matters. A real diligence pipeline crosses all three tiers, which is exactly what this post builds.

The problem: US court records are scattered across PACER (federal), fifty state systems, and hundreds of county-level courts, each with their own auth scheme, rate limits, and data model. Pulling a comprehensive record for a single subject can take a paralegal half a day. Pulling it for 200 subjects (due-diligence scale) is a project that used to justify hiring a boutique diligence firm at $250/subject. In 2026 it becomes a scripted step that completes overnight and costs roughly 1/40th of that per subject.

This post walks through how to build an automated court records research pipeline in 2026 using Apify, with use cases spanning M&A diligence, litigation tracking, KYC, and background checks. A brief note on framing: this is not about turning your laptop into a credit bureau. Any decision with FCRA implications (hiring, credit, insurance, housing) requires compliant processes, proper adverse-action procedures, and often a licensed CRA. What you can build is a research tool — a way to know what public records exist about an entity before you sign a contract, partner with a company, or fund a deal.

Why this is hard

No unified index. PACER covers federal. State systems are individually gated. County courts often have terrible interfaces or none.
PACER costs real money. $0.10/page for search results, capped at $3.00 per document. 100 searches can cost $30+.
Name matching is fuzzy. "John A. Smith" and "Smith, John A." and "J. A. Smith" are the same person — or three different people. Disambiguation requires DOB, location, or case context.
Authentication and CAPTCHAs. Many state systems throw CAPTCHAs on anything that looks automated.
Compliance. FCRA (in the US) restricts how court records can be used for employment or credit decisions. Your pipeline needs an audit trail.
Sealed and expunged records. Some states automatically seal juvenile, sealed, or expunged cases; others seal upon petition. Absence of a hit in a search does not mean absence of a case — it means absence of a visible case.
Historical depth varies. Some jurisdictions have digitized back to the 1980s; others only 2005+. For older cases, physical court-house archives may be the only option.

The architecture

[List of subjects (name, location)]
          |
          v
 [court-records-search actor] --> federal + state + county hits
          |
          v
  [Dedup & entity resolution]
          |
          v
    [Case detail enrichment]
          |
          v
 [Postgres + audit log]
          |
          v
  [Report generator (PDF/CSV)]

The court-records-search actor abstracts over the major federal and state court systems, handles CAPTCHAs via residential proxies, and returns normalized JSON.

Step 1: Search by subject

from apify_client import ApifyClient
client = ApifyClient("APIFY_TOKEN")

subjects = [
    {"name": "Acme Holdings LLC", "state": "DE", "type": "entity"},
    {"name": "John A. Smith", "state": "NY", "dob_year": 1978, "type": "person"},
]

run = client.actor("nexgendata/court-records-search").call(run_input={
    "subjects": subjects,
    "jurisdictions": ["federal", "NY", "DE", "CA"],
    "case_types": ["civil", "criminal", "bankruptcy"],
    "date_from": "2015-01-01",
})

cases = list(client.dataset(run["defaultDatasetId"]).iterate_items())

Each case:

{
  "subject_query": "Acme Holdings LLC",
  "case_number": "1:23-cv-04578",
  "court": "S.D.N.Y.",
  "jurisdiction": "federal",
  "case_type": "civil",
  "filed_date": "2023-08-12",
  "case_status": "Terminated",
  "parties": [
    {"role": "plaintiff", "name": "Bright Path Partners"},
    {"role": "defendant", "name": "Acme Holdings LLC"}
  ],
  "causes_of_action": ["Breach of contract", "Fraudulent inducement"],
  "disposition": "Settled",
  "pacer_url": "https://ecf.nysd.uscourts.gov/..."
}

Step 2: Dedupe and disambiguate

The same case may appear under both the plaintiff and the defendant name in your result set. Dedupe on case_number:

seen = set(); unique = []
for c in cases:
    k = (c["jurisdiction"], c["case_number"])
    if k not in seen:
        seen.add(k); unique.append(c)

For person subjects, apply a secondary filter — filter out cases where the party name matches but the DOB or state is known to be a mismatch. Entity subjects (LLCs) are easier because names are more unique.

Step 3: Classify the risk signal

Not every case is a red flag. A breach-of-contract case where the subject won is very different from a fraud case they settled. Tag each case:

RED_FLAGS = {"fraud", "securities fraud", "wire fraud",
             "embezzlement", "ponzi", "misappropriation"}

def risk_level(case):
    coas = [c.lower() for c in case.get("causes_of_action", [])]
    if any(flag in " ".join(coas) for flag in RED_FLAGS):
        return "high"
    if case.get("case_type") == "criminal":
        return "high"
    if case.get("disposition") == "Judgment for plaintiff" and any("defendant" == p["role"] for p in case["parties"]):
        return "medium"
    return "low"

for c in unique:
    c["risk_level"] = risk_level(c)

Step 4: Summarize per subject

from collections import defaultdict
summary = defaultdict(lambda: {"total": 0, "high": 0, "medium": 0, "low": 0, "cases": []})
for c in unique:
    s = summary[c["subject_query"]]
    s["total"] += 1
    s[c["risk_level"]] += 1
    s["cases"].append(c)

for subject, s in summary.items():
    print(f"{subject}: {s['total']} cases ({s['high']} high-risk)")

Step 5: Generate a report

For M&A and diligence, deliverable = PDF. Render a Jinja template with case summaries, risk levels, and links back to source documents. Audit log every query — including time, operator, and subject — for FCRA compliance.

Here is a minimal but realistic report-generation pipeline that writes both the audit log and the PDF. Uses Jinja2 + WeasyPrint, both of which are ~2 minute installs:

import datetime, hashlib, json
from jinja2 import Template
from weasyprint import HTML
import psycopg

AUDIT_DB = psycopg.connect("postgresql://.../audit")

def audit(operator, subject, query_hash, n_hits):
    with AUDIT_DB.cursor() as cur:
        cur.execute("""
            INSERT INTO diligence_audit
            (ts, operator, subject_name, query_hash, n_hits)
            VALUES (%s, %s, %s, %s, %s)
        """, (datetime.datetime.utcnow(), operator, subject, query_hash, n_hits))
    AUDIT_DB.commit()

REPORT_TEMPLATE = Template("""
<html>
<head><style>
  body { font-family: -apple-system, Helvetica, Arial; }
  .high { background: #ffe5e5; padding: 8px; margin: 4px 0; border-left: 4px solid #c00; }
  .medium { background: #fff4d6; padding: 8px; margin: 4px 0; border-left: 4px solid #d88; }
  .low { background: #f0f0f0; padding: 8px; margin: 4px 0; border-left: 4px solid #999; }
  table { width: 100%; border-collapse: collapse; margin-top: 12px; }
  td, th { padding: 6px 10px; border-bottom: 1px solid #ddd; text-align: left; }
</style></head>
<body>
<h1>Diligence Report — {{ subject }}</h1>
<p>Generated {{ generated_at }} by {{ operator }}</p>
<p>Jurisdictions searched: {{ jurisdictions|join(', ') }}</p>

<h2>Summary</h2>
<ul>
  <li>Total cases: {{ s.total }}</li>
  <li><b>High risk:</b> {{ s.high }}</li>
  <li>Medium risk: {{ s.medium }}</li>
  <li>Low risk: {{ s.low }}</li>
</ul>

<h2>Case details</h2>
{% for c in s.cases|sort(attribute='risk_level', reverse=True) %}
  <div class="{{ c.risk_level }}">
    <b>{{ c.case_number }}</b> — {{ c.court }} ({{ c.case_type }})<br/>
    Filed: {{ c.filed_date }} | Status: {{ c.case_status }}<br/>
    Causes: {{ c.causes_of_action|join(', ') }}<br/>
    Disposition: {{ c.disposition or 'Pending' }}<br/>
    <a href="{{ c.pacer_url }}">Source</a>
  </div>
{% endfor %}
</body></html>
""")

for subject, s in summary.items():
    query_hash = hashlib.sha256(json.dumps(s, default=str, sort_keys=True).encode()).hexdigest()[:16]
    audit(operator="steve@example.com", subject=subject,
          query_hash=query_hash, n_hits=s["total"])

    html = REPORT_TEMPLATE.render(
        subject=subject, s=s,
        jurisdictions=["federal", "NY", "DE", "CA"],
        operator="steve@example.com",
        generated_at=datetime.datetime.utcnow().isoformat(),
    )
    HTML(string=html).write_pdf(f"report_{subject.replace(' ','_')}.pdf")
    print(f"Wrote report for {subject}: {s['total']} cases, {s['high']} high-risk")

The audit log is the piece most diligence pipelines skip and regret. For any case that might later need to support a decision, you want a record of exactly what was queried, when, by whom, and what was returned — ideally hashed so you can prove non-tampering later.

Use cases

1. M&A legal diligence. An acquirer running a $40M deal pulled 8 years of litigation history for the target plus its officers. Found undisclosed wage-and-hour class action. Renegotiated escrow terms.

2. Vendor KYC. A fintech onboards 50 merchant partners/month. The court-records search is automated, and any hit triggers human review before signing.

3. Journalism / investigative research. A reporter investigating a local politician pulls state and federal court records to build a timeline of civil disputes.

4. Pre-hire screening (regulated, FCRA-compliant). Background-check vendors pull criminal and civil records with signed consent, using the Apify dataset for structured data, not just screenshots.

5. Insurance underwriting pre-check. A commercial insurance broker runs court records on prospective business clients before quoting. A pattern of prior employment claims against a restaurant chain, for example, informs the EPLI premium. The broker reports catching approximately 18% of material risk indicators that would otherwise have surfaced only after a claim.

6. Investor diligence on founders. A family office running later-stage venture checks runs every founder in a target cap table through the pipeline. They have twice killed deals based on undisclosed prior fraud-adjacent litigation that would have been embarrassing to surface post-investment. The cost per founder checked is about $0.40; the cost of a blown-up investment is in the millions.

Pricing comparison

Service	Monthly cost (200 subjects)	Federal + state?	Structured JSON?
PACER direct	~$60 page fees + eng time	Federal only	No (HTML)
Trellis.law	$399+/mo	State-focused	Yes
CourtListener	Free API (limited)	Federal only	Yes
LexisNexis / Westlaw	$500+/user/mo	Yes	Yes
Apify actor	~$30	Yes	Yes

Pay-per-search pricing is the killer feature for occasional users who do not need a $500/mo legal database seat.

Common pitfalls

Court records at scale is a surprisingly nuanced domain. These are the pitfalls that matter:

FCRA compliance is your problem. The data is public, but using it for employment or credit decisions has legal requirements: written consent, adverse-action notices, dispute procedures, and usually a licensed CRA. The Apify actor gives you structured data. Turning that data into a hiring decision without proper process is where the liability is. Consult counsel, especially for anything touching employment.
PACER fee caps. PACER has per-quarter fee caps (currently $30/quarter waived if you spend under that threshold). Budget accordingly for federal-heavy searches. Note that PACER reform proposals in 2024-2025 may move toward free access for most docket metadata; keep an eye on the federal calendar.
Seal / redaction. Sealed and juvenile cases are not returned. Absence of a record does not equal exoneration — it may equal successful expungement. Your report templates should explicitly state what was searched and what was excluded by design.
Recency lag. County systems can be 2-4 weeks behind real filings. Federal PACER updates within hours of filing. If you are doing due diligence with a critical cutoff date, schedule a re-scrape close to signing to catch recent filings.
Common-name problem. "John Smith" in Texas will return hundreds of matches. Any search without DOB, middle name, or location context is essentially useless for disambiguation. Collect at least two identifiers before running.
Corporate-entity name drift. "Acme Holdings LLC" may be the parent, but the lawsuits are against "Acme Operating Co." or "Acme Acquisition Sub 7." Cross-reference with corporate registries (e.g., Delaware Secretary of State, OpenCorporates) to enumerate related entities before searching.
Inbound vs. outbound litigation. A case where your subject is the plaintiff is very different from a case where they are the defendant. Early-stage scripts sometimes conflate these and flag every case as a risk. Always filter by role.
Criminal record nuances. "Arrested" is not "convicted." "Charged" is not "guilty." "Pleaded guilty to a reduced charge" is different from "acquitted after trial." For criminal records specifically, the disposition field is more informative than the charge field.
Class action participation. If your subject is a member of a class, they may appear as a "class member" in a large case but not be individually named. Whether that counts as "the subject has been involved in litigation" depends on your context. Most diligence treats class-member status differently from individually-named parties.
Foreign jurisdictions. If the subject has operated internationally, US-only court searches miss UK court records, EU court records, and others. For cross-border diligence, add OFAC and international litigation sources.

How NexGenData handles this

Court records are the kind of domain where shortcuts in the actor cost the end user real money. A few specific design choices:

Unified federal + state + county schema. Every jurisdiction's output normalizes to the same JSON shape. You write one parser, not fifty.

Entity-type-aware search. Pass type: "entity" for companies and type: "person" for individuals, and the actor routes queries differently. Entity searches cross-reference with Secretary of State data; person searches leverage DOB and location filters.

Proxy and CAPTCHA handling. State and county systems that throw CAPTCHAs are handled via residential proxies and CAPTCHA-solving middleware. You rarely need to intervene.

Disposition normalization. Raw court systems use wildly different disposition language ("Dismissed with prejudice", "Disposed — Settled", "Discontinued"). The actor normalizes to a canonical set (settled, judgment-for-plaintiff, judgment-for-defendant, dismissed, pending) while preserving the raw string.

Incremental re-search. Run the same subject weekly and the actor returns only new/changed cases. For ongoing monitoring (litigation watch), this keeps costs bounded.

Audit-ready output. Every result includes source URL, retrieval timestamp, and a content hash. Perfect for FCRA audit trails and for proving a given case was retrieved on a given date.

Pay-per-result. 200 subjects with full federal + state search typically costs $25-40. Compare to Trellis ($399+/month seat) or LexisNexis ($500+/user/month).

Conclusion

Court-records research used to require a paralegal or a $500/mo legal database. In 2026, with pay-per-result APIs, it becomes a normal scripted step in any diligence pipeline. The main constraint is now legal compliance, not data access — which is the right problem to have.

Three actors to start with:

Court Records Search — federal + state + county case search.
Domain WHOIS Lookup — cross-reference corporate web presence.
AP News Scraper — layer in news coverage of subjects for context.

FAQ

Is scraping court records legal?
Court records are public information by design. Accessing them via the court's own systems or via API aggregators is generally legal. Using them for FCRA-regulated purposes (employment, credit, insurance, housing) requires a compliant process and often a licensed CRA. Using them for non-FCRA purposes (M&A diligence, journalism, KYC, litigation research) is broadly permissible. Consult counsel for your specific use case.

How does this compare to PACER direct?
PACER is the authoritative federal source with per-page fees ($0.10/page, capped $3.00/document). The Apify actor uses PACER and other sources and normalizes output. You pay Apify's per-result fee and avoid the overhead of managing a PACER account. For very high federal-only volumes, direct PACER access may be cheaper; for mixed federal + state, the actor wins on total cost.

What about CourtListener?
CourtListener (from the Free Law Project) is an excellent free resource for federal opinions and PACER-adjacent docket metadata. Its coverage of state court data is limited. The actor complements CourtListener by adding state and county coverage.

How fresh is the data?
Federal PACER: near-real-time, typically within hours of filing. State courts: usually 1-7 days behind. County courts: 1-4 weeks behind, highly variable. If you need real-time monitoring, schedule a re-scrape cadence that matches the worst-case jurisdiction you care about.

Can I get full case documents (complaints, motions, judgments)?
The actor returns docket metadata and document URLs. Fetching full-text documents typically requires a separate PACER fee (for federal) or a document-retrieval fee for state systems. For bulk document retrieval, the Free Law Project's RECAP service is a free alternative that lets community-contributed documents be reused without fees.

What about non-US court records?
The actor focuses on US jurisdictions. For UK court records, use Courts and Tribunals Judiciary or Judiciary.uk. For EU, ECLI-indexed searches vary by country. Cross-border diligence typically combines multiple national sources.

Do I need an attorney to interpret these results?
For high-stakes decisions, yes. A paralegal or attorney can distinguish nuisance claims from material ones, weigh the significance of a settlement vs. a verdict, and identify omissions. The actor surfaces data; humans still provide judgment.

How do I handle false positives (wrong person with the same name)?
Always collect at least two identifying data points (DOB, city, employer) before running a search. Tag any match without a strong secondary identifier as "unconfirmed" in your output, and require human review before treating it as a hit.

DEV Community

Court Records Research for Legal & Due Diligence Workflows (2026)

Court Records Research for Legal & Due Diligence Workflows (2026)

Why this is hard

The architecture

Step 1: Search by subject

Step 2: Dedupe and disambiguate

Step 3: Classify the risk signal

Step 4: Summarize per subject

Step 5: Generate a report

Use cases

Pricing comparison

Common pitfalls

How NexGenData handles this

Conclusion

FAQ

Related tools

Top comments (0)