Heartlin Machado

Posted on Jun 28

How I Built a RAG System Over more than 100 USCIS Administrative Appeals Office Decisions with Gemini

#buildwithgemini #ai #immigration #rag

USCIS denial rates for EB-1A petitions nearly doubled in one year - from 25.6% to 46.6%. NIW denial rates hit 64.3%. Immigration attorneys charge $5,000 to $15,000 for case preparation that most applicants can't afford.

I'm building PetitionIQ, an immigration case preparation platform that analyzes visa petitions the way USCIS actually reviews them. The core of the platform is a RAG pipeline over 107 real USCIS Administrative Appeals Office (AAO) non-precedent decisions - not generic legal knowledge, not LLM training data, but actual adjudication outcomes with full provenance.

This post walks through every design decision in the RAG system: why the corpus is biased and how I handle it, why category isolation matters more than you'd think, and how hybrid retrieval with hard filters prevents the kind of cross-contamination that makes legal AI dangerous.

Why AAO decisions?

The Administrative Appeals Office publishes non-precedent decisions on uscis.gov. These are real adjudication outcomes - cases where someone filed an I-140 petition, got denied, and appealed. The AAO either sustained the appeal (overturned the denial), dismissed it (upheld the denial), or remanded it (sent it back for further review).

This corpus is valuable because it shows exactly how USCIS evaluates evidence for each criterion. Not what the law says in the abstract, but how officers actually apply it to real cases. When the AAO writes "the petitioner's three publications in field-specific journals, while commendable, do not establish that the beneficiary's work constitutes original contributions of major significance," that's a data point no amount of LLM training captures.

But the corpus has a fundamental problem.

The corpus bias problem

AAO decisions are appeals of denials. Clean approvals never appear in this dataset. If someone filed an EB-1A petition and got approved, there's no AAO record of it.

This means the corpus is selection-biased toward rejection. If I built a system that naively learned from this data, it would conclude that almost nothing gets approved - because it only sees the cases that didn't.

Design decision: PetitionIQ never outputs approval probabilities.

No "you have a 73% chance of approval." No "based on similar cases, your likelihood is high." The system uses strength indicators (strong, moderate, weak) and cites specific AAO decisions to explain why evidence does or doesn't meet a particular criterion. Every response includes a corpus bias disclosure explaining that the AAO corpus only contains appeals of denials.

This is not a limitation I'm hiding. It's a design constraint I'm highlighting. The honest thing to do with biased data is to be transparent about the bias, not to paper over it with false confidence.

The crawl pipeline

The AAO publishes decisions as PDFs on uscis.gov, organized by category and year. The crawler is a polite, rate-limited scraper that:

Discovers PDFs via directory listings on the USCIS website
Falls back to candidate URL probing when directory listings aren't available (AAO filenames follow predictable patterns like JAN162026_01B2203.pdf)
Downloads each PDF with a 2-second rate limit between requests
Extracts text using pdfplumber
Maintains an idempotent manifest so re-runs don't re-download

The current corpus: 107 decisions across 4 visa categories (EB-1A: 44, EB-2 NIW: 54, EB-1B: 4, O-1A: 5), totaling 262,778 words.

# Polite rate limiting
class RateLimiter:
    def __init__(self, min_interval=2.0):
        self.min_interval = min_interval
        self._last_request = 0.0

    def wait(self):
        elapsed = time.time() - self._last_request
        if elapsed < self.min_interval:
            time.sleep(self.min_interval - elapsed)
        self._last_request = time.time()

Gemini structured extraction

Raw AAO decision text is messy. Different officers write differently, formatting varies, and the same criterion can be discussed across multiple sections of a decision. I use Gemini 2.5 Flash to extract structured data from each decision:

Category (EB-1A, EB-1B, EB-2 NIW, O-1A)
Outcome (sustained, dismissed, remanded)
Criteria findings - which criteria were claimed, which were met, what the AAO's reasoning was for each
Field of endeavor - what field the petitioner worked in
Confidence score - how confident the extraction is

The extraction uses JSON response mode with a strict Pydantic schema. Decisions that fail validation (usually because Gemini returned null for a required boolean field) get quarantined rather than included with bad data.

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=prompt,
    config=types.GenerateContentConfig(
        temperature=0.1,  # Low temperature for factual extraction
        response_mime_type="application/json",
    ),
)

Out of 107 decisions, 93 extracted successfully and 14 were quarantined. The quarantined decisions were predominantly NIW cases where the Dhanasar prong analysis didn't map cleanly to the schema. I'd rather lose 13% of the corpus than include bad extractions.

Why category isolation matters

This is the design decision that most legal AI tools get wrong.

O-1A (extraordinary ability in the arts/sciences/business) and EB-1A (extraordinary ability for a green card) share almost identical criteria text. Both reference "awards," "published material," "original contributions," etc. But they apply different legal standards. O-1A uses a "distinction" standard. EB-1A uses a higher "sustained national or international acclaim" standard. The same evidence that satisfies O-1A may not satisfy EB-1A.

If your retrieval system returns EB-1A reasoning when a user asks about O-1A, the analysis is wrong even though the text looks relevant. The criteria names match, the evidence types match, but the legal standard is different.

Design decision: category is a hard filter, not a soft signal.

When a user requests analysis for O-1A, the retrieval system only returns chunks tagged as O-1A. Zero EB-1A chunks leak through, regardless of semantic similarity.

def retrieve(query, category, top_k=10, store=None):
    # Hard filter: only chunks matching the requested category
    category_chunks = [
        c for c in store.chunks
        if c.category == category
    ]
    # Semantic search only within filtered set
    results = semantic_search(query, category_chunks, top_k)
    return results

I wrote an eval test that specifically checks for cross-category leakage. It queries for O-1A criteria and verifies that zero EB-1A chunks appear in the results. This test runs on every build.

[PASS] category_leakage - Zero cross-category contamination

Chunking by criterion, not by token window

Most RAG tutorials chunk by fixed token windows: 500 tokens with 100 token overlap. This makes no sense for legal documents.

AAO decisions are structured around criteria. An officer evaluates the "Awards" criterion in one section, the "Original Contributions" criterion in another. Cutting a chunk in the middle of a criterion analysis breaks the reasoning unit.

PetitionIQ chunks by criterion section. Each chunk represents one complete piece of legal reasoning about one criterion from one decision. The chunks carry full metadata:

@dataclass
class Chunk:
    id: str              # unique chunk ID
    text: str            # the reasoning text
    category: str        # EB1A, EB1B, EB2_NIW, O1A
    corpus: str          # "case" or "authority"
    decision_id: str     # source AAO decision
    criterion_id: str    # regulatory citation
    criterion_name: str  # controlled vocabulary name
    outcome: str         # sustained/dismissed/remanded
    field_of_endeavor: str
    source_ref: str      # citation reference

The current index has 361 chunks (327 case chunks + 34 authority chunks from regulatory text).

Hybrid retrieval: cosine + TF-IDF + RRF

Pure semantic search misses important legal terminology. When a user asks about "Kazarian two-step analysis," semantic similarity might rank a chunk about "evaluation framework" higher than one that literally mentions Kazarian. Pure keyword search misses semantic meaning. A question about "impact of research on the field" should match chunks about "original contributions of major significance" even though the exact words don't overlap.

PetitionIQ uses hybrid retrieval:

Cosine similarity over gemini-embedding-001 embeddings (3072 dimensions) for semantic matching
TF-IDF for keyword matching with term weighting
Reciprocal Rank Fusion (RRF) to combine the two ranked lists into a single result

def reciprocal_rank_fusion(ranked_lists, k=60):
    scores = {}
    for ranked_list in ranked_lists:
        for rank, (chunk_id, _) in enumerate(ranked_list):
            if chunk_id not in scores:
                scores[chunk_id] = 0.0
            scores[chunk_id] += 1.0 / (k + rank + 1)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

RRF is simple and it works. It doesn't require tuning weights between semantic and keyword scores, and it's robust to score distribution differences between the two methods.

The authority corpus

In addition to case chunks, the retrieval system includes an authority corpus: 34 chunks of regulatory text, USCIS Policy Manual excerpts, and key precedent decision summaries (Kazarian v. USCIS, Dhanasar, Chawathe). These provide the legal framework that case chunks are interpreted against.

Authority chunks are always included in retrieval results alongside case chunks. The generator uses both to produce grounded analysis: "Under the Kazarian two-step framework [authority], the AAO in [decision_id] found that..."

Generation with citations

Every claim in the generated analysis cites a specific source. Not "based on AAO precedent" but "[AAO-JAN162026_01B2203]" with a clickable link to the original PDF on uscis.gov.

The generation prompt is strict about this:

Every factual claim must reference a retrieved chunk
No approval probabilities
Corpus bias disclosure on every response
If the evidence is insufficient for a conclusion, say so

The eval suite

Four tests run on every build:

category_leakage - Query O-1A, verify zero EB-1A chunks in results
probability_leak - Generate a response and verify no approval probability language appears
probability_pattern_validation - Test that the pattern detector catches probability language when it exists
retrieval_recall - Verify that relevant chunks are actually retrieved for known queries

[PASS] category_leakage     - Zero cross-category contamination
[PASS] probability_leak     - No approval odds language detected
[PASS] probability_patterns - Banned patterns correctly caught
[PASS] retrieval_recall     - Relevant chunks retrieved for all queries

All four passing on the current 361-chunk index.

What this enables

The RAG system powers PetitionIQ's deep analysis feature. When a user runs a deep analysis for their visa category, the system:

Embeds the query with gemini-embedding-001
Retrieves the top chunks via hybrid search (hard-filtered to the user's category)
Passes retrieved chunks + authority corpus to Gemini 2.5 Flash
Generates per-criterion analysis with AAO decision citations
Includes corpus bias disclosure

The entire pipeline runs on Google Cloud: Vertex AI for Gemini calls and embeddings, Cloud Run for the FastAPI backend, Firestore for persistence.

What I learned

Bias transparency beats bias mitigation. I spent time trying to "correct" for the selection bias in the AAO corpus before realizing the honest approach is to just tell the user about it. Every response says "this analysis is based on AAO appeal decisions, which only include cases that were denied and appealed. Approval patterns are not represented."

Hard filters beat soft signals for safety-critical retrieval. In legal analysis, returning the wrong category's reasoning isn't a "less relevant" result - it's an actively misleading one. Hard category filtering with eval tests is the only approach I trust.

Chunk by reasoning unit, not by token count. Legal reasoning has natural boundaries. Respect them.

Start with the eval suite. I wrote the four eval tests before building the retrieval system. They defined the contract the system had to satisfy. Every design decision was tested against them.

PetitionIQ is live at petitioniq.io. Free multi-visa analysis across 5 categories. The RAG-powered deep analysis, pre-submit consistency audit, document generation, and RFE response module are available with paid plans.

Built entirely on Gemini 2.5 Flash + gemini-embedding-001 + Google Cloud Run + Firestore for the Build with Gemini XPRIZE.

The full codebase is at github.com/4KInc/petitioniq.

Top comments (1)

Tae Kim • Jun 29

The corpus bias disclosure design decision is the right call, but the framing around selection bias opens a useful distinction: adjudication corpus (how rules get applied in contested cases) versus normative corpus (what the rules actually say). Working with a similar regulatory setup — banking regulation PDFs where enforcement records skew toward violations — the separation that mattered most was keeping those two corpora isolated and routing queries to the appropriate one based on intent. A "what does criterion X require" question and a "how has USCIS applied criterion X in practice" question need different source sets even when the surface answer looks similar.