DEV Community

Ashutosh Rana
Ashutosh Rana

Posted on

FERPA Compliance in RAG Pipelines: Five Rules Your Enterprise System Probably Breaks

If you are building a retrieval-augmented generation system for a higher-education institution, your pipeline is probably violating FERPA. Not because you meant to, but because the standard RAG tutorial pattern and the regulated record-access pattern are fundamentally different — and most documentation does not explain where they diverge.

This post covers five rules that most enterprise RAG implementations break, and what the correct pattern looks like for each.

What FERPA requires from a retrieval system
FERPA (Family Educational Rights and Privacy Act, 20 U.S.C. § 1232g; implementing regulations at 34 CFR Part 99) governs access to education records at institutions that receive federal funding. The relevant requirement for a RAG pipeline is simple: a student's education records must not be accessible to another student or to an unauthorized third party.

In a vector store-backed RAG system, "accessible" means more than whether the LLM produces the record in its response. It means whether the record enters the retrieval pipeline at all. A document that is retrieved, ranked, and then discarded by a post-filter has still been surfaced to a process that handles data for a different user. Under a strict reading of FERPA's minimum-disclosure principle — and under any reasonable security posture — that is not acceptable.

Rule 1: Filter before ranking, not after

What most systems do: Retrieve the top-k documents from the vector store based on semantic similarity, then apply a metadata filter to remove documents that belong to the wrong student.

Why this breaks FERPA: The unauthorized documents are scored, ranked, and processed by the retrieval pipeline before being discarded. If the post-filter has a defect — a misconfigured field name, a missing metadata key, a swallowed exception — the unauthorized content reaches the LLM context window. The failure mode is silent and the blast radius is wide.

The correct pattern: Apply the identity constraint as a metadata pre-filter on the vector store query. The query should not return unauthorized documents — they should not exist in the candidate set.

Wrong — retrieve all, then filter

all_docs = vector_store.similarity_search(query, k=20)
authorized = [d for d in all_docs if d.metadata["student_id"] == session.student_id]

Correct — filter at query time

authorized = vector_store.similarity_search(
query,
k=20,
filter={"student_id": session.student_id, "institution_id": session.institution_id}
)
Most vector stores support metadata filtering natively: Pinecone, Weaviate, Qdrant, pgvector, and Chroma all support pre-filter expressions. Use them.

Rule 2: Filter on institution_id, not just student_id

What most systems do: Filter by student_id.

Why this breaks FERPA: An institution's vector store may contain records from multiple institutions (in a multi-tenant deployment) or from institutional transfers. A student_id that is unique within Institution A may collide with a record at Institution B if the namespace is not isolated. More fundamentally, a student authorized to access their own records at Institution A should not be able to retrieve records from Institution B's data — even if their student_id matches.

The correct pattern: Apply a compound AND filter: student_id == X AND institution_id == Y. Both conditions must be satisfied. A document that passes the student_id check but fails the institution_id check should be excluded.

filter={
"$and": [
{"student_id": {"$eq": session.student_id}},
{"institution_id": {"$eq": session.institution_id}}
]
}
Rule 3: Enforce document categories as a second layer

What most systems do: Once the identity filter passes, all of the student's documents are fair game.

Why this breaks FERPA: Not all of a student's education records are equally accessible. Counseling records maintained solely by a counseling professional may be outside the definition of "education records" under 34 CFR § 99.12 in some contexts. Health records, disciplinary files, and financial aid records each have different access rules that vary by institution and applicable regulation.

More practically: even if the current retrieval is authorized, the category of document being retrieved should be logged separately. A financial aid query that incidentally surfaces a counseling note is retrieving the right student's record but the wrong type of record.

The correct pattern: After the identity pre-filter, apply a category authorization check. The authenticated session should carry a set of permitted document categories. Documents outside that set are excluded before they enter the context window.

session.allowed_categories = {"academic_record", "financial_record"}
authorized = [
doc for doc in identity_filtered_docs
if doc.metadata.get("category") in session.allowed_categories
]
This is the two-layer enforcement model. Layer 1 = identity boundary (who owns this document). Layer 2 = category authorization (what type of document is this, and is the session permitted to retrieve it).

Rule 4: Every retrieval event must produce an audit record

What most systems do: Log at the application level — a timestamped entry that a user made a query.

Why this breaks FERPA: 34 CFR § 99.32 requires institutions to maintain a record of each disclosure of education records. "Disclosure" includes allowing access to records, which includes retrieval by an AI pipeline. The audit record must capture who made the request, what was disclosed, the basis for disclosure, and the date. It must be accessible to the student for inspection.

An application-level log that records "user X made a query" does not satisfy this requirement. The log needs to record what records were accessed and on what authority.

The correct pattern: Produce a typed audit record for each retrieval event, keyed on session identity, containing the count of documents retrieved, the categories accessed, the policy version in effect, and the timestamp. Route this record to a durable, student-accessible store — not just an application log.

audit_record = AuditRecord(
student_id=session.student_id,
institution_id=session.institution_id,
documents_retrieved=len(raw_docs),
documents_filtered=len(authorized_docs),
policy_version="v1.2",
timestamp=datetime.now(timezone.utc),
requester_context={"session_id": session.id, "channel": session.channel},
)
audit_sink(audit_record) # write to compliance database

Rule 5: The identity values must come from the session, not the query

What most systems do: Accept the student_id and institution_id as parameters in the API request, or extract them from user-supplied query text.

Why this breaks FERPA: If the filter values come from the request, an attacker (or a misconfigured agent) can supply a different student's ID and retrieve their records. This is not hypothetical — it is the most common vector for unauthorized record access in multi-tenant educational systems.

The correct pattern: The student_id and institution_id used for filtering must come from the authenticated session token — not from the request body, not from the query, not from user input. The session token is verified by the authentication layer before it reaches the retrieval pipeline. The filter is constructed from the session; the user has no ability to influence it.

Wrong — accept from request

student_id = request.params["student_id"]

Correct — extract from verified session

session = verify_token(request.headers["Authorization"])
student_id = session.student_id # set by auth layer, not by user
institution_id = session.institution_id
This is not specific to FERPA — it is a basic authorization principle. In a RAG system it is easy to miss because most RAG tutorials treat the retrieval query as the only input and ignore the access control context entirely.

What a compliant pipeline looks like
Putting all five rules together, a FERPA-compliant RAG pipeline has this structure:

Authenticated session (student_id, institution_id, allowed_categories)


Vector store pre-filter query
(metadata filter: student_id AND institution_id — applied at query time)


Semantic ranking
(only authorized documents are scored)


Category authorization check
(second enforcement layer — removes out-of-scope document types)


Context assembly


LLM call


Audit record (34 CFR § 99.32)
(student_id, institution_id, documents retrieved, categories, timestamp)
The identity boundary is enforced twice — at the vector store and at the category level — before any document enters the LLM context window. The audit record is produced for every retrieval event, regardless of whether the LLM produces a response.

A reference implementation
The patterns described here are implemented in enterprise-rag-patterns, a MIT-licensed Python library that provides:

StudentIdentityScope — defines the retrieval boundary per student and institution
FERPAContextPolicy — two-layer enforcement (pre-filter + category authorization)
AuditRecord — structured 34 CFR § 99.32 disclosure logging with a typed sink interface
make_enrollment_advisor_policy — factory for the most common higher-education RAG use case
The design is platform-agnostic (any vector store, any LLM provider) and cloud-agnostic (AWS, GCP, Azure, OCI, or on-premises). The same two-layer pattern applies to HIPAA's minimum-necessary standard and GLBA's safeguards rule.

Summary
Rule What breaks What the fix is

  1. Filter before ranking Post-retrieval filter leaves unauthorized docs in pipeline Metadata pre-filter at vector store query time
  2. Filter on institution_id student_id alone allows cross-institution leakage Compound AND filter: student_id + institution_id
  3. Enforce document categories All of student's records are accessible regardless of type Category authorization as second enforcement layer
  4. Audit every retrieval event Application-level logs don't satisfy 34 CFR § 99.32 Typed AuditRecord per retrieval event, routed to durable store
  5. Identity from session, not query User-supplied filter values enable unauthorized access Filter constructed from verified session token only These are not edge cases. They are the default failure modes of standard RAG architectures when applied to regulated record-access environments. The fix for each is straightforward once you know where to look.

The reference implementation is available at github.com/ashutoshrana/enterprise-rag-patterns.

Top comments (0)