If you are building a retrieval-augmented generation (RAG) system for a higher-education institution, your pipeline is probably violating FERPA. Not because you meant to — but because the standard RAG tutorial pattern and the regulated record-access pattern are fundamentally different, and most documentation does not explain where they diverge.
This post covers five rules that most enterprise RAG implementations break, and what the correct pattern looks like for each.
What FERPA requires from a retrieval system
FERPA (Family Educational Rights and Privacy Act, 20 U.S.C. § 1232g; implementing regulations at 34 CFR Part 99) governs access to education records at institutions that receive federal funding.
The relevant requirement for a RAG pipeline is simple: a student's education records must not be accessible to another student or to an unauthorized third party.
In a vector store-backed system, "accessible" means more than whether the LLM produces the record in its response. It means whether the record enters the retrieval pipeline at all. A document that is retrieved, ranked, and then discarded by a post-filter has still been surfaced to a process that handles data for a different user.
Under FERPA's minimum-disclosure principle — and under any reasonable security posture — that is not acceptable.
Rule 1: Filter before ranking, not after
What most systems do: Retrieve the top-k documents from the vector store based on semantic similarity, then apply a metadata filter to remove documents that belong to the wrong student.
Why this breaks FERPA: The unauthorized documents are scored, ranked, and processed by the retrieval pipeline before being discarded. If the post-filter has a defect — a misconfigured field name, a missing metadata key, a swallowed exception — the unauthorized content reaches the LLM context window. The failure mode is silent and the blast radius is wide.
The correct pattern: Apply the identity constraint as a metadata pre-filter on the vector store query. Unauthorized documents should not exist in the candidate set.
# ❌ Wrong — retrieve all, then filter
all_docs = vector_store.similarity_search(query, k=20)
authorized = [d for d in all_docs if d.metadata["student_id"] == session.student_id]
# ✅ Correct — filter at query time
authorized = vector_store.similarity_search(
query,
k=20,
filter={
"student_id": session.student_id,
"institution_id": session.institution_id,
}
)
Most vector stores support metadata filtering natively: Pinecone, Weaviate, Qdrant, pgvector, and Chroma all support pre-filter expressions. Use them.
Rule 2: Filter on institution_id, not just student_id
What most systems do: Filter by student_id only.
Why this breaks FERPA: In a multi-tenant deployment, a student_id that is unique within Institution A may collide with a record at Institution B. More fundamentally, a student authorized to access their own records at Institution A should never retrieve records from Institution B — even if their student_id matches.
The correct pattern: Apply a compound AND filter: student_id == X AND institution_id == Y. Both conditions must be satisfied.
# ❌ Wrong — student_id alone
filter = {"student_id": session.student_id}
# ✅ Correct — compound identity predicate
filter = {
"$and": [
{"student_id": {"$eq": session.student_id}},
{"institution_id": {"$eq": session.institution_id}},
]
}
Never query on
student_idalone in a multi-institution deployment.
Rule 3: Enforce document categories as a second layer
What most systems do: Once the identity filter passes, all of the student's documents are fair game.
Why this breaks FERPA: Not all of a student's records are equally accessible. Counseling records, health records, disciplinary files, and financial aid records each have different access rules. Even if the current retrieval is authorized for identity, the category of document being retrieved matters.
A financial aid query that incidentally surfaces a counseling note is retrieving the right student's record — but the wrong type of record.
The correct pattern: After the identity pre-filter, apply a category authorization check. The authenticated session carries a set of permitted document categories. Documents outside that set are excluded.
# Session carries permitted categories (set by auth layer)
session.allowed_categories = {"academic_record", "financial_record"}
# Second enforcement layer — category filter
authorized = [
doc for doc in identity_filtered_docs
if doc.metadata.get("category") in session.allowed_categories
]
This is the two-layer enforcement model:
- Layer 1 — Identity boundary: who owns this document?
- Layer 2 — Category authorization: what type of document is this, and is the session permitted to retrieve it?
Rule 4: Every retrieval event must produce an audit record
What most systems do: Log at the application level — a timestamped entry that a user made a query.
Why this breaks FERPA: 34 CFR § 99.32 requires institutions to maintain a record of each disclosure of education records. "Disclosure" includes allowing access to records — which includes retrieval by an AI pipeline. The audit record must capture:
- Who made the request
- What was disclosed
- The basis for disclosure
- The date
An application log that records "user X made a query" does not satisfy this requirement.
The correct pattern: Produce a typed audit record for each retrieval event, containing the count of documents retrieved, the categories accessed, the policy version in effect, and the timestamp. Route it to a durable, student-accessible store — not just an application log.
audit_record = AuditRecord(
student_id=session.student_id,
institution_id=session.institution_id,
documents_retrieved=len(raw_docs),
documents_filtered=len(authorized_docs),
categories_accessed=list(session.allowed_categories),
policy_version="v1.2",
timestamp=datetime.now(timezone.utc),
requester_context={"session_id": session.id, "channel": session.channel},
)
audit_sink(audit_record) # write to compliance database — not application log
Application logs rotate. FERPA compliance audit trails must be retained for as long as the education records themselves are retained.
Rule 5: Identity values must come from the session, not the query
What most systems do: Accept student_id and institution_id as parameters in the API request, or extract them from user-supplied query text.
Why this breaks FERPA: If the filter values come from the request, an attacker — or a misconfigured agent — can supply a different student's ID and retrieve their records. This is the most common vector for unauthorized record access in multi-tenant educational systems.
The correct pattern: The student_id and institution_id used for filtering must come from the authenticated session token — not from the request body, not from the query, not from user input.
# ❌ Wrong — accept from request body
student_id = request.params["student_id"]
# ✅ Correct — extract from verified session token
session = verify_token(request.headers["Authorization"])
student_id = session.student_id # set by auth layer, not by user
institution_id = session.institution_id
This is not FERPA-specific — it is a basic authorization principle. In RAG systems it is easy to miss because most tutorials treat the retrieval query as the only input and ignore the access control context entirely.
What a compliant pipeline looks like
Authenticated session
(student_id + institution_id + allowed_categories — from verified token)
│
▼
Vector store pre-filter query
(metadata filter: student_id AND institution_id — applied at query time)
│
▼
Semantic ranking
(only authorized documents are candidates)
│
▼
Category authorization check
(second enforcement layer — removes out-of-scope document types)
│
▼
Context assembly → LLM call
│
▼
Audit record (34 CFR § 99.32)
(student_id, institution_id, documents retrieved, categories, timestamp)
→ written to durable compliance store
The identity boundary is enforced twice — at the vector store and at the category level — before any document enters the LLM context window. The audit record is produced for every retrieval event, regardless of whether the LLM produces a response.
Reference implementation
The patterns described here are implemented in enterprise-rag-patterns, a MIT-licensed Python library:
pip install enterprise-rag-patterns
It provides:
-
StudentIdentityScope— defines the retrieval boundary per student and institution -
FERPAContextPolicy— two-layer enforcement (pre-filter + category authorization) -
AuditRecord— structured 34 CFR § 99.32 disclosure logging with a typed sink interface -
make_enrollment_advisor_policy— factory for the most common higher-education RAG use case
The design is platform-agnostic (any vector store, any LLM provider) and cloud-agnostic (AWS, GCP, Azure, OCI, or on-premises). The same two-layer pattern applies to HIPAA's minimum-necessary standard and GLBA's safeguards rule.
A companion library regulated-ai-governance provides policy enforcement and audit for AI agents across FERPA, HIPAA, GDPR, CCPA, GLBA, and SOC 2.
Summary
| Rule | What breaks | The fix |
|---|---|---|
| 1. Filter before ranking | Post-retrieval filter leaves unauthorized docs in pipeline | Metadata pre-filter at vector store query time |
2. Filter on institution_id |
student_id alone allows cross-institution leakage |
Compound AND filter: student_id + institution_id
|
| 3. Enforce document categories | All of student's records are accessible regardless of type | Category authorization as second enforcement layer |
| 4. Audit every retrieval event | Application-level logs don't satisfy 34 CFR § 99.32 | Typed AuditRecord per retrieval, routed to durable store |
| 5. Identity from session | User-supplied filter values enable unauthorized access | Filter constructed from verified session token only |
These are not edge cases. They are the default failure modes of standard RAG architectures when applied to regulated record-access environments. The fix for each is straightforward once you know where to look.
Reference implementation: github.com/ashutoshrana/enterprise-rag-patterns
Top comments (0)