Hello Everyone 👋
So I had a bunch of scanned invoices and PDFs lying around from an old freelance gig. Manually reading them and extracting data felt like a punishment. Instead of going through them one by one, I decided to try something smarter (I know, common Developer Syndrome 😜), to build a small AWS workflow that could:
- Read scanned docs,
- Summarise key info,
- Convert it to structured data, and
- Visualise everything inside a dashboard.
Here’s how I did it, step by step. You can follow the same process and tweak it for contracts, reports, receipts, whatever you’ve got.
✅ Step 1: Set Up Your AWS Environment (Production-ready)
1) Pick a single region that supports everything
- Goal: Keep S3, Textract, Bedrock, and QuickSight in the same region to avoid cross-region latency/fees.
- How to choose: Check that your chosen region offers Bedrock (model availability varies) and Textract. If QuickSight isn’t in that region, use Athena + Glue as a bridge (S3 anywhere → Athena in QuickSight’s region).
-
Edge cases
- Bedrock not in your favorite region: run Textract + S3 where your data lives, then send plain text only to Bedrock in a supported region (privacy + cost trade-off).
- Corporate constraint on regions: document the exception and add an S3 Replication rule or use Athena.
2) Create an S3 bucket with the right defaults
-
Name:
doc-insights-<env>-<region>-<account-id>(e.g.,doc-insights-prod-eu-central-1-123456789012) -
Settings to enable right away
- Block Public Access: ON (all four toggles).
- Versioning: ON (lets you recover bad parses).
- Default encryption: SSE-KMS (customer-managed key) for auditability.
- Object Ownership: Bucket owner enforced (disables ACLs, simplifies perms).
- Lifecycle rules:
- Move
raw/to infrequent access after N days. - Expire
tmp/after M days. - Keep
processed/longer for audits.
Folder layout (simple + scalable)
s3://doc-insights-bucket/
raw/ # uploads: PDFs, images
textract-json/ # raw JSON from Textract (per doc)
extracted/ # normalized tables/kv (CSV/Parquet)
summaries/ # Bedrock text summaries (JSON/MD)
processed/ # final CSVs for BI
logs/ # app / pipeline logs
-
Edge cases
- Big uploads / flaky networks: use multi-part upload (SDK/CLI handles this).
- PII/PHI compliance: restrict bucket to VPC endpoints (below), rotate KMS keys, and enable CloudTrail data events for S3.
- Cross-account access: prefer bucket policy + role assumption over sharing long-lived keys.
3) IAM roles & least-privilege policies
Create three roles (one per workload) and scope them tightly:
a) Textract role (used by your Lambda/ETL or notebook)
-
Trusted entity:
lambda.amazonaws.com(or your compute service) - Inline policy (example)
{
"Version": "2012-10-17",
"Statement": [
{"Effect":"Allow","Action":["textract:AnalyzeDocument","textract:StartDocumentAnalysis","textract:GetDocumentAnalysis"],"Resource":"*"},
{"Effect":"Allow","Action":["s3:GetObject","s3:PutObject"],"Resource":["arn:aws:s3:::doc-insights-bucket/*"]},
{"Effect":"Allow","Action":["s3:ListBucket"],"Resource":["arn:aws:s3:::doc-insights-bucket"]},
{"Effect":"Allow","Action":["kms:Encrypt","kms:Decrypt","kms:GenerateDataKey","kms:DescribeKey"],"Resource":"arn:aws:kms:REGION:ACCOUNT:key/KEY_ID"}
]
}
Use Start/Get APIs for async jobs (multi-page PDFs); AnalyzeDocument is fine for small/sync runs.
b) Bedrock role (for summarization)
{
"Version":"2012-10-17",
"Statement":[
{"Effect":"Allow","Action":["bedrock:InvokeModel","bedrock:InvokeModelWithResponseStream"],"Resource":"*"},
{"Effect":"Allow","Action":["s3:GetObject","s3:PutObject"],"Resource":["arn:aws:s3:::doc-insights-bucket/*"]},
{"Effect":"Allow","Action":["kms:Encrypt","kms:Decrypt","kms:GenerateDataKey","kms:DescribeKey"],"Resource":"arn:aws:kms:REGION:ACCOUNT:key/KEY_ID"}
]
}
Limit
Resourceto the specific model ARNs once you pick them (Claude, Mistral, etc.).
c) QuickSight service role access to S3
Inside QuickSight → Admin → Security & permissions, grant access to your bucket. Backing IAM should allow:
{
"Version":"2012-10-17",
"Statement":[
{"Effect":"Allow","Action":["s3:GetObject"],"Resource":["arn:aws:s3:::doc-insights-bucket/processed/*"]},
{"Effect":"Allow","Action":["s3:ListBucket"],"Resource":["arn:aws:s3:::doc-insights-bucket"]}
]
}
-
Edge cases
- SSE-KMS blocks reads: add the QuickSight role principal to your KMS key policy.
- Cross-account S3: add a bucket policy that allows the QuickSight account role + use a manifest file.
4) KMS key policy that won’t bite you later
Add all compute roles and the QuickSight role to the key policy (not just IAM perms), e.g.:
{
"Version":"2012-10-17",
"Statement":[
{"Sid":"EnableRoot","Effect":"Allow","Principal":{"AWS":"arn:aws:iam::ACCOUNT:root"},"Action":"kms:*","Resource":"*"},
{"Sid":"AllowUseOfKey","Effect":"Allow","Principal":{"AWS":[
"arn:aws:iam::ACCOUNT:role/TextractRole",
"arn:aws:iam::ACCOUNT:role/BedrockRole",
"arn:aws:iam::ACCOUNT:role/service-role/QuickSight-ROLE"
]},"Action":["kms:Encrypt","kms:Decrypt","kms:GenerateDataKey","kms:DescribeKey"],"Resource":"*"}
]
}
-
Edge cases
- AccessDenied (kms:Decrypt) in QuickSight/Athena: 99% of the time it’s the key policy missing the service role.
5) Networking (private, compliant, and fast)
-
Create VPC interface endpoints (PrivateLink) for:
- S3 (Gateway endpoint), Textract, Bedrock, STS, Logs, etc.
Lock your compute (Lambda/ECS/EC2) to no-internet and route AWS traffic via endpoints.
-
Edge cases
- QuickSight can’t sit in your VPC: it’s managed; keep the S3 bucket policy open to QuickSight role (not public) and allow KMS for that role.
- Corporate egress proxy: verify Bedrock/Textract endpoints are reachable or use endpoints.
6) Turn on Amazon Bedrock access (once per account/region)
- In the Bedrock console, enable model access (Claude/Mistral etc.). Some models require an opt-in.
-
Edge cases
- Model not visible: you’re in a region without that model, or access not granted yet.
-
Throughput errors: request service quota increases for
InvokeModelif you batch summaries.
7) Enable Amazon QuickSight
- Choose the same region as your S3 if possible.
- In Security & permissions, tick your bucket and (if needed) Athena access.
-
Decide SPICE vs Direct Query:
- SPICE (in-memory) is faster; watch capacity limits.
- Direct is fine for small CSVs; for large data, prefer Parquet in S3 via Athena.
-
Edge cases
- CSV > size limit / too many rows: convert to Parquet and query via Athena.
- Timezone mismatches: set dataset time zone explicitly; normalize date formats (ISO-8601).
8) Budget, cost alerts, and audit
- AWS Budgets: create a monthly budget with email alerts (S3 + Textract + Bedrock + QuickSight).
- CloudTrail data events: ON for S3 (per-object audit).
-
CloudWatch Logs: centralize Lambda/app logs to
logs/. -
Edge cases
-
Sudden spikes: enable S3 object-level CloudWatch metrics and add alarms on object count/size in
raw/.
-
Sudden spikes: enable S3 object-level CloudWatch metrics and add alarms on object count/size in
9) File formats & naming that make your life easy
- Uploads (raw): accept PDF/JPEG/PNG/TIFF; normalize on ingest.
- Outputs: prefer CSV (QuickSight-friendly) and/or Parquet (Athena-friendly).
-
Names:
raw/{yyyy}/{mm}/{dd}/{vendor}/{fileid}.pdfprocessed/{yyyy}/{mm}/{dd}/invoices.csv -
Edge cases
- Weird encodings: force UTF-8, escape delimiters, include header row.
- Duplicate filenames: use a UUID suffix or checksum in the key.
10) Optional: IaC and CLI quick start
Create bucket (CLI)
aws s3api create-bucket --bucket doc-insights-bucket --create-bucket-configuration LocationConstraint=$AWS_REGION
aws s3api put-bucket-encryption --bucket doc-insights-bucket --server-side-encryption-configuration '{
"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"aws:kms","KMSMasterKeyID":"arn:aws:kms:REGION:ACCOUNT:key/KEY_ID"}}]}'
aws s3api put-bucket-versioning --bucket doc-insights-bucket --versioning-configuration Status=Enabled
aws s3api put-public-access-block --bucket doc-insights-bucket --public-access-block-configuration \
'{"BlockPublicAcls":true,"IgnorePublicAcls":true,"BlockPublicPolicy":true,"RestrictPublicBuckets":true}'
Example lifecycle (CLI)
aws s3api put-bucket-lifecycle-configuration --bucket doc-insights-bucket --lifecycle-configuration '{
"Rules":[
{"ID":"raw-to-ia","Filter":{"Prefix":"raw/"},"Status":"Enabled","Transitions":[{"Days":30,"StorageClass":"STANDARD_IA"}]},
{"ID":"expire-tmp","Filter":{"Prefix":"tmp/"},"Status":"Enabled","Expiration":{"Days":7}}
]}'
📄 Step 2: Use Amazon Textract to Extract Text from Documents (Production-grade)
Textract can do three things you’ll care about:
-
OCR (lines/words) →
DetectDocumentText -
Generic forms & tables →
AnalyzeDocumentwithFORMS,TABLES, optionallyQUERIES -
Invoice/receipt specialization →
AnalyzeExpense(often best for invoices)
Rule of thumb
- If it’s an invoice/receipt, try
AnalyzeExpensefirst (it returns normalized fields likeVENDOR_NAME,INVOICE_TOTAL).- If it’s a contract/form/report, use
AnalyzeDocumentwithFORMSandTABLES.- Use
QUERIESto target specific fields (e.g., “What is the due date?”) when layout varies wildly.
1) Sync vs Async: choose the right API
-
Synchronous (single call; immediate result)
-
AnalyzeDocument(good for single-page or small docs; great for tests) -
DetectDocumentText(basic OCR)
-
-
Asynchronous (submit job → poll or page through results)
-
StartDocumentAnalysis+GetDocumentAnalysis - Required for multi-page and larger PDFs/images
- Supports pagination via
NextToken
-
Guidance: Use sync for quick console/dev tests or one-pagers; use async for anything real (multi-page, batch).
2) Pre-processing (massively improves accuracy)
- Input formats: PDF, PNG, JPEG, TIFF work well; keep scans clean.
-
Quality: Aim for readable text (good contrast, minimal blur). If possible:
- Deskew, denoise, increase contrast, convert to grayscale/PNG.
- Fix rotation (many scanners rotate by 90° or 180°).
Encrypted or password-protected PDFs: decrypt before upload.
Color: Grayscale often beats noisy full-color scans.
Multi-page splits: If you hit page/file limits, split before upload.
File names: Include vendor/date hints in the key to simplify downstream routing (e.g.,
raw/2025/09/11/vendorX/invoice_123.pdf).
3) Minimal but robust code paths (Python/boto3)
A. Synchronous: quick test for forms + tables
import boto3
textract = boto3.client("textract", region_name="eu-central-1")
resp = textract.analyze_document(
Document={"S3Object": {"Bucket": "doc-insights-bucket", "Name": "invoice1.pdf"}},
FeatureTypes=["FORMS", "TABLES"]
)
blocks = resp.get("Blocks", [])
# → Save raw JSON, parse KEY_VALUE_SET/TABLE blocks later
B. Asynchronous: production for multi-page docs (handles pagination + retries)
import boto3, time, itertools
textract = boto3.client("textract", region_name="eu-central-1")
job = textract.start_document_analysis(
DocumentLocation={"S3Object": {"Bucket": "doc-insights-bucket", "Name": "long-invoice.pdf"}},
FeatureTypes=["FORMS","TABLES"]
)
job_id = job["JobId"]
def get_all_pages(job_id, sleep=2, max_wait_s=900):
start = time.time()
pages = []
next_token = None
# Wait for completion (simple poll)
status = "IN_PROGRESS"
while status == "IN_PROGRESS":
res = textract.get_document_analysis(JobId=job_id, NextToken=next_token) if next_token \
else textract.get_document_analysis(JobId=job_id)
status = res["JobStatus"]
if status == "IN_PROGRESS":
# If first page says in progress, wait; otherwise collect & keep paging
if "Blocks" in res:
pages.append(res)
next_token = res.get("NextToken")
if not next_token:
time.sleep(sleep)
else:
time.sleep(sleep)
if time.time() - start > max_wait_s:
raise TimeoutError("Textract job took too long.")
elif status == "SUCCEEDED":
pages.append(res)
while res.get("NextToken"):
res = textract.get_document_analysis(JobId=job_id, NextToken=res["NextToken"])
pages.append(res)
else:
raise RuntimeError(f"Textract job failed: {res.get('StatusMessage', status)}")
return pages
pages = get_all_pages(job_id)
# Combine blocks across pages for parsing
all_blocks = list(itertools.chain.from_iterable(p["Blocks"] for p in pages))
C. Invoices/Receipts: specialized AnalyzeExpense (often higher accuracy on totals/dates)
import boto3
textract = boto3.client("textract", region_name="eu-central-1")
resp = textract.analyze_expense(
Document={"S3Object": {"Bucket": "doc-insights-bucket", "Name": "invoice1.pdf"}}
)
# resp["ExpenseDocuments"][0]["SummaryFields"] includes types like:
# VENDOR_NAME, INVOICE_RECEIPT_DATE, INVOICE_RECEIPT_ID, TOTAL, DUE_DATE
D. Targeted fields with QUERIES (when layouts vary)
import boto3
textract = boto3.client("textract", region_name="eu-central-1")
resp = textract.analyze_document(
Document={"S3Object": {"Bucket": "doc-insights-bucket", "Name": "invoice1.pdf"}},
FeatureTypes=["QUERIES","TABLES"], # you can mix with FORMS too
QueriesConfig={
"Queries": [
{"Text": "What is the invoice number?", "Alias": "invoice_no"},
{"Text": "What is the total amount?", "Alias": "total_amount"},
{"Text": "What is the due date?", "Alias": "due_date"}
]
}
)
You can run QUERIES + TABLES together to pull exact fields and still capture line items.
4) How to parse Textract blocks safely (just enough for Step 2)
Textract represents a graph of Blocks with Relationships. The key ones:
-
PAGE,LINE,WORD -
KEY_VALUE_SET(withEntityTypes:KEYorVALUE) -
TABLE→CELL(cell has row/col indices) -
QUERY→ANSWER
Parsing outline (KV):
- Build a dict of blocks by
Id. - Collect all
KEY_VALUE_SETblocks. - For each
KEYblock, followRelationshipsto itsVALUEblock. - Extract text by concatenating child
WORDblocks; trackConfidence. - Normalize keys (
vendor,invoice no, etc.) and keep theBoundingBoxfor traceability.
Parsing outline (TABLE):
- Find
TABLEblocks, get their childCELLs. - Use
(RowIndex, ColumnIndex)to reconstruct rows. - Heuristically detect headers (first row or bold-ish/upper-case hints if present in text).
- Output as CSV; preserve page number.
We’ll do the full CSV/JSON conversion in Step 3. Here we only ensure you’ve got clean raw JSON saved and a clear parser plan.
5) Storage & idempotency
-
Always save raw responses (one file per source doc):
-
textract-json/<doc-id>.json(forAnalyzeDocument/DetectDocumentText) -
textract-json/<doc-id>.expense.json(forAnalyzeExpense)
-
Add metadata to input S3 objects (e.g.,
processed=true,textract_job_id=...) to avoid reprocessing.Use S3 object ETag/size as a cheap duplicate detector.
6) Confidence, validation & fallbacks
-
Confidence thresholds: treat fields under a threshold (e.g.,
< 0.8) as “needs review.” -
Numeric sanity checks: parse amounts with a currency regex, verify:
- sum(line_items) ≈ total (± a few cents, taxes/discounts considered)
- date is valid & plausible (e.g., due date ≥ invoice date)
-
Disambiguation: fight common OCR confusions:
-
O↔0,l/I↔1,S↔5,B↔8
-
Fallback order for invoices:
AnalyzeExpenseAnalyzeDocument (FORMS+TABLES)-
DetectDocumentText+ regex heuristics (as a last resort)-
Human-in-the-loop: flag low-confidence docs to a review queue (CSV row with a
needs_review=1flag).
-
Human-in-the-loop: flag low-confidence docs to a review queue (CSV row with a
7) Permissions & security edge cases (most common causes of failure)
-
S3: role needs
s3:GetObject,s3:ListBucket, and if encrypted,kms:Decrypton the CMK. - KMS key policy: include the calling role principal (not just IAM permissions).
- Cross-account: use bucket policy + role assumption; ensure KMS allows the assumable role’s principal.
- Private networking: if using VPC-only compute, add VPC endpoints for Textract/S3/STS/Logs.
8) Scaling, throughput, and cost control
- Batching: drive Textract from a queue (SQS) or Step Functions; cap concurrency to avoid throttling.
-
Retries: exponential backoff on
ThrottlingExceptionand transient network errors. - De-dup: check if a doc was already processed (S3 metadata) before calling Textract again.
-
Specialization saves money:
AnalyzeExpenseorQUERIEScan reduce post-processing and re-runs.
9) Troubleshooting cheat-sheet
| Symptom | Likely cause | Fast fix |
|---|---|---|
AccessDenied on S3 or KMS |
Missing s3:GetObject / kms:Decrypt or KMS key policy excludes role |
Add actions to role and role principal to KMS key policy |
| Job stuck “IN_PROGRESS” forever | Not polling pages / ignoring NextToken
|
Implement full pagination loop until NextToken is empty |
| Empty or low-quality text | Bad scans (blur, skew), wrong rotation, low contrast | Preprocess image: deskew, increase contrast, convert to PNG, rotate correctly |
| Fields missing on invoices | Using generic forms instead of expense API | Switch to AnalyzeExpense or add QUERIES
|
| Dates/amounts parsed wrong | Locale/regex issues or OCR confusions | Normalize formats (ISO-8601), add currency/number regex, apply character-swap heuristics |
| QuickSight can’t see outputs | Using KMS + missing QS access | Add QS role to bucket policy and KMS key policy |
10) Console workflow (great for validation)
- Upload a sample to S3 → Textract Console → Analyze.
- Toggle Forms + Tables, confirm highlights match expectations.
- For invoices, test Expense analysis and compare field names (often cleaner).
- Save the result JSON and move on to parsing (Step 3).
✏️ Step 3: Convert Textract Output to Structured Data (CSV or JSON)
1) Clean & normalize the extracted data
- Keep originals: store the raw Textract JSON (read-only) so you can re-parse later without re-running Textract.
-
Whitespace & casing: trim, collapse double spaces; keep a
_rawcopy for each value if you’ll audit later. -
Key normalization (FORMS/QUERIES): map variants to canonical keys:
-
invoice no,inv #,invoice number→invoice_number -
total,grand total,amount due→total -
supplier,vendor,billed by→vendor_name -
invoice date,date→invoice_date;due date→due_date
-
Dates: convert to ISO-8601 (
YYYY-MM-DD). If ambiguous/parsing fails, leave blank and flagneeds_review.Numbers & currency: strip thousands separators, unify decimals (
.), extract currency symbol/code ($→USD,€→EUR,£→GBP,₨→PKR,₹→INR). If none, set a default and flag for review.Confidence tracking: carry Textract confidence (0–100). Anything below your threshold (e.g., 80) sets
needs_review = 1.OCR confusions: apply gentle corrections when context demands numeric values (
O↔0,I/l↔1,S↔5,B↔8) but still keepneeds_reviewif confidence was low.
2) Choose your outputs
-
CSV (analytics-ready):
-
KV (long format):
doc_id, page, key_norm, value, confidence, needs_review -
Tables (per page/table):
doc_id, page, row, col, text, confidence, is_header - Invoice summary (if using Expense API): one row per document with normalized fields
-
KV (long format):
JSON (traceability): keep structured JSON when you need bounding boxes/page indices for audits.
(Optional) Parquet: for Athena/QuickSight at scale.
3) Dump key–value pairs to CSV
-
What to capture: each detected pair as one row in a long table:
doc_id, page, key_raw, key_norm, value, conf_key, conf_val, needs_review
Duplicates: keep all candidates. Later, pick the highest confidence (ties → closest to the label “Total”, latest on the page, or appears in a summary box).
Missing values: if a
KEYhas noVALUE, recordvalue = ""and setneeds_review = 1.Multi-line values: join lines with a single space; preserve original in
value_rawif you keep audits.
4) Convert tables from JSON to CSV
-
Reconstruct using cells: one row per cell with indexes:
doc_id, page, row, col, text, confidence, span_rows, span_cols, is_header
Headers: default to row 1 = header; mark
is_header = 1. If row 1 has poor confidence but contains many alphabetic tokens vs numeric rows below, still treat as header.Merged cells: repeat the merged text in the spanned area and record
span_rows/span_colsfor fidelity.Multi-page tables: keep separate by page; your business logic decides whether to stitch them later.
5) Build the simple “wide” table for your dashboard
Target columns from your outline:
Invoice No (invoice_number) |
Date (invoice_date) |
Vendor (vendor_name) |
Amount (total) |
Due Date (due_date) |
|---|---|---|---|---|
| INV-123 | 2025-08-03 | Alpha Supplies | 423.90 | 2025-09-02 |
Population priority:
-
Expense API (if used):
INVOICE_RECEIPT_ID,INVOICE_RECEIPT_DATE,VENDOR_NAME,TOTAL,DUE_DATE. - FORMS/QUERIES: after key normalization and numeric/date parsing.
-
TABLES/OCR fallback: totals row or summary box heuristics (e.g., last column named
Amount,Total,Balance).
Also add helpful derived fields:
-
currency,source(expense/forms/ocr),needs_review,month,year.
6) Quality & integrity checks
-
Arithmetic sanity: if you have
subtotalandtax, verifysubtotal + tax ≈ total(small tolerance for rounding). If not,needs_review = 1. -
Date logic:
due_date ≥ invoice_date; otherwiseneeds_review = 1. -
Negatives & zeros:
totalmust be> 0for invoices; credits/notes may be negative tag withdoc_type = credit_note. -
Duplicates: same
invoice_numberappearing across differentdoc_ids → keep the one with higher composite score (weighted confidence of critical fields) and mark others assuperseded. -
Two documents in one file: if you detect two distinct vendors or invoice numbers in one
doc_id, setmulti_doc_suspected = 1and route to review. -
Locale drift: month names/currency symbols outside your default locale → still parse but set
locale_detectedandneeds_reviewif parsing relied on heuristics. - Handwritten/low DPI scans: values may be fragmented; prefer QUERIES for targeted fields and flag low confidence.
7) Foldering
textract-json/ # raw API outputs (never edited)
extracted/
kv/ # <doc_id>_kv.csv
tables/ # <doc_id>_table_<page>.csv
expense/ # <doc_id>_summary.csv, <doc_id>_line_items.csv (if used)
processed/
invoices_wide.csv # your dashboard input (or Parquet partitioned by year/month)
🧠 Step 4: Summarise the Document Using Bedrock
1) What you’ll summarise
Feed Bedrock only what it needs:
- Facts you already extracted in Step 3 (invoice_number, vendor_name, total, due_date, currency, etc.).
- Top signal text: page headers, totals area, vendor/address block, and (for invoices) the line-items table flattened to a few lines (don’t paste thousands of rows).
- Optional full text: chunked by page if you need narrative context.
Redact or truncate PII that isn’t needed (addresses beyond city/country, phone, full account numbers). Keep the doc_id so you can trace back.
2) Output format
Ask Bedrock for structured JSON + a short narrative. Example JSON schema:
{
"doc_id": "string",
"summary": "1–3 sentence human summary",
"key_fields": {
"vendor_name": "string|null",
"invoice_number": "string|null",
"invoice_date": "YYYY-MM-DD|null",
"due_date": "YYYY-MM-DD|null",
"currency": "string|null",
"total_amount": "number|null"
},
"observations": [
"string" // anomalies, discounts, late fees, notes
],
"data_quality": {
"missing_fields": ["string"],
"suspected_errors": ["string"],
"confidence_note": "string"
}
}
This slots straight into your processed/ layer or QuickSight as a secondary dataset.
3) Prompting
Use a system + user style prompt; instruct the model not to invent data.
System
You are a careful financial document assistant. Use ONLY the supplied text and fields.
If a field is missing or unclear, returnnulland add a helpful note indata_quality.
Output must be valid JSON matching the provided schema. Do not include any other text.
User
DOC_ID: {{doc_id}}
FACTS (from extraction):
- vendor_name: {{vendor_name}}
- invoice_number: {{invoice_number}}
- invoice_date: {{invoice_date}}
- due_date: {{due_date}}
- currency: {{currency}}
- total_amount: {{total}}
EXCERPTS (trimmed; page-tagged):
[PAGE 1]
{{top_of_page_text}}
[PAGE 2 - totals box]
{{totals_area_text}}
[OPTIONAL line items]
{{few_key_lines}}
TASK:
1) Produce a JSON object exactly matching this schema:
{{paste schema}}
2) Base fields on FACTS when available; only use EXCERPTS to confirm or fill blanks.
3) If numbers disagree, prefer FACTS but note the discrepancy in `data_quality.suspected_errors`.
4) Do not guess. Use `null` for unknowns.
Why this works
- It anchors the model to your extracted fields (reduces hallucinations).
- It forces a predictable JSON payload for downstream use.
- It adds a quality channel for “known unknowns”.
4) Model choice & params
- Claude (Anthropic): strong on instruction following & JSON fidelity.
- Mistral: concise, fast; good when payloads are small and schema is simple.
- Parameters: keep temperature low (≈0.0–0.2) for factual stability; set max tokens tight (e.g., 600–900) for cost control.
Tip: If JSON validation fails sometimes, wrap with a “fix to valid JSON” pass (or enable JSON mode where available).
5) Long documents or Map/Reduce
For multi-page or very verbose docs:
- Map step: summarize each page (fixed schema, plus page-level anomalies).
- Reduce step: feed all page summaries (+ your Step-3 facts) into a second prompt that merges into a single final JSON, breaking ties in favor of Step-3 facts.
This avoids token limits and preserves page context.
6) Robust invocation
Use bedrock-runtime with the model you enabled (ids vary by region/provider). Two common patterns:
Non-streaming (simple)
import json, boto3, time
brt = boto3.client("bedrock-runtime", region_name="YOUR_REGION")
def invoke_bedrock(model_id, system_prompt, user_prompt, max_tokens=800, temperature=0.1):
body = {
"anthropic_version": "bedrock-2023-05-31",
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
"max_tokens": max_tokens,
"temperature": temperature
}
resp = brt.invoke_model(modelId=model_id, body=json.dumps(body))
out = json.loads(resp["body"].read())
# Extract the text depending on provider; for Claude:
text = "".join([c.get("text","") for m in out.get("content",[]) for c in [m]])
return text
Streaming
Stream when you want low latency for UI, otherwise skip (adds complexity).
Note: Exact request/response shapes differ by model family. Keep the modelId and payload format configurable and test in the Bedrock console first.
7) Validation & safety rails
- JSON validation: run a quick schema check (pydantic/jsonschema). If invalid, re-prompt the model with: “Return only valid JSON conforming to this schema. Do not add commentary.”
-
No-invention rule: instruct “If unsure, use
nulland add adata_qualitynote.” Penalize hallucinations by rejecting outputs that add fields you didn’t request. - Discrepancy handling: when EXCERPTS disagree with FACTS, keep FACTS and add a note like: “Totals area shows 432.90 but extracted total is 423.90 (difference 9.00).”
-
PII minimization: don’t pass line-item text unless needed; mask account numbers (e.g.,
****1234). -
Cost control: truncate excerpts, avoid full dumps; cap
max_tokens; batch summarize.
8) Edge cases
| Edge case | What to do in prompt / output |
|---|---|
| Missing total or due date | Return "total_amount": null / "due_date": null; add a data_quality.missing_fields entry. |
| Multiple totals (invoice vs balance) | Prefer FACTS; if absent, pick the one labeled “Total” near summary; note the alternative. |
| Conflicting currency symbols vs codes | Prefer the currency from FACTS; if excerpt differs, note discrepancy. |
| Handwritten / low DPI | Expect nulls and notes; don’t interpolate from context. |
| Multi-currency line items | Set currency to document currency; add observation like “Line items include mixed currency not used for total.” |
| Multi-doc PDF | If two vendors or two invoice numbers appear, say so in observations and add suspected_errors. |
| Non-English docs | Add “Output in English.” Keep dates numeric (YYYY-MM-DD). |
| Extremely long inputs | Use Map/Reduce; only pass per-page top areas and final consolidation. |
9) Example “good” output
{
"doc_id": "INV-2025-0045",
"summary": "Invoice 0045 from Beta Traders for 89.50 EUR, due on 2025-08-28.",
"key_fields": {
"vendor_name": "Beta Traders",
"invoice_number": "0045",
"invoice_date": "2025-08-14",
"due_date": "2025-08-28",
"currency": "EUR",
"total_amount": 89.50
},
"observations": [
"Early-payment discount mentioned (2%)."
],
"data_quality": {
"missing_fields": [],
"suspected_errors": [],
"confidence_note": "All fields aligned with extracted facts."
}
}
10) Where to store results
- Write JSON summaries to
summaries/and (optionally) a compact CSV viewprocessed/invoice_summaries.csvwithdoc_id,vendor_name,total_amount,due_date,needs_review,source='bedrock'. - Join this dataset with
invoices_wide.csvfor dashboard overlays (e.g., a tooltip with the narrative summary).
🗂️ Step 5: Store Final Data Back in S3
1) Folder layout
Use a layout that separates raw vs extracted vs consumable and supports BI tools:
s3://doc-insights-bucket/
processed/
wide/ # final, dashboard-ready “wide” tables (CSV/Parquet)
kv/ # merged KV outputs across docs
tables/ # flattened tables across docs
expense/ # invoice summaries & line items (if using AnalyzeExpense)
summaries/ # Bedrock JSON summaries
parquet/ # same datasets in Parquet (for Athena/QuickSight at scale)
_staging/ # temp writes (see atomic pattern)
manifests/ # optional manifest/index files
Partitioning (for Athena/scale):
-
Prefer hive-style partitions for Parquet:
processed/parquet/invoices_wide/year=2025/month=09/day=12/part-0001.parquet
Keep CSVs unpartitioned or lightly partitioned by month to avoid too many small files.
2) Naming strategy
-
Document-scoped files:
processed/expense/summary/<doc_id>.csvprocessed/summaries/<doc_id>.json -
Batch/append files:
processed/wide/invoices_wide_YYYYMMDD.csv(or append into a single table and de-dupe bydoc_id) -
Timestamped artifacts: add an ISO timestamp or a monotonically increasing batch id:
.../2025-09-12T18-40-12Z_invoices_wide.csv
Avoid collisions: include
doc_id(or a UUID) in per-doc artifacts; never rely only on original filenames.
3) Atomic write pattern
S3 has strong read-after-write consistency, but clients can still read half-written objects if your writer crashes mid-upload.
Pattern:
- Write to
processed/_staging/<uuid>/<file.tmp>(orstatus=writingtag). - Verify checksum/row count.
-
Copy to final key (e.g.,
processed/wide/invoices_wide_20250912.csv). - Optionally delete the staging file, or keep with tag
status=archived.
Edge cases:
-
Large files (>5 GB): use multipart upload; always send
Content-MD5for integrity when feasible. -
Concurrent writers: add a lightweight lock (S3 object
processed/_locks/invoices_wide.lockwith short TTL) or write per-writer partitions then compact later.
4) Metadata, tags, and headers
- Server-side encryption: SSE-KMS with your CMK (consistent with earlier steps).
-
Content-Type: set correctly (
text/csv,application/json,application/parquet). -
Cache-Control: not crucial for data lakes; you can omit or set
no-cachefor rapidly changing CSVs. -
Object tags (helpful for governance/search):
-
dataset=invoices,stage=processed,doc_id=<id>,status=ready,source=expense/forms/ocr/bedrock.
-
Custom metadata (e.g.,
x-amz-meta-rowcount,x-amz-meta-hash): useful for quick validation.
5) Security & access
-
Bucket policy: already denies insecure transport; also allow the QuickSight service role to
ListBucketandGetObjectonprocessed/*. - KMS key policy: include the QuickSight role principal (not just IAM permissions) so it can decrypt.
- Cross-account reads (optional): prefer role assumption + bucket policy + KMS grants.
- VPC endpoints: if compute is private, ensure endpoints for S3/KMS are present.
Edge cases:
- QuickSight can list but can’t read → key policy is missing the role principal.
- Using SPICE: watch dataset size limits; switch to Athena over Parquet when CSVs grow.
6) Data integrity & quality gates
Before moving from _staging → processed:
- Schema check: headers exist and match expected column set.
-
Row count > 0 (or allow zero with a
note=empty_batchtag). - Type validation: dates parse (ISO-8601), amounts numeric, currency present.
-
Sanity checks: if files include
subtotal,tax,total, verifysubtotal + tax ≈ total(tolerance). -
Checksum: compute SHA-256 of the file and store as sidecar:
processed/wide/invoices_wide_20250912.csv.sha256
7) Lifecycle & retention
- Versioning: keep enabled to roll back erroneous overwrites.
-
Lifecycle rules:
- Transition
processed/_staging/to Expire after 7 days. - Transition
processed/*.csvto STANDARD_IA after 30 days. - Keep Parquet in STANDARD (or GLACIER if archived) based on access.
- Transition
S3 Inventory (optional): enable for daily listings & encryption audit.
8) Manifests & discovery (helps BI tools)
- Manifest file lists the latest batch outputs:
{
"dataset": "invoices_wide",
"updated_at": "2025-09-12T18:40:12Z",
"files": [
"s3://.../processed/wide/invoices_wide_20250912.csv"
],
"row_count": 1243,
"schema": ["doc_id","vendor_name","invoice_number","invoice_date","due_date","total","currency","needs_review","source","month","year"]
}
Save as processed/manifests/invoices_wide.latest.json.
-
Glue Catalog (for Athena): create/update a table pointing to your Parquet path with partitions
year, month, day.
9) Eventing
-
S3 → EventBridge rule on
processed/prefix:- Trigger a Lambda to refresh QuickSight (or kick an ingestion SPICE refresh).
- Or publish a Slack/Email notification with the manifest summary.
Idempotency: include
run_idorbatch_idin event detail; ignore duplicates in the consumer.
📊 Step 6: Create a Dashboard in QuickSight
1) Pick your connection path
Path A | S3 CSV (fast start)
- Best for small/medium datasets (≤ a few million rows per file).
- In QuickSight: Datasets → New dataset → S3.
- Point to a single CSV or a manifest JSON (recommended if you have many files).
Path B | Athena + Parquet (scales)
- Best for large data, partitions, or frequent refresh.
- Store Parquet at
processed/parquet/invoices_wide/year=YYYY/month=MM/day=DD/... - In QuickSight: Datasets → New dataset → Athena → select database & table (from Glue Catalog).
Edge cases
- If QuickSight is in a different region than your S3 bucket, prefer Athena to avoid cross-region pain.
- Using SSE-KMS? Ensure the QuickSight service role is in your KMS key policy and bucket policy.
2) Connect to your data
A) S3 with a manifest (many files)
Create processed/manifests/invoices_wide.latest.json:
{
"fileLocations": [
{"URIPrefixes": ["s3://doc-insights-bucket/processed/wide/"]}
],
"globalUploadSettings": {
"format": "CSV",
"delimiter": ",",
"textqualifier": "\"",
"containsHeader": true
}
}
Then in QuickSight S3 connector, paste this manifest file’s S3 URL.
B) Athena (Parquet partitions)
- Make sure your Glue table points to
processed/parquet/invoices_wide/and has partitionsyear, month, day. - In QuickSight, choose Direct Query (live) or Import to SPICE (in-memory; faster but size-limited).
Edge cases
- “AccessDenied” when previewing = missing KMS permission or bucket grant.
- No new data after partition upload = run crawler/add partition OR enable partition projection.
3) Prepare the dataset
In the data prep screen:
Map columns
-
invoice_number→ Text -
invoice_date,due_date→ Date (formatYYYY-MM-DD) -
total→ Decimal -
vendor_name→ Text -
currency→ Text -
needs_review→ Integer/Boolean - (Optional)
ingest_dateorprocessed_at→ Date/Time
Create calculated fields (QuickSight expressions)
-
Month (bucket)
truncDate('MM', {invoice_date}) -
Year
extract('YYYY', {invoice_date}) -
Overdue (flag)
ifelse({due_date} < now(), 1, 0) -
Processing time (days) – pick the source you actually have:
- If you kept an ingest timestamp:
dateDiff('DD', {invoice_date}, {ingest_date}) - Or from S3 key date (if you saved it):
dateDiff('DD', {invoice_date}, {s3_ingest_date})
- If you kept an ingest timestamp:
Clean total (fallback) – if some totals are text:
parseDecimal({total})
Edge cases
- Mixed date formats? Normalize upstream; otherwise use a calculated field with
parseDate()per pattern and coalesce.- Multiple currencies? Keep a currency dimension; optional FX table join later.
4) Build the visuals
A) Total amount by vendor (Top-N bar)
- Visual: Horizontal bar
-
X:
sum({total}) -
Y:
vendor_name -
Sort: by
sum(total)desc - Filter: Top 10 vendors (add a Top N filter)
-
Optional: color by
needs_reviewto spot messy vendors
B) Monthly invoice totals (trend)
- Visual: Line chart
-
X:
truncDate('MM', {invoice_date}) -
Y:
sum({total}) - Forecast (optional): enable forecast to project next months
-
Tooltip: add
countDistinct({invoice_number})for volume context
C) Average processing time (days)
- Visual: KPI or bar by vendor
-
Field:
avg(dateDiff('DD', {invoice_date}, {ingest_date})) -
Group by:
vendor_name(or monthly bucket) -
Filter: exclude
needs_review = 1if you want only clean rows
D) Bonus quick wins
-
Overdue heatmap:
vendor_name×monthwith colorsum(ifelse({due_date} < now(), {total}, 0)) -
Quality gate: table filtered to
needs_review = 1(your manual review queue)
5) Filters, controls, and parameters
-
Date range control: Filter on
{invoice_date}with “Relative dates” (e.g., Last 12 months). -
Vendor control: Dropdown on
vendor_name(multi-select). -
Confidence/quality toggle:
needs_review= 0 by default (checkbox control). - Currency control (if applicable): filter or parameter to select a single currency.
Edge cases
- If your dataset mixes currencies, don’t sum across different codes in one visual unless you convert first.
6) SPICE vs Direct Query
- SPICE (Import): fastest visuals, schedule refreshes (e.g., hourly). Watch capacity limits.
- Direct Query: always live; best with Athena/Parquet; slightly slower per query.
Recommended
- Start with SPICE for S3 CSVs; switch to Athena + SPICE as you scale.
7) Refresh strategy
-
S3 CSV path
- If you overwrite one key (e.g.,
invoices_wide_latest.csv), set a refresh schedule (hourly/daily). - If you append new files, use a manifest and schedule refresh.
- If you overwrite one key (e.g.,
-
Athena path
- After adding partitions, either run a crawler or use partition projection.
- In QuickSight, schedule SPICE refresh or stay on Direct Query.
Edge cases
- Refresh fails with KMS errors → add QuickSight role to CMK key policy.
- Stale data → you’re writing to new paths; either point QS at a stable key or use manifests.
8) Sharing, RLS, and governance
- Share analysis → Publish dashboard; grant users/groups access.
-
Row-Level Security (RLS): upload an RLS mapping dataset (
user,vendor_name) to restrict rows per viewer. - Data dictionary tab: add field descriptions so the team knows what “processing time” means.
10) Minimal recipe
- Connect
- S3: dataset → S3 → manifest → validate preview.
- Athena: dataset → Athena → select table.
- Model
- Types: map date/number fields.
- Calcs:
month = truncDate('MM', {invoice_date}),processing_days = dateDiff('DD', {invoice_date}, {ingest_date}).
- Visualize
- Bar:
sum(total)byvendor_name(Top 10). - Line:
sum(total)bymonth. - KPI/Bar:
avg(processing_days)by vendor.
- Control
- Date range (last 12 months).
- Vendor multi-select.
-
needs_reviewtoggle (default 0).
- Operationalize
- SPICE refresh schedule (hourly/daily).
- If Athena: ensure partitions & permissions.
💭 Conclusion
This was honestly a fun little fortnight project.
It saved me hours of manual work and gave me clean, visual summaries of everything I needed. By combining S3, Textract, Bedrock, and QuickSight, you can turn unstructured, scanned documents into analysis-ready data and clear dashboards. This pipeline handles extraction, normalisation, summarisation, governed storage, and visualisation, built with security (SSE-KMS), least-privilege IAM, and scalable patterns (async Textract, Parquet/Athena). It replaces manual review with consistent, auditable automation and surfaces vendor spend, monthly trends, and processing KPIs in minutes. From here, you can harden orchestration with Step Functions, add Row-Level Security in QuickSight, and extend the same approach to receipts, contracts, or reports. Small lift-big leverage.
You could easily apply this to:
- Legal contracts
- Academic papers
- Medical prescriptions
- Tax docs
If you try it out and get stuck, drop a comment, and I'd be happy to help.
Top comments (2)
Read the whole post 2 times & "S3, Textract, Bedrock, and QuickSight, security (SSE-KMS), least-privilege IAM, and scalable patterns (async Textract, Parquet/Athena)" never heard these technical terms. As a 3rd year undergrad I've a genuine doubt, so like you created this project for fun, did u first learned these tools/skills & then applied or like you just went lets build a project , you build that. If yes from where to get roadmap of project & how to know what tools, libs, frameworks or what all will be used in project , how to get that bigger picture?
Awesome pipeline using Textract + Bedrock + QuickSight!
How well does it handle really large or poor-quality document batches in terms of accuracy and speed?