During the first half of 2022 inflation has been increasing prices in every category of goods and services. Inflation is mentioned every day on the news but without manual bookkeeping it can be hard to notice how the inflation affects daily cost of living. Small increases, accumulated, can make a big change on monthly or yearly budget.
How to track price changes for daily food essentials like milk and bread? That is the main question we are going to tackle in this blog posting. Food supplies that are bought again and again on every grocery run.
Overview
To track grocery price trends, I came up with an idea of gathering the pricing data from the grocery receipts. All the needed data is there:
- Item name
- Price per item
- Purchase date
- Shop name
Parsing and analysing the collection of grocery receipts provides the needed data for tracking the grocery price trends. All we need is an automated pipeline to extract the data from the receipts.
Part 1 of the blog post will cover how to setup the receipt ingestion pipeline. After we have finished, we should have an automation to extract the data from a receipt to JSON object.
In Part 2, I will present some ideas and results on price trend tracking. Data used on the analysis is from the close by supermarket, presenting price trends on the grocery items my household is buying regularly.
Architecture
Receipt data ingestion pipeline leverages serverless event driven workflow. An upload to S3 input bucket triggers receipt processing pipeline, resulting extracted grocery item data in JSON format to S3 output bucket.
Main used AWS services are:
- Amazon Textract detects and extracts lines of text from printed receipts.
- AWS Lambda is used for parsing the receipt data.
Following actions are triggered on every receipt upload:
- Receipt .jpg or .pdf is uploaded to input bucket.
- Trigger Lambda passes receipt filename and SNS - topic to Amazon Textract.
- When Textract gets OCR data ready it publishes a Textract
JobId
to the provided SNS - topic. - Parser Lambda reads Textract result data, parses pricing data and writes result JSON to output bucket.
- (Part 2) Grocery receipt JSON data is analysed with Amazon Quicksight.
Parsing the receipt data
Receipts from the following stores are supported:
- S-Market
- Prisma
- Sale
- K-Market
- KCM
- K-Citymarket
Image below contains an example of a grocery receipt. Depending on grocery chain or supermarket, receipt format may have some nuances like using commas instead of dots for price decimal separator.
When parsing the extracted receipt data, following variations on receipt item rows are implemented:
GREEN. General information about the purchase. Store name and receipt date.
YELLOW. Basic grocery item line has item name(MAITOJUOMA LAKTON RASVATO aka. non-fat lactose free milk) and price.
RED. Alennus, discount entry. Receipt item can have reduced price for various reasons. This pipeline is for tracking grocery price trends, so we are happy with the full price.
BLUE. Multiple items on the same entry(EUR/KPL) or EUR/KG priced goods. Total price is on the first line but per item or per kilogram price on the second. Same as with RED items. Because our aim is to track price trends we will read item name from the first line and item price from the second line. That way we can track the price trend for example 1 Kg of bananas, not for daily banana purchase.
For highlighted blocks on the receipt, pipeline outputs following JSON structure to:
s3://my-grocery-tracking-bucket-output/store=S-MARKET KALEVA PUH 0107671180/20191226-173900.json
Grocery store name is included to S3 prefix and used for data partitioning. More about that on the second part of the blog.
JSON data contains one receipt item line per JSON object:
20191226-173900.json:
{"name": "MAITOJUOMA LAKTON RASVATO", "price": 1.25, "currency": "EUR", "date": "2019-12-26 17:39:00"}
{"name": "100% KAURA 6KPL", "price": 1.59, "currency": "EUR", "date": "2019-12-26 17:39:00"}
{"name": "BANAANI LUOMU", "price": 1.79, "currency": "EUR", "date": "2019-12-26 17:39:00"}
Deploy with Cloudformation
Github repo for grocery-receipt-textract
To try out the solution, you can deploy the ingestion pipeline from Cloudformation template. The template creates all needed AWS resources to your AWS account.
git clone https://github.com/markymarkus/grocery-receipt-textract.git
cd grocery-receipt-textract
aws cloudformation package --s3-bucket cf-stage-sandbox-markus --output-template-file packaged.yaml --region eu-west-1 --template-file template.yml
aws cloudformation deploy --template-file packaged.yaml --stack-name dev-grocery-pipeline --parameter-overrides InputBucketName=my-grocery-tracking-bucket --capabilities CAPABILITY_IAM
# After the stack finishes, two buckets for receipts and pipeline outputs are created:
# Input = my-grocery-tracking-bucket
# Output = my-grocery-tracking-bucket-output
Next we will trigger the pipeline with some grocery receipt test data also included in the repo. Replace the bucket name with the input bucket name from the previous step:
aws s3 sync test_data s3://my-grocery-tracking-bucket/
# Wait for about 1 min and check the results:
aws s3 ls s3://my-grocery-tracking-bucket-output/
#PRE store=K-Market Domus/
#PRE store=S-MARKET KALEVA PUH 0107671180/
Conclusion
That's it! We have successfully created grocery receipt ingestion pipeline. In Part 2, we will put the pipeline in action and see if there are any hints of inflation to be found from the extracted price data.
Top comments (0)