In Part 1, we implemented ingestion pipeline for grocery receipts and if you have followed through the instructions, you should now have test data from two grocery receipts extracted to JSON files in S3.
In this second part, I'm going to show you basic steps on how to run analysis on the extracted grocery data. AWS services we are going to use are AWS Athena and Amazon Quicksight.
Setup Athena
First we are creating Athena table for extracted data. Data schema is a simple one. Just remember to change the bucket name to match your case and import partitioned data with MSCK REPAIR.
CREATE EXTERNAL TABLE grocery_items (
name string,
price float,
currency string,
unit string,
date string
)
PARTITIONED BY (store string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://my-grocery-tracking-output/'
MSCK REPAIR TABLE grocery_items
Query the data with SELECT * FROM grocery_items
and you should get 37 rows of grocery data. Nice, that means the solution is end-to-end working.
Next we need more data.
Backfill the purchase history
If you were to start collecting receipts from this day onwards, it can be weeks or more likely months before there is a price change on the item. Luckily it's possible to get grocery purchase receipt history to see price changes leading to the current point.
S-Group and Kesko, biggest players in Finnish grocery market, both have applications with opt-in services for digitised receipts(apps: S-Mobiili, K-Ruoka). Both applications can export receipts in PDF. All we need to do is to shovel the PDF receipts into the ingestion pipeline and visualise the results. You can also scan/photo receipts into JPG and that is working as well.
Some words about the data
In analysis, I have treated grocery item name as unique identifier. For example,
KAURAJUOMA
(Oat Milk) in following trends means The Oat Milk. It is very possible that actual receipt lineKAURAJUOMA
can refer to BrandA Kaurajuoma(1.5e) or BrandB Kaurajuoma(1.95e). In such case, same item name in the data would have two different prices. That is very possible anomaly in the price item analysis if your buying pattern is not consistent.If there is a data point missing on the graph that doesn't mean there has not been a price change. It just indicates there is no data on my receipts about the price change.
Query the grocery data
I have used S-Group's app and Omat Ostot service for about two years and on the following examples I am showing some trends based on that data. Basically I have first exported Omat Ostot - grocery receipts to PDF and then synced everything to the ingestion pipeline input - bucket.
I have about 6000 rows of data in grocery_items
table.
First query is to check which grocery items are occurring most in the data. Which items are purchased most frequently:
SELECT name,COUNT(name) AS item_line_count FROM items GROUP BY name ORDER BY item_line_count DESC
1 KAURAJUOMA 161
2 LUOMU RASVATON MAITO 125
3 MAITOJUOMA LAKTON RASVATO 124
4 PAPRIKA PUNAINEN IRTO 87
5 PEHMEÄ MAITORAHKA 84
6 RUISPUIKULAT 79
7 KURKKU SUOMI 79
8 BANAANI LUOMU 74
9 TOMAATTI SUOMI 73
10 AVOKADOPUSSI 70
Huh, no beer on that list. Just milk, rye bread and vegetables.
Visualize the price trends
Next, I created a dataset from Athena grocery data to Amazon Quicksight. Following graphs are created from the dataset, data filtered to S-Market Kaleva supermarket.
First price trends are for dairy and meat products:
Next, vegetables and fruits. Seasonal price fluctuation is clearly the trend with the fresh vegetables. Bananas though..
Cereal products, bread, flour, misc. Coffee was a first product that caught my attention of the price increases early spring.
Based on these, talk about grocery inflation 2022 is a real deal. Prices are increasing in almost every category of tracked grocery products. It is interesting to see how the trends will be a few months down the line.
Conclusion
By following the steps detailed in these posts, you can implement described receipt ingest pipeline and start doing analyses on the data.
For this quick exercise, I wanted to just focus on the grocery inflation. Another interesting angle for the data would be to analyse prices between grocery chains. That would require understanding of the grocery range in each market and how to match references in the data for meaningful comparison.. And much more receipts.
Thanks for reading!
Top comments (0)