This is a submission for the Gemma 4 Challenge: Build with Gemma 4
🤔 About the power of collections and our ability to compare things
First a bit of philosophy.
Did you notice how we tend to align things, tend to shape things so they can be aligned, compared (based on a common attributes like color, weight,...).
Comparing objects is much easier when they share common structure, then we can use attributes to get more knowledge, produce KPIs, make clever choices and put things in evidence.
👉 Well the same happens with machines : it's much much easier to compare and manage documents when they share a same structure.
This is the very core idea that motivated this work to explore how Open Source AI could help in a very pragmatic way... and feel the opportunities it opens with concrete prototypes.
🙋 What I Built
I've built a whole real life and live data pipeline that takes as input Open Data Public Sector Job offers (dataset/avis-de-vacances-de-poste-avp-drhfpnc) :
csv- Raw
PDFs
Then,
- From a dedicated GH repo
adriens/avpsI've prepared a whole structured mix ofmdthanks tocsv - Then from the GH Action I did transform brut raw PDFs with
pypi.org/marker-pdfintomarkdown - I ended to publish a dedicated Zensical
gh-pageswebsite :adriens.github.io/avps
Next, this is where things go really interesting : I wanted to be able to compare job offers the one against the others... but markdown were far too much different the one from others :
- Not the same number of sections
- Not the same section titles : hard skills, soft skills, missions,...
- Not the same levels of sections
- Not necessarly itemized the one
- Not the same style at all (section levels, CAPITALs, email, cities...)
... which made it very hard... or even impossible to compare them amongs the others... or even crazier : put them in a traditional SQL structured database.
👉 This is where gemma4:2b comes in to create a very well and consistent set of markdowns that can then be used for various use cases :
-
Create very well structured
ePubto read job offers on the go (and docx) - Create a very clean and well organized PDF : very easy to load in assistants, print or to drop in any assistant
-
Deliver structured data with clean
jsonfiles -
Make a
duckdbdatabase and performSQLon the data by using the now well structured markdowns, which made it possible to open unprecedented and exciting reporting opportunities (here induckdb) - Share all this as a dataset on Kaggle
SELECT '--- RÉPARTITION DES COMPÉTENCES PAR DOMAINE ---'
as titre_report;
SELECT domaine, count(*) as nb_competences
FROM savoir_faire
GROUP BY 1 ORDER BY 2 DESC;
🎯 Problems it solves
In input we really had very various kind of PDF documents, and no structured tabular data, now, they both are delivered as :
- Well formated and structured markdown
- A real database that embeds data as tables and views for advanced SQL reporting and charting
- Ready to use and perfectly well structred ePub and PDF documents, very easy for LLMs to understand
🤗 Experience it creates
The experience is rather an data experience as thanks to data normalisation and standardization we can load and compare job offers, which make job search and indexing much much more efficient, whatever the input.
Last but not least, using gemma4:2b-it proves that great things can be achieved even with small resources and that well prepared data opens so many intelligence opportunities, without having to deal with frontier models as "the output I got is good enough".
🍿 Demo
💰 The benefits : then and now
Below the benchmark of markdown before and after
⚖️ Benchmark : marker-pdf vs. marker-pdf ➕ gemma4:e2b
Below some results:
Structure consistency:
📊 Analytics on top of database
One the well-structured json could be produced from the markdown I could efficiently load them into a duckdb database and do some reporting see AVPS DRHFPNC - Les pdf en SQL avec duckdb Kaggle notebook :
📜 Code
- Kaggle Notebook : IA AVPs DRHFPNC Structurés
- Kaggle dataset :
avps-nouvelle-caldonie-structurs/data - GitHub repo
adriens/avps - Zensical website :
adriens.github.io/avps - Kaggle notebook that shows how to load structured
jsoninto aduckdbdatabase : AVPS DRHFPNC - Les pdf enSQLavecduckdb
🎁 Goodies
💡 How I Used Gemma 4
I chose google/gemma-4/transformers/gemma-4-e2b-it from kagglehub as I had a huge amounf of data to load (all New-Caledonia ones) and a restricted amount of time on Kaggle as well as small GPUs.
Also my intent was to be able to run this code one day onPrem on my very own hardware so I decided to stay as little as possible.
🤔 What remains to do...
Try to:
- Add an evaluation phasis to check output consistency
- Try to switch to CPU mode so the Notebook can be scheduled without exceeding the maximum Kaggle window
- Use
gemma-4-E4Band benchmark output quality -
Produce native
adocwith proper annotations -
First produce
json(and more standardized values, enums,...) then re-generatemd/adocfrom it


Top comments (1)
Submitted 🥳