🧑🎓 OSINT
: "What is Open-Source Intelligence?"
According to sans.org :
"Open-Source Intelligence (OSINT) is defined as intelligence produced by collecting, evaluating and analyzing publicly available information with the purpose of answering a specific intelligence question."
In this post, I'll show you an experiment I made.
My main goal is to show how apparently naive data can lead to valuable intelligence and help discover new kind of information at scale and see what kind of strategic insights we can get out of it.
"[...] information does not equal intelligence. Without giving meaning to the data we collect, open-source findings are considered raw data. It is only once this information is looked at from a critical thinking mindset and analyzed that it becomes intelligence."
At last we'll focus on
"[...] finding meaningful information that is applicable to the intelligence question and being able to provide actionable intelligence[...]."
🔁 Intelligence cycle
We'll implement (and share source code on Kaggle) a full "Intelligence cycle", based on Open Data in input and delivering a brand new enhanced and structured Open Data dataset, whith a - ollama
based approach - GenAI processing step in the middle... and all the code and used LLMs publicly available.
🍿 For impatients
💭 About Standard Occupation Classifications
The SOC system categorizes jobs in a standardized way to:
- Help governments and organizations track employment trends
- Understand skill needs
- Shape workforce policies
By using a consistent framework, SOC data makes it easier to compare job markets, plan training programs, and respond to shifts in the economy, benefiting both policymakers and employers.
So... a lot of strategic insights that made me want to attach them to other datasets.
🇫🇷 About the French ROME
code
In France, the ROME
(Répertoire Opérationnel des Métiers et des Emplois) system does something similar, classifying jobs like "Développeur Informatique" (Software Developer) under specific categories to match skills with job needs and support workforce planning. See 🗂️ Codes ROME database for more.
🤔 About acronyms
Any enterprise has a set of acronyms. I find them very useful as in a way... they are a way to discover point of interests of activities.
Also, when you're a new recrutee, you may need to get the reference to understand common jargon, documents and colleagues in meetings (which was my case).
OPT-NC publicly shared its acronyms as an Open Data dataset :
Tweet de teasing
In general an acronym has:
-
A very few letters (let's say
SaaS
for example) - A sentence that explains the meaning which is very specific
So a collection of them embeds a lot of information, especially when they are specific to your activities.
💡 The idea : delegate to LLM
Being able to put relationships between acronyms and jobs classification should (that's my hypothesis) give insights about activities.
☝️ But with an ever increasing amount of acronyms & activities... it would be much much more interesting to delegate relationship creation to a LLM.
🎯 Our goal
In output, we want a traditional well structured classical database with integrity constraints: a ready to use duckdb
database (and csv
) that links acronyms and activity codes.
Here is the way I'll give a try to OSINT
:
- Preparation : Transform existing open data to well structured datasets
- Collection : Make all required datasets within a single Notebook
- Processing : Build relationships thanks to LLM and structured outputs
-
Analysis : Do some reporting on output data with simple
SQL
and dataviz -
Dissimination : Deliver the output data as a Kaggle
duckdb
dataset
🦾 All about relationship automation
The main idea of this prototype is to delegate the hard stuff to LLM thanks to its core knowledge :
- No RAG
- No dedicated fine-tuned LLM
- No Pydantic to ensure well structured outputs
👉 Here, we'll just focus on just pure prompting over out-of-the-box LLMs.
- Import the acronyms dataset 📘 Lexique des acronymes de l’OPT-NC
- Import the SOC/ROME codes dataset 🗂️ Codes ROME database
-
Build a customized
ollama
model with a dedicatedPROMPT
to get structured outputjson
: OPT-NC : Acronymes genai augmentés -
For each acronym, get a collection of
json
matching SOC/ROME codes from this custom model -
LOAD
json
into a staging table induckdb
- Check & remove hallucinations : check integrity between generated SOC codes and the reference database
- Share the generated data as a dataset : OPT-NC acronyms Enhanced by Open Source AI
- Enjoy generated knowledge: perform some analysis on the the output database, see Kaggle Notebook OPT-NC acronyms genai exploration
⚖️ Accuracy ratio and LLMs benchmark
With this approach, it is then possible to switch and benchmark various LLMs just by changing a parameter and wait for the Notebook to finish on Kaggle:
LLM | Hallucinated ROME Codes | Valid ROME Codes | Duration |
---|---|---|---|
reflection |
25 | 127 | 3h47' |
llama3.1:70b |
31 | 137 | 4h05' |
llama3.1:8b |
71 | 40 | 4' |
nemotron |
41 | 194 | 06h08' |
qwen2.5:72b |
52 | 146 | 06h15' |
nous-hermes2-mixtral:8x7b |
7 | 23 | 08h53' |
llama3.3 |
24 | 199 | 5h 54m |
Next, we can compute the "Accuracy ratio: (valid codes) / (valid codes + hallucinated codes)" :
LLM | Hallucinated Codes | Valid Codes | Accuracy Ratio (%) |
---|---|---|---|
llama3.3 | 24 | 199 | 89.24% |
nemotron | 41 | 194 | 82.55% |
qwen2.5:72b | 52 | 146 | 73.74% |
llama3.1:70b | 31 | 137 | 81.55% |
reflection | 25 | 127 | 83.55% |
llama3.1:8b | 71 | 40 | 36.04% |
nous-hermes2-mixtral:8x7b | 7 | 23 | 76.67% |
As my goal is to get as much ROME codes as possible, here are the two best LLMs in my case:
- 🥇
llama3.3:70B
: This model outperforms others with 199 valid codes and the highest accuracy ratio (89.24%). It strikes an excellent balance between extracting the largest number of valid codes and minimizing hallucinations, making it the top performer for this task. - 🥈 Nemotron (
nvidia/Llama-3.1-Nemotron-70B-Instruct
) : lose behind, this model delivers 194 valid codes and a strong accuracy ratio (82.55%). It's particularly well-suited for tasks requiring comprehensive coverage and moderate control over hallucinations - 🥉
qwen2.5:72b
: hile it ranks third with 146 valid codes, its accuracy ratio (73.74%) is lower than the top two. This model is effective but generates more hallucinated outputs, which might require post-processing.
💰 Benefits
For example, it is then possible to drill down into categories:
📑 Resources : Notebooks and datasets
- 🦾 Notebook that links acronyms to ROME codes : OPT-NC : Acronymes genai augmentés
- 📚 Dataset OPT-NC acronyms Enhanced by Open Source AI
- 📊 Analysis Notebook : OPT-NC acronyms genai exploration
Core datasets:
Top comments (8)