Data is king, but not knowing what data you have, its nature and characteristics is what keeps businesses at a loss. I implemented a RAG system to answer questions over a lengthy PDF document, but before feeding it to the model, I spent most of my time doing Exploratory Data Analysis (EDA).
This helped me go back into my data and clean it better because I now know where most of the confidence scores lie, the areas I need to work on, and most importantly, where humans need to spend my time.
I believe in saving cost and time by ensuring we focus on the most important tasks and let computers handle the ones without nuances. How would you know what those nuances are if you don't dig into your data and uncover its relationships, attributes, strengths, and anomalies? In a world where everyone is rushing to "AI," do they really have an in-depth understanding of their data and how to leverage it to thrive?
Your data is trying to tell you something. Go in and listen to it, ask it questions, and you'll gain insights.
Here is the statistics of the chunking process (I explain it in the next paragraph)
I enriched the original chunks with more data:

Looking at the enrichment report:
I broke the document into 434 pieces. First, I ran rule-based classification. For chunks where the rules weren't confident, they were flagged for LLM processing. More than half (58.8%) fell into this category, requiring AI calls to provide their own confidence scores. Now imagine a business or individual with needs at 50x this scale. That's over 12,000 AI calls flagged before even running them. Even with batching, you're processing significantly more tokens through the LLM, and costs add up fast.
Without EDA, I might have fed all 434 chunks through the LLM, paying for processing that my rules already handled confidently. The 41.2% success rate from simple rules showed me what was already working, that's nearly half the workload I could have unnecessarily sent to AI processing. At 50x scale, that's avoiding 9,000+ wasteful LLM calls.
Here is another insight, two categories (system_section and spec_table_row) made up 51% of everything. With 41.2% already automated through rules, analyse why these types trigger LLM fallback and build better rules to handle them confidently. However, if that's difficult or genuinely going to be wasteful effort, let the LLM classify them and focus human review on the low confidence results where oversight catches nuances better.
Not all problems require LLM calls. Some genuinely just need better engineering. With EDA, we understand what data we have, what it can do, and deliver more value to users and stakeholders.

Top comments (0)