I was doing investigative research and found a crucial bit of information in a single article. I then used an LLM to perform Deep Research on the same topic but, to my surprise, it returned a report with the claim that the argument was unsubstantiated.
I used an open-source LLM to perform the same Deep Research and the result was the same. I then attempted to direct the LLM to the source whilst providing a scope, but it was unable to find the information. Instead, the LLM defaulted to searching for a higher-authority source, such as official reports, effectively dismissing the original article's finding as an unverified outlier.
⚙️ The Flaw of Statistical Text Matching
This fundamental reliance on statistical text matching is a flaw in LLMs' ability to perform genuine deep research. Consequently, the model has a bias towards articles or information that is frequent, common, and statistically factual. Rare information is often lost or missing from research.
Whilst multimodal models are better at processing varied inputs, the issue often stems from the initial filtering. Even if accessible to a crawler, my research suggests much niche information is disregarded due to poor Search Engine Optimisation (SEO) or poor indexing. The underlying LLM's research mechanism then filters this out due to low apparent authority or link popularity, reinforcing the LLM's bias towards common knowledge.
Interestingly enough, smaller models are better at returning facts. This may be because the pre-trained knowledge of LLMs can interfere with their ability to accept new information. This is why fact-checking and proper research is much better when there is human input and a RAG-based mechanism that relies on more than just text-based matching. The key lies in structured knowledge matching, for which an NER system is a highly capable tool.
NER systems are able to convert unstructured text into explicit facts and relationships, and then search those structured facts (this is a knowledge-based match). NER-based RAG is like asking a human to first annotate the text with all key facts and then search those structured facts. It is far better at handling the context. It does, however, require human input to set up and manage effectively.
đź”—The Role of Context Dependency
Context dependency essentially means the 'fact' only exists as a full truth when the context is included. The standard LLM's automated process often loses this nuance in the search, retrieval, and synthesis stages.
For example, an article contains information that is implicitly mentioned but not explicitly. This is where human attention to detail and reasoning are important for discernment. The model struggles because it operates on token probability (what word is most likely to come next) and semantic similarity (how closely a retrieved text snippet's vector resembles the vector of your prompt).
The LLM cannot perform the final step of human reasoning to connect: Premise A (in sentence 1) + Condition B (in sentence 3) → Explicit Fact C (missing piece).
Human researchers are good at:
Inference: Understanding what is implied, not just what is explicitly written.
Context: Distinguishing between primary data and additional examples.
Scepticism: Knowing which article to trust and how to connect its logic, even if the writing is "poorly written" or inexplicit.
⚖️ Why Not Just Use Gemini? It Works Fine.
When I asked an LLM to compare models, one of the key factors that caused Gemini 2.5 Pro to rank high is its ability to perform Source Quality Filtering. This is because Gemini leverages Google Search for real-time grounding, which means it has the best mechanism for filtering out misinformation, low-quality sites, and low-authority sources. This however, directly leads to the risk of missing rare or new information.
I have ranked models, including their mechanism by their ability to find niche information during deep research, considering factors such as access to specialised data, customisability, and inherent biases in training data. GPT-4o ranks third, mainly due to its slightly smaller context window of 128K tokens compared to Gemini 2.5 Pro, which is a limiting factor when dealing with large or numerous documents in deep research.
Based on customisation, context depth, and inherent bias in deep research for new information, my final ranking of the current solutions is:
| Model/Approach | Advantage | |
|---|---|---|
| ⚪ | Custom RAG/NER | Guaranteed Accuracy (Low Hallucination) & specialised jargon understanding. |
| Gemini 2.5 Pro | Maximum Synthesis over 1M tokens of search results for rare connections. | |
| Claude 3 Opus | Superior Reasoning and self-correction to vet complex, potentially niche findings. | |
| GPT-4o | Autonomy & Multimodality (e.g., extracting data from niche charts/images found online). |
The choice depends on priorities: reliability and ease of use favour general-purpose models.
Top comments (11)
This is a fascinating breakdown of where LLMs fall short in genuine deep research. The point about statistical text matching bias really stood out—how models tend to dismiss rare but potentially crucial findings in favor of what’s most common or “authoritative.” It makes me wonder: as RAG + NER systems become more widely adopted, could they finally bridge this gap between surface-level synthesis and true contextual inference? Also curious how human-guided annotation could scale in practice without losing the depth of reasoning you’ve highlighted here.
Yes, NER can help LLMs create associations or form a mental picture closer to that of humans. Human-guided annotation does not necessarily need to scale because a small NER model can train an LLM. I’ll explore the NER layer more in another post!
I find this to be particularly troubling socially. If you use Deep Research for anything political, you will inherently only receive the establishment narrative, leading to technocratic group think
Correct, the LLM's deep research strategy, based on frequency and authority, makes it inherently biased. This appeal to authority is a logical fallacy!
100%! Which is particularly worrying given how our institutions are so partisan nowadays. I don't think anyone in their right mind would consider either the UN or the US State Department to be impartial non-partisan sources, but the models certainly think they are. Humans know that most of these "institutions" are activists masquerading as experts, but when the lay person treats ChatGPT as a "meta-expert" and ChatGPT launders activism as fact, what happens to the nature of dissent away from institutions?
That's a fair statement. LLMs regard those institutions, including all governmental authorities, as high authority; these models have a statistical bias, such as ChatGPT, but Gemini seems to provide more balanced, centrist views on political topics despite its appeal to authority 🤷🏻‍♀️
Great write-up — you nailed the core weakness of LLMs as statistical engines rather than reasoning engines. The point on NER-based RAG is especially important: converting unstructured text into structured entities/facts is what lets retrieval become precise instead of probabilistic.
đź’Ż - thanks for reading!
The point about LLMs biasing towards common info and missing out on niche data is so true. I’ve noticed that too, sometimes the more unique, less popular sources just get skipped. I’m also a fan of the RAG/NER combo. It’s like giving the model a better map for finding the good stuff.
Thank you for sharing this. I’ve found that having a human in the loop is always essential, providing context is a key part of the effective collaboration. Understanding the strengths and limits of each side, human and machine, only helps us move forward together.
Amazing observation