DEV Community

Cover image for 🧠 Human Intelligence vs. LLMs
Nadine
Nadine Subscriber

Posted on

🧠 Human Intelligence vs. LLMs

I was doing investigative research and found a crucial bit of information in a single article. I then used an LLM to perform Deep Research on the same topic but, to my surprise, it returned a report with the claim that the argument was unsubstantiated.

I used an open-source LLM to perform the same Deep Research and the result was the same. I then attempted to direct the LLM to the source whilst providing a scope, but it was unable to find the information. Instead, the LLM defaulted to searching for a higher-authority source, such as official reports, effectively dismissing the original article's finding as an unverified outlier.


⚙️ The Flaw of Statistical Text Matching

This fundamental reliance on statistical text matching is a flaw in LLMs' ability to perform genuine deep research. Consequently, the model has a bias towards articles or information that is frequent, common, and statistically factual. Rare information is often lost or missing from research.

Whilst multimodal models are better at processing varied inputs, the issue often stems from the initial filtering. Even if accessible to a crawler, my research suggests much niche information is disregarded due to poor Search Engine Optimisation (SEO) or poor indexing. The underlying LLM's research mechanism then filters this out due to low apparent authority or link popularity, reinforcing the LLM's bias towards common knowledge.

Interestingly enough, smaller models are better at returning facts. This may be because the pre-trained knowledge of LLMs can interfere with their ability to accept new information. This is why fact-checking and proper research is much better when there is human input and a RAG-based mechanism that relies on more than just text-based matching. The key lies in structured knowledge matching, for which an NER system is a highly capable tool.

NER systems are able to convert unstructured text into explicit facts and relationships, and then search those structured facts (this is a knowledge-based match). NER-based RAG is like asking a human to first annotate the text with all key facts and then search those structured facts. It is far better at handling the context. It does, however, require human input to set up and manage effectively.


🔗The Role of Context Dependency

Context dependency essentially means the 'fact' only exists as a full truth when the context is included. The standard LLM's automated process often loses this nuance in the search, retrieval, and synthesis stages.

For example, an article contains information that is implicitly mentioned but not explicitly. This is where human attention to detail and reasoning are important for discernment. The model struggles because it operates on token probability (what word is most likely to come next) and semantic similarity (how closely a retrieved text snippet's vector resembles the vector of your prompt).

The LLM cannot perform the final step of human reasoning to connect: Premise A (in sentence 1) + Condition B (in sentence 3) → Explicit Fact C (missing piece).

Human researchers are good at:

  • Inference: Understanding what is implied, not just what is explicitly written.

  • Context: Distinguishing between primary data and additional examples.

  • Scepticism: Knowing which article to trust and how to connect its logic, even if the writing is "poorly written" or inexplicit.


⚖️ Why Not Just Use Gemini? It Works Fine.

When I asked an LLM to compare models, one of the key factors that caused Gemini 2.5 Pro to rank high is its ability to perform Source Quality Filtering. This is because Gemini leverages Google Search for real-time grounding, which means it has the best mechanism for filtering out misinformation, low-quality sites, and low-authority sources. This however, directly leads to the risk of missing rare or new information.

I have ranked models, including their mechanism by their ability to find niche information during deep research, considering factors such as access to specialised data, customisability, and inherent biases in training data. GPT-4o ranks third, mainly due to its slightly smaller context window of 128K tokens compared to Gemini 2.5 Pro, which is a limiting factor when dealing with large or numerous documents in deep research.

Based on customisation, context depth, and inherent bias in deep research for new information, my final ranking of the current solutions is:

Model/Approach Advantage
Custom RAG/NER Guaranteed Accuracy (Low Hallucination) & specialised jargon understanding.
 Gemini 2.5 Pro Maximum Synthesis over 1M tokens of search results for rare connections.
 Claude 3 Opus Superior Reasoning and self-correction to vet complex, potentially niche findings.
 GPT-4o Autonomy & Multimodality (e.g., extracting data from niche charts/images found online).

The choice depends on priorities: reliability and ease of use favour general-purpose models.

Top comments (0)