DEV Community: NEXADiag Nexa

# I stopped trusting a single AI for code review — here's

NEXADiag Nexa — Mon, 01 Jun 2026 18:45:41 +0000

I stopped trusting a single AI for code review — here's why

We've all been there:

You ask GPT-4o or Claude 3.5 to review your PR.
It says "Looks good!"
You ship.
Production crashes on an edge case the AI hallucinated its way past.

The problem isn't that AI makes mistakes. It's that we trust a single model. Every family has blind spots — GPT overexplains but misses logic errors, Gemini is great at structure but weak on security patterns, Claude is thorough but sometimes invents APIs that don't exist.

One model's "looks good" is just one vote.

Multi-LLM consensus

I built NexaVerify v1.6.0 — a local-first Windows tool that runs your code through 8 AI engines in parallel (Claude, GPT-4o, Gemini, Groq, Cerebras, Mistral, OpenRouter, Ollama).

The core idea:

Agreement is noise. If all models say it's fine, it probably is.
Disagreement is the signal. When Claude flags a security risk but GPT ignores it — that's where you need to look.

Every issue gets a confidence score based on how many providers confirmed it. Disagreements are surfaced, not buried.

Proof by fire: scanning itself

I ran v1.6.0 against its own 26,000-line codebase with 3 free-tier providers (Gemini, Groq, Cerebras).

99 real issues found. Including a potential NameError in constants.py and a missing try-except in main.py that would crash the app before logging initializes. The bug-finder had bugs — and it found them.

The full report is live here.

Under the hood

3-stage JSON repair — syntactic repair → fallback extraction → schema validation. Catches truncated LLM responses instead of silently returning [].
RSA Proof Bundles — every verdict carries a SHA-256 audit trail. Verifiable after the fact.
Local-first, BYOK — your code and keys never touch my server.
Ollama support — full offline consensus for sensitive projects.
Free tier available today — 3 analyses/day, 3 providers (Gemini, Groq, Cerebras), 10 files/scan.

No signup wall. Download, add your API keys, run.

→ Try it free or grab Pro (€19 lifetime): https://nexaverify.netlify.app/

I ship solo. Every feature is driven by what early users actually need. What's your current workflow for verifying AI-generated code? Running a single pass, or already testing multi-model pipelines?

Why JSON.parse() Fails Silently on Truncated LLM Responses (And What I Did About It)

NEXADiag Nexa — Wed, 13 May 2026 11:41:17 +0000

Why JSON.parse() Fails Silently on Truncated LLM Responses (And What I Did About It)

If you've shipped anything that asks an LLM to return JSON, you've already hit this bug. You just may not have noticed.

The LLM returns a response. Your code parses it. Most of the time it works. Sometimes it returns {} and you assume the LLM didn't find anything. The reality is darker: the JSON was truncated mid-object, your parser silently failed, and your downstream code is now operating on an empty dictionary instead of the partial result the LLM actually produced.

I lost six weeks to this bug. Here's what I learned.

The setup

I run code review with multiple LLMs in parallel. Each one returns a JSON array of issues found:


json
[
  {"file": "main.py", "line": 47, "type": "security", "severity": "high", "description": "..."},
  {"file": "main.py", "line": 89, "type": "smell", "severity": "low", "description": "..."}
]

When the LLM hits its max_tokens limit mid-response, the response gets cut off. You receive something like:
json

[
  {"file": "main.py", "line": 47, "type": "security", "severity": "high", "description": "..."},
  {"file": "main.py", "line": 89, "type": "smell", "seve

json.loads() raises JSONDecodeError. Most code catches the exception and returns []. The issues that WERE successfully parsed before the truncation are lost.
The dumb solution that actually works

You don’t need a streaming JSON parser. You need a bracket-counting repair function:
python

def guard_truncation(text: str, provider_id: str, file_path: str) -> str:
    stripped = text.strip()
    if not stripped.startswith("["):
        return text

    try:
        json.loads(stripped)
        return text  # already valid
    except json.JSONDecodeError:
        pass

    # find last complete object
    last_close = stripped.rfind("}")
    if last_close == -1:
        return "[]"

    # rebuild a valid array from the last complete object backward
    repaired = stripped[: last_close + 1] + "\n]"
    try:
        json.loads(repaired)
        return repaired
    except json.JSONDecodeError:
        return "[]"

It’s not elegant. It works. You recover 80-90% of the partial result instead of 0%.
The second bug that this revealed

Here’s where it gets worse.

My downstream code assumed every entry in the parsed list was a dictionary. Most of the time it was. But occasionally an LLM would return a string entry in the middle of the array:
json

[
  {"file": "main.py", "line": 47, ...},
  "I noticed there might be an issue here but I'm not sure",
  {"file": "main.py", "line": 89, ...}
]

My code did entry.get("file") on every entry. When it hit the string, AttributeError: 'str' object has no attribute 'get'. The exception was caught by a try/except too wide to be useful. The entire scan silently produced empty results for that file.

Six weeks. No error log. The only signal was “the report has fewer issues than usual for this codebase”.

The fix:
python

for entry in raw_issues:
    if not isinstance(entry, dict):
        continue
    # safe to call entry.get(...) here

Three lines. That’s it.
The bigger lesson

I don’t think LLM output should ever be trusted to match a schema. Even when you tell it “return valid JSON only”, you’ll get:

    Truncated JSON when you hit token limits
    Strings injected mid-array as informal commentary
    Wrong types in correct keys (line: "approximately 50" instead of line: 50)
    Extra keys not in your schema
    Missing required keys

The temptation is to use Pydantic or a JSON schema validator and reject malformed responses entirely. That’s the worst possible choice — you lose all the partial work the LLM did. The better choice is to repair what you can, type-check defensively at every step, and log what you couldn’t recover so you can iterate.

Three patterns that have saved me from similar bugs:

    Always isinstance(x, dict) before .get() on LLM-derived data. Always.
    Bracket-repair truncated JSON before declaring failure. 80% recovery beats 0%.
    Log what you discarded. If you silently filter bad entries, you’ll never know how often it happens. I now log every malformed entry with the provider name and file path.

Why this matters in 2026

Most teams treat LLM output as “either it works or it doesn’t”. The reality is closer to “it partially works most of the time, and the partial-failure modes are silent”. Production code that runs LLM output needs to be more paranoid than production code that talks to a normal API, because LLMs don’t have HTTP status codes — they have a single channel that mixes intent, format, and content.

I built my entire scanning workflow around the assumption that any single LLM response will be 5-10% broken. That assumption has been a better friend than any prompt engineering trick.

What’s your experience? Anyone else burned by silent truncation, or am I the last one to notice?

Stop copy-pasting AI code: The 6-step validation checklist for devs.

NEXADiag Nexa — Wed, 15 Apr 2026 14:42:58 +0000

It is impossible to be 100% certain that a tool or code generated by an LLM (like ChatGPT, Claude, etc.) is bug-free. LLMs are text predictors: they generate code that looks correct, but they do not "compile" or execute the code internally. Consequently, they can invent functions that do not exist (hallucinations) or make subtle logic errors.

However, you can achieve a very high level of confidence by following a rigorous validation method. Here are the essential steps:

1. Code Review (Never just copy-paste)

Have the code explained: Ask the LLM: "Explain this function to me line by line." If the explanation is logically sound, that is a good sign.
Check the business logic: Does the tool do exactly what you want, or did it simplify the problem to provide a faster answer?
Watch for LLM "habits": LLMs tend to use popular libraries even if they aren't the best fit, or they might ignore error handling (try/catch).

2. Edge Case Testing

This is where LLMs fail most often. A tool might work perfectly with normal data but crash with unusual data. Test for:

Empty inputs: What happens if you provide nothing?
Extreme values: A negative number where it should be positive? A text string of 10,000 characters?
Special characters: Accents, emojis, or HTML tags (<script>).
Wrong format: If the tool expects a date (DD/MM/YYYY), what happens if you type "Monday"?

3. Dependency Validation

LLMs sometimes invent package names or use obsolete functions.

Verify that every import (Python), require (Node.js), or using (C#) corresponds to an actual, existing library.
Check that the library version is compatible with your environment.

4. Use Automated Tools (Don't do everything manually)

Run the LLM's code through real development tools:

Linters: Tools like ESLint (JavaScript), Pylint (Python), or Ruff detect syntax errors and poor practices.
Type Checkers: If using TypeScript or Python with "Type Hints," the compiler will catch many silent errors (e.g., passing a string to a function expecting a number).
Ask the LLM to write unit tests: Ask: "Write unit tests (using Jest, PyTest, etc.) for this code including nominal and edge cases," then execute those tests.

5. Security Check (Crucial)

Never trust an LLM with security.

Check for hardcoded passwords or API keys in the script.
If the tool interacts with a database, ensure there is protection against SQL injections (using parameterized queries).
If the tool takes user input, ensure the data is sanitized before being displayed or processed.

6. Cross-Checking Technique (Pitting LLMs against each other)

If you have doubts about a complex piece of code:

Take the code generated by ChatGPT.
Open Claude or Gemini and ask: "Here is code generated by an AI. Find the bugs, security flaws, or performance issues."

LLMs have different biases. An error that goes unnoticed by one is often caught by another.

A note: this checklist is partly automated in NexaVerify, the multi-LLM consensus scanner I'm building. Step 6 (LLM cross-checking) is its core mechanic. Free tier on Gumroad if you want to try it.