Mak Sò

Posted on Nov 15

🧠Deterministic scoring for messy AI agent graphs: what I learned building OrKa v0.9.6

#ai #programming #rag #agents

Over the past 8 months I have been quietly building my own cognition layer for AI systems.

Not a shiny frontend. Not another wrapper around a single API. I wanted something that would let me define how a system thinks, step by step, and then replay that thinking when things go wrong.

The project is called OrKa-reasoning. With v0.9.6 I finally shipped the part that annoyed me the most: deterministic, explainable path selection in messy agent graphs.

This post is a mix of story and architecture. It is not a launch announcement. It is more like a lab notebook entry from someone who got tired of magical routing and decided to replace it with a boring scoring function.

If you are building agentic systems, or you care about reproducible AI behaviour, you might find some of this useful. Or you might violently disagree, which is also useful. 🙂

Why I stopped trusting my own orchestration

Like many people, I started by wiring models and tools together in the fastest possible way.

Let the model decide which tool to call.
Let it read its own outputs and decide the next step.
Add some glue code.
If it seems to work, ship it.

At small scale, this feels fine. You can manually test a few flows and convince yourself it is "smart". The problem appears when:

you add more tools
you add branching logic
you add retries and fallbacks
you need to explain a weird decision three weeks later

This is where I found myself reading logs that looked like random walks.

The worst part was not that the system was wrong. Of course it was wrong sometimes. The worst part was that I had no clean way to answer the simplest question:

Why did it choose this path instead of the other one?

If the answer is always "because the large model said so", you do not really have a system. You have an expensive dice that generates strings.

I wanted something stricter.

What OrKa is trying to be

Before talking about scoring, a quick snapshot of what OrKa is.

OrKa is a modular cognition layer where you define agents and orchestration in YAML.

Instead of burying logic inside a single prompt or a massive Python file, you write something like this:

orchestrator:
  id: research_orchestrator
  strategy: graph
  queue: redis_main

agents:
  - id: question_normaliser
    type: llm
    model: local_llm_0
    prompt: |
      Normalise the user question and extract the core task.
      Input: {{ input }}

  - id: graph_scout
    type: router
    implementation: GraphScoutAgent

  - id: decision_engine
    type: router
    implementation: DecisionEngine

  - id: executor
    type: executor
    implementation: PathExecutor

This is not the actual full config of OrKa, but the spirit is there.

The orchestrator knows which agents exist and how they can connect. The runtime executes this graph, logs every step, and writes traces to storage.

OrKa is not about inventing new models. It is about treating models as components inside a larger cognitive process that you can inspect and reproduce.

Which brings us to the main pain point: routing.

Routing in agent graphs is where the real intelligence hides

Once you have more than a linear sequence of agents, you need to decide which path a request will take.

Typical examples:

route a user question through either a summarisation path or a deep research path
decide whether to call an external API or not
choose between a cheap local model and an expensive remote one
pick a specific tool combination for a multi step workflow

Most frameworks solve this with one of these options:

Let the LLM choose, based on a description of tools.
Hard code a set of if/else rules.
Use some vague "policy" mechanism that is not really documented.

All of these work at small scale. None of them made me happy for serious systems.

What I wanted was:

a clear separation between generating candidate paths and choosing one
a scoring function that is explicit and configurable
a trace that shows me every factor in that decision
a way to compare different scoring strategies without rewriting half the stack

So I decided to treat path selection as a scoring problem.

The idea: treat paths as candidates and score them

Instead of thinking "which tool should I call", I started thinking "which full path through the graph should win".

That leads to a simple structure:

Look at the graph and current state.
Generate a set of candidate paths that are valid next moves.
Compute a score for each candidate using multiple factors.
Pick the winner according to a clear policy.
Log everything.

In OrKa v0.9.6 this is handled by four main components:

GraphScoutAgent
PathScorer
DecisionEngine
SmartPathEvaluator

Let us walk through what each one does.

GraphScoutAgent: exploring the space of possible moves

The GraphScoutAgent is responsible for reading the current graph and state and proposing candidate paths.

Its job is intentionally limited:

it does not assign scores
it does not choose winners
it does not care about cost or latency

It just answers the question:

Given where we are now, what are the valid next paths I can take, and what information do I need to evaluate them?

A "path" here is not just a single next node. It can be a short sequence that represents a meaningful strategy.

For example:

["normalise_question", "search_docs", "synthesise_answer"]
["normalise_question", "ask_clarification", "search_docs", "synthesise_answer"]
["normalise_question", "call_external_api", "summarise_api_result"]

The scout does some basic pruning. There is no point considering paths that are structurally impossible or obviously invalid.

Once we have a set of candidates, we can start scoring.

PathScorer: mixing LLM judgement with heuristics, priors, cost and latency

The PathScorer is where most of the interesting logic lives.

The scoring function is multi factor and looks roughly like this:

final_score = w_llm * score_llm
            + w_heuristic * score_heuristic
            + w_prior * score_prior
            + w_cost * penalty_cost
            + w_latency * penalty_latency

Each term is normalised to a consistent range before weighting, so scores are comparable across candidates.

The factors:

score_llm

The output of a small evaluation model that looks at a candidate path and the current context and answers a simple question:

How suitable is this path for what we are trying to do?

This does not need to be a giant model. A small local model is often enough.
score_heuristic

Hand written logic. For example:
- prefer paths that include a safety checker
- avoid paths that call the same API twice in a row
- boost paths that reuse recent context
score_prior

Domain or tenant specific priors. This is still a work in progress in 0.9.6.

Think of it as "distaste" for some strategies in some domains. For instance, in a financial setting you might have a strong prior against generating free form explanations without a verification step.
penalty_cost

Cost is not just money. Cost can be GPU time, external API calls, or latency budgets.

This term penalises candidates that are likely to be expensive.
penalty_latency

Expected latency. Sometimes you want to avoid slow paths even if they are slightly more accurate, especially in user facing flows.

All weights are configurable.

In v0.9.6 the default configuration is conservative. The point is not to ship a magic policy. The point is to ship a structure that you can bend to your needs.

And most importantly: every factor and weight is recorded in the trace.

DecisionEngine: from scores to a committed path

Once every candidate has a score, the DecisionEngine kicks in.

Its responsibilities:

sort candidates by score
handle shortlist semantics
decide how to break ties
commit to a path and make that decision visible to the rest of the system

"Shortlist semantics" might sound like a detail, but it matters in practice.

Sometimes you want:

a strict winner takes all policy
a shortlist of two candidates, where the second one is a fallback
a policy that says "if scores are too close, ask the user or ask another agent"

The DecisionEngine contains this logic and is the main place where you can plug in different strategies without touching scoring itself.

One thing I learned quickly: if you do not formalise this part, you end up with ad hoc logic scattered across the codebase, which is very hard to test.

SmartPathEvaluator: the orchestration facing wrapper

The SmartPathEvaluator is simply the wrapper that orchestration code talks to.

From the outside, you do not care about scouts, scorers and engines. You want to say:

decision = evaluator.evaluate(current_state)

and get back:

the chosen path
the shortlist
a full scoring breakdown

The evaluator handles initialisation, plugs everything together and provides a stable API.

This is the layer where backwards compatibility matters. Internally I can keep iterating on the blocks. As long as the evaluator contract stays stable, orchestration code will not need to change much.

What traces look like now

A big motivator behind this refactor was trace quality.

A trace for a single decision now includes at least:

list of candidate paths
scores per factor for each candidate
weights used for each factor
final aggregated score
shortlist and winner
any errors encountered during scoring

This means that when something weird happens, the debugging flow is finally clear:

Inspect which candidates were considered.
Check if their structure makes sense.
Look at the raw scores for each factor.
Adjust weights or heuristics if necessary.
Rerun with the same input and confirm the change.

No more guessing. No more "the model decided".

I am not claiming this is perfect, but at least there is a concrete trail to follow.

Testing: where the 74 percent coverage actually goes

Coverage numbers are easy to game, so here is what the 74 percent in OrKa v0.9.6 really means.

Things that are tested well:

scoring logic
normalisation and weighting functions
graph introspection for candidate generation
loop behaviour and basic convergence
DecisionEngine shortlist and commit semantics

These are mostly unit and component tests. They run fast and have no external dependencies.

Things that are partially tested:

integration between the new components and the rest of the orchestration runtime
logging format and trace emission

Here I lean on higher level tests that exercise the system in memory with mocks instead of real external services.

Things that are not properly tested on the CD/CI:

full end to end flows against real local LLMs and a live memory backend
failure modes when LLM outputs violate schemas in fun new ways
long running behaviour under realistic load

These are exactly the items where I'm struggling more. All tests run in to a github actions and there are no real LLM to call. Local test are in place to ensure me all is working before release.

I am sharing this explicitly because I am tired of changelogs that say "improved reliability" without telling you what is still risky.

Why local models actually help here

One side effect of building around deterministic scoring is that local models become even more attractive.

You can use a small local model as the "judgement" part of the scoring function:

it reads the candidate path
it reads the context
it outputs a suitability score or a categorical judgement

Because the rest of the scoring function is deterministic and visible, even a slightly noisy local model can be stabilised by heuristics and priors.

This has a few advantages:

you do not leak your graph structure and decisions to a third party API
you can tune the model or swap it without changing the architecture
latency is predictable and under your control

In my own experiments I have used small local models through runtimes like Ollama for this purpose. They are not perfect, but they are good enough when combined with the other factors.

The important part is that the scoring pipeline does not trust the model blindly. It treats it as one signal among many.

A small concrete example

To make this less abstract, here is a simplified YAML style example of how a decision might play out.

Imagine you have two candidate paths for a user question:

Path A: ["normalise", "search_docs", "answer"]
Path B: ["normalise", "ask_clarification", "search_docs", "answer"]

The inputs to the scorer could look like:

{
  "candidates": [
    {
      "id": "path_a",
      "steps": ["normalise", "search_docs", "answer"]
    },
    {
      "id": "path_b",
      "steps": ["normalise", "ask_clarification", "search_docs", "answer"]
    }
  ],
  "context": {
    "question": "Need a short summary of last quarter revenue",
    "user_tolerance_ms": 2000
  }
}

The scoring results might be:

{
  "path_a": {
    "score_llm": 0.78,
    "score_heuristic": 0.9,
    "score_prior": 0.5,
    "penalty_cost": -0.1,
    "penalty_latency": -0.05,
    "final_score": 0.71
  },
  "path_b": {
    "score_llm": 0.82,
    "score_heuristic": 0.6,
    "score_prior": 0.5,
    "penalty_cost": -0.3,
    "penalty_latency": -0.3,
    "final_score": 0.52
  }
}

Weights are hidden here for brevity, but they are part of the trace.

Looking at this, it is clear that:

The model slightly prefers path B because it likes clarifications.
Heuristics strongly prefer path A for this kind of query.
Cost and latency kill path B because the user tolerance is low.

The DecisionEngine then chooses path A, possibly keeping path B in a shortlist as a fallback for specific error modes.

When someone asks "why did we not ask for clarification here", the trace says it plainly: cost and latency mattered more than that extra step.

This is the sort of conversation I want to be able to have about AI systems.

Known gaps and where this goes next

I do not pretend OrKa v0.9.6 is finished work. It is an advanced beta, not a stable 1.0.

The most important gaps right now:

End to end validation

I need a small, boring suite of tests that run full flows with local LLMs and a real Redis or similar memory backend. No mocks. No shortcuts. Just reproducible runs.
Priors and safety heuristics

The structure is there, but the library of domain specific priors and safety rules is still thin. This is probably the most important piece for high risk domains.
PathExecutor shortlist semantics

I want more coverage of weird real world cases where the top candidate fails mid path and fallback logic kicks in.
LLM schema handling

Right now a lot of schema work is done, but I want schema failures to be first class citizens in traces. If a model gives me garbage, the system should not quietly "fix" it. It should record that the schema was broken.

All of these items are focused and measurable. There is no magic backlog of vague ideas. Just a short list of concrete things that need to be built and tested.

Why I am sharing this

There is a lot of noise in the AI space. Huge claims, vague diagrams, no tests.

I am not trying to shout over that.

I am sharing this for a simpler reason: if you are also building agentic systems, we are probably facing similar problems. You might have better solutions, or you might see blind spots in mine.

In that sense, OrKa is a conversation starter as much as it is a tool.

If you have strong opinions about routing policies, I want to hear them.
If you have horror stories about "smart" orchestration gone wrong, I want to learn from them.
If you think scoring is the wrong abstraction entirely, I want to know why.

You can find more details and code here:

Overview and docs: https://orkacore.com
Source code: https://github.com/marcosomma/orka-reasoning

If you made it this far, thank you.

Feel free to steal any of these ideas, or tear them apart. That is how the next iteration will get better. 🚀

Top comments (4)

Isaac Hagoel • Nov 16

Thanks. This was a great read. I think you're on the right track - we need to bring proper engineering into the AI applications space.

Mak Sò • Nov 16

hahah Thanks @isaachagoel,glad you enjoyed the read.

From my side I can assure you it is not easy to be a software engineer these days. Most AI focused work is very LLM centric, with very little attention to scalability and reproducibility. Ego also does not help in this buzz era.

But here we are, still fighting for it. 💪

Mattw0110 • Nov 16 • Edited

Really enjoyed reading this. I have made a team of roles in rooCode I think will work fantastic paired together with this! Gonna give it a look see :)

Mak Sò • Nov 16

Why not :) How I can help you? I will be more than happy to try out new thigs :)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.