Per-Step JSONL Logging for Agent Runs: Know What Your Agent Did and When

#hermeschallenge #ai #python #agents

You shipped an agent to production. A user filed a bug report. The agent gave a wrong answer. You need to know exactly what the agent did: which tools it called, what they returned, how many LLM turns it took, where it went sideways.

If you have no step log, your answer to "what did the agent do?" is "I don't know."

agent-step-log gives every agent run a structured, queryable JSONL record.

The Shape of the Fix

from agent_step_log import StepLog

log = StepLog(path="./logs/agent-steps.jsonl", run_id="run-abc123")

log.record_llm_call(
    model="claude-sonnet-4-6",
    input_tokens=1200,
    output_tokens=340,
    stop_reason="tool_use",
)

log.record_tool_call(
    tool="search_documents",
    input={"query": "Q3 revenue"},
    output={"results": [...]},
    duration_ms=240,
)

log.record_llm_call(
    model="claude-sonnet-4-6",
    input_tokens=1600,
    output_tokens=180,
    stop_reason="end_turn",
)

summary = log.summarize()
print(f"Steps: {summary.total_steps}")
print(f"Total tokens: {summary.total_tokens}")
print(f"Tool calls: {summary.tool_call_count}")
print(f"Duration: {summary.duration_ms}ms")

Each call appends a line to the JSONL file and updates an in-memory list. When the run ends, summarize() computes totals from the in-memory list without re-reading the file.

What It Does NOT Do

agent-step-log does not capture LLM payloads. The log records metadata (model, tokens, stop reason, duration) not the full prompt or completion text. For wire-level capture including full request and response bodies, use agenttap.

It does not stream to external systems. The JSONL file is local. If you want to ship logs to Datadog, CloudWatch, or a custom sink, wrap the StepLog class and add your transport layer.

It does not aggregate across runs automatically. Each StepLog instance is scoped to one run. Cross-run analysis means loading multiple JSONL files. agent-debug-replay builds on agent-step-log to provide that cross-run view.

Inside the Library

Each step record has a consistent shape:

# LLM call record
{
    "run_id": "run-abc123",
    "step": 1,
    "type": "llm_call",
    "model": "claude-sonnet-4-6",
    "input_tokens": 1200,
    "output_tokens": 340,
    "stop_reason": "tool_use",
    "duration_ms": 1840,
    "ts": 1748107200.123
}

# Tool call record
{
    "run_id": "run-abc123",
    "step": 2,
    "type": "tool_call",
    "tool": "search_documents",
    "input": {"query": "Q3 revenue"},
    "output": {"results": [...]},
    "error": null,
    "duration_ms": 240,
    "ts": 1748107202.456
}

The implementation:

class StepLog:
    def __init__(self, path: str, run_id: str):
        self._path = Path(path)
        self._run_id = run_id
        self._steps: list[dict] = []
        self._step_counter = 0
        self._start_ts = time.time()
        self._lock = threading.Lock()

    def _write(self, record: dict) -> None:
        with self._lock:
            self._step_counter += 1
            record["run_id"] = self._run_id
            record["step"] = self._step_counter
            record["ts"] = time.time()
            self._steps.append(record)
            with self._path.open("a") as f:
                f.write(json.dumps(record) + "\n")

    def record_tool_call(
        self,
        tool: str,
        input: dict,
        output: Any = None,
        error: str | None = None,
        duration_ms: float = 0,
    ) -> None:
        self._write({
            "type": "tool_call",
            "tool": tool,
            "input": input,
            "output": output,
            "error": error,
            "duration_ms": duration_ms,
        })

The summarize() method computes from the in-memory list:

def summarize(self) -> StepSummary:
    llm_steps = [s for s in self._steps if s["type"] == "llm_call"]
    tool_steps = [s for s in self._steps if s["type"] == "tool_call"]
    return StepSummary(
        run_id=self._run_id,
        total_steps=len(self._steps),
        llm_call_count=len(llm_steps),
        tool_call_count=len(tool_steps),
        total_tokens=sum(s.get("input_tokens", 0) + s.get("output_tokens", 0) for s in llm_steps),
        error_count=sum(1 for s in tool_steps if s.get("error")),
        duration_ms=(time.time() - self._start_ts) * 1000,
    )

Thread-safe writes use a threading.Lock on the file handle and in-memory list update together. Both happen under the same lock so the in-memory list and the JSONL file stay in sync.

When to Use It

Use it for every agent run in production. The JSONL file is small — a typical 5-step run produces under 2KB. Storage cost is negligible compared to the debugging value.

Use it for cost monitoring. The token counts per run let you identify which run types are expensive. Filter by stop_reason == "max_tokens" to find runs that hit context limits.

Use it for SLA monitoring. Each step record has a timestamp and duration. You can compute P95 tool latency and P95 run duration directly from the log files with a few lines of Python.

Skip it for single-shot scripts. If you are running a one-off agent task that you will never need to audit, the overhead of a step log adds complexity without benefit.

Install

pip install git+https://github.com/MukundaKatta/agent-step-log

# Or from PyPI
pip install agent-step-log

from agent_step_log import StepLog
from agent_run_id import RunContext

async def run_agent(task: str) -> str:
    ctx = RunContext.start()
    log = StepLog(path=f"./logs/{ctx.run_id}.jsonl", run_id=str(ctx.run_id))

    messages = [{"role": "user", "content": task}]

    while True:
        response = call_llm(messages)
        log.record_llm_call(
            model=response.model,
            input_tokens=response.usage.input_tokens,
            output_tokens=response.usage.output_tokens,
            stop_reason=response.stop_reason,
        )

        if response.stop_reason == "end_turn":
            summary = log.summarize()
            logger.info("run_complete", **summary.__dict__)
            return extract_text(response)

        for tool_call in response.tool_calls:
            start = time.monotonic()
            try:
                result = execute_tool(tool_call.name, tool_call.input)
                log.record_tool_call(
                    tool=tool_call.name,
                    input=tool_call.input,
                    output=result,
                    duration_ms=(time.monotonic() - start) * 1000,
                )
            except Exception as e:
                log.record_tool_call(
                    tool=tool_call.name,
                    input=tool_call.input,
                    error=str(e),
                    duration_ms=(time.monotonic() - start) * 1000,
                )
                raise

Sibling Libraries

Library	What it solves
`agent-debug-replay`	Step-through navigator for agent-step-log JSONL files
`agenttap`	Wire-level capture including full prompt and response bodies
`agentsnap`	Usage tracking with cost aggregation per run
`agent-run-id`	Generates and propagates the run_id field
`agent-event-bus`	In-process pub/sub to broadcast step events to multiple subscribers

The observability stack: agent-run-id generates the run ID, agent-step-log records steps, agenttap captures wire traffic, agent-debug-replay lets you navigate the combined picture.

What's Next

Structured error context: when record_tool_call receives an error, also capture the exception type and first 3 frames of the traceback. This makes the JSONL log actionable for error triage without a full debugger.

Run metadata header: a first record in each file with run-level metadata (agent version, task type, environment). Makes it easier to filter logs by deployment context rather than just run ID.

Storage backend: StepLog(path="s3://bucket/logs/", run_id=...) for agents running in cloud environments where local file writes are ephemeral or unavailable.

Built as part of the agent-stack family: composable Python primitives for production LLM agents.