DEV Community: Krishi Attri

stepback: rewind AI coding agent edits without touching your real git

Krishi Attri — Fri, 24 Jul 2026 15:52:27 +0000

an AI coding agent edits six files, decides one needs a full rewrite, and drops a stray debug_output.json in your repo root while it's at it. now you want the last good state back. git stash doesn't help, the agent never committed anything, there's nothing to stash. your editor's undo stack unwinds one file at a time, if it even covers all six. so you do it by hand: git diff, squint, revert some hunks, delete the junk file, hope you didn't miss anything.

i built stepback for that moment. it's a CLI that wraps your agent, stepback run -- claude, stepback run -- codex, stepback run -- aider, or anything else that edits files on disk. it snapshots your working tree as the agent works and lets you rewind to any snapshot exactly, deleted files and binaries included.

$ stepback run -- claude
  ... let the agent work ...
$ stepback list
  #7    2m ago  3 file(s) [~3]      +conv:claude-code
  #6    5m ago  1 file(s) [+1]      +conv:claude-code
$ stepback diff 6
$ stepback rewind 6
  restored to checkpoint #6.  (`stepback redo` to undo this rewind)

how it snapshots without touching your git

the constraint that shaped the whole design: stepback must never be able to corrupt or even nudge the repo you're actually working in. no writes to HEAD, no commits on your branch, no touching your staged changes. so it doesn't use your real index at all.

every checkpoint runs through GIT_INDEX_FILE pointed at a throwaway index, not .git/index:

GIT_INDEX_FILE=<tmp>  git add -A .        # into a throwaway index, respects .gitignore
GIT_INDEX_FILE=<tmp>  git write-tree      # -> tree SHA, content-addressed

that tree gets committed with commit-tree under stepback's own author identity (so it works even if you have no git user configured locally) and pointed to by a ref under refs/checkpoints/<session>/<n>. never a branch ref, never HEAD. identical trees dedupe automatically, so an idle agent doesn't spam checkpoints. outside a git repo entirely, stepback creates a private bare object store at .stepback/shadow.git and runs the same plumbing against that instead.

restore is the mirror image. diff the current tree against the target, stage only the files that actually changed into a temp location with checkout-index, delete what shouldn't exist anymore, then move each staged file over the real one with an atomic same-filesystem rename. if a restore dies halfway, every individual file is either fully old or fully new, never a half-write. and before any of that runs, stepback commits your current state and pushes it onto a redo stack protected by its own ref, so stepback redo works even if the rewind itself gets interrupted.

the isolation guarantee, and how it's checked

"never touches your real git" is easy to claim and easy to accidentally violate the first time someone rewinds mid-merge. so it's tested directly, not just asserted: the suite checks that the index, HEAD, and branch are byte-for-byte unchanged after a checkpoint-and-rewind cycle under a normal repo, a detached HEAD, a repo with MERGE_HEAD present, and a repo with staged-but-uncommitted changes sitting in the index. 67 tests total, covering that isolation guarantee plus exact restore of binaries, symlinks, unicode and space and leading-dash filenames, .gitignore handling, crash recovery, and cross-process locking, so a running watcher and a manual rewind typed in another terminal can't corrupt each other's state.

two layers, and i mean it about "best-effort"

file checkpoint and restore is layer 1, and it's the part i'd stake a claim on. solid, tested, the part that's actually hard to get right: git plumbing, atomicity, crash recovery. layer 2 sits on top of that. stepback also snapshots the agent's on-disk session transcript at each checkpoint (~/.claude/projects/<slug>/<uuid>.jsonl for claude code, the newest file under ~/.codex/sessions/ for codex), so a rewind can hand you a resume hint like claude --resume 9f3c... and you pick the conversation back up from that point, not just the code.

i'm not going to oversell that part. those are private, undocumented session formats with zero stability guarantee, owned by vendors who can change them in any release. layer 2 is quarantined behind an adapter interface on purpose: every adapter method is wrapped in try/except at the engine level, so an adapter that raises on every single call still can't touch layer 1. if your agent isn't claude code or codex, or the format shifted under an update, stepback drops to file-only rewind silently. no crash, no half-broken state, just one fewer convenience. that's the whole policy: a real result or a clean degrade, nothing in between.

try it

pip install stepback
stepback run -- claude          # or -- codex, -- aider, or any command
stepback status                  # storage mode, session, watcher, detected adapters
stepback list                    # checkpoints, newest first
stepback diff <id>               # what a checkpoint changed (-w to diff vs current tree)
stepback rewind <id>             # preview + confirm + restore
stepback rewind <id> -n          # dry run, show the plan, touch nothing
stepback redo                    # undo the last rewind

it's 0.1.0. the file layer is the part that's had the real engineering effort and it's where i want scrutiny, issues and PRs welcome, especially edge cases in restore. code is at github.com/Archerkattri/stepback, package is on PyPI.

Zero-shot point-cloud registration actually transfers: BUFFER-X inside splatreg

Krishi Attri — Thu, 02 Jul 2026 05:04:13 +0000

splatreg registers 3D Gaussian splats: two 3DGS scans in, one SE(3)/Sim(3) transform out, optionally one fused splat. Its coarse-init stage seeds a Levenberg–Marquardt refine, and until recently the practical default for real scans was a classical FPFH+RANSAC seed. This post is about what happened when I swapped in BUFFER-X (ICCV 2025), a zero-shot learned registration model — and, since it's probably the more useful part, the exact recipe for building its 2023-era CUDA extensions on a 2026 stack.

Why zero-shot matters for a splat registrar

Per-dataset-trained backbones like PSReg and DiffusionPCR top the 3DMatch leaderboard at 95%+ registration recall. But a splat registrar should not require training a per-scene or per-sensor model to align two captures someone made with a phone and a drone. So splatreg deliberately keeps a generalist seed: BUFFER-X is a single pretrained model that claims to register across sensors and scales with no per-dataset tuning. The question was whether the claim survives contact with the official benchmarks when wired in as a real seed.

The numbers

I ran the complete official gt.log pair sets — not a curated subset — with a pair counted as recalled at RRE < 15° and RTE < 0.3 m:

3DMatch (8/8 scenes, n=1619): BUFFER-X seed 0.962 recall, median RRE 1.46°, vs 0.630 / 2.12° for the classical FPFH seed.
3DLoMatch (the hard 10–30% overlap split, n=1781): 0.777 / 2.77° vs 0.122 / 103.4°.

That 3DLoMatch line is the story: 6.4× the recall, and the classical seed's median error of 103° means it isn't "less accurate" there — it's landing in random basins. BUFFER-X won every scene on both splits.

The caveat that keeps these numbers honest: both seeds were pushed through the identical lighter feature_align refine, so the comparison isolates the seed. These are not full-pipeline absolute numbers to lay next to leaderboard entries; they answer "which seed should splatreg trust on real scans," nothing more.

Here is one real low-overlap pair (7-scenes-redkitchen 35→46, ground-truth overlap 0.10) watched end to end — the classical seed slews the fragment into the wrong basin at 151.5° error, then BUFFER-X + refine locks on at 2.0°. Both transforms are actual library outputs; the animation interpolates between real estimates, nothing is hand-posed:

The build recipe (the part you actually came for)

BUFFER-X ships native extensions written for an older stack. Getting them to build on CUDA 12.8 / RTX 5090 (sm_120) / torch 2.11 / numpy 2.4, without sudo, took a day of archaeology. The full recipe is in docs/BUFFERX_BUILD_MODERN_CUDA.md; these are the walls I hit:

1. pointnet2_ops hardcodes dead GPU architectures. Its setup.py sets TORCH_CUDA_ARCH_LIST = "3.7+PTX;5.0;...", and nvcc 12.8 flat-out rejects compute_37. Patch that line to your real arch ("12.0" for sm_120) and install with pip install --no-build-isolation ..

2. The KPConv C++ wrappers use the numpy 1.x C-API. numpy 2.x removed it. The port is mechanical once you know it: NPY_IN_ARRAY → NPY_ARRAY_IN_ARRAY, and cast the PyObject* handles to PyArrayObject* everywhere PyArray_NDIM/DIM/DATA is called.

3. You don't need apt install libtbb-dev. pip install tbb tbb-devel drops tbb/tbb.h under <venv>/include; point CPLUS_INCLUDE_PATH there (plus a --depth 1 clone of header-only Eigen) and the wrappers compile sudo-free.

4. Two CUDA deps don't deserve a build at all. BUFFER-X only uses knn_cuda.KNN with k=1 — that's torch.cdist + topk. And torch_batch_svd is just torch.linalg.svd, which batches natively now. Tiny pure-torch shim modules on the path replace both; they ship in docs/bufferx_shims/.

5. The silent killer: the pretrained checkpoints are full-model state dicts. The keys are prefixed Desc./Pose.. Load them into a submodule with strict=False and nothing matches — you get randomly initialized weights that produce garbage seeds with no error anywhere. If your zero-shot model performs like a random-pose generator, check this first.

Using it

pip install splatreg

from splatreg.api import register
result = register(target, source, init="bufferx")   # zero-shot seed + LM refine

If the BUFFER-X weights or extensions are absent, init="bufferx" logs a note and falls back to the classical robust seed — it never fails silently. Everything downstream (Sim(3) scale recovery, spherical-harmonic rotation via real-basis Wigner-D, pose covariance for pose graphs, merge + dedupe) is identical regardless of which seed you chose.

What it doesn't do

Zero-shot does not mean magic. Below roughly 40% retained overlap the rotation-disambiguating geometry is physically absent, and no seed fixes that — splatreg flags those cases as ambiguous rather than silently wrong-posing, and scale is unobservable under thin overlap no matter what. The 0.962/0.777 figures are seed-isolation numbers under one shared refine, not leaderboard entries; per-dataset-trained models still hold the absolute 3DMatch record and I say so in the README. The BUFFER-X path needs a real CUDA build (the recipe above) — CPU-only installs get the classical fallback. And splatreg itself registers splats; if all you have are raw point clouds, BUFFER-X upstream serves you directly without any of my wrapping.

Every number here has a reproduction path in RESULTS.md, and the figure/GIF generators live in examples/.

Your AI can stop hallucinating math: a real Lean kernel over MCP

Krishi Attri — Thu, 02 Jul 2026 05:03:52 +0000

I got tired of watching AI assistants confidently misquote theorems, so I built mathlas: an MCP server that gives any agent a real Lean kernel, PSLQ, OEIS matching, and a 3.68M-document theorem index. No LLM inside, no API key, Apache-2.0.

The premise is a strict division of labor: the AI is the brain, mathlas is the hands. Every tool returns data — candidates, verdicts, checklists — and the agent does the judging. No tool inside mathlas ever calls an LLM, which means no tool inside mathlas can hallucinate.

The discipline: airtight or nothing

Every verdict-producing tier follows one rule: return an independently-checkable fact, or an honest "nothing." Never a plausible guess.

Here is what that looks like in practice — real in-process tool outputs, captured from the live server:

verify_formal runs the actual Lean 4.31.0 kernel. Hand it a proposition and your Lean 4 proof and you get one of:

VERIFIED_PROOF — the kernel typechecked the full declaration;
REFUTED — with the kernel's error message verbatim, so the agent can repair and retry;
REJECTED — for sorry/admit holes (Lean itself exits 0 on a sorried proof; mathlas scans the source and the kernel's sorryAx diagnostics, so you can't sneak a hole past it);
UNDETERMINED — when the toolchain is missing, an import can't resolve, or the 60 s cap hits. An honest shrug, never a fake verdict.

On the numeric side, identify_constant and verify_numeric use PSLQ plus an independent high-precision re-evaluation (50–51 digits). Type in 1.6449340668482264... and it hands back pi**2/6, re-verified — or nothing at all. Measured false-positive rate across every tier (numeric, sequence, Ramanujan-style relations): zero. Structureless inputs produce zero false hits, 8 out of 8 times we tried to bait it. Full tables with reproduction commands live in RESULTS.md.

Twelve tools, one pipeline

The tools compose into a workflow the agent drives:

search_existing_math → applicability_checklist / mapping_scaffold → (AI judges) → verify_numeric / verify_formal

The retrieval side is search_existing_math, served from a 3,683,428-document dense + BM25 + RRF index (the text side is open on Hugging Face). applicability_checklist is the tool I'm proudest of and the one nobody else ships: it decomposes a theorem's hypotheses into atomic preconditions the AI verifies one by one — the guardrail against applying Banach's fixed-point theorem to an incomplete space. Then there's identify_sequence (exact OEIS term-match), search_formal_math (proxies Loogle + LeanSearch for mathlib declarations, provenance-labeled), conjecture_relation (Ramanujan Machine-style PSLQ over a rich basis), a sandboxed funsearch harness, and add_finding, which matters for the benchmark below.

The benchmark, with its caveat up front

On TheoremSearch's own 110 human-written queries, mathlas with its self-augmenting web loop scores 59.1% theorem-level Hit@20 (65/110) against TheoremSearch's 45.0%. Sounds great. Here is the part you should read before quoting it:

This is a loop-vs-static comparison, not a corpus-vs-corpus one. Corpus-only, mathlas's baseline on this benchmark is 10.0% — TheoremSearch withheld ~85% of their private 9.2M corpus (the non-redistributable arXiv-licensed papers), so 95 of the 110 target papers are unreachable for any open system. What the 59.1% measures is the add_finding loop: the agent web-finds each missing statement, embeds it with the same Qwen3-Embedding-8B, and fuses it into the live index at runtime. TheoremSearch's 45.0% is a static system answering from its full private corpus. The honest headline is "a validated writeback loop repairs an open system's coverage gap at runtime," not "my index beats theirs." The math domain is the right place for such a loop precisely because the write-back candidate can be deterministically checked (verify_numeric / verify_formal) before it's trusted.

Reproduce it with benchmarks/webaug_110_bench.py in the repo.

Try it in one line

With uv installed:

claude mcp add mathlas -- uvx mathlas-mcp

That's it — Claude Code now sees twelve tools. Plain pip works too (pip install mathlas-mcp), Cursor or any MCP client can point at the same stdio command, and if the official mcp SDK isn't installed the server falls back to a dependency-free stdio JSON-RPC implementation, so it always runs. It's also on the official MCP registry as io.github.Archerkattri/mathlas.

What it doesn't do

mathlas does not write proofs — the generator/verifier split is absolute, so the kernel checks your proof and reports exactly why it fails, but the repair is on you (or your agent). Corpus-only retrieval will not beat TheoremSearch on their benchmark; the 10.0% baseline is a licensing-bounded floor and I report it as such. Two tools degrade without optional local data: identify_sequence wants a local OEIS copy and verify_formal wants a Lean toolchain — without them you get a clear "not available," never a fake answer. The full-quality index needs the Qwen3-Embedding-8B encoder, which is not laptop hardware; there are measured quantized and 0.6B tiers that trade 7–9 points of recall for running on 4 CPU threads, documented with their exact costs. And it's not a CAS — if you want to symbolically massage an expression you gave it, sympy is the right tool; mathlas finds, scopes, and verifies existing math.

Code, benchmarks, and every number's reproduction command: github.com/Archerkattri/mathlas.