Phase 1 of NumPath is done. Seven of eight Definition of Done items are checked — the eighth requires real children completing pilot sessions, which no amount of code will substitute for. The stack runs cleanly in Docker Compose, 56 unit tests pass, and a student can log in, answer ten problems, and see their knowledge state update in real time.
What the commit history doesn't show is the afternoon I spent fighting four bugs that don't appear in any FastAPI or Vue tutorial. This post is that afternoon.
What We Built
NumPath is an adaptive math tutor for children with dyscalculia. Phase 1 ships the minimum research instrument: a student practice loop, a rule-based adaptive engine, and a read-only teacher dashboard. No ML yet — just clean infrastructure and a data collection pipeline capable of generating the 150+ attempt records that Phase 2 needs to train the BKT model.
The stack: FastAPI 0.110 + SQLAlchemy 2 + Alembic + asyncpg on the backend; Vue 3 + Tailwind + Pinia on the frontend; PostgreSQL 16 + Redis 7 in Docker Compose.
Bug 1: passlib AttributeError on bcrypt ≥4.0
The symptom was immediate on first login attempt:
AttributeError: module 'bcrypt' has no attribute '__about__'
passlib has a version check that reads bcrypt.__about__.__version__. bcrypt 4.0 removed the __about__ module. The libraries have been incompatible for two years and passlib is effectively unmaintained.
The fix: delete passlib entirely. Replace it with three lines of direct bcrypt calls:
# backend/auth/password.py
import bcrypt
def hash_password(plain: str) -> str:
return bcrypt.hashpw(plain.encode(), bcrypt.gensalt()).decode()
def verify_password(plain: str, hashed: str) -> bool:
return bcrypt.checkpw(plain.encode(), hashed.encode())
pyproject.toml: swap "passlib[bcrypt]>=1.7.4" for "bcrypt>=4.0.0". Done. Don't reach for passlib on new Python projects — the dependency is dead.
Bug 2: pnpm 10 security policies blocking Docker builds
The frontend Dockerfile used node:20-slim and installed the latest pnpm via corepack. When pnpm 10 shipped, the build started failing with:
ERR_PNPM_PREPARE_PKG_FAILURE Error when preparing the package
Blocked by policy: electron-to-chromium@1.5.134 is not allowed
because it was released 0 days ago (policy: minimumReleaseAge=3 days)
pnpm 10 introduced release-age security policies that refuse to install packages published within the last N days. A reasonable feature in production — a CI-breaking surprise when your lock file pins a package that was published yesterday.
Two separate policies hit us: minimumReleaseAge and ignored-builds (which blocks esbuild and vue-demi unless explicitly allowed). The package.json "pnpm" field that's supposed to configure these policies is silently ignored in pnpm 10 — it logs a warning and reads nothing.
The fix: pin to pnpm 9:
FROM node:22-slim
RUN corepack enable && corepack prepare pnpm@9.15.9 --activate
pnpm 9 has no release-age policies. The upgrade to pnpm 10 can wait until the project has a proper CI environment to absorb the breaking change.
Bug 3: FastAPI container connecting to localhost instead of postgres
The backend started cleanly. Every database call returned:
asyncpg.exceptions.ConnectionRefusedError: connection refused (host 127.0.0.1, port 5432)
The DATABASE_URL in .env was postgresql+asyncpg://numpath:numpath@localhost:5432/numpath. Inside a Docker Compose network, localhost is the container's own loopback — not the postgres service. The postgres container is reachable by its service name.
The fix: override the env var at the service level in docker-compose.yml:
backend:
env_file: ../.env
environment:
DATABASE_URL: postgresql+asyncpg://numpath:numpath@postgres:5432/numpath
REDIS_URL: redis://redis:6379/0
The environment block wins over env_file, so local development (which uses localhost) keeps working. Containers talk to each other by service name.
Bug 4: SQLAlchemy column defaults not applied at construction time
This one cost the most time. POST /attempts returned a 500:
TypeError: unsupported operand type(s) for -: 'int' and 'NoneType'
The BKT update equation was subtracting from p_learn, which was None. The KCStateRecord model had:
class KCStateRecord(Base):
p_learn: Mapped[float] = mapped_column(Float, default=0.3)
p_guess: Mapped[float] = mapped_column(Float, default=0.2)
p_slip: Mapped[float] = mapped_column(Float, default=0.1)
The bug: SQLAlchemy's default= is a server-side or flush-time default. When you construct KCStateRecord() in Python and haven't flushed to the database yet, those columns are None on the Python object. The domain code ran immediately after construction, before any flush.
The fix: set defaults explicitly in the constructor, then flush and refresh before returning:
record = KCStateRecord(
student_id=student_id,
skill_id=skill_id,
p_mastery=0.1,
p_learn=0.3,
p_guess=0.2,
p_slip=0.1,
opportunity_count=0,
)
self._db.add(record)
await self._db.flush() # write to DB so defaults are applied
await self._db.refresh(record) # re-read the DB-populated values
The rule: if you use a newly constructed SQLAlchemy model object before any flush, assume every default= column is None. Either set defaults in the constructor or flush first.
What the BKT update looks like in practice
With those bugs cleared, the full attempt flow works end to end. A correct answer on a SUB_BORROW problem with a fresh KCState shows:
before: p_mastery=0.100, opportunity_count=0
after: p_mastery=0.533, opportunity_count=1
That 0.1 → 0.533 jump is the Bayesian update working — prior p_mastery combines with p_learn, corrected for p_guess and p_slip. The math is covered in detail in Bayesian Knowledge Tracing in 37 lines of Python.
Why It Matters for the Research
Phase 1's job was never to be elegant — it was to be instrumented. Every attempt record written to the attempts table is a training signal for Phase 2's BKT parameter estimation. We need ≥150 records (5 students × 3 sessions × 10+ problems) before Phase 2 can begin.
The bugs above are why research-grade software is harder than it looks. Each one silently corrupts data in a different way: password hashing fails outright (detectable), Docker networking fails silently on every write (detectable but subtle), SQLAlchemy defaults produce None BKT parameters (corrupts ML inputs, hard to detect in test data).
The fix for all of them is the same: run the full stack. Not unit tests. Not import my_function; print(my_function()). Start the containers, log in as a real user, and watch what happens.
What We Learned
The honest retrospective:
Seed data is harder than it looks. Writing 60 hand-crafted math problems at three difficulty levels takes longer than writing the adaptive engine. Every problem needs a machine-checkable answer, a hint, and a calibrated difficulty score.
Docker Compose env_file + environment is the right pattern. env_file carries the defaults; environment carries container-specific overrides. The pattern is obvious in hindsight and invisible until you need it.
The flush() + refresh() pattern is load-bearing for async SQLAlchemy. Any code that creates an ORM object and immediately passes it to domain logic needs an explicit flush. The async path doesn't auto-flush the way the synchronous ORM used to.
What's Next
Phase 2: BKT parameter estimation from real student data, and a mistake classifier that categorises subtraction errors beyond "wrong." The attempts table is waiting.
Key Takeaways
-
passlibis dead — usebcryptdirectly; it's three functions and no transitive dependency risk - Docker Compose containers reach each other by service name, not
localhost; overrideDATABASE_URLin theenvironmentblock rather than theenv_file - SQLAlchemy
default=columns areNoneon a freshly constructed Python object until after aflush()+refresh()— always set constructor defaults explicitly when domain code runs immediately after creation
Top comments (0)