"Centralizing billing across 5 products triggered a 403 nobody saw coming"

#multitenant #saas #fastapi #billing

We flipped USE_BCP=true on Red at 14:02. The first 403 hit Sentry at 14:06. By 14:11 the pattern was clear: any user who tried to do something that touched org-level credit (granting a teammate access, viewing the org credit balance, kicking off a fix run under an org-scoped project) got a 403 back from the Red API, which had received a 403 from BCP, which had received a "not a member" from Auth.

Staging didn't catch it. I want to be honest about that part before anything else. Staging had two users in one org, both of which had been provisioned by me through the Auth admin path months ago, so their org memberships existed in Auth's org_members table by accident of history. Every code path I exercised in staging happened to read from a row that was already there. The bug only fires when a user accepts an org invitation on the product side after the cutover, and we had no synthetic flow for that in staging. Lesson noted, expensive way to learn it.

This post is about what actually broke, why the design wasn't wrong (the implementation was missing), and the three branches I considered for where org-membership authority should live before settling on the one that produced the bug.

Phase H: why centralize billing now

Codens is five products plus two platform services. Red does auto-fix, Blue does QA, Green does PRDs, Yellow is the engineering activity ledger, Purple is the orchestration layer. Auth is the identity service. BCP, the Billing Control Plane, is the newest piece and the subject of this story.

Until last quarter, each product calculated its own credit consumption. That was fine when Red was the only product taking money. It became untenable around the time Green went into beta, because we had three different rounding rules, two slightly different definitions of "what counts as a billable run," and a support ticket pattern that boiled down to "my org's credit balance on Red doesn't match my org's credit balance on Blue and you charged me twice." Phase H of the architecture roadmap pulls all of that into BCP. Every product reads its credit policy from BCP, posts consumption events to BCP, and asks BCP "can this user/org afford this operation?" before starting work.

The cutover is gated behind two env vars per product:

USE_BCP=true
BCP_API_URL=https://api.billing.codens.ai

I cut one product at a time, starting with Red because it has the highest traffic and the most mature billing surface. Red PR #266 was the actual flip. Blue PR #233 and Green PR #411 followed once Red had been stable for a week. Yellow and Purple are scheduled for next quarter, both still on local credit math.

The cutover order matters for this story because the 403 only manifests on org-scoped operations. Red individual-account billing kept working perfectly. So did Blue and Green individual accounts. It was specifically the org-shared credit pool path that exploded, and only for users who had joined their org through the product-side invitation flow rather than through Auth's admin console.

Tracing the 403

The first instinct was "BCP is misconfigured." It wasn't. BCP logs showed clean inbound requests with the right org_id, the right user_id, the right requested operation. BCP then made an internal call to Auth: "is user X a member of org Y?" Auth returned false. BCP returned 403. Red returned 403. User saw 403.

The Auth log line was the clarifying one:

GET /internal/orgs/{org_id}/members/{user_id} -> 404

So Auth wasn't broken either. Auth was correctly reporting that user X was not a member of org Y, as far as Auth knew. I pulled the user out of the database. The user existed in Auth's users table. The org existed in Auth's organizations table. The link row in Auth's org_members was missing.

I went over to Red's database. The link row was there. Red had a row that said user X belonged to org Y, with the role and joined-at timestamp from the day the user accepted the invitation. Red had been authoritative for this relationship the entire time.

CDTSK-1392 captured the root cause. Auth Codens is supposed to be master of organizations and memberships, but each product had grown its own organizations and org_members tables back when each product was a standalone service. Invitation acceptance was handled locally by each product. The row landed in the product's database, and nobody told Auth. Pre-BCP, this didn't matter, because the product was the one authorizing org-scoped operations against its own tables. Post-BCP, BCP asks Auth, Auth doesn't know, 403.

The bug is not in the centralization. The bug is that we shipped centralization assuming a sync that didn't exist.

Three branches for where authority lives

Before writing the sync, I had to decide whether the sync was even the right answer. There are three reasonable places to put authority over org membership in a multi-product setup like ours.

Authority in the auth service. Auth is the master record. Every product holds a local cache (or a foreign-key shadow) and reflects changes back to Auth as they happen. This is what we have. It's the most conventional choice. The downside is the one we just discovered: every product-side write path that affects membership has to remember to call Auth, and forgetting is silent until something else (like BCP) starts depending on Auth being correct.

Authority in billing itself. BCP owns the org and member tables. Every product reads from BCP. This has the appeal of "the system that needs to know the truth owns the truth." It also means every product becomes hard-dependent on BCP being up to render a user's basic org context, which is a much bigger blast radius than billing being temporarily degraded. I didn't want every Red dashboard render to fail because BCP was deploying.

Authority distributed across products. Each product remains the source of truth for memberships that originate in that product. BCP, when asked to authorize an org-scoped operation, routes the membership question to whichever product owns the org. This sounds clever for two products. With five products, the routing table is a permanent piece of infrastructure that has to be updated every time a new product launches, and the question "who owns this org" is itself a piece of state that has to live somewhere central. You've reinvented the auth service, badly.

I chose branch one. The 403 wasn't evidence of a wrong choice. It was evidence that I'd shipped half of a choice. The half I shipped (BCP queries Auth) was correct. The half I hadn't shipped (products tell Auth about new memberships) was the gap.

The sync endpoint

The fix has two halves. Auth needs an endpoint that products can call. Products need to call it at the right moments.

On the Auth side, I added POST /api/v1/internal/organizations/{org_id}/members:upsert. The verb is upsert deliberately. The endpoint is idempotent and the products call it both on invitation acceptance and on role changes, so the handler has to be willing to create or update without the caller knowing which case applies. The response status differentiates: 201 if a new membership row was created, 200 if an existing row was updated.

Getting FastAPI to actually return 201 vs 200 from the same handler was the part that almost shipped broken. PR #124 was the fix. The original handler looked like this:

@router.post(
    "/organizations/{org_id}/members:upsert",
    response_model=UpsertOrgMemberResponse,
)
async def upsert_org_member(
    org_id: UUID,
    payload: UpsertOrgMemberRequest,
    use_case: UpsertOrgMemberUseCase = Depends(get_upsert_use_case),
) -> UpsertOrgMemberResponse:
    result = await use_case.execute(org_id, payload)
    return UpsertOrgMemberResponse.from_domain(result)

When you annotate the return as a Pydantic model, FastAPI takes over status code resolution and forces the default for the route (200 for POST in our config, or 201 if you set status_code= on the decorator). Either way you can't branch. You get one status for both the create and the update case, which silently broke the idempotency contract for any caller that wanted to distinguish.

The fix is to return JSONResponse directly so the handler controls the status:

@router.post("/organizations/{org_id}/members:upsert")
async def upsert_org_member(
    org_id: UUID,
    payload: UpsertOrgMemberRequest,
    use_case: UpsertOrgMemberUseCase = Depends(get_upsert_use_case),
) -> JSONResponse:
    result = await use_case.execute(org_id, payload)
    status = 201 if result.created else 200
    return JSONResponse(
        status_code=status,
        content=UpsertOrgMemberResponse.from_domain(result).model_dump(mode="json"),
    )

You lose automatic OpenAPI response model inference, which is a real cost. You get correct semantics, which is a bigger gain. I document the response shape with responses={200: ..., 201: ...} on the decorator to keep the OpenAPI spec honest.

On the product side, Red PR #264 added the client call at the two moments membership state changes: invitation acceptance and role update.

async def accept_invitation(self, invitation_id: UUID, user_id: UUID) -> None:
    invitation = await self.invitations.get(invitation_id)
    await self.org_members.create(
        org_id=invitation.org_id,
        user_id=user_id,
        role=invitation.role,
    )
    await self.auth_client.upsert_org_member(
        org_id=invitation.org_id,
        user_id=user_id,
        role=invitation.role,
    )
    await self.invitations.mark_accepted(invitation_id)

The Auth call is not in a transaction with the local write, which is a deliberate choice and a place where I might be wrong. If the local write succeeds and the Auth call fails, we have drift. The current mitigation is a nightly reconciliation job that compares product org_members to Auth org_members and re-upserts anything missing. I'd rather drift and reconcile than block invitation acceptance on Auth being reachable.

Blue and Green shipped matching calls in their respective PRs.

Side cleanup: while I was in BCP I noticed that the bonus-credit endpoint silently dropped its grant when the grant_type field name on the wire didn't match what the receiver expected (the sender was using bonus_type, the receiver was reading grant_type, Pydantic accepted the payload with extra="ignore" and quietly inserted a row with the default grant type). PR #265 fixed the Red caller and PR #231 fixed Blue. Lesson there is to not use extra="ignore" on internal wire models, but that's another post.

Lessons

The biggest one is that staging only catches the bugs you have data for. The org-membership row was present in staging by historical accident, so the path that read it worked. I now provision a fresh, end-to-end test user (sign up, accept invitation, perform org-scoped action) as part of pre-cutover validation, scripted, not "remember to do it."

Cutting one product at a time was the only thing that kept the blast radius survivable. If I had flipped all three on the same morning the triage would have taken twice as long, because every signal would have been duplicated three ways. The order Red, then Blue, then Green wasn't load-balanced for anything clever — it was just the order I trusted the metrics on.

Naming the endpoint :upsert instead of overloading POST .../members mattered more than I expected. When the FastAPI status code issue came up, the conversation was "the upsert endpoint should return different codes for create vs update," which is a one-sentence problem statement. If the endpoint had been POST /members I'd have spent another hour arguing about whether 200 or 201 was correct in the abstract.

Wrap

The hardest part of centralizing anything across a product family is not the new service. The new service is straightforward, you write it, you deploy it, you wire up clients. The hard part is figuring out who is allowed to be the source of truth for the relationships the new service depends on, and then making every existing write path honor that choice. We chose Auth as the master for org membership, which I still think is right. We just hadn't enforced it everywhere it mattered, and BCP was the first dependent that actually cared.

If you want to see how the rest of the harness fits together, the English landing page is at https://www.codens.ai/en/. Yellow and Purple come onto BCP next quarter. I'll write that one up too, hopefully without the same shape of bug.