Joshua Gutierrez

Posted on Jun 5

Hardening Two Multi Tenant SaaS APIs

#security #saas #python #fastapi

What We Found, What We Fixed, and What Changed

Security hardening is not glamorous work.

Most of it is careful reading, boring verification, uncomfortable edge cases, and refusing to trust assumptions that have quietly become part of the system.

Recently, I completed a remediation pass across two multi tenant SaaS products: Site2CRM and Made4Founders. Both products had grown into real platforms, with authenticated dashboards, public webhooks, billing flows, CRM integrations, OAuth style connections, file uploads, background jobs, and customer owned data.

That kind of product has a wide attack surface.

The work started with security reports and ended with two hardened branches, 24 total commits, more than 140 new regression tests, database migrations, centralized security utilities, audit logging, startup gates, and CI checks that now block the same classes of mistakes from coming back.

This article is a breakdown of what we found, how we fixed it, and the engineering lessons that came out of the process.

The goal was not just to close findings

A security report usually arrives as a list of issues.

That can make the work feel transactional.

Fix this endpoint. Add this check. Reject this payload. Patch this route.

That is necessary, but it is not enough.

The real question is:

What class of mistake allowed this bug to exist?

That question changed the remediation strategy.

For every finding, the goal became:

Fix the specific issue
Add a regression test for that finding
Centralize the security pattern where possible
Add a guard so the same mistake is harder to reintroduce
Document any deployment or data migration steps clearly

That approach turned the work from a cleanup pass into a hardening pass.

Finding 1: Tenant data must always be scoped by tenant

In a multi tenant application, the most important security rule is simple:

A customer should only be able to access data that belongs to their organization.

That rule sounds obvious, but enforcing it consistently is where systems get tested.

A risky query often looks harmless:

lead = db.query(Lead).filter(Lead.id == lead_id).first()

The problem is that lead_id alone is not a tenant boundary.

In a multi tenant system, the query needs to prove both identity and ownership:

lead = (
    db.query(Lead)
    .filter(
        Lead.id == lead_id,
        Lead.organization_id == current_user.organization_id,
    )
    .first()
)

The same principle applies to updates, deletes, dashboard feeds, background jobs, calendar items, configuration records, and integration data.

In one case, a calendar feed was using a join path that could leak data across tenants. In another, a vault configuration model had rows that were not safely attached to an organization. Both issues were fixed by making organization ownership explicit, adding migrations where needed, and creating regression coverage around the exact failure modes.

The broader lesson was this:

Tenant isolation should not depend on developer memory.

It should be a pattern the codebase makes easy and the test suite actively defends.

Finding 2: Organization admins are not platform admins

One of the clearest authorization lessons came from a global resource protected by a tenant level role check.

Most SaaS products have organization roles like:

OWNER
ADMIN
USER

Those roles are useful inside a customer account. An organization owner should be able to manage users, settings, forms, integrations, and billing details for that organization.

But tenant authority is not platform authority.

A customer OWNER should not be able to manage global platform resources, issue marketplace codes, access internal tools, or make changes that affect other organizations.

The risky pattern was conceptually simple:

if current_user.role not in ["OWNER", "ADMIN"]:
    raise HTTPException(status_code=403)

That check answers the wrong question.

It asks:

Is this user powerful inside their own organization?

For global platform actions, the application needs to ask:

Is this user trusted to operate the platform itself?

The fix was to introduce a real platform staff boundary and move global operations behind that boundary.

require_platform_staff(current_user)

This was not just a one line fix. Tests were added to prove that tenant owners and tenant admins could not access platform scoped functionality. A static guard was also added so future global table routes cannot quietly be protected only by tenant roles.

The lesson:

Never reuse customer roles for platform administration.

They represent different trust models.

Finding 3: Public webhooks are public, not trusted

Public webhook endpoints need to be reachable by external providers.

That does not mean they should trust incoming requests.

Several webhook surfaces needed stronger sender verification. The risk was not that the endpoints existed. The risk was that state changing payloads could be processed without proving they came from the provider they claimed to represent.

The corrected model is:

Read the raw request body
Verify the provider signature
Reject missing or invalid signatures
Parse the payload only after verification
Apply idempotency where relevant
Mutate state
Return success

This mattered for provider events such as:

Inbound messaging events
Email bounce and complaint notifications
Billing lifecycle events

Billing webhooks received extra attention because they can change plan state. A forged billing event can potentially activate, cancel, downgrade, or otherwise alter customer access.

The fix was to make verification mandatory. If the provider webhook secret or webhook ID is missing in production, the app should fail closed. If the signature is invalid, the route should reject the request instead of logging the error and continuing.

The lesson:

A webhook endpoint can be public without being unauthenticated.

Reachability and trust are separate things.

Finding 4: External install identifiers are not authentication

One integration trusted a client supplied install identifier as if it were a credential.

That is dangerous.

Install IDs, instance IDs, account IDs, and external resource IDs are identifiers. They are not proof that the caller owns the installation.

The hardened approach was to require a signed provider token, verify it server side, and only then extract the canonical installation identity from the verified payload.

The corrected flow became:

Receive signed provider instance token
Verify the token using the provider app secret
Extract the canonical instance ID from the verified payload
Resolve the install record from that verified identity
Refuse cross organization rebinding unless ownership is proven

This fixed both the authentication issue and the related install rebinding risk.

The lesson:

Do not authenticate integrations with bare IDs.

Use provider signed assertions, server side verification, and explicit ownership rules.

Finding 5: User controlled URLs need SSRF protection

Several features across the two products involved server side requests to external URLs.

That pattern is common:

Outbound webhooks
RSS imports
Social integrations
Callback URLs
Remote media fetches
Health checks

The security risk is SSRF, or Server Side Request Forgery.

SSRF occurs when a user can influence a URL that the server requests. Without guardrails, an attacker may be able to point the server at internal infrastructure:

localhost
private network addresses
cloud metadata services
link local addresses
internal admin panels

The fix was to centralize outbound fetching through a safe fetch utility.

That utility blocks dangerous destinations, applies timeouts, restricts redirects, limits response size, and prevents internal response bodies from being reflected back to users.

The important part was centralization.

Instead of asking every route to remember SSRF rules, risky outbound requests now go through one safer path.

The lesson:

Any feature that lets a user provide a URL should be treated as a network boundary.

Finding 6: Cryptography should not have production fallbacks

Development conveniences can become production vulnerabilities if they are allowed to survive past local use.

The hardening pass removed weak secret fallbacks and hardcoded encryption key fallbacks. Token encryption was versioned, and a re encryption path was added so legacy stored values could be migrated safely.

The improved model included:

No hardcoded production fallback keys
Explicit application encryption key required
Versioned encrypted token format
Migration script for legacy tokens
Production flag to reject legacy plaintext tokens after migration

The lesson:

Cryptographic failure should be loud.

If the key is missing in production, the app should not quietly invent one.

Finding 7: File uploads should stay boring

File uploads are easy to underestimate.

For brand logos and media uploads, the safest approach was to restrict formats and improve storage behavior.

The hardening changes included:

Rejecting risky file types where they were unnecessary
Using high entropy media names
Enforcing body size limits
Handling chunked uploads safely
Avoiding active content in logo uploads

One practical example was SVG.

SVG can be useful, but it can also contain active content. If served from the wrong origin or with weak headers, it can become a stored XSS risk.

XSS means Cross Site Scripting. It occurs when attacker controlled content executes JavaScript in a trusted browser context.

For a logo upload feature, SVG was not worth the additional risk.

The lesson:

If a product only needs images, do not accept formats that behave like documents or code.

Finding 8: Security relevant configuration should fail closed

Some risks were not about code paths. They were about missing configuration.

Examples included:

Missing webhook secrets
Missing encryption keys
Missing CAPTCHA secrets
Missing Redis for rate limiting
Weak application secret keys
Insecure production cookie settings

The fix was to add production startup gates.

In development, flexible configuration is helpful.

In production, missing security configuration should stop the app from starting.

A startup gate turns a hidden runtime weakness into an obvious deployment failure.

The lesson:

Failing to boot is better than booting insecurely.

Finding 9: Rate limiting should not silently degrade

Rate limiting is often treated as a nice to have, but for authentication and abuse prevention it is a security control.

If Redis is unavailable and the system silently falls back to per process memory, limits become weaker under multiple workers.

For example, a limit of 10 attempts may effectively become 40 attempts across four workers.

The production behavior was hardened so that security sensitive rate limiting depends on a real shared backend.

The lesson:

A degraded security control should be visible, not silent.

Finding 10: Regression tests are part of the fix

Every meaningful finding received a named regression test.

That mattered.

A test named after a security issue tells future maintainers why the behavior exists. It also prevents a fix from being accidentally removed during a refactor.

Examples of the test coverage included:

Tenant data cannot be accessed across organizations
Platform scoped routes require platform staff
Unsigned webhooks are rejected
Invalid webhook signatures are rejected
Unsafe webhook URLs are rejected
SVG logo uploads are rejected
Weak production secrets fail startup
Legacy plaintext tokens can be rejected after migration

The lesson:

If a security issue was important enough to fix, it is important enough to test.

Finding 11: Static guards catch what tests miss

Tests are excellent for specific behavior.

Static lint is better for broad architectural patterns.

Both repositories now have security lint guards for risky patterns:

Tenant owned queries without organization scope
Raw outbound requests using user controlled URLs
State changing public routes without auth or signature classification
Global platform routes protected only by tenant roles

The linter supports a baseline, which is important for mature codebases.

A baseline allows known accepted cases to remain documented while CI fails only on new violations. That keeps the guard practical instead of noisy.

The preferred workflow is:

Fix the violation
Add a clearly marked exception only when intentional
Update the baseline only when the exception is reviewed and accepted

The lesson:

The best time to add a guard is right after fixing the bug class.

That is when the pattern is fresh, the risk is understood, and the team knows what should never happen again.

Finding 12: Some fixes require operational discipline

Not every security fix ends with a commit.

Some work has to happen during deployment:

Coordinate breaking integration changes
Set required production secrets
Run database migrations
Inspect quarantined records
Re encrypt stored tokens
Enable flags that reject legacy formats
Scrub sensitive files from Git history
Run full test suites before release

These steps were intentionally left as human controlled actions because they affect production data, integration behavior, and deployment timing.

That is part of responsible hardening.

The code can be ready before production is ready.

The lesson:

A remediation branch is not deployable until the operational checklist is complete.

What changed by the end

Across both products, the hardening pass produced:

24 total commits
More than 140 new regression tests
Database migrations
Centralized webhook verification
Centralized SSRF safe fetch behavior
Fail closed production startup gates
Platform staff authorization
Audit logging for sensitive actions
Versioned token encryption
Tenant isolation checks
Security lint guards enforced in CI
Main branches left untouched for review

More importantly, the products now have better security shape.

The fixes were not just scattered patches. They became reusable boundaries.

Tenant data access now has stronger conventions.

Platform operations now have a separate trust model.

Webhook verification now follows a consistent pattern.

Outbound URL fetching now has a safer path.

Production misconfiguration now fails early.

Dangerous patterns now have CI visibility.

Final takeaway

The most valuable security work is not only fixing what was found.

It is asking what the finding reveals about the system.

A good remediation process should answer:

What failed?
Where else could it fail?
What is the correct shared pattern?
How do we test the fix?
How do we prevent the class from returning?
What must be true before deployment?

DEV Community

Hardening Two Multi Tenant SaaS APIs

What We Found, What We Fixed, and What Changed

The goal was not just to close findings

Finding 1: Tenant data must always be scoped by tenant

Finding 2: Organization admins are not platform admins

Finding 3: Public webhooks are public, not trusted

Finding 4: External install identifiers are not authentication

Finding 5: User controlled URLs need SSRF protection

Finding 6: Cryptography should not have production fallbacks

Finding 7: File uploads should stay boring

Finding 8: Security relevant configuration should fail closed

Finding 9: Rate limiting should not silently degrade

Finding 10: Regression tests are part of the fix

Finding 11: Static guards catch what tests miss

Finding 12: Some fixes require operational discipline

What changed by the end

Final takeaway

Top comments (0)