Multi-Tenant Architecture in ERP Systems: The Anatomy of Sharing

#multitenant #erp #architecture #dataisolation

When working on a manufacturing ERP, one of the fundamental architectural decisions we often face is the multi-tenant structure. This model becomes inevitable when we want to serve multiple clients or different internal departments from the same software and infrastructure. In my experience, designing this architecture required me to focus not only on technical details but also on organizational flows and business requirements.

In this post, I will describe the challenges I encountered and the solutions I implemented while bringing a multi-tenant ERP system to life, covering everything from data isolation to performance, security, and operational processes. My aim is to shed light on the anatomy of this complex structure through my practical experiences.

Data Isolation Models and Their Challenges

One of the most critical aspects of multi-tenant architectures is ensuring that data from different tenants does not mix, i.e., data isolation. I worked with three main models for this, and I found that each had its own advantages and disadvantages. When making my choice, I had to balance factors like cost, complexity, and performance.

The first model was using a separate database for each tenant. This is the method that provides the highest level of isolation. For example, when developing an internal platform for a bank, this model was preferred due to sensitive financial data. However, this approach significantly increases operational costs and management complexity, especially when you have more than 50 tenants. Database upgrades or schema changes had to be done for each database individually, which meant weeks of workload.

⚠️ Separate Database Fatigue

Using a separate database for each tenant might seem ideal for data isolation, but it can turn into an operational nightmare, especially in projects with a large number of tenants. A single schema change or index addition might require manual operations on hundreds of databases, increasing the risk of errors and lengthening deployment processes.

The second model involved using a separate schema for each tenant. In PostgreSQL, you can easily achieve this with the CREATE SCHEMA tenant_x; command. This approach optimizes resource utilization by running on a single database server while also preventing naming conflicts for tables. In a manufacturing ERP, we used this model for medium-sized tenants. However, there were complexities here too. Things got complicated when taking pg_dump backups or writing cross-tenant queries (e.g., "inventory report across all tenants"). I had to write dynamic SQL to join or search across schemas, which increased code complexity.

Finally, the most common model, and the one I used the most, was using a tenant_id column within a shared schema. By having a tenant_id field in every table, you filter the data. This was the most flexible and resource-efficient approach, but it required developers to correctly implement this tenant_id filter in every query. A small mistake could result in one tenant seeing another tenant's data. To reduce this risk, I tried implementing a global tenant_id filter in the ORM layer, ensuring that all queries automatically passed through this filter.

-- Data isolation with tenant_id column in a shared schema
SELECT * FROM orders
WHERE tenant_id = 'my_tenant_uuid' AND order_status = 'processing';

-- Separate schema model (PostgreSQL)
SET search_path TO tenant_x, public;
SELECT * FROM orders WHERE order_status = 'processing';

In my practice, the shared schema with the tenant_id column became the most practical solution. Database management was simple, and I was working on a single schema. However, developer discipline and a robust filtering mechanism in the ORM layer were essential. I also considered using PostgreSQL's row-level security (RLS) feature to enforce this discipline at the database level, but due to the added complexity it introduced initially, I focused more on filtering at the application layer.

Shared Infrastructure and Resource Sharing

At its core, multi-tenant architecture involves sharing the same physical or virtual infrastructure across multiple tenants. This means resource sharing across many layers, from server infrastructure to network components. In my experience, one of the biggest challenges brought by this sharing was the "noisy neighbor" syndrome. One tenant's heavy resource usage could negatively impact the performance of other tenants.

On application servers, especially when working with a FastAPI-based manufacturing ERP, it was important to identify which tenant each request came from and monitor that tenant's resource usage. I solved this by using a reverse proxy like Nginx to route requests to the correct application servers based on specific Host headers or URL paths. More importantly, when I ran the application servers in Docker containers, I assigned specific CPU and memory limits to each container using Linux cgroup limits. For example, to prevent a tenant's heavy reporting job from affecting the API response times of other tenants, I set a soft limit on the memory usage of the container running the reporting service using the memory.high value. This was critical for the overall stability of the system.

# Defining cgroup memory limit with Docker Compose
services:
  app-service:
    image: my-erp-app
    deploy:
      resources:
        limits:
          memory: 512M # Hard limit
        reservations:
          memory: 256M # Soft limit

On the database side, managing the PostgreSQL connection pool was very important. Instead of each tenant opening its own connection pool, we centrally managed database connections using a connection pooler like PgBouncer. This reduced the load on the database and ensured efficient use of connections. In my observation, an improperly configured connection pool could increase server response times by up to 30% during peak hours. In caching layers like Redis, I prevented cache data from mixing by using a separate prefix for each tenant. I used key patterns like tenant_id:user_session:token.

At the network layer, I used VLAN segmentation to isolate traffic from different tenants. On a large e-commerce site, I had different departments (logistics, finance, sales) communicating through their own VLANs with separate security policies. In the multi-tenant ERP, I similarly applied egress controls by moving services that accessed sensitive data to a separate network segment. This was a reflection of Zero Trust Network Access (ZTNA) principles; it was essential not to trust any service by default and to verify every connection.

Security and Multi-Tenant Environments

Security in multi-tenant architectures is not just about data isolation; it also encompasses a broad range of areas such as authentication, authorization, and preventing potential data leaks. Preventing one tenant from accessing another tenant's data, either accidentally or maliciously, was one of my biggest security concerns.

On the authentication side, I used JWT (JSON Web Tokens) and OAuth2 patterns. Each tenant had its own user pool, and when users logged into the system, their tenant_id and permissions were included in the JWT token. This token was sent with every API request and verified at the application layer. This way, a request with an incorrect tenant_id was rejected before it even reached the application layer.

// Example JWT payload
{
  "sub": "user123",
  "name": "Mustafa Erbay",
  "tenant_id": "my_production_company_uuid",
  "roles": ["admin", "production_planner"],
  "exp": 1789123456
}

The authorization mechanism was built on top of the tenant_id filter. As I mentioned earlier, all database queries had to include the tenant_id filter. This was enforced at the ORM layer, but I also considered some database-level checks against potential bypass scenarios. PostgreSQL's Row-Level Security (RLS) feature offered an additional layer of security by allowing specific roles or users to see only rows belonging to their tenant_id. However, due to the potential impact on performance and the risk of complicating query planning, I primarily considered it as a fallback mechanism for critical tables.

ℹ️ SQL Injection and Tenant ID

In multi-tenant systems, SQL injection attacks carry the risk of not only data exposure but also inter-tenant data leakage. Using parameterized queries and enforcing the tenant_id filter at every layer of the application is vital against such attacks.

DDoS mitigation and rate limiting were also important. One tenant overwhelming the API could affect the services of other tenants. I implemented IP-based or user token-based rate limiting on Nginx using the limit_req module. I also monitored failed login attempts with tools like fail2ban and temporarily blocked the relevant IP addresses. Lower-level security measures like kernel module blacklisting were used to enhance the overall security of the shared server; for example, I enhanced general system security by blacklisting potentially vulnerable modules like algif_aead.

Finally, by using the audit subsystem (auditd), I logged all system calls and file accesses. This allowed me to perform retrospective analysis in case of any security breach and understand which user did what, when, and in the context of which tenant_id.

Performance Optimizations and Odd Corner Cases

Performance optimization in multi-tenant architectures is much more layered and complex than in single-tenant systems. Identifying and resolving bottlenecks caused by shared resources was a constant struggle. In a manufacturing ERP, I had to make fine-tune adjustments to ensure users did not experience delays on real-time reports and operator screens.

On the database side, PostgreSQL indexing strategies played a vital role. Since every table had a tenant_id column, applying the correct indexes on this column was critical. I generally used composite indexes like (tenant_id, other_column). For example, in an orders table, indexes like (tenant_id, order_date) or (tenant_id, customer_id) significantly improved performance when querying a specific tenant's orders by date or customer. I regularly reviewed EXPLAIN ANALYZE outputs to ensure the query planner was using the correct index. On one occasion, I realized the index was not being used because I was trying to query with TEXT on a table where tenant_id was defined as VARCHAR; type mismatches can lead to such odd corner cases.

💡 Tenant ID Indexing Strategies

If the tenant_id column is frequently queried in PostgreSQL, creating composite indexes that include this column can improve performance. However, if the distribution of tenant_id is skewed (a few tenants have a lot of data, others have little), exploring GIN or BRIN indexes instead of B-tree might be beneficial.

I extensively used Redis caching, especially for real-time dashboards and operator screens requiring instant data analysis. To ensure each tenant's data remained isolated, I used tenant_id as a prefix in Redis keys. For example, like tenant_uuid:dashboard:open_orders. I also set Redis's maxmemory-policy to allkeys-lru so that when the cache filled up, the least used keys would be evicted, preserving important data. On one occasion, due to my incorrect Redis OOM eviction policy choice, I observed that critical cache data was being evicted prematurely, leading to sudden performance drops. Making such settings correctly is essential for system stability.

At the application layer, ORM traps (especially the N+1 query problem) can be even more devastating in multi-tenant systems. In a manufacturing ERP, I noticed that fetching the details of sub-components for a Bill of Materials (BOM) resulted in over 1000 queries for a single user. This situation caused the database to lock up when 10 users concurrently accessed this report. As a solution, I dramatically reduced the number of queries by correctly using the ORM's eager-loading mechanisms or writing manually optimized JOINs. These kinds of optimizations often don't surface in test environments and only appear under heavy load in production. In my experience, I had to use continuous performance monitoring tools to detect such regressions.

Deployment and Operational Challenges

The deployment and operational management of a multi-tenant system have exponentially increasing complexity compared to a single-tenant application. When I was working on a manufacturing ERP, I faced this complexity at every step, from rolling out new features to fixing bugs.

The reliability of CI/CD pipelines becomes extremely important here. Since every deployment has the potential to affect multiple tenants, the automation must work flawlessly. Using blue-green deployment strategies, I had the ability to test the new environment thoroughly before rolling out a new version to production and to quickly roll back to the old version if any issues arose. This was a lifesaver, especially when we had to perform an emergency rollback due to a disk full issue on April 28th; I intervened immediately when the WAL rotation alarm dropped at 03:14. Without automatic rollback mechanisms, it would not have been possible to recover so quickly.

Schema migrations were also a significant headache. In a shared database schema, adding a new column to a table or modifying an existing one required extreme caution as it affected all tenants. It was essential to understand the LOCK mechanisms of ALTER TABLE commands and how to add NOT NULL constraints. For example, when adding a new NOT NULL column, I first added it as NULL, populated the data, and then added the NOT NULL constraint, preventing long table lockups. Although I didn't have tools like pt-online-schema-change, I could perform similar online schema changes with scripts I wrote myself.

🔥 Risk-Free Schema Changes

When making schema changes in a multi-tenant database, a single wrong step can affect the systems of all tenants. Be sure to make changes as non-blocking as possible and always have a rollback plan. Especially avoid ALTER TABLE operations on large tables, or apply them gradually.

Observability, i.e., monitoring system metrics, logs, and traces, is even more critical in multi-tenant environments. Using Prometheus and Grafana, I monitored the performance metrics of each tenant separately (API response times, database query times). Collecting logs in a central system and being able to filter them by tenant_id dramatically sped up the troubleshooting process. When addressing a tenant's complaint that "the late shipment report was always missing," I was able to find the root cause of the problem (an integration service timing out) in just 3 days by searching the relevant logs with the tenant_id filter. Managing SLOs (Service Level Objectives) and error budgets was also a must to provide a certain level of service quality to each tenant.

The Future and Architectural Developments

Multi-tenant architectures are constantly evolving, and my own thoughts and experiences in this area are evolving over time. Especially the rise of artificial intelligence and serverless architectures in recent times has led me to reshape my approaches in this field.

Serverless architectures, especially platforms like AWS Lambda or Google Cloud Functions, provide a natural fit for multi-tenancy. The fact that each request is handled as a separate function call and resources scale automatically greatly alleviates the noisy neighbor problem. By designing some microservices in the backend of my own side project as serverless, I built a more cost-efficient structure that can scale individually according to each tenant's load. This is much more flexible and less management-intensive than traditional VM-based approaches.

The integration of AI models into multi-tenant ERPs is also one of my areas of interest. While developing an AI-powered production planning module in a manufacturing company's ERP, I attempted to optimize production plans by training or customizing AI models with each tenant's own data using a prompt engineering strategy. RAG (Retrieval-Augmented Generation) patterns allow AI models to produce more accurate responses by dynamically adding tenant-specific data to the model's context. By using multiple provider fallbacks via OpenRouter with different AI providers like Gemini Flash, Groq, or Cerebras, I can dynamically select the most cost-effective or fastest model according to each tenant's needs.

💡 Tenant Isolation in AI Models

In multi-tenant AI applications, ensure that each tenant's data is isolated during model training or prompting processes. RAG patterns are an effective way to achieve this isolation by dynamically adding tenant-specific data to the model's context.

Zero-trust architectural principles are also becoming increasingly important for internal network security in multi-tenant systems. Company segmentation and ZTNA egress controls are critical not only against external threats but also for preventing potential lateral movement between tenants. A model where no internal service trusts another by default, and every connection undergoes identity and authorization checks, will form the basis of future ERP architectures. For example, to ensure a production line operator screen only accesses production data belonging to its own tenant_id, I pass every API call through a ZTNA proxy to verify its identity.

Conclusion

Multi-tenant architecture is an inevitable design choice for complex enterprise applications like ERP systems. However, this structure brings with it numerous challenges such as data isolation, resource sharing, security, performance, and operational management. What I have seen in my twenty years of field experience is that overcoming these challenges requires not only technical knowledge but also a pragmatic approach and continuous learning.

Every decision has a trade-off, and the best solution varies depending on the project's specific requirements, budget, and team's capabilities. While preferring tenant_id-based data isolation in a manufacturing ERP, using separate databases in a banking project, for instance. The important thing is to make these trade-offs consciously and to be able to foresee potential risks in advance. My experiences in this field have taught me that architecture is not just about code; it is largely a reflection of organizational flows and the human factor.

Next step: I will explain how event-sourcing and CQRS patterns can be applied in multi-tenant systems with my own experiences.