DEV Community

Mustafa ERBAY
Mustafa ERBAY

Posted on • Originally published at mustafaerbay.com.tr

Anatomy of Supply Chain Data Flow: ERP Integrations

When developing a manufacturing ERP, my biggest nightmare was the delay in the digital twin of a physical movement. The moment an operator scans a barcode on the factory floor, it's not enough for that data to simply be written to the database; inventory must be updated, costs calculated, the shipping department notified, and if an AI-based production planning module exists, it should recalculate the route. In my 20 years of field experience, I've seen that supply chain software isn't just "software"; it's the codified representation of an organizational flow.

In this article, we'll examine the data's journey in a manufacturing ERP, from raw material acceptance to invoicing, the technical bottlenecks I encountered, and the architectural approaches I used to overcome them. I'll particularly focus on PostgreSQL transaction management, network layer segmentation, and asynchronous data processing. If your system receives complaints like "I sent the data, but it's not showing in stock," you're likely experiencing a leak somewhere in this flow.

The Data's Journey: From Raw Material to Product – SQL Operations

Everything in the supply chain begins with an INSERT or UPDATE, but these operations must be atomic. When working in a manufacturing ERP, if the system crashes while 500 kg of polypropylene leaving the raw material warehouse is being assigned to a work order, it leads to "lost" inventory. In such situations, I always use the Transaction Outbox Pattern to maintain consistency even under heavy load. While writing data to the main table, I also record the pending task in an "outbox" table within the same transaction.

Let me give you a real-world example: In one of my client's projects, data was being entered simultaneously from 40 different operator screens. When we tried to do too much work within BEGIN and COMMIT blocks on PostgreSQL, lock times increased, and we encountered deadlock errors. The solution I found was to move heavy calculations (like cost accounting) outside the main transaction and manage them asynchronously via a worker. This reduced UI latency from 800ms to 40ms.

-- Transaction Outbox Example
BEGIN;
  -- Record stock movement
  INSERT INTO stock_movements (product_id, quantity, type) 
  VALUES (101, -50, 'PRODUCTION_CONSUMPTION');

  -- Work order notification to outbox table
  INSERT INTO outbox_events (event_type, payload, status) 
  VALUES ('STOCK_UPDATED', '{"product_id": 101, "diff": -50}', 'PENDING');
COMMIT;
Enter fullscreen mode Exit fullscreen mode

The trade-off of this approach is the risk of "eventual consistency." The operator might see the stock reduced on their screen, but the central planning screen updates 2 seconds later. On a production line, these 2 seconds are tolerable, but the entire system halting due to database locks is not. As I mentioned in my [related: PostgreSQL performance tuning] article, if you don't set up your indexing strategies correctly, this outbox table will also eventually become a bottleneck.

Network Layer Bottlenecks: iSCSI and VLAN Segmentation

Data flow doesn't just happen at the application layer; the physical cables and switch configurations that carry this data are also part of the equation. When setting up the ERP infrastructure for a manufacturing plant, leaving office traffic and the traffic from IoT devices and barcode readers on the production floor in the same VLAN is a big mistake. Last year, in one factory, a broadcast storm on a switch in a staff member's office locked up all handheld terminals on the production line.

To overcome this problem, I strictly segmented the network. I allowed only necessary ports (5432 for PostgreSQL, 8000 for API) to pass through the firewall between the Production Area (VLAN 10), Server Room (VLAN 20), and Office (VLAN 30). Additionally, I run iSCSI traffic for storage units on completely isolated switches and dedicated ports. Be careful when setting MTU 9000 (Jumbo Frames) on the iSCSI side; if it's not supported along the entire path (Switch-NIC-Storage), performance can drop by 400% due to packet fragmentation.

⚠️ MTU Mismatch

If you have remote storage connected via VPN, remember to reduce the MTU value to 1400 or 1380 instead of the standard 1500. If you don't perform MSS clamping, large JSON packets from the ERP (e.g., a 200-line delivery note) may not pass through the tunnel, causing the connection to "hang."

Regarding network security, switch hardening is a must. In a network where DHCP Snooping and Dynamic ARP Inspection (DAI) are not active, a malicious device (or an accidentally connected router) can hijack all ERP traffic (Man-in-the-Middle). In a manufacturing ERP, the accuracy of data is as critical as its confidentiality; a fake "production completed" packet can collapse the entire supply chain planning.

Asynchronous Flows: FastAPI and Background Tasks Management

My main reason for using FastAPI in the ERP backend is its asynchronous (async/await) structure. However, making everything asynchronous isn't always the solution. Especially for CPU-intensive tasks (e.g., generating a thousands-line Excel output or running a complex production planning algorithm), you shouldn't block FastAPI's event loop.

In one of my client's projects, when the operator pressed the "Finish Production" button, the system wouldn't respond for 12 seconds. Upon reviewing the logs, I saw that the application was generating PDF barcodes and sending emails at that time. As a solution, I moved these tasks to a separate worker structure (Redis + Celery or a lighter solution like taskiq) instead of BackgroundTasks.

from fastapi import FastAPI, BackgroundTasks
from redis import Redis

app = FastAPI()
redis_conn = Redis(host='localhost', port=6379)

@app.post("/production/finish/{job_id}")
async def finish_production(job_id: int, background_tasks: BackgroundTasks):
    # 1. Perform critical database operation immediately
    await update_job_status(job_id, "COMPLETED")

    # 2. Offload heavy tasks to the background
    # background_tasks.add_task(generate_pdf_and_mail, job_id) # This is still in the same process
    redis_conn.lpush("task_queue", job_id) # This transfers it to another worker

    return {"status": "success", "message": "Operation queued."}
Enter fullscreen mode Exit fullscreen mode

The most critical point in this architecture is idempotency. What happens if the operator presses the button twice? Storing a unique request_id for each operation in the database and rejecting the second request with the same ID prevents double-counting of stock. I usually expect an X-Idempotency-Key via the header and store it in Redis for 24 hours.

Integration Security: JWT, Rate Limiting, and Fail-Safe Mechanisms

The supply chain isn't limited to your factory; you communicate with logistics companies, customs brokers, and raw material suppliers via APIs. This connection to the outside world is the system's weakest link. When an external API (e.g., a cargo tracking service) slows down or crashes, it's unacceptable for your ERP to slow down as well.

For such situations, I implement the Circuit Breaker pattern. If an external service returns 5 consecutive errors or its response time exceeds 2 seconds, the system stops trying to reach that service for 60 seconds and returns a message to the user like "Cargo service is currently unavailable, please try again later." This prevents consuming all system resources (threads/memory) waiting for that faulty service.

Security Layer Implementation Method Benefit Provided
Auth OAuth2 + JWT (RS256) Secure authorization between microservices
Rate Limit Nginx limit_req Prevents brute force and erroneous loops
Validation Pydantic (Strict Mode) Prevents dirty data from leaking into the database
Audit PostgreSQL Triggers Proof of who changed what, when

Speaking of security, I also take some kernel-level precautions. On ERP servers, I blacklist all unnecessary kernel modules. For example, preventing the loading of modules like algif_aead which have had past CVE records (like CVE-2026-31431), narrows the attack surface. Scanning Nginx logs with fail2ban and blocking IPs that serially produce 404 or 401 errors at the firewall level is also part of my standard procedure.

Database Performance: PostgreSQL WAL Bloat and Index Strategies

In a manufacturing ERP, data volume grows rapidly. The stock movements table alone can reach 50 million rows in a year. At this point, standard B-tree indexes start to become insufficient. Especially for timestamped data, using BRIN (Block Range Index) can reduce index size from 100 MB to 500 KB and significantly increase insert performance.

In an incident last month, I saw PostgreSQL's WAL (Write Ahead Log) files fill the disk to 100%. The reason was the failure of archive_command during a massive data transfer (import). When the disk filled up, the database went into self-protection mode and shut down. To prevent such disasters, it's necessary to tune max_wal_size and min_wal_size settings according to server RAM and definitely monitor long-running transactions with a monitoring tool (e.g., via pg_stat_activity).

# PostgreSQL Configuration Check
# /var/lib/postgresql/data/postgresql.conf
max_connections = 200
shared_buffers = 4GB # 25% of total RAM
effective_cache_size = 12GB # 75% of total RAM
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1 # If SSD is used
Enter fullscreen mode Exit fullscreen mode

Additionally, I offload reporting load from the main server by using logical replication at the database level. Pulling heavy cost reports while production is ongoing can inflate disk I/O and freeze operator screens. Running reports always from a "Read Replica" is essential for operational continuity. [related: PostgreSQL WAL bloat management]

Operator Screens and UX: Low-Latency Data Visualization

The furthest point of the supply chain is the operator walking around the warehouse with a tablet or standing at a machine. The design of these screens must be different from white-collar screens. On an operator screen, "speed" and "error margin" are more important than "aesthetics." The most important lesson I've learned from screens I've developed since 2006 is this: large buttons that an operator can press even with gloves, and clear audio/visual feedback (green/red light) at the end of each operation.

On the technical side, to keep these screens up-to-date, I prefer Server-Sent Events (SSE) or a smart polling mechanism over WebSockets. Wi-Fi signals often drop in a factory environment. WebSocket connection state management becomes difficult when the connection is lost. However, SSE or an idempotency-supported polling structure can resume from where it left off as soon as the connection is re-established.

In one production project, we added an AI-powered control layer to prevent operators from entering incorrect raw material data. This model, running on Gemini Flash, checks the entered data's compatibility with past production patterns. If an operator types 100 kg where they should normally enter 10 kg, the system warns, "This value is outside the norm, are you sure?" We perform this check as an asynchronous task in the backend, not client-side, so as not to interrupt the operator's speed.

In conclusion, supply chain data flow is not just about writing code; it requires considering every layer, from the network to the database kernel, from operator psychology to the reliability of external APIs. In these systems, there is no "perfect," only "working and recoverable" systems. In the next article, I will explain how I set up the monitoring and observability processes for these complex structures.

Next step: [related: Observability in software architecture].

Top comments (0)