DEV Community

Mustafa ERBAY
Mustafa ERBAY

Posted on • Originally published at mustafaerbay.com.tr

Supply Chain Data Flow: An Anatomical Analysis for ERP Systems

The difference between a supply chain integration run with synchronous REST API calls and a data flow based on an asynchronous, Transactional Outbox architecture is the difference between the production line stopping or keeping on running. In classic monolithic structures or hastily written microservices, sending an HTTP POST request to an external supplier system and waiting for a "200 OK" is the most common design mistake. When the external API fails to respond for 15 seconds or the connection drops, the internal PostgreSQL transaction remains open, row-level locks accumulate, and eventually, the connection pool is exhausted.

In this anatomical analysis, I dissect the data flow architecture that I built while developing a production ERP—which manages hundreds of inventory movements and supplier orders per second—across all its layers. From database locks to network segmentation, real-time operator screens to AI-powered supplier risk analysis, I will explain everything using methods I have personally implemented in the field.


From Order to Production: The Anatomy of Purchasing and Inventory Reservation

At the entry of an order, which is the first link in the supply chain, the consistency of inventory quantities is a critical threshold. In a production ERP, imagine 10 different sales representatives trying to reserve the same critical raw material (for example, a special alloy steel coil) at the exact same time. If your database isolation level is left at the default Read Committed setting, "double booking"—meaning the duplicate reservation of the same inventory for two different orders—is inevitable.

To overcome this issue, I use a pessimistic locking mechanism with SELECT ... FOR UPDATE on PostgreSQL. However, there is a major danger here: Deadlock. If transaction A tries to lock inventory X first and then inventory Y, while transaction B simultaneously tries to lock inventory Y first and then inventory X, the system locks up. The way to prevent this is to always sort the inventory IDs in the application layer before sending them to the database.

The SQL block below shows how a secure inventory reservation process and a structure that limits the lock wait time in case of a potential conflict work on PostgreSQL 16+.

-- We define a 5-second lock wait timeout
SET LOCAL lock_timeout = '5000'; -- 5 seconds

BEGIN;

-- We lock the inventory IDs in a sorted order (Deadlock prevention)
SELECT id, sku, stock_qty, reserved_qty 
FROM inventory_items 
WHERE id IN (1042, 1089, 2101) 
ORDER BY id 
FOR UPDATE;

-- Inventory check and reservation update
UPDATE inventory_items 
SET reserved_qty = reserved_qty + 150
WHERE id = 1042 AND (stock_qty - reserved_qty) >= 150;

-- If the affected row count is not 1, the balance is insufficient, we rollback.
-- The application layer catches this situation and returns "Insufficient Stock" to the user.

COMMIT;
Enter fullscreen mode Exit fullscreen mode

In a client project where I set up this structure, I reduced the error margin to zero across 45,000 daily inventory movements. If I hadn't limited the lock wait time (lock_timeout), database connections would have hung, and the entire ERP panel would have become inaccessible.


Queue Management and Idempotency: The Battle Against Duplicate Records

When integrating with external supplier portals (for example, placing raw material orders over iSCSI or via an external web service), network outages are your greatest enemy. Your application sends the purchase request, the supplier's server processes the transaction, but the response back to you gets lost along the way. When your application retries the request, a duplicate order is created on the supplier's side. This situation can come back to the company as thousands of dollars in incorrect invoices.

The only solution to this is an Idempotency architecture (where repeating the same request does not change the state). I assign a unique idempotency_key (UUIDv4) to every order request. The supplier API stores this key on Redis for 24 hours. When a new request arrives with the same key, the transaction is not executed again; the old successful response stored in Redis is returned directly.

ℹ️ Transaction Outbox Pattern

To make the database update and the message writing to the queue a single atomic unit, I use the Transactional Outbox pattern. The message is not written directly to the queue; it is first written to the outbox table in PostgreSQL, and from there, it is transferred to the Redis/RabbitMQ queue with the help of a daemon.

Below, you can examine the service function implementing the transaction outbox pattern and ensuring idempotency, written using FastAPI (Python) and PostgreSQL:

import uuid
from sqlalchemy.orm import Session
from models import PurchaseOrder, OutboxMessage

def create_purchase_order(db: Session, supplier_id: int, items: list, idempotency_key: str) -> PurchaseOrder:
    # 1. We perform the idempotency check
    existing_order = db.query(PurchaseOrder).filter(PurchaseOrder.idempotency_key == idempotency_key).first()
    if existing_order:
        return existing_order  # Already created, return the existing order

    # 2. We create the new order
    new_order = PurchaseOrder(
        id=uuid.uuid4(),
        supplier_id=supplier_id,
        items=items,
        status="PENDING",
        idempotency_key=idempotency_key
    )
    db.add(new_order)

    # 3. We add the message to the outbox table (Within the same transaction!)
    outbox_payload = {
        "event_type": "ORDER_CREATED",
        "order_id": str(new_order.id),
        "supplier_id": supplier_id
    }
    outbox_entry = OutboxMessage(
        aggregate_type="PurchaseOrder",
        aggregate_id=str(new_order.id),
        payload=outbox_payload,
        processed=False
    )
    db.add(outbox_entry)

    # Both are written together or not at all with a single commit at the DB level
    db.commit()
    db.refresh(new_order)
    return new_order
Enter fullscreen mode Exit fullscreen mode

I have made this method a standard in my own side products and enterprise projects. When the network drops, the system does not crash; a systemd service running in the background (with the help of a timer) scans the outbox table and safely transmits unprocessed records to the external service.


Production Planning Algorithms and Real-Time Operator Screens

The screens in front of operators on the shop floor are where real-time data flow is most critical. In the event of a machine breakdown or running out of raw materials, production must be stopped immediately, and the planning algorithm (our AI-powered planning engine) must reroute. To provide this flow, the polling method (sending requests at specific intervals) is a complete disaster; 150 operator tablets querying the database every 2 seconds would push disk I/O values through the roof.

In this architecture, I completely rely on WebSockets and the Nginx reverse proxy layer. Operator screens connect to the FastAPI backend via Nginx using a persistent TCP connection. When an event occurs, the backend pushes the message only to the relevant operator group.

However, Wi-Fi signals in factories drop frequently. When the network disconnects, the WebSocket connection drops. To manage this situation, I had to configure custom timeout and buffer settings on Nginx. Below, I share the Nginx configuration I use to compensate for unstable network connections on the production floor:

# /etc/nginx/conf.d/websocket.conf

upstream websocket_backend {
    server 127.0.0.1:8000;
    keepalive 64; # Number of idle connections to keep open with the backend
}

server {
    listen 80;
    server_name fabrika.local;

    location /ws/ {
        proxy_pass http://websocket_backend;
        proxy_http_version 1.1;

        # Required headers for WebSocket
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "Upgrade";
        proxy_set_header Host $host;

        # We extend timeout periods to prevent connection drops
        proxy_read_timeout 3600s; # Closes the connection if no data flows for 1 hour
        proxy_send_timeout 3600s;

        # We turn off buffering so there is no latency
        proxy_buffering off;

        # We enable TCP keepalive packets
        proxy_socket_keepalive on;
    }
}
Enter fullscreen mode Exit fullscreen mode

Previously, while setting up a similar WebSocket proxy layer on a VPS infrastructure, I noticed connections dropped every 60 seconds because the keepalive setting was missing. Thanks to this configuration, even if industrial tablets switch between access points while moving inside the factory, the data flow continues uninterrupted.


PostgreSQL WAL Bloat and Managing Heavy Disk I/O in the Supply Chain

IoT sensors, barcode readers, and RFID antennas in the supply chain bring a massive write load to the database. In PostgreSQL, every UPDATE operation actually means writing a new version of the row (due to MVCC) and marking the old one as a "dead tuple". If your autovacuum mechanism is not configured correctly, database files (especially Write-Ahead Logging - WAL files) bloat incredibly (WAL bloat).

In a production ERP, I saw disk occupancy reach 98% on a table receiving approximately 1.2 million inventory status updates per day. The solution was not just to make autovacuum settings more aggressive, but also to define custom storage parameters for tables receiving heavy writes.

Below, you can find the postgresql.conf optimization parameters I prepared for PostgreSQL 15+ that prevent disk bottlenecks in systems with this kind of heavy I/O:

# /etc/postgresql/16/main/postgresql.conf or custom.conf

# WAL Write Settings
wal_level = replica
max_wal_size = 16GB       # Depending on disk status; prevents frequent checkpoints
min_wal_size = 2GB
checkpoint_completion_target = 0.9 # Smooths out disk I/O distribution

# Aggressive Autovacuum Settings
autovacuum = on
autovacuum_max_workers = 5         # Can be increased based on CPU core count
autovacuum_vacuum_scale_factor = 0.05 # Trigger vacuum when 5% of the table changes
autovacuum_vacuum_cost_limit = 1000  # Raise the disk I/O limit for vacuum
autovacuum_vacuum_cost_delay = 10    # Wait time in milliseconds
Enter fullscreen mode Exit fullscreen mode

After applying these settings, the disk write queue length dropped from 4.2 to around 0.3. Additionally, by setting the fillfactor parameter to 80% on status tables that are updated very frequently, I ensured that new updates remain on the same disk page. This also prevented index bloat.


Network Segmentation and Zero-Trust: Safely Connecting the Factory Floor to the ERP Server

Networks on the factory floor are generally "dirty" networks. Maintenance technicians' laptops, IoT devices, and even guest Wi-Fi networks can pass through the same physical switches. If an operator tablet on the factory floor is compromised or the network is breached, direct access to the ERP database or main servers must not be possible. I never fall into the "internal network is secure" fallacy.

I apply the Zero-Trust (Never Trust, Always Verify) principle in the architecture. I place devices on the factory floor into a separate VLAN (for example, VLAN 120 - Production). This VLAN has no direct access to the main server network (VLAN 10 - DMZ/Core). Access is allowed only from specific IPs to specific ports (for example, only HTTPS - 443) through a firewall positioned in between.

⚠️ Security Risk

Opening the SQL port (5432) directly to devices inside the factory is suicide. A breach on any device could result in the entire database being encrypted (ransomware). All access must pass through an API Gateway.

Additionally, I set up dual-stack WireGuard VPN tunnels for warehouses in external branches to connect to the central server. Below, you can see the iptables rules and WireGuard configuration I use on the central gateway to limit requests coming from the factory network only to the API port:

# /etc/wireguard/wg0.conf (Central VPN Server)
[Interface]
Address = 10.8.0.1/24
ListenPort = 51820
PrivateKey = <central_private_key>

# Factory Warehouse Server Connection
[Peer]
PublicKey = <factory_warehouse_public_key>
AllowedIPs = 10.8.0.10/32
Enter fullscreen mode Exit fullscreen mode

The firewall rules I applied for restriction are as follows:

#!/bin/bash
# We restrict traffic coming from the VPN interface
# Only allow IP 10.8.0.10 (Factory Warehouse) to access port 443 of server 192.168.10.50 (ERP API)

# First, clear existing rules
iptables -F FORWARD

# Define the rules
iptables -A FORWARD -s 10.8.0.10/32 -d 192.168.10.50/32 -p tcp --dport 443 -j ACCEPT
iptables -A FORWARD -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -A FORWARD -j DROP # Block all other traffic
Enter fullscreen mode Exit fullscreen mode

Thanks to these strict rules, even if a machine on the factory floor gets infected with a virus, I completely eliminate the possibility of the virus jumping to the central ERP servers.


API Fallback Architecture: Smart Supply Chain with Multi-Provider and AI Integration

In supply chain planning, we now integrate artificial intelligence models (LLMs) to predict delivery times and map alternative supplier routes. However, there is a major risk here: When the API of the AI service provider (for example, OpenAI or Anthropic) crashes or returns a rate limit (HTTP 429) error, a critical shipping decision cannot be left half-finished.

To overcome this issue, I built a Multi-Provider Fallback architecture. The system primarily sends requests to the fastest and cheapest model (for example, Gemini Flash via OpenRouter). If a timeout (for example, 3 seconds) or an error is returned, it instantly switches to a Llama model on Groq or the Cerebras infrastructure.

The Python class below shows how I implement this smart fallback mechanism, along with an exponential backoff algorithm:

import time
import requests
from typing import Dict, Any

class ResilientAIService:
    def __init__(self):
        # Provider priority order and API endpoints
        self.providers = [
            {"name": "OpenRouter-Gemini", "url": "https://openrouter.ai/api/v1/chat/completions", "key": "KEY_A", "model": "google/gemini-2.5-flash"},
            {"name": "Groq-Llama", "url": "https://api.groq.com/openai/v1/chat/completions", "key": "KEY_B", "model": "llama3-70b-8192"},
            {"name": "Cerebras-Llama", "url": "https://api.cerebras.ai/v1/chat/completions", "key": "KEY_C", "model": "llama3.1-8b"}
        ]

    def analyze_supplier_risk(self, prompt: str) -> Dict[str, Any]:
        payload = {
            "messages": [{"role": "user", "content": prompt}]
        }

        for provider in self.providers:
            payload["model"] = provider["model"]
            headers = {
                "Authorization": f"Bearer {provider['key']}",
                "Content-Type": "application/json"
            }

            # We make a maximum of 3 attempts for each provider
            for attempt in range(3):
                try:
                    response = requests.post(
                        provider["url"], 
                        json=payload, 
                        headers=headers, 
                        timeout=3.5 # Our timeout limit is 3.5 seconds
                    )

                    if response.status_code == 200:
                        return {
                            "success": True,
                            "provider": provider["name"],
                            "data": response.json()["choices"][0]["message"]["content"]
                        }

                    if response.status_code == 429: # Rate limit
                        time.sleep(2 ** attempt) # Exponential backoff

                except requests.exceptions.RequestException:
                    # In case of connection error or timeout, silently move to the next attempt
                    pass

            # If this provider failed completely, log it and move to the next one
            print(f"Provider {provider['name']} failed. Trying next provider...")

        return {"success": False, "error": "All AI providers failed."}
Enter fullscreen mode Exit fullscreen mode

I use this design in the purchasing module of a production ERP I developed. Supplier delay risk analyses are completed through alternative channels within milliseconds, even if one of the providers crashes. The system always remains operational.


Summary and My Architectural Choice

Enterprise software architecture is the art of coordinating organizational and physical flows rather than just writing code. In structures like the supply chain that do not tolerate errors and directly touch physical production and logistics, one must break away from the monolithic mindset.

My architectural choices presented in this analysis are as follows:

Layer Preferred Technology / Method Reason for Choice
Database PostgreSQL (Pessimistic Locking + Autovacuum Tuning) Data consistency, deadlock prevention, and high I/O management.
Queue / Idempotency Transactional Outbox + Redis Preventing data loss and duplicate records during network outages.
Communication Layer WebSockets (Nginx Keepalive and Buffering Off) Low latency and stable connection on operator screens.
Security VLAN Segmentation + WireGuard Zero-Trust security model and isolation of the factory floor.

Top comments (0)