Our order processing queue hit a backlog of over 5,000 messages—compared to a normal baseline under 100. Even worse, customer order statuses were stuck at "Paid" for over 4 hours with no updates. Investigation revealed that our scheduled polling script for syncing 1688 order statuses had exhausted the database connection pool under high concurrency, causing the entire system to hang.
Background: The Cost of Polling
We were running a Cron Job every 5 minutes that called the 1688 Open API to fetch order status changes from the past 24 hours. The code looked like this:
import requests
import time
from myapp.models import Order
from myapp.db import get_db_connection
def poll_1688_orders():
conn = get_db_connection()
cursor = conn.cursor()
# Fetch 100 records per page, no incremental markers
orders = requests.get("https://api.1688.com/order/list", params={"pageSize": 100, "pageNum": 1})
for order in orders.json()["data"]:
cursor.execute("UPDATE orders SET status = %s WHERE order_id = %s", (order["status"], order["id"]))
conn.commit()
cursor.close()
conn.close()
This script had three critical issues:
- Full fetch: Retrieved all orders on each run, with no incremental ID or timestamp markers, causing duplicate processing
-
Synchronous blocking:
requests.getwas blocking—when the API responded slowly, the entire process stalled -
Connection leaks: Under high concurrency, connections created by
get_db_connection()weren't properly released, eventually exhausting the connection pool
The direct consequence of that incident: database connections spiked to 200+, exceeding MySQL's default maximum of 151, causing all write operations to fail. We spent 45 minutes manually restarting services and clearing the queue, during which 120 order shipping status updates were delayed.
Solution: Replacing Polling with Webhooks
We decided to migrate to Webhook mode. The 1688 Open API supports configuring a callback URL for each application. When order statuses change, it proactively pushes a JSON payload to our server. This fundamentally eliminates the resource waste and latency caused by polling.
Step 1: Configure the Webhook endpoint in the 1688 Open Platform.
Step 2: Write a lightweight Webhook receiver using an async framework to process requests without blocking.
from fastapi import FastAPI, Request, HTTPException
from pydantic import BaseModel
import hmac
import hashlib
import asyncio
from myapp.services import update_order_status
app = FastAPI()
class OrderWebhookPayload(BaseModel):
order_id: str
new_status: str
timestamp: int
signature: str
@app.post("/webhook/1688/order")
async def handle_1688_webhook(payload: OrderWebhookPayload, request: Request):
# 1. Verify signature - prevent forged requests
secret = "your_1688_webhook_secret"
raw_body = await request.body()
expected_signature = hmac.new(secret.encode(), raw_body, hashlib.sha256).hexdigest()
if payload.signature != expected_signature:
raise HTTPException(status_code=403, detail="Invalid signature")
# 2. Async database update - use connection pool
await update_order_status(payload.order_id, payload.new_status)
# 3. Return 200 quickly to avoid timeout retries
return {"status": "ok"}
Key improvements:
-
Async processing:
async defallows FastAPI to handle thousands of concurrent Webhook requests simultaneously without blocking I/O - Signature verification: Uses HMAC-SHA256 to verify payload origin, preventing malicious requests
-
Connection pool:
update_order_statusinternally uses SQLAlchemy's connection pool, keeping maximum connections under 20
Lessons Learned: Refactoring from the Incident
This incident prompted us to redesign the entire order sync architecture. Here are the specific optimizations:
1. Decouple with a Message Queue
The Webhook receiver only handles verification and enqueuing—it doesn't write directly to the database. We use Redis as a lightweight queue:
import json
import aioredis
redis = await aioredis.from_url("redis://localhost:6379")
async def enqueue_order_update(order_id: str, new_status: str):
message = json.dumps({"order_id": order_id, "status": new_status})
await redis.lpush("order_updates", message)
2. Batch Processing in Consumer
Another async worker pulls messages from the queue in batches, updating the database every 10 messages or every 5 seconds:
async def batch_update_worker():
while True:
messages = []
for _ in range(10):
msg = await redis.rpop("order_updates")
if msg:
messages.append(json.loads(msg))
else:
break
if messages:
async with db_pool.acquire() as conn:
await conn.executemany(
"UPDATE orders SET status = $1 WHERE order_id = $2",
[(m["status"], m["order_id"]) for m in messages]
)
await asyncio.sleep(5)
After deploying this architecture, our order sync latency dropped from an average of 5 minutes to under 10 seconds, and database connections stabilized at 15-20. More importantly, we haven't experienced any more cascading failures caused by polling.
Summary: Don't use polling to solve real-time problems, especially when a third-party API provides Webhooks. The Webhook + message queue combination not only reduces resource consumption but also makes systems more stable when handling burst traffic. If your order system is still using Cron Job polling, now is the time to refactor.
What do you think of this approach? Has your order sync system encountered similar bottlenecks? Let's discuss in the comments.
About the Author: Building cross-border purchasing solutions with taocarts — a daigou system for 1688/Taobao purchasing, order management, and international shipping.
Top comments (0)