Introduction
A web crawler is only as useful as its stability.
When a scraper sends a log every 7 seconds — and dozens or even hundreds of users are using it simultaneously — things can go wrong, fast.
How do you stop:
- The database from exploding?
- The system from slowing to work?
- The support team from drowning in unmanageable logs?
To prevent this, you need a logging architecture that scales.
This article walks through a real-world implementation of a logging system purpose-built for a scalable web scraper, focusing on performance, durability, and developer experience.
Two Hidden Enemies of Real-Time Logging Systems
1- Unbounded Data Growth
Without structure and filtering, logs quickly saturate the database and make analytics almost impossible.
2- Performance Degradation
Poorly designed logging directly impacts frontend responsiveness and backend throughput.
Frontend Logging Strategy (React): Store Less, Show More
In the React frontend, I chose to keep only the latest 50 logs in memory and display them in the UI.
const MAX_LOGS = 50;
const handleLog = useCallback((log: ScrapLog) => {
setLogs((prev) => [log, ...prev.slice(0, MAX_LOGS - 1)]);
}, []);
Why this works:
- Users see only fresh, relevant logs
- DOM and memory stay lightweight
- Logs can be exported as CSV for support teams
Backend Architecture: Separate the Signals from the Noise
1- Summary Logs
Each scraping session generates one summary record containing:
- Number of list pages scraped (pagination)
- Number of product items extracted
- URL of the first and last list pages scraped (pagination)
- First and last items extracted
- Final status (success or failure)
- Total execution time
Retention: Permanent
Use case: Dashboard analytics and long-term monitoring
2- Fine-Grained Logs (Triggered by Unexpected Errors)
If the scraper encounters an unexpected error (i.e., a type of error not seen in the past 24 hours), the frontend sends the last 50 logs to the server. These include:
- URLs of visited pages
- Actions performed
- Any captured errors
Format: Lightweight, structured JSON
Retention: 7 days (configurable)
Cleanup: Automatically via scheduled job
Smart Cleanup in Laravel + MySQL
Efficient storage isn't enough-you must clean up intelligently.
🔍 Why Indexing Matters
To speed up deletion of old logs, we index the created_at
field. This drastically improves the performance of time-based queries.
CREATE INDEX idx_created_at ON scrap_logs (created_at);
🧹 Scheduled Cleanup Job in Laravel
// App\Console\Commands\DeleteOldLogs.php
class DeleteOldLogs extends Command {
protected $signature = 'logs:cleanup';
public function handle() {
ScrapLog::where('created_at', '<', now()->subDays(7))->delete();
$this->info('Old logs cleaned successfully!');
}
}
And register it in the scheduler:
protected function schedule(Schedule $schedule) {
$schedule->command('logs:cleanup')->dailyAt('01:00');
}
Why This Architecture Works
✅ Only essential logs are stored
✅ Tables stay clean and queryable
✅ Debugging becomes painless
✅ Data analysis remains performant
Conclusion
A well-designed logging system isn't just for debugging -it's a critical survival mechanism. With a scalable, performance-conscious architecture, your system can remain:
- Stable under load
- Transparent when things go wrong
- Insightful for business and technical teams
- User-friendly.
Since implementing this system:
- Debugging is fast
- User behavior is easy to analyze
- We retain full traceability when incidents occur
I hope these insights help you on your journey.
📣 Let's Talk
How do you handle logging in production system?
Share your thoughts in the comments - I'd love to hear your approach. If this article helped, consider clapping 👏 and following for more insights on web development, browser automation, and software engineering.
📬 Get in Touch
- Email: amin.mashayekhan@gmail.com
- Book a Quick Tech Call: https://calendly.com/amin-mashayekhan/15min-tech-call
Let's build better tools, faster!
Top comments (0)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.