DEV Community

Winola Joe
Winola Joe

Posted on

Handling Big Data Challenges: A Case Study of AllFreeNovel.cc

AllFreeNovel.cc
## Technical Challenges & Solutions

1. Data Ingestion Bottlenecks

Problem:

Daily ingestion of 50,000+ new chapters from multiple sources (CN/JP/KR) with varying formats:

  • XML feeds from Korean publishers
  • JSON APIs from Chinese platforms
  • Raw text dumps from Japanese partners

Solution:

# Distributed ETL Pipeline
class ChapterIngestor:
    def __init__(self):
        self.kafka_topic = "raw-chapters"
        self.schema_registry = AvroSchemaRegistry()

    async def process(self, source):
        async for chunk in source.stream():
            normalized = await self._normalize(chunk)
            await kafka.produce(
                self.kafka_topic,
                value=normalized,
                schema=self.schema_registry.get(source.format)
            )
Enter fullscreen mode Exit fullscreen mode

2. Search Performance Optimization

Metrics Before Optimization:

  • 1200ms average query latency
  • 78% cache miss rate
  • 12-node Elasticsearch cluster at 85% load

Implemented Solutions:

  1. Hybrid Index Strategy

    • Hot data (latest chapters): In-memory RedisSearch
    • Warm data: Elasticsearch with custom tokenizer
    • Cold data: ClickHouse columnar storage
  2. Query Pipeline:

graph TD
    A[User Query] --> B{Query Type?}
    B -->|Simple| C[RedisSearch]
    B -->|Complex| D[Elasticsearch]
    B -->|Analytics| E[ClickHouse]
    C/D/E --> F[Result Blender]
    F --> G[Response]
Enter fullscreen mode Exit fullscreen mode

3. Real-time Recommendations

Challenge:

Generate personalized suggestions for 2M+ DAU with <100ms latency

ML Serving Architecture:

┌─────────────┐ ┌─────────────┐
│ Feature Store│◄─────│ Flink Jobs │
└──────┬───────┘ └─────────────┘

┌──────▼───────┐ ┌─────────────┐
│ Model Cache │─────►│ ONNX │
└──────┬───────┘ │ Runtime │
│ └─────────────┘
┌──────▼───────┐
│ User │
│ Interactions │
└──────────────┘

Results:

  • P99 latency reduced from 2200ms → 89ms
  • Recommendation CTR increased by 37%
  • Monthly infrastructure cost saved: $28,500

Key Takeaways

  1. Data Tiering is crucial for cost-performance balance
  2. Asynchronous Processing prevents pipeline backpressure
  3. Hybrid Indexing enables optimal query performance
  4. Model Optimization (ONNX conversion) dramatically improves ML serving

Top comments (0)