DEV Community

Winola Joe
Winola Joe

Posted on

Handling Big Data Challenges: A Case Study of AllFreeNovel.cc

AllFreeNovel.cc
## Technical Challenges & Solutions

1. Data Ingestion Bottlenecks

Problem:

Daily ingestion of 50,000+ new chapters from multiple sources (CN/JP/KR) with varying formats:

  • XML feeds from Korean publishers
  • JSON APIs from Chinese platforms
  • Raw text dumps from Japanese partners

Solution:

# Distributed ETL Pipeline
class ChapterIngestor:
    def __init__(self):
        self.kafka_topic = "raw-chapters"
        self.schema_registry = AvroSchemaRegistry()

    async def process(self, source):
        async for chunk in source.stream():
            normalized = await self._normalize(chunk)
            await kafka.produce(
                self.kafka_topic,
                value=normalized,
                schema=self.schema_registry.get(source.format)
            )
Enter fullscreen mode Exit fullscreen mode

2. Search Performance Optimization

Metrics Before Optimization:

  • 1200ms average query latency
  • 78% cache miss rate
  • 12-node Elasticsearch cluster at 85% load

Implemented Solutions:

  1. Hybrid Index Strategy

    • Hot data (latest chapters): In-memory RedisSearch
    • Warm data: Elasticsearch with custom tokenizer
    • Cold data: ClickHouse columnar storage
  2. Query Pipeline:

graph TD
    A[User Query] --> B{Query Type?}
    B -->|Simple| C[RedisSearch]
    B -->|Complex| D[Elasticsearch]
    B -->|Analytics| E[ClickHouse]
    C/D/E --> F[Result Blender]
    F --> G[Response]
Enter fullscreen mode Exit fullscreen mode

3. Real-time Recommendations

Challenge:

Generate personalized suggestions for 2M+ DAU with <100ms latency

ML Serving Architecture:

┌─────────────┐ ┌─────────────┐
│ Feature Store│◄─────│ Flink Jobs │
└──────┬───────┘ └─────────────┘

┌──────▼───────┐ ┌─────────────┐
│ Model Cache │─────►│ ONNX │
└──────┬───────┘ │ Runtime │
│ └─────────────┘
┌──────▼───────┐
│ User │
│ Interactions │
└──────────────┘

Results:

  • P99 latency reduced from 2200ms → 89ms
  • Recommendation CTR increased by 37%
  • Monthly infrastructure cost saved: $28,500

Key Takeaways

  1. Data Tiering is crucial for cost-performance balance
  2. Asynchronous Processing prevents pipeline backpressure
  3. Hybrid Indexing enables optimal query performance
  4. Model Optimization (ONNX conversion) dramatically improves ML serving

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay