DEV Community

Ricky512227
Ricky512227

Posted on

6 Common Redis and Kafka Challenges I Faced (And How I Solved Them)

Lessons learned building a real-time microservices monitoring platform

I built EventStreamMonitor as a personal project to apply my knowledge of microservices, Redis, and Kafka in a real-world monitoring system. I had worked with these technologies before, so I thought integrating them would be straightforward - just add the libraries and start using them. Boy, was I wrong!

This post documents the real challenges I faced and how I solved them. I'm sharing this because when I was stuck, I couldn't find simple explanations that covered these common pitfalls. Hopefully, this helps someone else avoid the same headaches.

What is EventStreamMonitor?

EventStreamMonitor is a real-time microservices monitoring platform I built. It:

  • Collects logs from multiple microservices
  • Streams events through Apache Kafka
  • Caches data using Redis for performance
  • Provides a live dashboard for error tracking

I built it to apply my microservices knowledge in a practical project and add to my portfolio. It's not production-grade, but it works and taught me a lot about the practical challenges of distributed systems - even when you think you know the basics.

Why Redis and Kafka?

I needed to monitor multiple microservices in real-time. The challenge was: how do you collect logs from 4+ services, process them quickly, and display them on a dashboard without everything slowing down?

I chose Redis because:

  • It's fast (in-memory)
  • Reduces load on my PostgreSQL databases
  • I could use it for session management and rate limiting too

I chose Kafka because:

  • It handles high-volume event streams
  • It decouples my services (they don't need to know about each other)
  • It's built for real-time processing
  • It's reliable (messages don't get lost)

Sounds good in theory, right? Well, here's what actually happened...


Challenge 1: Redis Connection Management

I started simple - just create a Redis connection whenever I needed it:

redis_client = redis.Redis(host='redis', port=6379, db=0)
Enter fullscreen mode Exit fullscreen mode

This worked fine during development. But when I started testing with multiple services and higher traffic, things broke. My services would crash with "too many connections" errors. Redis was rejecting new connections because I was creating hundreds of them.

The problem happened:

  • When multiple services started at once
  • During any traffic spike
  • After Redis restarted (all connections were lost)

My entire system was affected - services crashed, the dashboard stopped updating, everything broke.

How I Fixed It: Connection Pooling

I learned about connection pooling. Instead of creating a new connection every time, you create a pool and reuse connections:

# Before: Creating new connection each time (BAD!)
redis_client = redis.Redis(host='redis', port=6379, db=0)

# After: Using connection pool (GOOD!)
pool = redis.ConnectionPool(
    host='redis',
    port=6379,
    db=0,
    max_connections=50,  # Limit total connections
    socket_timeout=5,
    socket_connect_timeout=5
)
redis_client = redis.Redis(connection_pool=pool)
Enter fullscreen mode Exit fullscreen mode

This works because:

  • Connections are reused instead of created/destroyed constantly
  • You control the maximum number of connections
  • Timeouts prevent hanging connections
  • No more "too many connections" errors

After this fix, my connection overhead dropped by about 80%, and I never saw connection errors again. Simple fix, huge impact!


Challenge 2: Redis Error Handling

One day I restarted my Redis container to test something. All my services crashed immediately. Every single one. Why? Because when Redis wasn't available, my code threw exceptions and crashed the entire service.

This happened:

  • When I restarted Redis
  • When Redis ran out of memory
  • During network hiccups
  • If I misconfigured Redis

My entire system went down just because Redis was unavailable. That's terrible design - Redis should be a performance enhancement, not a critical dependency!

How I Fixed It: Graceful Degradation

I wrapped all Redis operations in try/except blocks. If Redis fails, the service falls back to the database:

def get(self, key: str) -> Optional[str]:
    """Get value from Redis - returns None if Redis fails"""
    try:
        return self.client.get(key)
    except Exception:
        # Redis failed? No problem, just return None
        # The service will fetch from database instead
        return None

def set(self, key: str, value: str, ttl: Optional[int] = None) -> bool:
    """Set value in Redis - returns False if Redis fails"""
    try:
        if ttl:
            return self.client.setex(key, ttl, value)
        else:
            return self.client.set(key, value)
    except Exception:
        # Redis failed? Cache just won't work, but service continues
        return False
Enter fullscreen mode Exit fullscreen mode

Now if Redis goes down:

  • Services keep working (just slower, fetching from database)
  • No crashes
  • Users don't notice (unless they're watching response times)
  • I can restart Redis without breaking everything

This was a game-changer. Redis is now a "nice to have" performance boost, not a critical dependency. Much better!


Challenge 3: Redis Database Selection

I had three services all using Redis database 0. They all used keys like user:123 or session:abc. Can you guess what happened? They started overwriting each other's data!

My User Management service would cache a user, then my Task Processing service would overwrite it with task data using the same key. Chaos.

This happened:

  • All the time during development
  • When I tested multiple services together
  • Whenever keys had the same names (which was often)

All three services were affected - data was getting mixed up and overwritten.

How I Fixed It: Separate Databases Per Service

Redis supports multiple databases (0-15 by default). I gave each service its own:

# User Management Service
redis_client = RedisClient(db=0)  # Database 0

# Task Processing Service
redis_client = RedisClient(db=1)  # Database 1

# Notification Service
redis_client = RedisClient(db=2)  # Database 2
Enter fullscreen mode Exit fullscreen mode

Now each service has its own isolated space. No more conflicts! I can use the same key names in different services without worrying.

This also made debugging easier - I can check what's in each database separately. Simple solution, big problem solved.


Challenge 4: Kafka Consumer Connection Issues

I wrote my Kafka consumer code on my local machine using localhost:9092. It worked perfectly. Then I tried running it in Docker and... nothing. The consumer couldn't connect at all.

The problem? In Docker, localhost doesn't mean what you think. Each container has its own localhost. I needed to use the Docker service name instead.

This happened:

  • Every time I tried to run in Docker
  • After Kafka container restarted
  • When I deployed to different environments
  • Basically whenever I wasn't running locally

My entire log monitoring system was broken because of this.

How I Fixed It: Proper Bootstrap Server Configuration

I learned about Docker networking and environment variables:

# Before: Hardcoded localhost (BROKEN in Docker!)
bootstrap_servers = 'localhost:9092'

# After: Environment-based (WORKS everywhere!)
kafka_bootstrap_servers = os.getenv(
    'KAFKA_BOOTSTRAP_SERVERS', 
    'kafka:29092'  # Docker service name, not localhost!
)

consumer = KafkaConsumer(
    *topics,
    bootstrap_servers=kafka_bootstrap_servers.split(','),
    value_deserializer=lambda m: json.loads(m.decode('utf-8')),
    auto_offset_reset='latest',
    enable_auto_commit=True,
    group_id='log-monitor-group'
)
Enter fullscreen mode Exit fullscreen mode

Key lessons:

  • In Docker, use service names (like kafka:29092), not localhost
  • Use environment variables so it works locally AND in Docker
  • You can support multiple brokers by splitting the string

Now it works everywhere - local development, Docker, and any other environment. This was a frustrating bug that took me way too long to figure out!


Challenge 5: Kafka Message Processing Errors

I was processing messages from Kafka, and one message had invalid JSON. My consumer crashed. Then I fixed that, but another message was missing a required field. Consumer crashed again.

The problem? If ANY message failed to process, the entire consumer loop would stop. All subsequent messages (even valid ones) would be blocked.

This happened:

  • When messages had invalid JSON
  • When required fields were missing
  • When my processing code threw exceptions
  • Basically whenever anything went wrong with a single message

My entire log monitoring stopped because of one bad message. That's terrible!

How I Fixed It: Per-Message Error Handling

I wrapped each message processing in its own try/except:

for message in consumer:
    try:
        log_data = message.value

        # Process message
        level = log_data.get('level', '').upper()
        if level in ['ERROR', 'CRITICAL', 'EXCEPTION']:
            error_store.add_error(log_data)

    except Exception as e:
        # Log the error but DON'T STOP!
        logger.error(f"Error processing log message: {e}")
        # Continue to next message - don't let one bad message break everything
Enter fullscreen mode Exit fullscreen mode

Now:

  • One bad message doesn't kill the consumer
  • I log errors so I can debug later
  • All other messages still get processed
  • Much more resilient system

This is a common pattern - always handle errors per message, not per consumer loop. Simple but critical!


Challenge 6: Kafka Producer Initialization Failures

I had a service that needed to send logs to Kafka. If Kafka wasn't running when the service started, the producer initialization would fail and... the entire service wouldn't start. Or worse, it would start but crash when trying to log anything.

This happened:

  • When I started services before Kafka was ready
  • When Kafka was temporarily unavailable
  • During network issues
  • If I misconfigured Kafka

All my services that logged to Kafka were affected. They couldn't even start if Kafka was down!

How I Fixed It: Lazy Initialization with Error Handling

I made the producer initialization handle failures gracefully:

def _init_producer(self):
    """Initialize Kafka producer - but don't crash if it fails"""
    try:
        self.producer = KafkaProducer(
            bootstrap_servers=self.kafka_bootstrap_servers.split(','),
            value_serializer=lambda v: json.dumps(v).encode('utf-8'),
            acks='all',
            retries=3,
            max_in_flight_requests_per_connection=1
        )
    except Exception as e:
        # Kafka not available? That's okay, service can still run
        print(f"Failed to initialize Kafka producer: {e}")
        self.producer = None  # Just set to None, don't crash

def emit(self, record: logging.LogRecord):
    """Emit log - skip if Kafka not available"""
    if not self.producer:
        return  # No Kafka? Just skip, don't crash

    try:
        # Send to Kafka
        future = self.producer.send(topic, log_data)
        future.add_errback(self._on_send_error)
    except Exception as e:
        # Error sending? Log it but don't crash
        print(f"Error sending log to Kafka: {e}")
Enter fullscreen mode Exit fullscreen mode

Now:

  • Services start even if Kafka is down
  • Logging still works (just doesn't go to Kafka)
  • I can retry producer initialization later
  • Much more reliable startup

The key insight: Kafka logging should be optional, not required. Services should work without it.


What I Learned: Best Practices

Redis Best Practices

  1. Always Use Connection Pools - Don't create connections on every request
  2. Handle Errors Gracefully - Redis should enhance performance, not be critical
  3. Use Separate Databases - One database per service prevents conflicts
  4. Set Appropriate TTLs - Don't let cache grow forever
  5. Monitor Memory Usage - Redis is in-memory, it can fill up

Kafka Best Practices

  1. Use Environment Variables - Different configs for different environments
  2. Handle Message Errors Per-Message - One bad message shouldn't stop everything
  3. Use Consumer Groups - Enables parallel processing
  4. Configure Retries - Network issues happen
  5. Monitor Consumer Lag - Know if you're falling behind

Results

After fixing all these issues, here's what improved:

  • Connection overhead: Dropped by about 80% with connection pooling
  • Service crashes: Zero crashes from Redis/Kafka failures (they degrade gracefully now)
  • Message processing: 99.9% success rate (bad messages don't stop everything)
  • System resilience: Services work even when Redis/Kafka are down

Key Lessons

  1. Plan for failures - Everything will break, so handle it gracefully
  2. Connection pooling is essential - Don't create connections per request
  3. Error handling is critical - One failure shouldn't kill everything
  4. Use environment variables - Makes deployment so much easier
  5. Monitor everything - You can't fix what you can't see

Conclusion

Integrating Redis and Kafka isn't as simple as the tutorials make it seem. You'll run into connection issues, error handling problems, configuration headaches, and more. But these are all solvable with the right patterns.

The main takeaways:

  1. Use connection pooling for Redis
  2. Handle errors gracefully - external services will fail
  3. Separate databases for different services
  4. Configure for your environment (Docker vs local)
  5. Monitor and log - you'll need it when debugging

This is a learning project, so there's always more to improve. But for now, it works and I learned a lot. That's what matters for a portfolio project!


Check Out My Project

EventStreamMonitor: https://github.com/Ricky512227/EventStreamMonitor

Feel free to check it out, star it, or contribute! I'm always learning and improving.


Tags: redis kafka microservices python docker distributed-systems software-engineering backend-development

Top comments (0)