c0d3l0v3r

Posted on Jun 2

From a Simple Auth Service to a Distributed Authentication Platform with Kafka, Debezium, and Observability

#devchallenge #githubchallenge #distributedsystems #backend

GitHub “Finish-Up-A-Thon” Challenge Submission

This is a submission for the GitHub Finish-Up-A-Thon Challenge

What I Built

I built a Distributed Authentication System as a long-term learning project to explore real-world backend and distributed systems concepts through a familiar use case: user authentication.

The project started as a simple authentication service with signup and login functionality. Over time, it evolved into a distributed architecture that incorporates event-driven communication, change data capture (CDC), observability, horizontal scaling, and performance testing.

The current system consists of:

PostgreSQL as the primary source of truth for user credentials
Redis for refresh token storage and session management
Kafka as the event streaming platform
Debezium for Change Data Capture (CDC) from PostgreSQL
MongoDB for materialized user profile documents
Nginx for load balancing across multiple service instances
Prometheus and Grafana for monitoring and observability
k6 for load testing and performance analysis

When a user signs up, the authentication service writes user data to PostgreSQL. Debezium captures the database change and publishes it to Kafka. A separate profile service consumes the event and creates a corresponding profile document in MongoDB. This allows services to communicate asynchronously while keeping the architecture loosely coupled.

Beyond implementing features, the main goal of this project was to understand the trade-offs involved in building distributed systems. Throughout development I conducted load-testing experiments, measured replication lag, analyzed bottlenecks, and documented the architectural decisions that shaped the system.

This project has become my personal distributed systems playground where I can experiment with new ideas, evaluate design decisions, and learn how real systems behave under load.

Demo

GitHub Repository

Repository:
https://github.com/c0d3l0v3r-HeHe/distributed-auth-system

Signup Flow

Login Flow

The repository contains additional architecture diagrams, performance experiments, load-testing results, observability metrics, and detailed documentation covering the design decisions and trade-offs explored throughout the project.

Rather than reproducing all of those results here, I've kept this submission focused on the project's journey and evolution. If you're interested in the deeper technical details, bottleneck analysis, replication lag measurements, scaling experiments, or observability setup, you'll find them documented in the repository.

The Comeback Story

This project was not originally intended to be a standalone distributed authentication platform.

In May, I started building it as the authentication backend for a larger job portal project. The initial goal was fairly straightforward: implement signup, login, token management, and the supporting infrastructure needed for user authentication.

After building the core authentication service and setting up the initial infrastructure, I shifted my focus to other work and the project was left unfinished. While the foundation existed, many of the ideas I wanted to explore—distributed systems patterns, observability, scalability, and performance analysis—were still missing.

A few weeks later, I came across the GitHub Finish-Up-A-Thon Challenge and decided to revisit the project instead of letting it remain another abandoned repository.

Rather than simply cleaning up old code, I used the opportunity to significantly expand the project and turn it into a distributed systems playground.

During the revival, I:

Added Redis-based refresh token management
Implemented a CDC pipeline using PostgreSQL, Debezium, and Kafka
Added a dedicated profile service backed by MongoDB
Introduced Prometheus metrics and Grafana dashboards
Added k6 load-testing infrastructure
Scaled the authentication service horizontally behind Nginx
Conducted multiple performance experiments and documented the results
Created architecture diagrams and expanded the project documentation

One of the most valuable outcomes of revisiting the project was discovering bottlenecks that only became visible under load. While scaling the authentication service improved CPU utilization, load testing revealed that MongoDB and the CDC pipeline became the primary bottlenecks. Investigating these trade-offs taught me far more than simply building the original authentication service.

By the end of the challenge, the project had evolved from an unfinished backend component into a fully documented distributed authentication system that I can continue using to explore distributed systems concepts, scalability patterns, and performance engineering.

My Experience with GitHub Copilot

GitHub Copilot acted as an implementation and exploration partner throughout the revival of this project.

For many components, I first designed the architecture and wrote the interfaces, function signatures, and high-level implementation plan. I then used Copilot to generate boilerplate code, suggest implementations, and accelerate repetitive development tasks.

Copilot was particularly useful for:

Generating service scaffolding and repetitive CRUD logic
Assisting with Docker Compose configuration
Helping configure Prometheus metrics collection
Suggesting Kafka consumer and producer implementations
Writing integration and infrastructure tests
Explaining configuration options for Debezium and Kafka Connect
Speeding up refactoring and cleanup work

One workflow I found especially effective was defining the architecture and API contracts myself, leaving implementation placeholders, and then using Copilot to generate an initial implementation. This allowed me to focus more on system design decisions and less on repetitive coding.

I also used Copilot while setting up and validating the distributed architecture. It helped me troubleshoot configuration issues, understand service interactions, and quickly iterate on infrastructure changes during development.

The biggest benefit wasn't code generation itself—it was the ability to move from an idea to an experiment much faster. Since this project is intended as a distributed systems learning playground, that rapid feedback loop allowed me to spend more time investigating architectural trade-offs, performance bottlenecks, and scalability challenges.

Feedback Welcome

This project started as a learning exercise and has evolved into my personal distributed systems playground.

If you're an experienced backend or distributed systems engineer and happen to read this submission, I'd genuinely appreciate any feedback on the architecture, design decisions, bottlenecks, or trade-offs discussed throughout the project.

Some areas I'm currently thinking about include:

Improving the CDC pipeline under higher load
Reducing replication lag and read amplification
Better approaches to profile materialization
Scaling strategies beyond the current setup
Observability improvements and production-readiness considerations

Constructive criticism, alternative approaches, and architecture suggestions are all welcome. One of the main goals of this project is to learn from engineers who have solved these problems in real systems.

Top comments (4)

Valentyn Kit • Jun 25

Using Debezium CDC here is the right instinct, because it sidesteps the dual-write problem: you commit to Postgres and let the log produce the Kafka event instead of writing to both and hoping they agree. The thing I'd watch is the consumer side. CDC gives you at-least-once with per-partition ordering, so a replayed login or token-revocation event has to be idempotent or you reintroduce the consistency bug you just removed. Are you keying by entity and deduping on the consumer, or relying on Kafka's exactly-once semantics end to end?

c0d3l0v3r • Jun 25

Hi Valentyn, thank you for taking the time to review the project and for the insightful question.

You are exactly right, I am deduping on the consumer rather than using Kafka's EOS end-to-end. I key by the PostgreSQL user.id and check if the profile already exists. If it does, the consumer simply skips it and returns cleanly. This also safely avoids "poison pill" scenarios, ensuring the consumer offset keeps advancing instead of getting stuck in a crash/retry loop.

(I am actually planning to update this "read-then-write" check into a true atomic upsert soon to rule out any concurrent race conditions).

Given your experience at Solana, do you generally prefer datastore-level idempotency like this, or do you find the operational overhead of Kafka transactions worth it for EOS in production?

Valentyn Kit • Jun 26

I usually prefer simpler solutions prioritizing maintainability over scalability, so wouldn't choose the Kafka transactions.
But it depends also on other factors.

c0d3l0v3r • Jun 26

That makes sense. So your philosophy is to optimize for simplicity and maintainability first, and only introduce additional complexity like Kafka transactions when the guarantees are worth the operational cost. Thanks for Insight! I would remember it for sure 😃😁