The Repository Pattern in Python
We've all inherited it: a critical 500-line function with raw SQL strings precariously placed between error handling, business logic, and an API call. You feel a natural instinct to refactor it, separate concerns; that's the right call! However, the pattern most tutorials teach you to accomplish this just creates a different kind of mess.
I'm talking about the repository pattern: an approach to separate your business layer (business logic) from your data layer (persistence and retrieval from a database).
Understanding what drives this pattern - and where the standard implementation goes wrong - will permanently change how you structure database logic. The end result is maintainable, testable, and readable code.
Martin Fowler describes this as mediating interaction, "… between the domain and data mapping layers using a collection-like interface." Let's break down what that actually means, and how it's been so misinterpreted.
The (short) theory
The fundamentals are exactly the same between good and bad approaches.
1) The "collections-like interface" is this: a class that defines database method signatures. In Python, we use Protocol from the typing module - this gives us structural subtyping (essentially implicit interfaces, like Go):
from typing import Protocol
from datetime import datetime
class UserRepository(Protocol):
async def get_user_birthday_by_id(self, user_id: int) -> datetime: ...
async def create_user(self, user: User) -> None: ...
async def delete_user(self, user: User) -> None: ...
2) The implementation is a class with a single Unit of Work (database session) attribute.
from sqlalchemy import select
from unit_of_work import UnitOfWork
class UserStore: # implements UserRepository
def __init__(self, uow: UnitOfWork) -> None:
self._uow = uow
async def get_user_birthday_by_id(self, user_id: int) -> datetime:
result = await self._uow.session.execute(
select(User.birth_date).where(User.id == user_id)
)
return result.scalar_one() # raises NoResultFound if missing
async def create_user(self, user: User) -> None:
self._uow.session.add(user)
async def delete_user(self, user: User) -> None:
await self._uow.session.delete(user)
Notice: UserStore never inherits from UserRepository. Python's Protocol uses structural subtyping - if the class has the right methods with the right signatures, it satisfies the protocol automatically. No inheritance required. This is the same behavior as Go's implicit interface satisfaction.
The Bad Way
To create a code version of the Pacific garbage patch, follow outdated tutorials and define a single, giant protocol of database methods per table.
# Lives in: src/everything/user_repository_and_motorcycle_parts_slash_comic_book_store.py
class UserRepository(Protocol):
async def get_user_birthday_by_id(self, user_id: int) -> datetime: ...
async def create_user(self, user: User) -> None: ...
async def delete_user(self, user: User) -> None: ...
# ... 50 more methods for every niche edge case
This now serves as mini dumping grounds of every method any service needs, across your application.
This is a producer-defined interface - the repository declares everything it can do, and every consumer must accept the whole thing. Any test that touches this needs to mock all 50+ methods.
Let's see how this approach scales, starting fresh.
Two engineers - Sarah and Mike - start developing separate features, working with user data. Sarah needs an add_or_upgrade_user_subscription_tier database method to upgrade paying users (or add them if they're on a Free account).
class UserRepository(Protocol):
async def add_or_upgrade_user_subscription_tier(
self, user_id: int, tier: Tier
) -> None: ...
Mike now needs a method that adds time, in days, to a user's subscription. It's somewhat related, but he doesn't have a choice where to put it - it goes in the hole.
He can either:
1) Widen the existing method: Expand Sarah's query to accommodate his work, forcing all of its callers to follow the unrelated contract he shoved into it
class UserRepository(Protocol):
async def add_or_upgrade_user_subscription_tier_or_time(
self, user_id: int, tier: Tier | None = None, extend_days: int | None = None
) -> None: ...
2) Add a near-duplicate method: Break the interface segregation principle so every service working with users now has +1 extra useless method, adding a single-use, near-identical method
class UserRepository(Protocol):
async def add_or_upgrade_user_subscription_tier(
self, user_id: int, tier: Tier
) -> None: ...
async def add_or_upgrade_user_subscription_time(
self, user_id: int, extend_days: int
) -> None: ...
Neither are good. The interface grows either way and becomes a Pandora's box of hundreds of tangentially related queries. The implementation of this interface is even worse: easily breaking thousands of lines of unoptimized ORM code in a single file as the unavoidable blast radius of each subsequent change climbs.
"The bigger the interface, the weaker the abstraction."
Eventually, we have a god object that is injected into all parts of your code that have to touch user data. Testing your single method becomes a game of ensuring the other 50+ are mocked.
So, how can solely changing the location of these methods transform this approach into the gold standard for maintainable database logic? By spreading it back out.
The good way
Martin Fowler didn't say the "collection-like interfaces" need to be Python files that are thousands of lines long. So let's make them smaller and more focused. Instead of defining a 1000-point Swiss army knife for your application, we let each service define exactly what it needs: just a screwdriver; a hammer and 3 nails; a butter knife and some tweezers.
We switch from producer-defined interfaces to consumer-defined interfaces, making specialized, mini-repositories per feature.
In Python, the Service defines the Protocol, and the Repository satisfies it structurally - without inheritance.
# src/features/notifications/service.py
from typing import Protocol
from datetime import datetime, timedelta
class NotificationStore(Protocol):
"""Service-owned protocol: only the db methods this feature needs."""
async def get_last_notified(self, user_id: int) -> datetime: ...
async def mark_as_notified(self, user_id: int) -> None: ...
class NotificationService(Protocol):
async def notify_user_by_id(self, user_id: int) -> None: ...
class Service:
def __init__(
self, store: NotificationStore, notifier: NotificationService
) -> None:
self._store = store
self._notifier = notifier
async def notify_user(self, user_id: int) -> None:
last = await self._store.get_last_notified(user_id)
if (datetime.utcnow() - last) < timedelta(hours=24):
return
await self._notifier.notify_user_by_id(user_id)
await self._store.mark_as_notified(user_id)
The implementation lives in the same feature folder, satisfying the protocol of the service.
# src/features/notifications/notification_database.py
from datetime import datetime
from sqlalchemy import select, update
from unit_of_work import UnitOfWork
class PostgresStore:
"""Satisfies NotificationStore without inheriting from it."""
def __init__(self, uow: UnitOfWork) -> None:
self._uow = uow
async def get_last_notified(self, user_id: int) -> datetime:
result = await self._uow.session.execute(
select(UserModel.last_notified_at).where(UserModel.id == user_id)
)
return result.scalar_one() # raises NoResultFound if missing
async def mark_as_notified(self, user_id: int) -> None:
await self._uow.session.execute(
update(UserModel)
.where(UserModel.id == user_id)
.values(last_notified_at=datetime.utcnow())
)
This gives us clean, modular interfaces; readable and maintainable; testability; and a great separation of concerns between features - at the cost of a Protocol.
The drawback of this approach is repetition. If 5 services need a get_user() method, are we going to implement it 5 times?
Let's look at the nuances of this approach, how those 5 services probably aren't using same get_user() method you think they are, and how, "a little copying is better than a little dependency" - Go Proverbs.
Nuance (with pushback)
A different get_user() method per feature seems crazy - I'm with you; what happened to Don't Repeat Yourself (DRY)?
Upfront: application-wide repositories are okay and can absolutely be the right move. But they are often not needed.
Let's see who's calling this get_user() method:
- Authentication service
- Notification service
- Payment service
Each of these callers needs different parts of a user's data for radically different purposes; a user in the context of billing is fundamentally a different entity than a user in the context of authentication.
A) The authentication service wants the user's username & password
B) the payment service just needs the user's payment info
C) the notification service only needs an email and the last time they were notified.
Making a global get_user() method that returns a god User object that has these + 50 more attributes - just to satisfy all callers - sounds eerily similar to the interface explosion problem we just solved.
@dataclass
class User:
id: int
username: str
email: str
password_hash: str
plan_id: str
subscription_status: str
stripe_customer_id: str
trial_ends_at: datetime | None
last_login_at: datetime | None
...
# 50 + more
Now every test that touches user data has to construct this entire object, even if the feature only cares about two fields.
I challenge you to always start local:
# src/features/notification_service/repository.py
from typing import Protocol
from dataclasses import dataclass
from datetime import datetime
class MikesNotificationStore(Protocol):
async def get_user_last_utc_time_notified(
self, user_id: int
) -> datetime: ...
async def get_all_premium_users_with_pending_notifications(
self,
) -> list[NotificationUser]: ...
# A user specialized for the service
@dataclass(frozen=True, slots=True)
class NotificationUser:
id: int
contact_method: str
contact_address: str
last_notified_at: datetime
has_pending_notif: bool
Allow queries to be "repeated". Forcing generic CRUD methods to work with 5 disparate features increases bad coupling, maintenance cost, and lines of code - as the User object blows up to accommodate working with everything. Specific, business-case queries are the way to go.
Lean into high-affinity coupling: components that change together belong together in the same package, not sparsely connected by a generic method. Individual implementations should be able to naturally evolve with their service - at little risk to the rest of the application. Requirements will change; the question is whether your architecture will fight or accommodate that.
A note on Go
Python's Protocol and Go's implicit interfaces work the same way conceptually: the consumer defines what it needs, and any struct/class that has the right methods satisfies it - no inheritance or explicit declaration required.
Conclusion
Bad code is rarely a skill issue; it's a pattern issue. The wrong abstraction taught confidently in a tutorial does more damage than no abstraction at all. Knowing the benefits of the right solution matters just as much as knowing the pitfalls of the wrong one; you now have a solid grasp on both.
It wasn't Mike that caused a pileup with his addition of a method, it was starting with the wrong pattern. Bad decisions compound. So set the standard: make a consumer-defined interface today so you aren't fighting someone else's 3000-line producer-defined interface six months from now.
Next time you're adding a database method to a shared repository, setting up a new feature, or want to refactor that 500-line Jenga block, pause for a second. Ask yourself: "does every function in my codebase need launch_user_into_space_query(), or just the one I'm working with?"
I'm Ivan, follow me for more content like this! :)
Top comments (0)