Mustafa ERBAY

Posted on May 30 • Originally published at mustafaerbay.com.tr

JWT Revocation: Stateless Promise Meets Real-World Challenge

#security #jwt #authentication #stateless

In many projects, especially in systems transitioning to microservice architectures, we use JWT (JSON Web Token) for authentication. JWT's core promise is to authenticate users in a "stateless" manner, without needing to maintain session state on the server side. This is very appealing, especially for scalability and distributed systems. However, this "stateless" structure comes with a cost: things get a bit complicated when the need arises to revoke a token.

Over the years, I've used JWT in many different scenarios, and each time I've faced the question, "what if I need to revoke this token immediately?" When a user's password changes, their permissions are updated, or worst of all, their token is stolen, that token must instantly lose its validity. This is where a tension arises between JWT's stateless nature and the urgent demands of the real world. In this post, I'll explain how I've managed this tension, what approaches I've tried, and which solutions have worked for me in practice.

The Stateless Promise and Realities of JWT

JWTs are essentially digitally signed data packets. They contain information such as the user's identity, permissions, when the token was issued (iat), and when it expires (exp). This information is signed with a key and sent to the client. The client sends this token back to the server with each request, and the server merely checks the validity of the signature and whether the token has expired. If the signature is correct and the token hasn't expired, the information within the token is trusted, and the operation proceeds.

The beauty of this model is that the server doesn't have to go to a database or cache for every request to ask, "is this user currently logged in?" This provides a significant performance and scalability advantage, especially for high-traffic applications or distributed architectures with multiple servers. For example, in an internal banking platform where tens of thousands of employees perform transactions simultaneously, using JWTs instead of checking every request against a database significantly reduced the system's overall response time and resource consumption.

ℹ️ JWT Structure

A JWT fundamentally consists of three parts: Header, Payload, and Signature. These three parts are base64url encoded and joined with dots (header.payload.signature). The signature is created by hashing the header and payload with a secret key, guaranteeing that the token's content has not been altered.

However, this stateless structure also means that the token is not actively tracked by the server. Once a JWT is signed and issued to a client, it remains valid until its expiration time. This situation puts us in a difficult position when an urgent revocation is needed. For example, in an ERP system for a manufacturing company, if an operator's permissions needed to be revoked immediately at the end of their shift, an 8-hour token remaining valid until the end of that period could pose a serious security vulnerability.

Why Does the Need for Revocation Arise?

Token revocation is not just a theoretical concern; it's a practical requirement frequently encountered in daily operations. Here are some scenarios I've faced in the backend of my own side projects or client projects:

User Logout: When a user clicks the "logout" button, all tokens they currently hold are expected to become invalid. Otherwise, a token obtained from browser history or cache could still grant access without the user logging in again. In one of my mobile applications, not revoking the token on the server when a user logged out sometimes caused the user to return to their old session without logging in again, creating both a security and user experience issue.
Password Change: When a user changes their password, it's a critical security requirement that all tokens obtained with the old password are immediately invalidated. This prevents an attacker from accessing the system with old tokens in case of a potential password theft.
Permission Changes: When a user's roles or permissions are updated, it's undesirable for their current token to carry old permissions. The token needs to be refreshed or revoked for the new permissions to take effect immediately. In a manufacturing ERP, when a manager's authority was removed, it was vital for this change to be effective instantly for the integrity of the system.
Security Breach (Token Theft): This is perhaps the most critical scenario. If a user's JWT is somehow stolen, an attacker can use this token to access the system with the user's permissions. In such a case, the stolen token must be revoked immediately. In my experience, if such an event occurred, the lack of a quick revocation mechanism could lead to potential data leaks or unauthorized operations.
Account Closure or Suspension: When a user's account is closed or temporarily suspended, all active sessions and tokens associated with that account must be terminated.

These scenarios clearly demonstrate that JWT's "stateless" promise is not always sufficient. The real world often brings with it stateful requirements.

Token Revocation Methods and Trade-offs

Although JWTs are inherently stateless, various methods have been developed to meet these revocation needs. Each method has its own advantages and disadvantages, and the choice depends on the project's security requirements, performance goals, and architectural constraints.

1. Blacklisting

This method involves storing tokens that have not yet expired but need to be invalidated in a central location (typically Redis, a fast in-memory caching system). When a token is revoked, its unique ID (JTI - JWT ID claim) or the token itself is added to this blacklist. For every incoming request, the server first checks the token's signature and expiration, then queries whether it's on the blacklist. If it's on the list, the token is considered invalid.

💡 JTI Claim

The JWT specification defines a unique JTI (JWT ID) claim for each token. This ID simplifies the blacklisting process. Blacklisting only the JTI instead of the entire token saves storage space.

Pros:

Instant Revocation: Once a token is blacklisted, it becomes invalid almost immediately.
Simple Implementation: Relatively easy to set up with systems like Redis.

Cons:

Deviation from Statelessness: Although it's a separate service, this method ultimately requires the server to track a "state." This means compromising on JWT's core promise.
Performance Overhead: Checking the blacklist for every request adds an extra load to the system. Especially in high-traffic systems, every call to the Redis server can cause latency.
Storage Cost: If tokens are long-lived and there are many users, the blacklist can grow over time and lead to significant memory consumption on Redis. However, storing tokens with a TTL (Time-To-Live) equal to their exp duration makes this cost manageable.

In a client project, I had to use blacklisting due to very strict security requirements. I observed an additional latency of about 5-10ms per request, but this was acceptable within the project's overall response time targets.

# FastAPI example: Blacklist check
from fastapi import Depends, HTTPException, status
from jose import jwt, JWTError
from datetime import datetime
from redis import Redis

# ... (JWT settings, SECRET_KEY, ALGORITHM, etc.) ...

redis_client = Redis(host='localhost', port=6379, db=0)

async def verify_token(token: str = Depends(oauth2_scheme)):
    try:
        payload = jwt.decode(token, SECRET_KEY, algorithms=[ALGORITHM])
        jti: str = payload.get("jti")
        if jti is None:
            raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED, detail="Invalid token (missing JTI)")

        # Blacklist check
        if redis_client.exists(f"blacklist:{jti}"):
            raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED, detail="Token revoked")

        # Other checks (exp, user_id, etc.)
        return payload
    except JWTError:
        raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED, detail="Could not validate credentials")

# Function to blacklist a token
def revoke_token(jti: str, exp_timestamp: int):
    # Keep in Redis for the remaining lifetime of the token
    ttl = exp_timestamp - int(datetime.utcnow().timestamp())
    if ttl > 0:
        redis_client.setex(f"blacklist:{jti}", ttl, "revoked")

2. Short-Lived Access Tokens + Refresh Tokens

This approach aims to mitigate the revocation problem while preserving the stateless nature of JWT. When a user first logs in, they receive two different tokens:

Short-Lived Access Token: A token typically valid for a short period, like 5-15 minutes. Used for API requests.
Long-Lived Refresh Token: Can typically be valid for days or weeks. Used to obtain a new access token. This token is stored in a database or cache on the server side and actively tracked.

Since the access token has a short lifespan, even if stolen, it remains valid for a very short time. When a token needs to be revoked, simply deleting the refresh token from the database is sufficient. This prevents the user from obtaining a new access token, and the existing access token automatically becomes invalid when it expires.

Pros:

More Secure: The short lifespan of access tokens minimizes the damage they can cause if stolen.
Performance: Most API requests are still processed completely stateless, as no database check is performed for access tokens.
Revocation Mechanism: Refresh tokens can be easily revoked because they are stateful.

Cons:

Added Complexity: Managing two different token types (creation, renewal, storage) adds complexity to the system architecture.
Refresh Token Security: Since refresh tokens are long-lived, their theft poses a more serious security risk. Therefore, secure storage and transmission of refresh tokens are critical (HTTP-only cookies, encrypted storage, etc.).
Not Instant Revocation: If an access token is stolen, it remains valid until it expires (e.g., 5-15 minutes). During this time, an attacker can perform operations on the system. This is a disadvantage in scenarios requiring urgent revocation (e.g., security breach).

I used this approach in the backend of my side project's mobile application. While preventing users from logging in too frequently, I reduced the security risk by keeping the refresh token in a secure HTTP-only cookie. Access tokens had a 15-minute lifespan, while refresh tokens had a 30-day lifespan.

3. Stateful Session Management (Traditional Approach)

In some cases, completely abandoning the stateless promise of JWT and reverting to traditional stateful session management might be more sensible, especially in projects with high security requirements or complex authorization scenarios. In this approach, the server maintains each user's session state in its own database or cache (e.g., a session ID and corresponding user information). Only a session ID (in a cookie) is sent to the client, and with each request, session information is queried from the server using this ID.

Pros:

Full Control: Since full control of sessions is on the server, they can be revoked instantly and easily. User permissions can be updated immediately.
Less Complexity (in some aspects): No need to deal with JWT's signature verification, expiration management, refresh token mechanism, etc.
Security: Even if a session ID is stolen, it's easier to prevent unauthorized access by performing additional checks on the server side (IP address, user-agent control, etc.).

Cons:

Scalability Challenges: Since state queries must be performed on the server side for every request, it can create performance and scalability issues in high-traffic or distributed systems. Sharing session data across multiple servers (sticky sessions or a central session store) adds complexity.
Performance Overhead: Making a database or cache query for every API call increases latency.

In an internal banking platform, I used stateful session management instead of JWT for some critical modules (e.g., money transfers). Although we sacrificed some performance, security and instant revocation requirements necessitated this approach. I overcame the scalability issue to some extent by storing sessions on a Redis Cluster.

4. Distributed Cache Usage

Distributed cache systems like Redis or Memcached, frequently mentioned in the blacklisting and refresh token approaches above, form the foundation of JWT revocation mechanisms. These systems are ideal for quickly checking token IDs thanks to their high-performance key-value storage capabilities.

Pros:

Speed: They offer very low-latency queries thanks to in-memory storage.
Scalability: They can scale horizontally with Cluster or Sentinel modes, which is important for high-traffic systems.
TTL Support: The TTL feature, which automatically deletes tokens after their exp duration, optimizes storage costs.

Cons:

Additional Infrastructure: Adding an extra component (Redis server or cluster) to the system means setup and management costs.
Consistency: Cache consistency can sometimes be an issue in distributed systems, but "eventual consistency" is generally acceptable for JWT revocation.

In a manufacturing ERP, I used the Redis Cluster that I was already using for caching AI model results for production planning, also for tracking JWT refresh tokens and blacklisting access tokens in emergencies. This allowed me to get maximum efficiency from the existing infrastructure.

A Pragmatic Approach: Hybrid Solutions

In my experience, no single approach perfectly meets all scenarios. Most of the time, it's necessary to develop hybrid solutions for different needs. My generally preferred approach is as follows:

Short-Lived Access Tokens and Long-Lived Refresh Tokens: This provides a fundamental balance of security and performance. Access tokens have a lifespan of 5-10 minutes, while refresh tokens are longer-lived, like 1 week or 1 month. Refresh tokens are stored in a database or a secure cache like Redis and actively tracked.
Refresh Token Revocation for User Logout and Password Change: When a user logs out or changes their password, I delete all active refresh tokens for that user from the database. This prevents the user from obtaining a new access token. Existing access tokens remain valid until they expire, but this period is already short.
Blacklisting for Emergencies (Token Theft): If a security breach occurs and an access token is found to be stolen, I immediately add its JTI (JWT ID) to a blacklist in Redis. This ensures the stolen token is instantly invalidated. Since this scenario is rare, the additional overhead on every request is an acceptable cost. I also give tokens in the blacklist a TTL equal to their remaining lifespan, so Redis memory doesn't unnecessarily swell.

This hybrid approach allows me to both leverage the stateless advantages of JWT and have the ability to instantly revoke tokens in critical security scenarios. In a client project, by implementing this approach, I was able to meet the expectations of both the development team and the security team.

⚠️ Important Note: 'iat' and 'exp' Claims

It is critically important for every JWT to contain iat (issued at) and exp (expiration) claims. These determine how long the token will be valid and are used for TTL management in the blacklist mechanism. Additionally, the nbf (not before) claim can also be useful in some scenarios.

# Example: Creating Refresh token and Access token
import datetime
from jose import jwt
import uuid # For generating unique JTI

SECRET_KEY = "mysecretkey"
ALGORITHM = "HS256"

def create_access_token(data: dict, expires_delta: datetime.timedelta = None):
    to_encode = data.copy()
    if expires_delta:
        expire = datetime.datetime.utcnow() + expires_delta
    else:
        expire = datetime.datetime.utcnow() + datetime.timedelta(minutes=15) # Short-lived
    to_encode.update({"exp": expire, "iat": datetime.datetime.utcnow(), "jti": str(uuid.uuid4())}) # Adding JTI
    encoded_jwt = jwt.encode(to_encode, SECRET_KEY, algorithm=ALGORITHM)
    return encoded_jwt

def create_refresh_token(data: dict, expires_delta: datetime.timedelta = None):
    to_encode = data.copy()
    if expires_delta:
        expire = datetime.datetime.utcnow() + expires_delta
    else:
        expire = datetime.datetime.utcnow() + datetime.timedelta(days=7) # Long-lived
    to_encode.update({"exp": expire, "iat": datetime.datetime.utcnow(), "jti": str(uuid.uuid4())}) # Adding JTI
    encoded_jwt = jwt.encode(to_encode, SECRET_KEY, algorithm=ALGORITHM)
    return encoded_jwt

# When user changes password, delete all refresh tokens
def revoke_all_user_refresh_tokens(user_id: int):
    # Delete all refresh tokens belonging to user_id from the database
    # E.g.: db.session.query(RefreshToken).filter_by(user_id=user_id).delete()
    print(f"All refresh tokens for user {user_id} have been deleted.")

# Blacklist access token in an emergency
def blacklist_access_token(jti: str, exp_timestamp: int):
    # Assuming redis_client is initialized globally or passed as an argument
    # from redis import Redis
    # redis_client = Redis(host='localhost', port=6379, db=0)
    ttl = exp_timestamp - int(datetime.datetime.utcnow().timestamp())
    if ttl > 0:
        redis_client.setex(f"blacklist:{jti}", ttl, "revoked")
        print(f"Token {jti} blacklisted, will be deleted after {ttl} seconds.")

The code snippets above provide a simple example of how I implemented this logic in a FastAPI application. The create_access_token and create_refresh_token functions generate tokens with different durations, and I add a unique jti (JWT ID) to both. This jti is used to uniquely identify the token, especially in the blacklist mechanism. With revoke_all_user_refresh_tokens, I can terminate all sessions of a user, while with blacklist_access_token, I can instantly invalidate a specific access token in an emergency. These structures provide both flexibility and security, especially in a production environment.

Performance and Scalability Concerns

When we add state tracking for token revocation, performance and scalability concerns naturally arise. Especially in the blacklisting approach, every incoming request going to Redis to query whether the token is on the blacklist creates additional latency. Even if this latency is at the millisecond level, in high-traffic systems with tens of thousands of requests per second, it can accumulate and become a significant bottleneck.

In my experience, on an API Gateway processing 5000 requests per second, adding a 2ms Redis query to each request could increase the total response time by 10-15%. Therefore, optimizing such checks becomes critical:

TTL Management: Keeping blacklisted tokens in Redis only for their remaining validity period reduces storage costs and Redis's memory consumption. If a token has 10 minutes left, it should remain on the blacklist for a maximum of 10 minutes.
Caching: If possible, caching blacklist checks at higher layers (e.g., in an API Gateway) can reduce the number of calls to Redis. However, this introduces consistency issues between updating the blacklist and updating the cache.
Asynchronous Revocation: In some scenarios, "almost instant" or "eventual consistency" might be acceptable instead of instant revocation. For example, when a user's permissions change, the existing access token is expected to expire, and they access updated permissions with a new access token. Instant blacklisting is preferred only for urgent security breaches.
Efficient Data Structures: Using data structures like SET or HASH for the blacklist in Redis ensures that EXISTS queries run with O(1) time complexity, maximizing performance.
Network Latency: Positioning the Redis server or cluster close to the application servers reduces network latency, improving performance. Like positioning them within the same Availability Zone in AWS.

In an ERP system for a manufacturing company, for the revocation of JWTs used for operator screens, I used blacklisting only in critical security situations (token theft). For normal logout procedures, I preferred to delete the refresh token from the database and wait for the short-lived access tokens to expire. This approach maintained the system's overall performance while providing a security net for emergencies.

ℹ️ Related: Cache Management in Distributed Systems

I previously shared my experiences on [related: cache consistency in distributed systems]. The JWT blacklist can also present similar challenges.

Conclusion

JWTs are a powerful tool for authentication in distributed systems, and their stateless promise offers significant advantages. However, when the need for token revocation arises in real-world scenarios, this stateless structure alone is not sufficient. In my experience, a "hybrid" solution, blended according to needs, provides the most balanced outcome rather than a purely "stateless" approach.

Using short-lived access tokens and stateful refresh tokens provides sufficient security and performance for most scenarios, while having a Redis-based blacklist mechanism for emergencies offers the ability to instantly invalidate stolen tokens. The important thing is to thoroughly analyze your project's security requirements, performance goals, and architectural constraints to determine the most suitable trade-off for you.

Remember, no security solution is 100% flawless. The key is to understand the risks and mitigate them to a manageable level. In my next post, I'll discuss the "phantom transaction" problem I encountered in a manufacturing ERP and the event-sourcing pattern I used to solve it.

DEV Community

JWT Revocation: Stateless Promise Meets Real-World Challenge

The Stateless Promise and Realities of JWT

Why Does the Need for Revocation Arise?

Token Revocation Methods and Trade-offs

1. Blacklisting

2. Short-Lived Access Tokens + Refresh Tokens

3. Stateful Session Management (Traditional Approach)

4. Distributed Cache Usage

A Pragmatic Approach: Hybrid Solutions

Performance and Scalability Concerns

Conclusion

Top comments (0)