Differential Transformers Explained

Josiah Liciaga-Silva — Tue, 15 Oct 2024 15:12:48 +0000

The Basics

Before diving into the new Differential Transformer, let's go over how a traditional Transformer works. At its core, Transformers use an attention mechanism to allow a model to focus on specific parts of an input sequence. This attention is computed using a softmax function:

A tt e n t i o n (Q, K, V) = so f t ma x (\frac{Q K ^{T}}{d _{k}}) V

Where:

$Q$ is the query matrix
$K$ is the key matrix
$V$ is the value matrix
$d_{k}$ is the dimensionality of the key matrix

This mechanism assigns weights to different input tokens based on their relevance. Despite its success, current Transformers tend to be very distracted. The standard softmax function tends to over-allocate attention to irrelevant parts of the context. In long-context sequences, the model can focus too broadly, leading to inefficient learning. This broad focus also negatively impacts in-context learning.

The Differential Transformer addresses these challenges by introducing a new mechanism. Instead of relying on a single attention map, it calculates two distinct attention maps:

A_{d i ff} = so f t ma x (A_{1}) - so f t ma x (A_{2})

Yes, that's right, it's that simple. This approach effectively removes redundant or noisy attention, promoting sparser and more focused attention. In turn, this prevents over-allocating attention to irrelevant tokens and allows the model to better manage long sequences and complex in-context learning scenarios.

Key Benefits:

Sparse Attention Patterns: By reducing redundant attention, the model can better focus on critical parts of the input sequence.
Improved Long-Context Modeling: Differential Attention allows the model to handle longer contexts more effectively, improving tasks like document summarization and question answering.
In-Context Learning: The differential attention mechanism dynamically adapts based on the input context, enhancing the model's ability to learn from examples within the input.
Hallucination Mitigation: In generation tasks, the DIFF Transformer reduces hallucinations by focusing more accurately on relevant context, leading to more coherent outputs.

The DIFF Transformer has broad applications, particularly in:

Handling long texts while focusing on the core information in Text Summarization tasks
Improved performance in QA systems which require nuanced understanding of context
Robust Generation, mitigating hallucinations in current models (GPT, Claude, Llama, etc.)

Implementation

Start by modifying the attention mechanism within a Transformer architecture. Instead of directly computing the attention using the standard softmax approach, compute two separate attention maps and subtract them to generate a differential attention map:

Here's a high-level snapshot written in Python:

def diff_attention(Q, K, V):     
    A1 = softmax(Q @ K.T / sqrt(d_k))  # first attention map    
    A2 = softmax(Q @ K.T / sqrt(d_k))  # second attention map    
    diff = A1 - A2  # differential attention    
    return diff @ V`

This approach allows you to integrate differential attention into any Transformer-based architecture.

That's a Wrap

DIFF Transformers are a significant leap forward. By refining the attention mechanism, they address key weaknesses we all encounter in the traditional Transformer architecture, leading to more efficient, focused, and context-aware models. Implementing these ideas can enhance the performance of large-scale language models in your applications, from NLP tasks to GenAI.

Thanks for reading!

References

ARXIV - Differential Transformer

The Basics of Event-Driven Architecture

Josiah Liciaga-Silva — Thu, 15 Aug 2024 23:46:28 +0000

Introduction

As a company evolves, its back-end system will often grow into many different small services called micro-services. Ensuring these all work smoothly is a challenge and there are many ways to solve said challenge. One of the methods often deployed to solve this problem is the fabled Event-Driven Architecture. Large companies like LinkedIn, Uber, and Amazon deploy this technique to build near real-time services. Providing the smooth user experience we all enjoy today. It's time to FAFO.

What is Event Driven Architecture?

EDA in its basics is a software design pattern. If you are a front-end developer you already do this as this is exactly how the client works. An user presses a button in which it creates an event, and another component of the client handles said event. In other words, it is a system that produces, detects, and consumes (reacts) to events.

Let's break this down further.

An event is created by a producer. That event is then transmitted through an event channel. One or more consumers receive the event and react to it.

There are a few key components to this architecture

The Event:

This is a notable occurrence or change in the state of a system. For example; a user registers, an order is placed, your apple watches sensor readings, the ever fluctuating price changes in a particular stock
This typically contains an event type, timestamp, and the relevant data associated with the event.

The Event Producer:

These are the components of our system that generate the events when something noteworthy occurs.
These can be user actions, system processes, or external systems performing operations.

Event Broker / Channel:

This acts as the intermediary for events. It is in charge of managing event routing between producer and consumers.
They are often implemented as Message Queues or Event Streams.
Apache Kafka, RabbitMQ, Amazon SNS/SQS are a few of the technologies we use to handle this feature.

Event Bus:

This is a system that handles event distributions across different parts of our system.
This is what allows us to decouple the Event Producers from the Event Consumers.

Event Consumer / Processor:

These are the components (services) in our system that listen for and react to specific events.
This is the action step for our event, it can aggregate, filter, enrich, save, update, or destroy data, and generate new events.

Event Store:

This is our persistence storage for our events. This is what enables us to replay events and reconstruct our system state.

There are a few patterns in EDA

In the wild, you will come across some of these...

The fabled Publish-Subscribe (Pub/Sub) Model

Here Publishers emit events without knowledge of subscribers, the subscribers receive the events they're interested in and do their action. GraphQL uses this effectively.

Event Sourcing

This pattern stores the state of the application as a sequence of events. This allows for event replay to reconstruct a system state and also allows us to complete an audit of trail changes.

Command Query Responsibility Segregation (CQRS)

This pattern separates read and write operations for a data store. This is often used in conjunction with Event Sourcing and allow us to improve our scalability and performance for complex domains.

Event Stream Processing

This pattern allows for the continuous processing of event streams in real-time, this pattern is used in analytics, monitoring, and reactive systems.

There are advantages to using EDA as well as challenges to be overcome...

Advantages

EDA does allow us to scale horizontally by adding more Event Consumers.
We benefit from Dependency Inversion
EDA can handle high volumes of events concurrently
It allows for loose coupling, as components interact through events, reducing direct dependencies
- This in turn makes it easier to modify, replace, or add new components to the system.
New Event Consumers can be added without affecting existing ones.
- Which in turn helps us evolve our business requirements with relative ease.
EDA also enables real-time processing and reaction to events
EDA also improves our systems ability to adapt to ever changing conditions (see point number 4 again for more details...)

Challenges in EDA

The propagation of Events takes time, this leads to periods of temporary inconsistencies also known as Eventual Consistency.
- This adds a rather complex layer of intricacy, the system needs to be designed in a way that handles out-of-order events or duplicate events.
- This also requires you to create additional mechanisms like sequence numbers or idempotent consumers to create a robust processing semantic.
Changes to event structure must be managed carefully, you need to make sure that events are forward and backward compatible
Distributed systems can make tracing and debugging much harder to debug and fix, you will need to implement a robust logging and monitoring solution to overcome this.

Best Practices to follow

Events should be meaningful and atomic, include all the necessary data for consumers. Nothing more, nothing less. To the point.
Implement Idempotent Consumers
Use asynchronous processing, this much should be obvious, DECOUPLE the event production from consumption. This improves responsiveness.
Version each event, plan for backward and forward compatibility.
Implement Proper Error Handling, design for failure scenarios and network issues, use circuit breakers and retries where appropriate.

That's a wrap....

EDA's allow us to create incredible experiences for our users. In the age of speed and efficiency, this architecture enables the creation of real-time applications ranging from Uber, to Spotify, to Battlefield 1 (the goat), to Netflix. Be aware of the challenges and plan ahead. Also, don't start your unicorn side-project with this architecture in place. You won't get far.

Til next time!

DEV Community: Josiah Liciaga-Silva