Mathieu Ledru

Posted on Apr 17

⚙️ Message-oriented vs. Data-oriented orchestration - from data to knowledge

#architecture #data #distributedsystems #systemdesign

For intellectual property reasons, the subject chosen for the application of this article will not be the one discussed, although it is closely related. For any further information, please contact Omer who will be happy to answer, and apologize for any potential inconvenience.

In this article, we explore two fundamental approaches to software orchestration:

Message-Oriented Orchestration: via Symfony Messenger synchronous respectively asynchronous
Data-Oriented Orchestration: via Navi for synchronous and Flow for asynchronous

The case study is based on a classic but structuring problem: text mining applied to a set of Git repositories.

For the practical demonstration, I will take the EIT tutorial from 2007/2008 carried out at the time on classification with Matthieu Beyou during computer science class tutorials.

For the data, we use those of Omer (former work colleague) available on his site https://git.arkalo.ovh (via the api).

The goal is not to produce the best machine learning model, but to understand how the form of orchestration influences the complexity, readability, and scalability of the system.

The problem: transforming repositories into usable knowledge

The dataset consists of a list of Git repositories defined in a repos.json file, the data of the directories listed on https://git.arkalo.ovh/explore/repos.

For your information, you can extract the information using the Composio connector for https://composio.dev/toolkits/gitea. Refer to my previous article for the implementation: https://blog.darkwood.com/fr/article/relacher-les-connecteurs-des-outils-au-langage.

Each deposit becomes a canonical document constructed from:

repository name
description
README
metadata (owner, topics…)

This document is then transformed via a classic text mining pipeline:

Pretreatment (cleaning, tokenization)
Feature Extraction
TF-IDF Weighting
Similarity between documents
Classification / clustering

This pipeline is directly inspired by historical approaches:

TF-IDF: weight = tf * log(N / df)
Cosine similarity between documents
Supervised Naive Bayes Classification
Unsupervised k-means clustering

What interests us here is not the algorithm, but the way to orchestrate it.

Note that if you are fond of documentation, you can refer to the Resources section at the bottom of the article which lists a number of topics concerning data mining applied in computer science.

Business pipeline (independent of orchestration)

First and foremost, the core business needs to be isolated.

Repository → Document → Tokens → Features → TF-IDF → Similarity → Results

This pipeline represents a data transformation.

Each step:

takes a piece of data
produces new data
without strong dependence on an external context

This is precisely where the two approaches diverge.

Approach 1 Message-Oriented - Orchestration via Symfony Messenger

In the Message-Oriented implementation, the pipeline is not expressed as a continuous data transformation.

It is encapsulated in a message, then executed via the Symfony bus.

Execution Model

Command → Message Bus → Handler → PipelineService → Stages

In concrete terms:

a CLI command triggers the execution
A message is sent
a handler takes care of the execution
the core business remains centralized in a shared service

RunMessengerPipelineMessage
→ RunMessengerPipelineHandler
→ PipelineService

Separation of responsibilities

This implementation adheres to a key project constraint:

The core business is strictly shared between the two approaches

So :

Messenger contains no business logic
he only orchestrates the execution

Actual Pipeline Executed

The handler triggers a deterministic pipeline:

1. ingest
2. preprocess
3. feature build
4. classification
5. clustering

Each step is executed in a common application service (PipelineService).

Concepts introduced by Messenger

The orchestration explicitly introduces:

a message class
a dedicated handler
a dependence on the bus
a dispatch layer

Command → Message → Handler → Service

These elements are specific to Messenger and do not exist in the data-oriented model

Observability and debugging

Messenger offers a natural debugging model:

message inspection
middleware
bus logging
Extensibility towards async / queue

Debug = niveau message + middleware

Nature of the overhead

In this MVP, the overhead is conceptually measurable:

introduction of an artificial message
Indirection via handler
the need to structure the execution around the bus

But this overhead is located in the orchestration adapter, not in the hardware.

Summary

This approach transforms the pipeline into:

a distributed work unit

She favors:

Symfony standardization
extensibility towards async
integration with the ecosystem

At the cost of an additional layer of indirection.

Conceptual Example

final class ComputeTfIdfMessage
{
    public function __construct(public DocumentId $id) {}
}

final class ComputeTfIdfHandler
{
    public function __invoke(ComputeTfIdfMessage $message)
    {
        $document = $this->repository->get($message->id);
        $vector = $this->tfidf->compute($document);

        $this->bus->dispatch(new ComputeSimilarityMessage($vector));
    }
}

Benefits

strong decoupling
resilience (retry, queue)
native parallelization
Symfony standard

Structural Limitations

The problem quickly becomes apparent:

➡️ the message becomes an artificial envelope

We manipulate:

IDs
persistent states
indirect transitions

The problem is simply:

data → transformation → data

This introduces:

the boilerplate
implicit dependencies
a loss of overall readability

Approach 2 Data-Oriented - Orchestration via Navi (synchronous) and Flow (asynchronous)

In the Data-Oriented implementation, the pipeline is expressed as an ordered sequence of actions applied to a context.

There is no message.

There is no dispatch.

There is only:

a piece of data
a context
a sequential transformation

Execution Model

Command → WorkflowRunner → Actions → PipelineService → Data

In concrete terms:

a command triggers a workflow
The WorkflowRunner executes a list of actions
each action transforms a Context
The business services are identical to Messenger

WorkflowRunner
→ PipelineStageAction[]
→ Context
→ PipelineService

Pipeline Structure

The pipeline is explicitly defined as a sequence:

[IngestAction,
 PreprocessAction,
 FeatureBuildAction,
 ClassificationAction,
 ClusteringAction]

Each action:

takes a Context
applies a transformation
returns a new Context

Nature of the Context

The Context becomes the central object:

it contains the pipeline status
it evolves at each stage
it is inspectable

Context₀ → Context₁ → Context₂ → ... → Contextₙ

Concepts introduced by Flow

This approach introduces:

explicit actions
a runner
an evolving context

Data → Action → Data

Unlike Messenger:

no message
no handler
no bus

Observability and debugging

The debugging process changes completely in nature:

Debug = suite d’actions + snapshots de contexte

Benefits :

visible execution order
inspectable intermediate state
deterministic pipeline

Nature of readability

The pipeline can be directly read as a stream:

ingest → preprocess → features → classification → clustering

Without structural transformation.

Structural Overhead

The cost introduced is different:

need for a Context
abstraction via actions

But :

no envelope
no bus detours
no break in the data flow

Summary

This approach transforms the pipeline into:

a series of data transformations

She favors:

immediate readability
direct transformation of data
absence of envelope
deterministic pipeline
ease of testing

Boundaries

less suitable for complex distributed systems
requires strict discipline regarding the purity of the transformations
Tooling less standard than Messenger

Direct Comparison

Criteria	Message-Oriented	Data-Oriented
Mental model	Events / Messages	Data streams
Readability	fragmented	linear
Overhead	high (messages, handlers)	low
Scalability	excellent	depends on the design
Debug	indirect	direct
Business coupling	weak but diffuse	strong but explicit

Appearance	Messenger	NaviFlow
Central unit	Message	Context
Orchestration	Bus + Handler	Runner + Actions
Flow	indirect	direct
Debug	message-centric	data-centric
Overhead	message + handler	action + context
Pipeline	encapsulated	explicit

Key point: the illusion of complexity

In the case of text mining, each step is:

pure
determinist
functional

Examples:

TF-IDF → simple mathematical formula
Similarity cosine → normalized dot product

There is no natural need for messages.

The introduction of Messenger is therefore an architectural decision, not a business necessity.

Main Insight

Message-oriented transforms data into events.

Data-oriented technology transforms data into data.**

In a system like this:

Message-Oriented adds a layer
Data-Oriented reveals the model

Implications for Symfony

Symfony is evolving towards:

async
workers
sidekicks (FrankenPHP)
distributed orchestration

But this raises a fundamental question:

👉 Does everything have to be orchestrated via messages?

The answer depends on the problem.

When to use each approach