This article is an extended and refined version of my previous post on this topic published at DZone.
Data Processing Patterns
Every application somehow deals with data. After all, this is why applications are necessary in the first place.
In the case of backend applications, there are well-known stable patterns for how code deals with data. Everyone knows about CRUD, for example.
But there is another set of patterns which present in the vast majority of backend applications. Unlike traditional software design patterns, which more revolve around code and much less about data, these patterns are purely data-driven. To be precise, they are defined by dependencies between different parts of data circulating inside application.
Data Dependencies
A very high-level look at backend applications will show almost the same picture regardless from the application itself and technologies used for implementation: a set of entry points (Inputs), some kind of persistence layer and/or external services (Outputs), and a blob of code that transforms data between these two sets (Application Logic):
The large blob in the middle is the place where all magic happens. And exactly this part is the focus of the discussion below.
Each entry point takes some data (parameters, perhaps optional) as input and produces some data (response) as output. Let's forget about parameters for the moment and take a look at how entry point generates the response: in the vast majority of cases, entry point requests one or more pieces of data from external sources (DB, internal or external services, DB's, etc.) and then transforms them into a response. Once we abstract out the details of the data retrieval and focus only on the data itself, we immediately notice that the response depends on the data received from external sources. This dependency can be one of two kinds:
- Response can be created only when all data from external sources are successfully obtained. This is a dependency of type All.
- Response can be created when any external source successfully obtains data. This is a dependency of type Any. Note that the dependencies described above do not require that data obtained from external source should be actually used to generate a response. In fact, for example, CRUD's Update or Remove entry points usually need only operation status: success or failure. Nevertheless, data dependency still exists — we can't generate response until we receive data from an external source.
What is even more interesting is that these patterns are repeated inside external sources we're calling. In most cases, they retrieve data from other external sources and then transform and return to us. This happens again and again until we reach the external source, which actually holds necessary data. If we take a look at the whole picture from the application entry point down to every source of data, we will see a tree of data dependencies or Data Dependency Graph (or DDG). The external sources, which actually hold data and don't depend on other external sources, are leaves at this tree. For convenience, I'll be calling them terminal sources.
So, the whole tree has roots (application entry point) and, through intermediate nodes (services' methods), goes down to terminal sources.
The diagram below shows example of such a data dependencies:
Each rounded rectangle on the diagram contains the name of the service and the name of the data structure (or table name) it returns to the requester. The direction of the arrows represents the data transfer direction.
Data Dependency Graph Representation
The diagram shown above is good for illustration and something similar might be convenient to drawing on a whiteboard. In some cases, it might be more convenient to use pure text representation. We can write patterns All and Any as functions that have dependencies as parameters. The returned value is our response. For example, DDG nodes from the diagram, shown above, can be written as follows:
UserProfileResponse = All(UserProfile, LastUserLogin)
UserProfile = All(UserData)
LastUserLogin = All(UserLogin)
UserLogin = All(user_logins)
Sometimes, for brevity, it might be convenient to substitute intermediate dependencies and write whole DDG as a single function:
UserProfileResponse = All(All(UserData), All(All(user_logins)))
While this representation omits some useful information (names of intermediate data, for example), it's more explicit, as it shows DDG depth and contains only terminal sources as dependencies. The DDG depth is an important property for architecture analysis, as we'll see below.
Some Terminology
Data Dependency Forest (DDF) — a set of DDG's where each DDG represents one application entry point
Data Dependency Analysis (DDA) — a process of building DDG or DDF
DDA and Architecture of Existing Applications
It's easy to notice that DDG is rather abstract. It does not depend on the implementation language, framework, or packaging (monolith or microservice). Legacy applications often have no documentation nor tests and very often are written...let's say...in an outdated style. It's somewhat tedious work to build DDF for such an application, but once it's done, we obtain valuable source of information:
- We get information about applications' internal structure
- We get information about how different parts of information interact with each other With this information in hand, it's much simpler to start reworking applications into something more modern and maintainable or even rewrite the application from scratch.
DDA and Architecture of New Applications
As DDG is abstract, it does mean that we can build it for application, which does not yet exist. Once detailization of the architecture reaches the level when we can discover dependencies between data, we can build DDF for application. From DDF, we can find:
- How to split the whole application into services
- How dependencies interact with each other and where can be potential bottlenecks
- Estimates for how long will take the processing of each request (discussed below) Needless to say, this information is useful during design as it allows us to avoid many pitfalls and costly mistakes. By rearranging data, it is possible to optimize architecture before the first line of code will be written.
Estimation of Request Processing Time
Obviously, obtaining dependencies requires some time. Some time is necessary to retrieve data from the terminal source. Then, as data propagates up in the DDG at each step some communication overhead is added. By collecting these times through the whole DDG, we can estimate how long processing will take. Note that there is a difference in how time is calculated for synchronous and asynchronous handling of All and Any dependencies.
Synchronous Handling
Estimate for obtaining dependency of type All is a sum of estimates for each data dependency. Since processing is synchronous, we obtain all dependencies one-by-one, hence time is the sum of all times. If processing is paralleled somehow, then time is reduced to the same estimate as for asynchronous processing.
Estimate for obtaining dependency of type Any is a sum of time to retrieve first dependency and any subsequent dependency if previous dependency failed. This happens we ought to retrieve dependencies one by one until one of them will be successful.
Asynchronous Handling
Usually, various asynchronous processing mechanisms provide dedicated methods to obtain All and Any results in parallel. This property results to significant reduction of the processing time.
Estimate for obtaining dependency of type All is the longest estimate among data dependencies. All dependencies retrieved in parallel so all of them will be available once one most long-lasting will be obtained.
Estimate for obtaining dependency of type Any is equal to estimate of first successful data dependency. Again - all of them are processed in parallel and once one of them returns success, there is no need to wait for remaining ones.
Simple Example of Request Processing Time Estimation
Let's try to estimate request processing time for DDG provided above and repeated here for convenience:
UserProfileResponse = All(UserProfile, LastUserLogin)
UserProfile = All(UserData)
LastUserLogin = All(UserLogin)
UserLogin = All(user_logins)
Let's assume following average access times for terminal sources and network delays:
Access time for Auth SaaS = 5ms
Access time for Activity Repository = 2ms
Average communication delay between services = 3 ms
Synchronous Version
UserLogin time = 2ms (access) + 3ms (network) = 5ms
LastUserLogin time = UserLogin time + 3ms (network) = 8ms
UserProfile time = 5ms (access) + 3ms (network) = 8ms
UserProfileResponse time = LastUserLogin time + UserProfile time = 16ms
Asynchronous Version
UserLogin time = 2ms (access) + 3ms (network) = 5ms
LastUserLogin time = UserLogin time + 3ms (network) = 8ms
UserProfile time = 5ms (access) + 3ms (network) = 8ms
UserProfileResponse time = max(LastUserLogin time, UserProfile time) = 8ms
Even in this simple example, the asynchronous version is twice as fast as the synchronous one.
Conclusion
Data Dependency Analysis is a new approach to architecture design and analysis. While it already looks very interesting and useful, the research in this direction is still incomplete. In particular, the part that allows the systematic synthesis of code for DDG entry points is still in progress. DDA might be also useful for refactoring and performing code reviews, but these areas also need deeper research.
Top comments (0)