DEV Community: Vladimir Podolskiy

Reverse Engineering for the Good: From the Source Code to the System Blueprint (Part II)

Vladimir Podolskiy — Sun, 17 Sep 2023 11:53:21 +0000

In the previous part of the article, we briefly discussed the explanatory mindset that one requires to get better at reverse engineering of complex software systems. We have also tipped our toes into the nasty waters of reconstructing the meaning of the source code with the help of data state transition diagrams and pseudocode. In this second and last part of the article, I’d like to build on top of reverse engineering for the code fragments and focus on getting the software system’s blueprint. As always, I’m summarizing my personal experience, so it may or may not align with how you tackle this complex topic. Please add anything that you find relevant to this topic and particularly curious in the comments section.

Reverse Engineering for the System Blueprint

Data and Control Flow diagrams

I believe that most of us have drawn a kind of data or control flow diagram when we wanted to better understand how the software works. These diagrams are natural to draw for software systems. Since any program performs some operations on the data according to some set of conditions, the software systems can be represented as various entities that pass data to each other. In the following paragraphs, I’ll refer to these entities as operating entities.

There is no academic definition behind this term so I’d try to explain it on an intuitive level. An operating entity is something that has a well-defined function in the system or, sometimes, multiple functions. Importantly, this function (or functions) should not be shared with any other operating entity. Let’s discuss an example. You can imagine a class that handles the incoming requests for some persisted data. Upon getting a request, this class puts the request into the queue. Next, some other code within the same class polls requests from the queue to process them and persist the results. Although the polling part of the code might be implemented within the class and does not necessarily have its own encapsulation within some other class, it is the only part of the code that performs polling from the queue and decides how to process the polled request. In that sense, depending on the complexity of the logic, one could wrap this code as a Request Processor operating entity or maybe as two entities, namely, Requests Poller and Requests Dispatcher (if the processing is excluded and handled separately). Again, it does not matter if these entities are not implemented in the code. When reconstructing a data/control flow diagram, you do not invent anything new, you just collect and make the pieces of functionality visible and addressable. ‘Purification’ and naming of such operating entities is an important prerequisite for creating a comprehensive and clean data/control flow diagram.

How can you determine if you have not overlooked an operating entity? If you experience challenges in connecting operating entities that you’ve found so far, then, most likely, there is some other entity implemented in the code that you have not yet discovered. A challenge might be in the form of having entity A and entity B with entity B clearly processing data that was at some point processed by entity A but you struggle to find a concise description for the data flow arrow that connects them.

When operating entities have been discovered, list them all and briefly describe which function(s) they perform. You may also find it useful to list the data that these operating entities might be communicating through, like some shared data structures or communication channels (if the software is distributed). This will be useful when you start drawing the actual diagram.

The process of discovering operating entities in the code might remind you of refactoring, and, indeed, it is quite similar. Nevertheless, implementing operating entities that are not directly represented by the source code might not be needed due to performance reasons or due to sheer overhead of making a small piece of functionality pronounced and addressable. Anyway, while performing reverse engineering, we are not doing any actual changes to the code regardless of how appealing they may appear.

Although the academic literature advises us to split the data and control flow diagrams (DFD and CFD), personally, I prefer to combine them for reverse engineered software systems. The advantage of merging them is that it allows one to see the triggers of data flows and other operations on the data (e.g. data removal upon expiration although one may come up with something like a negative/removal data flow). To distinguish between the control flow and the data flow on the same diagram, I use different types of arrows. I use solid arrows to depict how the data moves from one operating entity in the system to another. The dashed arrows depict the control flow like triggering the events relevant for data processing. Such an arrow starts at the entity that triggers the event and ends at the recipient of the event.

Now let’s look at an example data/control flow diagram below. Although I did not yet explain everything shown there, you will find it useful to first take a look and then read the explanation. This will help you to have a reference point in your mind.

Now, let me take some time to explain what you just saw.

An additional layer of information on a data/control flow diagram is provided by the coloring. If you are colorblind, you can still use various kinds of fill (like solid, dashed, dotted, etc) in addition to the color.

I prefer to start with a separate color for the entity that initiates data processing in the system. This is likely to be some external entity, commonly, a client, or some internal asynchronous process, like a cleanup or some other scheduled task. In the diagram above, I’ve used the red color to highlight these entities (Client, Index Updater, and Expiration Service). Note that the client is external to the system so you may want to make this explicit in your diagram by changing its shape in comparison to the internal entities or draw some additional bold line between it and the rest of the system (a system boundary). In addition to the client, there are two operating entities that run asynchronous processes in the system itself. It is the Expiration Service that triggers the removal of expired versions of the data from the persistent storage and the Index Updater that reads the persisted data from time to time and updates the search index.

Other operating entities whose boxes are colored in yellow and green on the sample diagram are reactive, meaning that they do some processing only when there is the data or they are explicitly triggered. One could color all of them in the same way but I wanted to distinguish between the layer that the client directly interacts with and the deeper parts of the system. Although this is not necessary, you may find it useful to somehow emphasize (with a color or otherwise) a specific entity or a group thereof if, let’s say, they perform common functions or if you want to distinguish between the entities that are on the producing and consuming end of some queue.

Check different coloring below. The composition (boxes and arrows) is exactly the same. However, instead of emphasizing that two of the entities are in direct contact with the external client, we focus on two reactive entities (namely, In Handlers and Internal Dispatcher) participating in the production of requests for the queue and three other entities (namely, Requests Processor, Multiversion Reader, and Out Handlers) participating in the consumption of requests from the queue. This kind of coloring makes the producer-consumer pattern implemented in this system more pronounced.

As you have probably noticed (and, if not, take another look at the diagram), in addition to the operating entities, the diagram also depicts the data entities that act as a source or a destination for the data flow arrows. It turns out to be very useful for the understanding of the system to depict these data entities explicitly along with the operating entities. Most likely, the data entities will be represented as some data structures in the code (maybe from some library or even some custom classes).

There are three data entities in the diagram: the Client Requests Queue, the Search Index, and the Persistent Storage. The queue has its own style of drawing as a box containing the queued elements. I prefer to draw the queues like that to convey the semantics of this data entity graphically - it shows where different parts of the system communicate with each other.

You can also spot that I’ve used slightly different coloring across the data entities. All the data entities use the same color (blue) but two entities are filled with strokes and one has solid fill. Why is that? Well, that simply reflects the fact that the queue and the search index are in-memory structures whereas the persistent storage is… persisted (is stored on disk). This difference in fill highlights another piece of semantics that is relevant for how the system tolerates faults which might or might not be important for your reverse engineering goals. In the very same style you could distinguish between encrypted and unencrypted data entities if the security is the cross-cutting concern that you would want to emphasize in your diagrams. One could also duplicate the diagram for each cross-cutting concern and ‘recolor’ the boxes in each of them.

That was a lengthy discussion of coloring but now you can see how color and fill can be used to represent various kinds of information on the diagram. Next, let’s focus on the logic of the data/control flow.

In the example diagram, it is pretty intuitive to deduce the sequence in which the data passes through the operating entities. The data flow starts at the client (remember that the client is the originator of a request, or a proactive operating entity) and proceeds through the system. At some point the data lands into the queue which it later leaves to be processed by the request processor. The request processor needs additional pieces of data to process the request: a specific location on the disk (let’s call it a bucket) to fetch the data from and the profile info fetched from the disk. Here things become slightly ambiguous although still understandable. Does the request processor get the bucket first and then the profile info? Or both at the same time? The first assumption is the correct one. The request processor needs to first learn which bucket of the persistent storage to query, then it can query data from the storage.

When you reverse engineer a software system, you will likely encounter cases when the correct sequence is not all that evident. For these cases, I number the arrows with small encircled numbers that highlight the sequence in which the data flows in the system and in which the events are triggered. Sometimes, one would need to show simultaneous or unordered data flows, e.g. if the component extracts the data from multiple sources and combines it in the scope of one meaningful logical step. Then, one can use the same number on multiple arrows. In the below diagram, I do this with ‘3’ that labels both an arrow that puts the profile info request into the queue and another arrow that notifies the request processor about the possibility of polling from the queue. Such numbered and chained representation on the diagram might come in quite handy when you need to reflect the data/control flow in a system full of asynchronous calls and callbacks. Such systems require a considerable amount of mental effort to understand so we’ll not cover them in this article in depth.

Notice that we’ve numbered not all of the arrows. Indeed, how would you number an arrow that originates at some operating entity that is internal to the system? Or maybe another arrow originating at the client that represents a write operation? Should it precede the read operation (“fetch profile info”) or not? These are all valid questions but they have nothing to do with how we represent the flow of data or control on the diagram. These questions arise since the above diagram does not do its job well. This diagram shows multiple distinct data/control flows at once! The one that we have numbered is a read flow (more specifically, fetch profile info flow, because there might be more read flows). However, there are two more. Both originate from the asynchronous operating entities - the expiration service and the index updater. Depending on your goals, you may leave them in the same diagram or you may want to drag them into a separate one since they meet only at the data entities, so, they do depend on the same data, but they do so in different ways.

To address the issue of intersecting data/control flows, I prefer to draw multiple data/control flow diagrams - one for each such flow. Otherwise, the diagrams tend to get cluttered and thus hard to navigate and reason about. So, below is how I’d dissolve the previous diagram into three: profile info read flow, index update flow, and the data expiration flow. As you can see, some operating entities are repeated across these diagrams; however, it becomes easy to argue about each flow separately. The disadvantage of introducing a separate diagram per each flow is that it becomes slightly more difficult to find the dependencies between different flows, and, as practice shows, most of the obscure design issues and bugs lurk in the intersection of multiple flows (usually, operating on the shared data). Hence, one may want to preserve the bigger diagram along with sub-diagrams for every flow to clearly see where the flows intersect e.g. by relying on some shared data like the search index or the persistent storage.

Space-time diagrams (aka Lamport diagrams)

Let’s admit it, engineering distributed systems is far from easy. Therefore, a good diagram is worth a thousand words in that domain. If your software system falls into the category of distributed then you need to reflect this aspect on your diagrams. When dealing with the distributed aspects, we depart from the functionality-first view of this article. The reason is that some characteristics of your system contributing to its ultimate value proposition, like reliability, rely on the distributed design of your system.

The key aspect of a distributed system that needs to be properly understood and visualized is replication. In a nutshell, replicated data is the same data repeated on several distinct pieces of virtual or physical infrastructure, so, kind of like having multiple copies of the same data spread across several servers.

Why do we replicate the data? There are a couple of reasons. First, to increase fault-tolerance. Second, to achieve higher performance. We ignore the second aspect and focus on the first one.

What makes replication hard? Ideally, one wants to modify the data and have every copy of this data to be exactly the same at every point in time, that is, one wants all the replicas to be consistent with each other. At the same time, people are somewhat reluctant to have a single point of failure e.g. a server that all the operations pass through. On the side of limitations, there is the ‘hardcoded’ limit on the speed of light and various kinds of communication failures that just happen. As a result, a modification applied to one of the replicas (copies) might get lost or delayed for another replica. Therefore, engineers have to figure out the best mixture of design goals that satisfies the use cases valid for the designed system. Then, these requirements and limitations are manifested as a replication algorithm in the code of your system. It might be external (in some library) or it might be custom-built. Still, if you offer a stateful distributed system, you have to deal with the task of replication and ensuring some sort of consistency.

So, how to best describe the replication algorithm implemented in the system that you reverse-engineer?

For the purpose of fault-tolerance, it is helpful to describe how the replication is handled in various operational scenarios. It is usually easier to start with a happy path where no crashes and network partitions occur, and the system operates as intended. Then, one proceeds to specify how the replication algorithm handles common failures e.g. an instance going down or a communication being temporarily interrupted. Let’s begin with the normal operation scenario.

There is a certain kind of diagram that helps to represent distributed algorithms in a clear and concise form. Given their inherent complexity, these are definitely desirable properties for a diagram. Such diagrams are called space-time diagrams or Lamport diagrams (named after the Turing laureate Leslie Lamport). These diagrams may remind you of sequence diagrams but they introduce the notion of requests (calls) delay and the possibility of entity crash.

Lamport diagram focuses only on the system instances that maintain the replica of the same data entity. So, even if your system is deployed on hundreds of servers, it is not necessary to bring all of them into the diagram, you just need to focus on as many as your replication factor (the total number of replicas maintained for each data entity) is set to be. This number is rather small for trusted environments. In addition to the servers each maintaining a replica of a data item, you will need to depict a client that serves as an origin of the requests and the destination of the responses for your system. Sometimes, requests may be purely internal, then one of the instances of your system acts as a client.

Each server (also called node) and the client will get its own horizontal line with its name to the left of the line. This line represents the ‘lifetime’ of the server or a client, it is like a local time axis of the entity (client or a node). The flow of time in this diagram is from left to right. If your system relies on shared storage, like AWS S3, then it should also appear on such a diagram. For the sake of keeping the discussion short, we’ll focus on the shared-nothing distributed system design, i.e. when no two servers have common storage. Below is the example of nodes and a client depicted on such a diagram.

Let’s assume that the system that we reverse-engineer implements one of the most common replication approaches - primary-backup replication. In this kind of replication, each node that keeps the replicated data item assumes one of the following roles: a coordinator or a follower. They might also be called differently, e.g. a master and a follower, a primary and a backup.

Coordinator role requires the node to order the updates to the replicated data that is present on the coordinator and on the followers. Intuitively, it should be clear that only one such node should be present for the replicated data entity to avoid conflicting updates to the replicated data. The coordinator role can be transferred to another node if the original coordinator fails or becomes unresponsive. The nodes with the follower role can only perform writes that are issued by the coordinator node.

Depending on the specific replication behavior that you want to depict, you may color the label of the node to reflect its role or you may draw an appropriately colored box around the local time axis of the entity. The former is helpful to avoid clutter when analyzing common scenarios whereas the latter is helpful when you want to accurately represent replication behavior in the presence of changing roles of the nodes (useful for debugging). Both options are represented below. We will focus on the first one.

Probably, the easiest way to find the code implementing the replication is by searching for requests that the nodes exchange to replicate the data or to track how the new data propagates in the system. You could deduce whether the code belongs to the master (coordinator) or follower (backup) role depending on whether it is triggered by the request that originates from the client or from another node of the system (master). Figuring out the distribution of roles in replication algorithms might be the most challenging problem but it is still soluble. Since there are a handful of replication algorithms commonly used in the software systems, after getting acquainted with them you will have an easier time matching the code to the algorithm that it implements.

Once you’ve figured out the roles of the nodes, say master and follower, you need to track the requests that they send to each other as well as the conditions of sending and accepting these requests. Ideally, you would try to find the original request from the client that hits the master and depict it on the diagram with an arrow. Then, it should be possible to track all the subsequent requests and responses of other nodes. Be mindful of ordering for these requests because the guarantees and the performance that you can deduce from the diagram will depend on how you depicted this communication.

In our example, primary-backup replication starts with the client sending a request to the coordinator node, that is, node A. Then, once node A is done processing the request and persisting its results, this node may not reply to the client straight away but instead will send the result of this request to both nodes B and C. Node A will also wait for the acknowledgements from these nodes prior to confirming to the client that the data has been successfully modified In the diagram below, the acknowledgements are shown with the dotted arrows and labels ACK Write X whereas the actual writes are shown with solid arrows labeled with Write X. As you can see, the whole write issued by the client can take quite a while because the acknowledgement from node B was delayed (notice the write time depicted with the dashed green line). It is not necessary to depict the delays like that - you may want the diagram to be more general. However, be wary that delays can rearrange the order of many requests and responses.

Now, let’s change this diagram slightly:

Can you see what changed in this diagram compared to the previous one? In this new diagram, the ACK Write X arrow from Node A to the client directly follows the Write X arrow from the client on the time axis of node A, that is Node A does not wait for acknowledgements from nodes B and C. In practice, this would mean that the guarantee on the write that the client performed is different compared to the previous diagram. With the algorithm reflected in the new diagram, when the client gets an acknowledgment of its write, it means that the data is only on Node A whereas in the previous diagram it meant that the data was present on all the nodes: A, B, and C. Is it bad? In terms of fault tolerance - yes. However, if we consider the latency of client operations, the second option depicted in the new diagram is actually faster. Compare the length of the segment of Node A’s timeline labeled as write time to the segment with the same label in the previous diagram. The write time segment in the new diagram is considerably shorter than in the previous one. If your system needs to perform thousands or tens of thousands of such writes per second, then this option with waiting only for one acknowledgement may become very appealing since the client gets their response quickly. On the other hand, the cost of latency reduction might be inappropriate for your software if the use case requires guaranteeing some level of fault tolerance when one of the nodes crashes after client got acknowledgement of the write operation. With that in mind, let’s dive into representing the failure scenarios with Lamport diagrams.

You could imagine the following failure scenario that the Lamport diagram can represent fairly well: Node A acknowledged the write to the client but then immediately crashed without completing the replication to other followers. This might be a plausible scenario for the system that you attempt to reverse-engineer and it thus becomes very important to catch such a case and depict it. With such diagrams, you will learn a lot about the ways in which you may be losing the data and whether it aligns with the guarantees that you provide to the users of your system. In the below diagram, the crash of Node A is represented with the crossed circle of red color, also the timeline for this node ends abruptly at this circle. Node B becomes coordinator for X which is depicted with its repeated label colored with blue color. Notice that a subsequent read of X from client to node B returns nothing because node A failed before replicating the data to nodes B and C. The same could have happened if it succeeded in replicating X to node C but not to node B.

In addition, the same situation may occur even if Node A does not crash but the client is allowed to read from the followers. Below, you can see an example where the write to node B was not fast enough and thus the client could not read the value of X that it wrote previously since it chose Node B for its read (maybe due to some load balancing mechanisms). For the sake of convenience, the part of nodes timelines shown in red represents the fact that the value of X has not yet been written to them. With that, it becomes obvious why Node C that the client attempts to read from returns that it does not know X.

Such mishaps occur because from the point of view of the client, all three nodes act as a single virtual entity, also called a replication group. Such an interface allows to hide the concern of fault tolerance from the client but, as you can see, depending on how fault tolerance is implemented, the client may get different guarantees on what to expect from the system.

The topic of distributed systems and guarantees is very vast. We’ve just scratched the surface in this article. The intention was to show you how certain kinds of diagrams can help you reverse engineer the distributed algorithms that your software implements. The challenge here is that you have to take into account both the code that embodies the happy path as well as the possible edge cases with various kinds of failures occurring in your system. Thus, it usually makes sense to create the happy path diagram first and then start introducing various kinds of failures and see how the diagram would change in response to them. From these diagrams you will also get another piece of information about your software system that will become very important on the next stage of reverse engineering. In essence, you will get the guarantees that the system provides to the use cases in respect to the data stored in it and the information on which properties of the system are prioritized (possibly, at the cost of others).

Written concise statements about the system

All the previous tools were instrumental to this pinnacle of reverse engineering. We started at the code level and then gradually went up peeling off everything that does not directly relate to the logic of the system. We’ve been reducing the level of detail while increasing the scope in each new diagram and description. At the very end of the reverse engineering exercise, an engineer should be able to describe the reverse-engineered system in a set of concise statements. These written statements might be quite diverse.

Among the most important categories I could list the following:

guarantees to the application/use case of the system, like “System Foo supports strong consistency for the client data” or “System Foo stores the client data in an encrypted format”;
key system architecture choices, like “System Foo implements distributed system architecture based on shared storage” and “System Foo replicates client data using chain replication approach”;
interfaces available to the use cases/applications like “A client of the System Foo can connect, perform reads and writes of its own state”;
summary of the key functionality, like “System Foo dispatches the client requests to the internal queues by using the client ID and the key of the request to look up the destination queue ID in the Queue Index”.

These concise statements will serve you and your team as a reminder of how the system works at the highest level and what it offers its users. Concise statements also help to bring one’s attention to the relevant pieces of logic and behaviors of the system. Note that using these statements in the discussions requires that the participants would have a similar level of understanding of what hides behind each of these statements. It is also of great value to understand what each of these statements means for the design of the system in terms of opportunities and limitations as well as for the services that the system is able to perform for its users. In other words, the participants of the architectural discussion need to have the same context. Otherwise, there is a real danger that the discussion turns out to be too superficial or abstract for those that lack this context.

Let’s have a closer look at the example statement “System Foo implements distributed system architecture based on shared storage”. Having read this statement, an engineer will recall both the fundamental advantages as well as the disadvantages of this design choice. He or she will also quickly discover which implications this choice will have for the availability of the system and replication speed on cluster composition changes. Such statements are a very powerful tool to shape the discussions around certain key aspects of the system design in the team and thus should be carefully crafted based on the evidence collected on the previous steps. Ideally, you would want to refine and agree on these statements as a team since they will likely stick with you for a while and thus they need to be very precise; in addition, some of these statements may be converted to marketable value propositions for the product of the company if it sells software or offers SaaS solutions.

Instead of Conclusion: Key Takeaways

I will not repeat myself on the use of specific diagrams and representations for performing the task of reverse engineering. Instead, I’d like to reiterate on the overall approach to the task of reverse engineering.

First and foremost, you need to cultivate the explanatory mindset in yourself to succeed in reverse engineering. You can acquire it by trying to simplify your explanations of how the software works and by gradually reducing the amount of details in your explanations and using more and more general terms. Reverse engineering is a bottom-up approach but you may also find it useful to compare the results of your reverse engineering exercise with the documentation that is already available for the software system (if you are lucky enough to have it). Given that we live in the age of abundance of versatile software systems that gradually get rooted into various business processes of companies, it would be one of the most sought after skills in software engineering.

Another important point is that you need to think outside-of-box when trying to unveil the design blueprint of the software system. Try different diagrams and representations, do not limit yourself by whatever you have seen in this article or in other sources (like UML diagrams and whatever is taught at the university). Being creative and at the same time focused will help you a lot to get a clear picture of the software that you are working on and would bring you far in improving and expanding it.

Reverse engineering is a kind of destination but at the same time it is also a journey. While wrestling with unfamiliar code and trying to figure out better and more concise representations, you would also alter numerous connections in your brain and start thinking differently about the software system that you work on. This is a necessary part of professional growth that you will likely miss if you only work on hobby or greenfield projects. What will help you on this path is binding the reverse engineering exercise to reading various theoretical books that expose you to software design patterns, data structures, and so on. However, those should not be read in isolation from the reverse engineering activity. Just by reading them you won’t learn much, maybe you will even make yourself worse off because of time spent on reading and getting detached from the code and the practice of programming. One should strive to avoid this at any cost.

A Note on Other Approaches

Written concise statements are good but what about ADRs?

I’ve used ADRs for documenting. Personally, I’ve found them practical to initiate the discussions about specific design issues (RFC-style) and to document fresh design decisions (i.e. that have not yet been taken in the past). I did try to use them to document the past design decisions as well but performing the archeological task this way turns out pretty sour very quickly.

In my experience, ADRs are VERY verbose for documenting the past design choices and are not very useful since one would have to read through all of these ADRs in order to restore the full state of the system design. This is very time-consuming. In addition, they are quite linear and do not let one easily highlight some cross-cutting design concerns without repetition in multiple ADRs. Last but not least, in an ideal world, you’d need to interview the stakeholders who took the design decision in the past. Otherwise you’ll end up documenting your phantasies which may be reasonable, but they are still, well, phantasies.

In contrast, the approach taken in this article focuses on creating a snapshot of the state of your software system’s design and on making it as independent of minute details as possible.

Aren’t you reinventing UML diagrams here?

First of all, my goal is not to invent THE diagrams (as compared to THE standards which one just cannot get enough of) but rather to show how to produce these artifacts based on the code at hand and how to proceed from one diagram (or, more general, a representation) to another on higher level in a somewhat meaningful way. Indeed, the diagrams in this article will resemble some of the UML diagrams or other diagrams. In the end, they revolve around the same concepts and same relations. You may use UML diagrams if you like them (I, personally, don’t, and I did try them on multiple occasions). This is a perfectly valid tool for the task of reverse engineering as well, might be a bit more strict and verbose, but still valid.

Reverse Engineering for the Good: From the Source Code to the System Blueprint (Part I)

Vladimir Podolskiy — Wed, 06 Sep 2023 15:36:36 +0000

The second part of the article is here: LINK

The Roots of Reverse Engineering and the Explanatory Mindset

Reverse engineering in the public discourse refers to semi-legal practices of deconstructing a product to extract its design blueprint. This applies to the software product as well. In case of software, this process includes reconstructing design and algorithms from binaries. This is not the kind of reverse engineering that I’d like to address in this article. Instead, I’d like to focus on the special kind of reverse engineering that we, software engineers, have to deal with, willingly or unwillingly, on a daily basis. I mean our efforts to understand how the software works on the conceptual level, what it does, and why it was shaped in this specific way.

The major part of the reverse engineering efforts lies in reading and understanding the code. However, reverse engineering is way more than that. In addition, it produces various artifacts that you, as an engineer, your colleagues, and even other teams could use to navigate the software systems better and to be able to modify them without breaking. This improved understanding is something that could lead both to improved outbound communication to build more trust with the users of the software and to visualizing technical tradeoffs that were taken at some point in the past and that you have to abide by or maybe reconsider if the circumstances changed or the use cases of your system evolved and maybe got broader.

People rarely talk about reverse engineering in this latter sense although this is, without a doubt, a critical skill when it comes to developing enterprise software systems and libraries used in production systems. This skill is overlooked by the classical academic curriculum and has to be acquired on the job or by contributing to the open source software projects. Quite frankly, there is no surefire method to acquire this skill and to get better at it. Any software engineer (unless he constantly jumps greenfield projects) has to figure out her own way of unveiling the inner workings of the software that she is assigned to develop or maintain. What definitely helps is to start using the system and to poke around it by changing various pieces of code and then discover how its behavior changes. Although this might not seem as systematic as one would hope it would be, this approach has worked pretty well so far for most of us.

Could one skip reverse engineering a system that does have fairly comprehensive and clear documentation? You can do so at your own risk. Reverse engineering in and of itself is a very useful and satisfying exercise that allows software engineers to understand the interdependencies of modules deeper, and, on top of that, it allows us to discover why this or that engineering decision has been taken. These parts can barely be represented in the documentation. Keeping these up-to-date is also a huge challenge. I’d argue that understanding the reasons behind the specific implementation is the crux of reverse engineering. Why is that? Well, because one gets down to the list of requirements that explicitly, or, what’s more important, implicitly, led to the software being implemented the way it is. In some sense, this is akin to an archeological task of reconstructing the software requirements specification (SRS) based on the code.

Having requirements distilled, one can track which ones fit nicely together and which may result in a design tradeoff. For the sake of clarity, imagine that during your deep dive into the source code of some software system you’ve discovered that it replicates certain kinds of data. You can deduce that the use cases for this system require it to be fault-tolerant and available in the sense that a crash of an instance of that system would not render the whole deployment unusable and at the same time, even if the node never recovers, the data will still be preserved. When done digging a bit more, you realize that every read operation waits for the majority of the replicas to respond. Wow, isn’t it a contradiction? Doesn’t it sound like a reduction in availability? Indeed, it is. Say, if the majority of nodes goes down or gets separated from the rest of the system, the system won’t be able to read or modify the data. However, this does not affect the capability of the system to tolerate the node crashes in a sense of data not being lost. Why would waiting for the majority to perform operations even be required? This is likely due to the use cases that require the system to offer a high level of consistency between replicas of the data. One could imagine this behavior to be quite beneficial for financial organizations like banks where the cost of data inconsistency is monetary.

Finding such requirements and formulating them clearly to spot the contradictions is rarely achieved by just looking at the source code. The engineer has to perform a deliberate and highly focused mental trip that starts in the code and then to climb up the abstraction ladder mercilessly splitting what is not relevant to the logic being reconstructed. Along this way, one produces a sequence of explanations of what, how, and why the source code really does.

One usually starts with a fairly complex explanation and gradually peels off more and more details to simplify the explanation. Naturally, this leads to the loss in accuracy the further along one is in this process. On the positive note, the capability to produce simpler and coherent explanations helps the engineer to form better understanding of the software system. A sophisticated reader might recognize here an approach popularized by Richard Feynman and developed further by David Deutsch in “The Beginning of Infinity”. I’d go as far as to actually proclaim that reverse engineering requires one to have an explanatory mindset, that is, wanting and being able to discover or forge explanations of various phenomena.

reverse engineering requires one to have an explanatory mindset

But enough of philosophy. Let’s proceed to the practical tips on how to reverse engineer a software system.

A Bird’s Eye View on Reverse Engineering the Software Systems

In most of the common programming languages one would encounter the following cross-cutting concern that gets in the way of reverse engineering and that has to be cut off as long as it is not in the spotlight of the analysis - performance optimizations. Performance optimizations, as you might have guessed, aim at improving the overall performance of the system. Performance itself is pretty much an umbrella term that can be viewed from the application performance perspective (throughput in operations per second, latency of each operation, etc) and the resource perspective (CPU usage, memory usage, disk throughput, etc). Sometimes, both aspects could be joined to form a bit more comprehensive characteristics like throughput per 1% percent of CPU used. In addition, they may be expressed as a monetary cost in SaaS and IaaS offerings.

Optimizing performance in the source code boils down to implementing some kind of ‘better’ resource management. The ‘better’ part here stands not for something that is necessarily ‘universally better’ but for something that is perceived as better for the use cases of this specific software system. If, for instance, the use case puts a premium on reliability of the data, more intensive disk usage and higher CPU utilization might come as an additional price to pay since the system will try hard to pack all the critical operations into transactions that persist the data to the stable storage. On the other hand, if the use case demands to push operation latencies as low as possible, high memory usage should not be a surprise since the latency/size ratio for RAM makes it appealing for caching that hides expensive disk operations.

Thus, the first step on the path of reverse engineering is to carefully peel all the ‘unimportant’ parts: caches, various encoding schemes that optimize memory usage, thread pools, transactions logic, and so on. On this step, one would have to pay special attention to the data structures. Instead of focusing on how the data structure works, one would have to consider which interface it offers. By interface here I mean the set of operations that the surrounding code depends on to perform its piece of job. One could imagine using a B-tree or a trie to map some keys to some values but in the end what matters to the description of the system is that a specific key maps to a specific value (or a set of values) and that this value can at any time be retrieved using this key. Bringing this explanation a step further, one could represent this mapping as a discrete function. However, this level of representation is rarely needed and can be, in a sense, too abstract so that the context of the specific use of the data structure is lost.

In the remaining part of the article I describe a few tools/approaches that I personally find very useful for the exercise of reverse engineering a complex codebase. Feel free to leave it or take it. Those have proven to be very useful in my work. These tools are arranged in order of decreasing the level of detail with occasional twist towards specifics of reverse engineering for distributed systems.

The tools and techniques are roughly split into two categories. The first one is the tools used when one starts with some code snippet and has to build their way up on the ladder of abstractions. The second one is the tools used when one needs to glue descriptions of multiple code pieces together into a coherent system description. It is appealing to name this second category the tools for reverse engineering the architecture. Since the use of the term architecture is opinionated beyond any reason, I’d leave it up to the reader to get along with this name or to simply refer to these tools as ‘second category’ or ‘tools up the abstraction stack’.

Reverse-engineering for the Code Fragments

State Change Diagrams

You might be familiar with state diagrams from your university years. There, you might’ve encountered them in the compiler design or computation theory classes. Although this tool has a very ‘academic’ flavor, it can, in fact, be repurposed to conduct the first analysis step on the path to understand the codebase. The academic literature frequently focuses on the state of the program which allows to get very comprehensive descriptions but would mostly be applicable to the well-scoped algorithms with firm boundaries. Unfortunately, this is not frequently the case in most of the practical software systems. So, instead, let us focus on the state of the data when showing how the state diagrams can be applied to reverse-engineering the software systems.

In the software systems domain, one sometimes refers to the data as state. This is quite misleading in the context where we discuss how the pieces of data change their values which can also be referred to as state (state of the data). In addition to that, there is also the state of the program which includes both the data and supportive register values like program counter etc. As you can see, the term state is overused. Hence, in this text we will refer to the state of the data unless stated otherwise. You can imagine some variable, let’s say X, which changes its state (aka value) from 0 to 1 and then to 10 or something like that.

Before heading off to drawing circles and arrows, a hallmark of such diagrams, it is customary to find which data the program modifies during execution. Usually, this is an easy task for most programming languages. You need to look no further than the coded data structures. Most importantly, be on the lookout for the data structures whose lifetime does not end with the method call. Although local variables might be useful to understand how the algorithm works, they are frequently there only for the sake of developer’s convenience or for the sake of optimization. Presence of these local variables on your diagrams will likely bloat them so try to keep these away as much as possible or prune them when you refine your diagrams iteratively. All in all, class variables might be the best place to start if the codebase happens to follow the object-oriented paradigm (OOP).

Not all the data is created equal. Upon discovering all (or the majority) of pieces of data, you would likely need to filter out those that perform some supporting function like maybe some compare-and-swap primitives serving to organize accesses to the piece of state shared across multiple threads. I call this negative filtering since one focuses on removing what is unnecessary. Those removed pieces might still be interesting to consider when talking about performance and safety but they are fairly orthogonal to the task of initial understanding of what the code actually accomplishes for the use cases of the system. Once you’ve done pruning, you would likely discover that the final list of the data is not very long. Indeed, most of the data manipulation tasks that the code performs require just a handful of data structures, possibly, various sorts of indices or intermediate representations.

You may as well arrive at pretty much the same list of data entities with positive filtering by focusing on keeping the data that is (possibly, through a long chain of operations) transformed into the user-visible data. Personally, I don’t like using positive filtering in isolation since it may drag along a lot of data that in the end has purely supportive function. In contrast, negative filtering propels one to always ask oneself a question of whether this specific data structure should really be represented in the diagram. Remember that you can never overdo the removal part. If by pure accident you’ve removed some critical piece of data from the consideration, you will discover it on the next step when trying to make sense of the state change diagram. Personally, I find it better to add details on later stages than end up with a barely readable diagram.

Another important piece of data in the system that you don’t want to overlook is the one persisted to the disk. Actually, the persisted data might be the most important in the system from the business point of view. Its presence on the disk means that the use cases care about the durability even though persisting to disk is more expensive than keeping the data in main memory. Such data must also be included into the diagram.

Once you’ve found the most important data structures/entities, it becomes really helpful to think of the program as something that mutates or changes the state of this data. Essentially, all the coded operations serve the purpose of modifying the state of the data in some way. An operation may update only one variable or multiple at once, i.e. atomically. Ideally, you’d want to group logically related operations even if they change multiple variables at once and even if they are not, strictly speaking, atomic.

You then draw some shape around these grouped operations on multiple data items. Personally, I prefer circles, but it could be a rectangle or something else like a cloud - whatever is more readable for you and/or your team. If you happen to have multiple data structures/variables changed, then you may explicitly bring them into the diagram. This would make your ‘states’ (grouped operations with a shape around them, e.g. a circle) in the diagram look more like basic blocks used in compiler design but, well, such a notation serves us well-enough.

You may also want to distinguish between the past and the future in your grouped and encircled operations. Usually, it can easily be deduced from the position of the assignment sign: everything to the left of it is the future state (i.e. after the operation on the right hand side is performed) and everything on the right side refers to the past state. However, if your logically grouped operations use modified values straight away, then it might make sense to somehow distinguish between them explicitly. Like many distributed systems practitioners, I prefer to add a tiny Ꞌ symbol next to the name of variable that represents its future state, like this: NꞋ = N + 1. The below picture shows how the X variable gets incremented and the new value of Y is computed based on the updated value of X and the current value of Y. You can see now why Ꞌ might be important in such diagrams.

If the piece of logic in one of the circles becomes too big to be meaningfully represented in the diagram, you can always denote it with some label like S0 and write down the operations that occur in that group somewhere else. Personally, I’d rather go for a slightly bigger figure like a rectangle to avoid forcing the reader (most likely, myself) to look at multiple different places to understand the whole picture.

Some data state modifications might be conditional, i.e. something must have happened or some condition needs to be met in order for the state to get updated. Such conditions are a part of a so-called control flow.

You can see it on the diagram below, which is a state diagram sketch of a simple program that writes batches of events of Batch_Size size to the file. For whatever reason, the original program uses a separate variable called N to track the number of events in the list. It first performs the operations in the circle pointed at by the hanging arrow (notation used to recognize the starting state of an automaton), that is, it assigns zero to N and initializes the list L of events to be an empty list. Upon arrival of the first event E1, the program performs an increment on N and appends this first event to the list L. Then, the self-loop in the diagram signifies that upon getting new events, the same operations are performed, namely, the length of the list is incremented and the new event is appended to the list. Once the length of the list becomes Batch_Size, the diagram shows that the program proceeds to write the list L at the end of the file File, and, once the write is over, it returns to the initialization operations.

Since N is only used to track the length of the list of events, we should get rid of it since it does not add any value to the understanding of the piece of code. We can substitute it with a call to some function Len that returns the length of the list. Note that it is possible that the programming language and its libraries do not offer such a function at all. Regardless, we do not strive to represent the code 1-to-1 in these diagrams. If we wanted a 1-to-1 mapping, we would simply skip drawing any diagrams since the code is already there. Therefore, do not be afraid to lose precision when drawing these diagrams. State diagrams should reflect the logic of the underlying process, the algorithm, rather than be an accurate graphical representation of the code.

In our diagram new events are appended (added at the very end) to the list L. The use of a particular data structure (such as a list or an array) might be dictated by the application or performance requirement, or by both. If it is there only for the performance reasons, then try your best to find the most generic structure that keeps the same application-relevant properties such as uniqueness of the elements or the order in which they are traversed. In the diagram below we might have wanted to substitute the list L for something more generic, like a set. However, switching from a list to a set in this case would not be a simple matter of performance. The list gives us an additional ordering guarantee which might be crucial in the context of reconstructing the temporal order of events later on. Moreover, the list does not require its elements to be unique, unlike a set. Hence, this list can be viewed as a sequence of non-unique events, and, if these properties are important in the context of the application, then one should avoid substituting the list for another structure. On the other hand, being more specific, like whether it is a singly-linked or doubly-linked list or whether it is a blocking structure or not, won’t be of much help in clarifying the purpose of the software so these details can often be omitted.

In case you want to avoid implementation specifics, you may prefer to use sequence instead of a list. However, try not to overdo it since you still want these artifacts to be useful to your colleagues that might not necessarily want to speak high-level abstractions. Remember: too much abstraction may undermine the clarity of your explanation!

If you study the state diagram that we’ve produced so far, you might spot a bug (or, rather, a design flaw). If the software that this diagram represents can be used in applications that have only a finite number of events, then it might happen that N - Batch_Size last events will never be written to the file and will hang out in the main memory until the machine is powered down or the application crashes. Your inner voice may suggest you fix that by adding a timeout like on the diagram below. But do remember my warning - DON’T try to fix bugs/design issues while doing reverse engineering! Note the design issue down for later and ignore it. Regardless of how much accuracy you remove from the diagram in comparison to the source code, it should NEVER contain anything that is not in fact a part of the existing design, so, in simple words: you are allowed to remove but you are not allowed to add. Adding something which is not actually a part of the codebase is akin to ChatGPT hallucinating software libraries that never existed or books and papers that were never published. Basically, you are a sculptor working on a block of marble trying to reveal the figure hidden inside it. You can only do this by removing the excess marble but you cannot add more of it. The purpose of reverse engineering is to reveal what is hidden, not to invent something new.

You might start your dive into the code by rigorously noting down all the operations on all the variables and then find yourself tete-a-tete with the crammed diagram like the one below. Well, it’s just 20 circles and a handful of arrows but it could be 50, 100, or more… To top that, imagine all these arrows to be in fact labeled. Analyzing such diagrams could get daunting and would likely get you confused instead of providing a clear picture of what’s going on in the code.

But what if all these states are indeed a required part of the business logic? What if you have reached the point where you cannot shed any more of these groups of operations and arrows without sacrificing the correctness of the diagram? Unfortunately, sometimes a piece of code may attempt to do way more than it should. This haziness of the code is likely a sign of hidden abstractions, that is, abstractions that have not been spelled out but which determine how the implementation is structured and evolves. How can we deal with the diagrams that got bloated because of this?

Personally, I try very hard to split complex state change (transition) diagrams across the semantic dimensions present in the code. One of the most useful semantic dimensions in my practice is the dimension of performed roles. One can understand this intuitively by imagining the software as performing multiple roles like Data Storer (code that is responsible for storing the data), Data Replicator (code that is responsible for replicating the data), Notificator (code that is responsible for notifying other subsystems). This allows us to speak about the same code in its different roles which narrows the focus of technical discussions. Introducing such roles explicitly in the code itself might not even be required since it frequently adds complexity and might not fit well with the rest of the classes. However, for the purpose of building higher-level abstractions, role distillation may yield very helpful results.

With the roles figured out, a diagram might be split into multiple simpler ones - one for each role. Below, we can see the previous complex diagram being split into three roles that have multiple transitions between them. The corresponding state change diagrams for each role contain the connected state from another role marked with Ꞌ. Having these states interspersed into a different role helps to contextualize the state transitions within the role and how they depend on another role. This is particularly useful to describe the operations that transform the state and precede the transition into a different role, like, for example, filling up a buffer to send as a Data Bundler and then proceeding to send it as a Reliable Sender.

Once we are done with the low-level state transition diagrams, we may proceed to spell out what algorithm these diagrams (or parts of them) represent. We’ll do so with pseudocode.

Pseudocode

Chances are that you are quite familiar with pseudocode if you’ve studied computer science/computer engineering or encountered some research papers/books on algorithms and data structures. You may wonder why pseudocode comes after the state transition diagram. Why wouldn’t you want to gradually abstract the details by writing down pseudocode that focuses on the most important pieces of data and logic and then express it as a diagram for a better understanding?

First of all, the order in which you create these representations is a matter of perspective and taste. However, my personal take on starting with the state transitions diagram and then proceeding to the pseudocode is that pseudocode then gets compact and focused. This comes in very handy on the latter stages and also when you consider redesigning your software system. Concise pseudocode snippets can be implemented in the program code without all the bells and whistles of the original implementation and thus can serve as a ground to experiment with different concepts, data structures, algorithms, etc. Concise pseudocode would bring clearer communication since, on one hand, it does not bring the unnecessary complexity with it, and, on the other hand, your teammates might have an easier time studying the pseudocode and transforming it into the real implementation than wrestling with the elaborate diagrams.

When talking about pseudocode, it is very useful to pay attention to the algorithms that use some kind of computations and not simply move the data around or perform search operations over it. If the code that you reverse-engineer contains computations that are more than usual counting (incrementing, decrementing), then it makes a lot of sense to restore the mathematical formula behind these computations. This formula should then be placed into pseudocode substituting for the sequence of actions used to compute the end result. Consider the following pseudocode snippet:

mean := 0
for eventsCntPerGroup in eventsCountsTotal:
  mean += eventsCntPerGroup

mean := mean / groupsCnt

Although this is not a lot of lines, the idea of the mean events count per group is obscured by the implementation details. In contrast, the formula is compact:

$m e an = \frac{\sum _{g ro u p s} e _{g ro u p}}{n _{g ro u p s}}$

Essentially, you sum the event counts over all the groups and divide the result over the number of groups. Such a formula is easily readable and allows one to clearly visualize how the data state (variable mean) would change depending on the changes in its constituents (number of groups and events per group).

Formulas better capture the behavioral aspect of the application and sometimes they might be way more expressive than the pseudocode. However, as with every explanation, try not to overdo it and take into account how well-versed the reader of your explanation is in math. You could probably expect some understanding of the basic concepts in mathematical statistics and of basic mathematical operations like summation and product. More complex concepts like integrals and differentials might require some written clarifications. One can also add a graph that explains the relation captured by the formula or explain it using textual description or an example. There are plenty of tools like Desmos that allow one to visualize formulas and play with parameter values.

Conclusion of Part I

I’d like to draw a line for this lengthy first part of the article here. In my opinion, state change diagrams and pseudocode are essential and at the same time sufficient to get a fairly good understanding of what various bits and pieces of the codebase try to achieve. Although these representations will get you far in analyzing and improving some well-scoped parts of the codebase, unfortunately, they won’t provide you with the holistic picture of how all these fragments fit together and what impact the change in one of these fragments incurs on the rest of the system. The second part of the article will share my experience and tools in extracting the system blueprint and requirements that are desperately needed when performing large-scale changes to your codebase.