DEV Community: JohnN6TSM

A Tale of 2 Codebases (Part 2 of 4): Testability

JohnN6TSM — Wed, 05 Oct 2022 22:22:57 +0000

As I discussed in Part 1 the premise of this series is a simple natural experiment: comparing 2 large codebases written by the same solo programmer before and after introduction of SOLID Design principles. PhotoDoc, the pre-intervention project, is an electronic medical record dedicated to medical forensics. Melville.PDF is a free, open-source PDF renderer for .NET. In this article I will discuss testing differences between the two projects.

Both projects use similar testing infrastructure. I write unit tests in C# using XUnit.net. I frequently use mock objects in testing, and MOQ is my tool of choice. I utilize continuous testing and coverage analysis through Rider. I do not have specific objectives for code coverage. When writing complicated algorithms, I frequently shoot for 100% coverage of the algorithm. I test simple properties inconsistently, and frequently do not test guard clauses.

I ran each project’s unit test under the coverage analyzer in Rider and considered results only from assemblies which might actually run in production. Thus, unit test assemblies, performance test harnesses, and apps designed only for developer use are not included. The production assemblies of Melville.Pdf have significantly more unit test coverage than PhotoDoc (30.3% vs 85.6%, p < 0.0001.) I was actually quite surprised to find the low test coverage for PhotoDoc, as I have for many years considered myself to be a test driven developer.

I think there are two primary reasons that Melville.PDF has significantly more test coverage, one of which is due to SOLID design, or more particularly Clean Architecture.

Lesson #1: Poorly chosen dependencies make testing hard.

One of my biggest surprises in reading Clean Architecture was the assertion that the UI should be a plugin on the periphery of the system. (Martin, Clean Architecture pg 172-173.) PhotoDoc began its life as a WPF application. (Such things were fashionable in 2007.) If my memory serves correctly, I began the PhotoDoc project by hitting File|New Project and generating a shell of an app that popped up an empty window titled “PhotoDoc.” PhotoDoc has forever been a WPF app, and since its original goal was to manipulate photos, it made sense to embed WPF imaging classes deep in the new applications data model. I now regret that decision, and despite multiple attempts, the effort that would be required to reverse such a fundamental decision have always proven to be more effort than my estimation of their worth, so I live with an awkward architecture 15 years later.

That decision makes testing hard. WPF has strong thread affinity and runs only in single-threaded apartments. Thus my 16-processor computer runs the PhotoDoc tests one at a time. My tests run much more slowly, and thus get run less frequently, than they might otherwise.

PhotoDoc uses multithreading extensively, whereas single threaded code is easier to unit test. Early on I eliminated multithreading using #fdef DEBUG statements, and for many years I could not even run the unit tests in release mode so my actual production code went out the door untested.

Depending on the UI in the PhotoDoc business logic interacts with my other mistake: flagrant violations of the Law of Demeter. PhotoDoc’s metaphor is of a “patient chart” that contains “sheets” for different kinds of data. One type of “sheet” is a folder that can contain other sheets. Early in the design I included in each sheet a reference to the folder that contained it, and the sheets would liberally walk the folder tree looking for information they found useful.

That design decision turned into a nightmare for testing. Eventually the simplest tests required constructing elaborate object model. Because many of those objects contain WPF objects, they run on STA threads, and many access system-wide resources. These system dependencies are often hidden deep in the domain model. In 2016 I made a major effort to simplify the system by segmenting objects and using dependency injection. I eventually broke the reference between sheets and their parents, but did so with a more complicated IOC configuration than I would have chosen. I did manage to cut down the 10,000 line ShellViewModel class to 1200 lines, but the 18 constructor parameters make it a real bear to construct. I require a separate inversion of control framework just for the test cases. It is more brittle that I would like, but at least it lets me do some automated testing.

In contrast, WPF is one of two, currently, front ends that plug into the Melville.PDF business objects. There are a small number of tests for the WPF functionality – and all the problems described above apply to that small subset of the test suite. The remainder of the test suite runs multithreaded. Inside the Melville.PDF business logic classes rely upon a small number of abstractions that are easy to mock, and most of the unit tests run quickly and within a very narrow scope.

Lesson #2: Integration testing can greatly improve testing effectiveness

Another reason that Melville.PDF has so much higher test coverage is extensive use of integration testing. Part of this is in the nature of the code – because Melville.Pdf is a rendering library it is trivial to generate a bitmap of any page of a PDF document. This contributed to easy integration testing,

The first element of the integration testing is a collection of reference documents. I store my collection of reference documents as a C# assembly where each class implementing the IPdfGenerator interface generates one PDF document. This allows a hierarchy of classes to concisely describe the differences between different test documents. Test programs use reflection over this assembly to get a tree of reference documents.

The comparing reader is a WPF application that will simultaneously display the generated reference PDF files in four different renderings. The first is a PDF renderer included in the Windows API, the second and third render with the two bindings of Melville.Pdf. The fourth renderer shows a tree view of the PDF objects that make up the document. The comparing reader also loads external PDF files, making it an ideal tool for investigating rendering failures. The reader can also open document using Adobe’s PDF reader as well.

The Comparing reader is essential because the PDF rendering algorithms are complicated and high fidelity to existing renderers is essential to the purpose of the code. (In other words, an effective PDF renderer needs to produce output that looks like all the other PDF renderers.) The reference document generator makes it easy to establish a library of test documents that fully exercise the renderer. The comparing reader makes it easy to ensure high-fidelity rendering of each document. Incidentally, this test mechanism must be effective, as it uncovered numerous instances where the Microsoft renderer disagrees with Adobe PDF reader on how various files should appear.

The comparing reader works while writing rendering code, but it does not prevent rendering regressions over time. The RenderingTest class in Melville.PDF.IntegrationTesting assembly will render the first page of each reference document to a PNG file using both the WPF and Skia renderers. The resulting PNG files is hashed and the hashes are stored in the codebase and checked into source control. This gives a robust set of integration tests to prevent rendering regressions. These integration tests contribute significantly to code coverage because much of the rendering code with written directly to the integration tests, without an intermediate unit test.

Lesson 3: Single Responsibility Principle allows unit testing of private methods.

Object oriented code says that classes should hide their implementations. A consumer should only depend on the public members of a class, and everything not meant for public consumption should be private to allow the class to change in the future. Unfortunately making things private often works against unit tests which, absent messy reflection hacks, can only call or inspect public members.

Say we have a class this code:

class Sut
{
    private int field; 
    public void A(byte[] inputData) => field = B(TrickyComputation(inputData));
    private int B(double computedValue) => //Something worth testing on its own
}

Method B is going to be difficult to unit test because designing input data to method A that covers all the cases we want to test for method B may be nontrivial. Furthermore the public method A stores the result in a private field, and it may be difficult to get Sut to give up the private value. This is a common pattern in PhotoDoc, I will have groups of methods within large classes with a private field or fields dedicated to those methods.

The insight to resolve this problem comes from the Single Responsibility Principle. We reason that if we want to unit test B alone, then B must have a responsibility, and since B is private it must not be SUT’s only responsibility. The solution is to move B into its own class. There are many patterns for doing so, but here is one.

class Sut
{
    private int field;
    private BHolder sub = new(); 
    public void A(byte[] inputData) => field = sub.B(TrickyComputation(inputData));
}
public class BHolder {
    public int B(double computedValue) => //Something worth testing on its own
}

The critical observation is that Sut’s encapsulation is not violated by this refactoring. Sut has exactly the same public interface, but its implementation dependence on B() is hidden in a private field rather than a private method. B() can now be trivially tested. As a bonus, BHolder’s implementation of B is now available to other classes which may desire it.

This is exactly what happened when I was writing the Lempel-Ziv-Walsh(LZW) decoder for Melville.PDF. LZW requires a stream of bytes to be read as a stream of bits. This was a responsibility I wanted to test independently, so I made a class, BitReader, to handle this responsibility. I tested byte-to-binary stream conversion long before I started LZW implementation. Eventually, however, the CCITT and JBIG decoders also need to read bytes as a bit stream. This single class, with minimal changes, now serves all three decoders. This class is both reusable and easy to test because it follows the SRP – it does exactly one thing well.

Conclusions

Melville.PDF has much better test coverage than PhotoDoc. Even though test coverage is an imprecise metric of test quality, this confirms my subjective impression that Melville.PDF is the much better tested codebase. One reason for this is accidental. The other is architectural.

The accidental reason is that the problem Melville.PDF solves is easy to test. At the highest level of abstraction, Melville.PDF converts ASCII strings into graphics. Testing Melville.PDF comes down to generating interesting ASCII Strings and making sure you like the output. PhotoDoc on the other hand started as image processing application that runs with a GUI. GUIs are classically difficult to test.

While GUIs are hard to test, PhotoDoc shot itself in the foot by embedding system level concepts, like WPF imaging classes or system calls to record audio deep in the domain model. An enormous ShellViewModel object at the root of the domain model was too attractive a target for violations of the law of Demeter. This means that every unit test is essentially an integration test because I have to create large portions of the domain model to test anything.

The sad part of this story is that I tried to refactor my way out of it, and was only partially successful. The ShellViewModel is a 7th of its original size but it still takes 18 parameters – and uses many of them in the constructor so they can’t just be simple mocks. I wrote a testing configuration of the IOC container that is brittle and slow just to be able to do some testing.

Consistent, perhaps even fanatical, application of the Single Responsibility Principle has resulted in much more test flexibility. Melville.PDF uses dependence injection, but uses no DI framework, because a library shouldn’t. All of the classes rely on a modest number of abstract dependencies one level of abstractions below themselves and are easily testable without a DI framework. Reasonable defaults make the library usable without a DI framework even though it uses dependency injection frequently. All the code I want to test is in a public member of some class. Private fields or explicit constructor calls are used to hide public implementations within classes at a higher level of abstraction.

Testability is a major goal of the SOLID principles, and for me they succeeded. Melville.PDF is objectively and subjectively more testable than PhotoDoc.

The next part of this serries will address code reuse.

In the next section of this 4 part serries I look at the effect of SOLID principles on code reuse.

A Tale of Two Codebases (Part 3 of 4): Code Reuse

JohnN6TSM — Wed, 05 Oct 2022 22:22:38 +0000

As I discussed in Part 1 the premise of this series is a simple natural experiment: comparing 2 large codebases written by the same solo programmer before and after introduction of SOLID Design principles. PhotoDoc, the pre-intervention project, is an electronic medical record dedicated to medical forensics. [Melville.PD(https://github.com/DrJohnMelville/Pdf) is a free, open-source PDF renderer for .NET. In this article I discuss code re-use.

Bob Martin claims that “Duplication may be the root of all evil in software.” (Martin, Clean Code One of the promises of SOLID design is to divide code into reusable units. The insight of the SOLID principles is that classes are more reusable if they do only one thing. Even if I must chain several classes together to get the behavior I want, it is much easier to accumulate the desired behavior from multiple objects than to refuse an unwanted feature from a big and chunky class.

Lesson 1: Large classes are difficult to reuse.

Large classes make reuse difficult. I approached the design of PhotoDoc such that all the code that touched an object’s data was within that class. If data was used to do more than one thing, then the class acquired multiple responsibilities. PhotoDoc’s name implies its original purpose -- to analyze digital photos. There is no surprise that a Photo class exists. Photos can load themselves from a disk file, display metedata, manage a rich collections of pixel shaders that filter the image, and display a collection of tools like lines arrows, and text that annotate the image. Using PhotoDoc’s “folders of sheets” metaphor, a
PhotoSheet class holds a collection of photos.

Years after PhotoDoc implemented PhotoSheets, I added support for scanned documents. Scanned documents typically come in as PDF, XPS, TIFF, or DICOM files, but fundamentally they are a sequence of images. Over the years I have added metadata display, filters, tools, and annotations to the scanned documents as well. You would expect that PhotoSheets and ScannedDocuments, both representing a series of images with annotations, filters, and etc, would share almost 100% of their code – except they don’t.

Like I said earlier, Photo and PhotoSheet are chunky objects that reflect my 2007 understanding of objects that map directly onto concepts in the real world and directly implement all the behaviors of those objects. Photo classes know how to load themselves from a disk path, and use that path as a key to prevent reloading when a photo is used more than once. For scanned item pages images do not have a unique path – they have a path plus a page number. They get loaded using different libraries and are cached differently. While I could refactor this to a better design, it would be expensive and, so far, the problem has not been big enough to make it into active development – I suspect it never will.

By all rights, PhotoSheets and ScannedDocuments should just be alternative views of equivalent data structures. In fact, they remain very different. Photos can be associated with an injury noted on a traumagram, and scanned pages can be associated with a line in a document index, but not vice-versa. I even implemented features to turn a sequence of photos into a scanned document, and another feature to create a PhotoSheet from the pages of a scanned document. Chunky objects make even this obvious code reuse a higher cost refactoring than I have been willing to execute.

Lesson 2: The Single Responsibility Principle Facilitates Unexpected Code Reuse

In contrast, small classes with single responsibilities plus the right abstractions create happy coincidences where objects just fit together in unanticipated patterns. One of those happy coincidences happened when I was working on content streams.

Content streams are a domain specific language which PDF uses to describe the appearance of visual elements. Melville.PDF, obviously has no choice but to implement a content stream parser that takes a content stream and produces the rendered page. The “and” in the preceding sentences is a giveaway that these were two different responsibilities, and the Single Responsibility Principle dictates that they be represented as separate classes.

Thus Melville.PDF’s content stream parser design emerges from these requirements. The IContentStreamOperations interface contains one method for each legal combination of opcode and parameters in the content stream DSL. The ContentStreamParser class takes accepts input bytes from a pipe and calls the corresponding methods on a IContentStreamOperations instance passed in the parser’s constructor. This design separates the
concern of interpreting the bytes of the content stream from the concern of executing a sequence of drawing commands.

Separately, it became clear that PhotoDoc would need a nontrivial library of test documents, and so PDF generation capabilities were needed to support the test code. The ContentStreamWriter class implements IContentStreamOperations and responds to various method calls by writing the equivalent content stream code to the designated output stream. The ContentStreamWriter separates the concern of designating which content stream actions should be produced from the concern of generating the correct syntax of those operations. Because C# requires method calls to have the proper number of correctly typed arguments, proper PDF syntax is enforced by the C# compiler and supported by C# intellisense.

Late in the development of Melville.PDF, a need developed to pretty print content streams. Debugging a renderer is a loop of 3 steps. 1. Find a document that renders different in Melville.PDF than Adobe Reader. 2. Understand what that document is doing to cause the different rendering. 3. Fix Melville.PDF to render the file the same as Adobe Reader. Step 2 involves reading many content streams that often span thousands of lines. Because PDF is a binary format not intended for human consumption, most content streams exist in a very concise format that resembles minified javascript, even though whitespace is ignored in content streams, so an indented rendering is allowed. I wanted a pretty printer to produce indented, easy to read content streams.

Conceptually, a pretty printers and code minifiers both can be thought of as a parser that outputs to a code generator that generates the source language. Conveniently I had a parser that outputs operations to an interface and a code generator that implemented the interface my parser targeted. Thus, creating a PDF minifier was trivial – I just had to pass the ContentStreamWriter to the ContentStreamParser and let it run.

Converting my dirt cheap minifier to the pretty printer required only the creation of an IndentingContentStreamWriter to implement the “pretty” part of pretty printing. IndentingContentStreamWriter runs just over 100 lines of code, most of which is spent designating which constructs should begin or end indented regions. IndentingContentStreamWriter delegates writing operations to a contained ContentStreamWriter so the class itself is focused on the single responsibility of adding whitespace to the content stream output. The entire process of implementing this feature took about 30 minutes.

The content stream pretty printer is a developer convenience that users of the library will never see. The feature was feasible because it was so unbelievably cheap that efficient debugging more than paid for the minimal development cost.

The joy of clean coding is that these serendipitous opportunities for code reuse become more, rather than less, frequent as the project continues. In Melville.PDF big endian bit binary integer parsing from the ICC parser got reused in the CCIITT and JBIG parsers, the entire CCITT parser got reused in the JBIG parser, and a byte stream to bit stream adapter was repurposed from the LZW parser into the binary image parsers. As the development progresses, the developer accumulates a library of small classes withlities that form a toolbox uniquely customized to the problem domain.

Conclusion

SOLID code is reusable because classes are small, and they do one thing. Problems tend to recur within a problem domain. In reuse scenarios, augmenting a small class is much easier than removing undesired “features” from a larger class. When classes are small and focused, the probability that the next problem encountered can be solved with code that already exists and works increases with time.

The last post in this four part serries will address dependency management.

A Tale of Two Codebases: One Developer’s Reflections on SOLID Software Design (Part 1 of 4)

JohnN6TSM — Wed, 05 Oct 2022 22:22:19 +0000

I suspect that my programming background is somewhat unusual – I am a hobbyist programmer with a master’s degree in computer science. Weeks after I finished CS grad school, I entered the UCSD School of Medicine. I have spent the past twenty years as a practicing physician. I develop software in the early mornings, evenings, and weekends, largely for enjoyment, and to build the tools I use during my “day job.”

My formal computer science education spanned 1993—1998. I took the required software engineering classes and was taught straight waterfall design. In my 4 years of CS education, I never heard the words “unit test.” Over the intervening years I have read about test driven development, agile methods, refactoring, and design patterns. More recently, in 2019 I read Bob Martin’s “Clean Code” and was impressed with the SOLID principles.

The purpose of this article is to compare two nontrivial codebases – one legacy codebase preceding my introduction to the SOLID principles and a second, newer codebase featuring an intentional SOLID design. This first post will ask if SOLID principles measurably change my coding practices using objective source code metrics. Next, I will compare unit testing in the two projects. A third article will compare code reuse within the two projects. Lastly I will close with an article about dependencies in the two projects.

Meet the Codebases

PhotoDoc, is a legacy codebase that has grown up with my career. PhotoDoc was born in 2007 when I volunteered as an examiner for the Northwest Arctic Borough Sexual Assault Response Team. It irked me that I was being asked to make complex medicolegal decisions based of photographs, without even the basic image manipulation tools I was used to from my computer science training. My volunteer interest in medical forensics became a career in 2010 when I entered a postdoctoral fellowship in child abuse pediatrics, and PhotoDoc grew along with me. PhotoDoc picked up examination forms, growth charts, x-ray, video, and audio analysis during my fellowship. As I moved on to my current position leading the Child Abuse Pediatrics division at the Medical University of South Carolina, PhotoDoc learned how to talk to many research, billing, and clinical databases that are so essential to my current position.

Despite being developed largely in my free time, PhotoDoc is a working codebase. PhotoDoc is used daily by the approximately 10 people unfortunate enough to report to me. The Department of Pediatrics considers it critical infrastructure for our division. I still maintain the code, but rarely feel the need for new features.

Melville.Pdf came to be in mid 2021. Pdfs are important to PhotoDoc because much of the information I consume at work comes to me in PDFs. Some of the forms the state makes me fill out go out as PDF forms. My reports are printed to PDFs before being sent to partner agencies. PhotoDoc relies on three different PDF libraries, and I wasn’t happy with any of them. I was especially irked that my choices for rendering PDF were either not completely free, closely tied to windows, or hopelessly buggy.

About this time PhotoDoc was transitioning to maintenance mode and so I was looking for a new project. I thought “how hard can it be to render PDF?” then “it sounds like fun, and I can give away a free, .NET PDF renderer.” Github records my first commit on 6/27/2021 and 1,014 commits later Melville.PDF is available of Nuget and Github.

Recently, the development of these two codebases occurred to me as an interesting natural experiment on the effect of SOLID programming. The strength of this experiment is that neither codebase is a kata or a toy – they are each significant codebases with tens of thousands of lines of code. Each codebase responds to significant external requirements and was not intended to be an example of a particular architectural style. The two codebases are each the product of a single programmer (me) so team communication is not a factor, and my raw programming talent probably has not significantly changed in the past decade and a half.

Like all experiments, this one has some limitations. The two codebases do different things, and it could be that one problem is markedly harder than the other. PhotoDoc developed over time in response to shifting requirements, whereas Melville.PDF is coded to a static Pdf specification. Most notably in the 15 years that PhotoDoc has existed, C# itself has been in active development and has become more concise. Especially in the code metrics one must recall that, PhotoDoc is written in a variety of historical styles whereas Melville.PDF is exclusively modern C#. An additional limitation is that maintenance continues on PhotoDoc, including some recent efforts to clean up this legacy codebase. Thus there is some crossover as I have been refactoring PhotoDoc toward a SOLID design.

Comparing the Codebases

This first article concludes with a simple question. Did my decision to adopt clean code result in measurable changes to the codebases?

I used the Visual Studio code analysis feature to compute various metrics for both projects. For each project I excluded unit test code. On the Melville.PDF side I included code that generates test documents and applications written pretty much exclusively to view and debug the rendering output. Test code is not unimportant, but it is different. I may look at the tests in a future part. Because this is an investigation of my personal coding practices, I also excluded the JPEG and JPEG2000 code that I copied from other libraries into the Melville.PDF codebase.

Total Code

PhotoDoc is a much larger codebase at 118,236 lines of code. Melville.PDF weighs in at 46,624 total lines of code. Despite increasingly concise C# notation over time PhotoDoc has a greater proportion of lines containing executable code (29.2% vs 26.4%, p < 0.0001.) This may reflect that PhotoDoc.PDF contains hardcoded data tables for various character mapping schemes. It may also reflect that clean code emphasizes many small methods, and method declaration lines are not executable.

Classes

Regarding classes, Uncle Bob insists “The first rule of classes is they should be small. The second rule of classes is they should be smaller than that.” (Martin, Clean Code pg 136) Martin suggests that classes of less than 100 lines are easy to read, and I adopted this informal guideline in Melville.Pdf. Median class size in Melville.Pdf is smaller than in PhotoDoc (20 vs 29 lines, p < 0.0001.) The overwhelming majority of classes in both projects are less than 100 lines.

If the target for classes is less than 100 lines, then classes over 200 lines are definitely suspect. Despite PhotoDoc being roughly 3 times the size of Melville.Pdf, PhotoDoc contains 101 classes > 200 lines and Melville.Pdf contains 15 (p for difference of proportions < 0.0001.)

Even more interesting than the raw numbers is looking at the 15 instances where Melville.Pdf classes exceed 200 lines. 7 of the 15 classes are essentially dictionaries that contain no significant computation or algorithms. The 4422-line behemoth, for example, maps entries in the Adobe Glyph List to their Unicode equivalents. Other “dictionary classes” include flyweights for name objects used throughout the PDF spec, character mappings, and standardized Huffman tables from the JBIG specification document. Other large classes include a strongly typed property bag for drawing contexts, and a class featuring an unavoidably large switch statement that dispatches drawing commands for the PDF content stream parser. PhotoDoc, in contrast, includes multiple classes that got big because they do a lot of stuff – and they have stayed big despite recent efforts to make them smaller.

Methods

Methods were already small in PhotoDoc, with the median method being three lines of code. Melville.PDF reduced this to 2 lines, which is statistically significant (p = 0.0002) but may not be practically significant. The reduction in median method length remains significant (5 vs 4 lines, p < 0.0001) when simple property setters and getters are excluded.

Melville.Pdf contains only 6 methods with more than 50 lines of code. Two are methods that generate test documents and contain large amounts of quoted document content but a single control path with no loops or decisions. Two methods are unavoidably large switch statements; one dispatches content stream drawing operations and the other maps Unicode to the MacRoman character set. The last two simply create static dictionaries with hardcoded values. None of the long methods contains branching logic more complicated than a single switch statement.

In contrast, PhotoDoc’s 29 methods over 50 lines of code are a mix of “data centric” lookups and several methods with complicated control flow.

Other Metrics

Differences in method and class size are of some interest, but I set out explicitly to trim method and class sizes, so it is not surprising that they changed. The remaining Visual Studio code metrics were not specifically targeted and provide some insight into the effect of clean coding on quality metrics which predate the clean code movement.

Median cyclomatic complexity (1) and class coupling (3) did not differ between the codebases. A slight difference in median value for Microsoft’s maintainability index (86 vs 88) likely has no practical significance. Further exploratory data analysis on these metrics did not reveal insightful differences in the interquartile range, total range, or patterns of outliers for these three metrics.

Discussion

Empiric code metrics confirm that I successfully adopted cleaner coding practices. Both classes and methods are significantly smaller in Melville.PDF than they are in PhotoDoc. Classes shrunk more than methods did in the transition. Most interestingly, other quality metrics did not significantly change.

The size of classes decreased by a larger factor than the size of methods, but this is likely because the pre=intervention methods were already quite small. Subjectively, one of the biggest differences I noted in writing the two projects was stronger insistence on the single responsibility principle for classes. In the PhotoDoc codebase, many classes had “subparts” delineated by #region blocks with extra-private fields that, by convention, should only be manipulated within that block. In Melville.PDF these subparts are separated into their own classes. Melville.PDF also makes extensive use of read only struct types to logically contain related methods without putting additional pressure on the garbage collector.

I was very surprised that cyclomatic complexity, class cohesion, and maintainibilty index did not differ significantly between the codebases. The Melville.PDF codebase subjectively “feels” easier to maintain, as will be discussed in future parts, but this difference was not reflected in objective code metrics. This may be because two of the metrics, cyclomatic complexity and maintainability index, are very method-centric and my pre-intervention methods were already quite short.

Conclusion

This first part has proposed a natural experiment – comparing two large codebases each written by the same single programmer before and after introduction of SOLID software design. In this first part we have confirmed that SOLID design principles have measurably changed characteristics of the code, but that those changes were largely restricted to measures that are directly targeted in the SOLID guidelines. Three classic metrics of code maintainability were unchanged by the introduction of SOLID design.

In the next part I will look at the testability of the two codebases.

A Tale of Two Codebases (Part 4 of 4): Dependency Smell

JohnN6TSM — Wed, 05 Oct 2022 22:21:18 +0000

Dependencies are a mixed bag at best. One might think that “any code I don’t have to write is good code.” On the other hand NIH syndrome came from somewhere – somebody else’s code is never going to be exactly what you hoped it would be

In reading Clean Architecture I read that adopting a framework is an “asymmetric marriage,” because the dependency might impose significant constraints on the application, but the application has no influence on the framework. (Martin R, Clean Architecture pg 293) Unfortunately, I already knew. Early on, PhotoDoc married 2 frameworks.

Lesson #1: Your domain code will last longer than any of your dependencies.

I have already mentioned that my earliest thoughts about PhotoDoc were as a WPF App. (I might go so far as to admit that some of the early features in PhotoDoc were inspired by the WPF demos that were dime a dozen in 2007.) I have already discussed, in part 2, what a mess integrating WPF controls into my domain model made for testing. The choice to marry WPF has had other consequences as well.

My life is different now than it was in early 2007. At the time I was a rural physician living my dream in rural Alaska. I ran a two-person forensic examiner program at a rural hospital that saw about 100 patients a year. I saw patients one at a time. What I really thought I needed was a simple image manipulation program. At the time my hospital used paper records, so I could write my forensic notes in Word and print them out for the patient chart.

My life is different now, I run a relatively large academic child abuse practice with 10 practitioners that sees well over a thousand patients a year. In addition to digital photographs, I get x-rays, audio, video, and documents in multiple formats. I have multiple funders and research partners, each of whom requires slightly different data in a slightly different format. I have to handle data transfers (in both directions) between my system and the legal electronic medical record at the medical center that hosts my clinics.

WPF was the hot new technology in 2007, and while it is not dead in 2022, it is no longer the darling getting all the attention. Now some of my employees prefer Macs, so they have to run windows in a VM, because I am tied to WPF and Windows. I have years of patient data stored in PhotoDoc files – but the only parser I have for those files is strongly tied to WPF and the specific windows I create in PhotoDoc, which kneecaps my ability to search through and manage the large mass of patient data I have accumulated through the years. I did not understand in 2007 that my interest in medical forensics was going to last longer than WPF’s heyday.

But today is not what I really worry about. I don’t turn 65, and nominally eligible for retirement, until 2040. At that point WPF will be 33 years old. When my turn comes to move along to something else, WPF will be as old then as MS-DOS 6.0 is right now. Microsoft has a very good record with backward compatibility, so there is a reasonable chance that I might avoid a catastrophic and costly rewrite. If I were a Mac programmer, PhotoDoc would already be obsolete. In 2007 I never dreamed I would be running a university division, let alone using PhotoDoc, in 2040. Now that future looks entirely probable.

Melville.PDF is a brand-new codebase, so I do not yet have 15 years of regrets to complain about. But I hope it will be more durable that PhotoDoc. Melville.PDF does depend on WPF, but rather than having the data model depend on WPF, a single assembly plugs into the data model and provides WPF functionality. If WPF disappeared tomorrow, I would continue using Melville.PDF with the Skia binding. Furthermore, building Melville.Pdf to support 2 different frameworks, WPF and SkiaSharp, forced me to carefully define and segregate common PDF rendering code from framework specific rendering code.

Lesson #2: Don’t Buy the Cow When You Can Get the Milk for Free

Taking a dependency on WPF is not the worst of my dependency sins in PhotoDoc. WPF shipped with, and still has, what I consider to be a significant flaw. WPF makes it trivially easy to bind to properties on POCO objects. Wpf does not have a corresponding mechanism to bind UI events to an arbitrary method on a POCO object. This deficiency results in an unending stream of MVVM frameworks for WPF.

I picked Caliburn Micro, and I have lived to regret it. At the time I took the author’s advice and copied the source code into my source, so it is not as bad as it could be. I have fixed or deleted some of the most objectionable parts, I have enhanced some of the other parts – and I still hate it. The problem is that I now have literally hundreds of view classes that don’t work without Caliburn Micro’s “magic.” Worse still, Caliburn Micro uses conventions based on the WPF Name properties assigned to controls. Even if I was willing to modify and re-test the hundreds of dependent classes there is no obvious way to search for all the locations that depend on the framework. I bought that cow and now she’s mine to keep.

Years later, I wrote my own MVVM binding for WPF. I think it’s better, of course, because I wrote it. Now I have two ways to bind to events, two ways to bind mouse moves, two ways to associate ViewModels with Views, and etc. It grates on me to see the “old” way of doing things littered throughout the codebase, but there is no way to fix it without a massive refactoring and manual testing effort.

Writing Melville.PDF, I have been very selective about the dependencies I take, especially dependencies in the core assemblies. Eventually I took 3 dependencies outside of the .NET framework, a JPEG parser, a Jpeg2000 parser, and a library that parses multiple font files. These dependencies are stable – they parse decades-old file formats. I hope I have not chosen poorly.

Should I develop regrets, however, I didn’t buy the cow this time, I just took the milk! This is as evidenced by the fact that the JPEG library is the fourth library I have used to parse Jpegs. It turns out that PDF has some rather unique requirements for JPEG parsing, and so using the WPF image parser, Six Labours’ ImageSharp, and even an educational but frustrating attempt at writing my own parser, all had unacceptable liabilities. Eventually, I was able to use the insight I gained from writing my own parser to modify an open-source parser, JpegLibrary, to meet my needs.

Unlike the dependency hell I experienced with PhotoDoc, each of these replacements was a trivial operation. Melville.PDF has only one class that knows anything about JpegLibrary, named DctDecoder. (I enforce this constraint – see the next section.) The low-level PDF parser, which is the customer in this case, declared an interface, ICodecDefinition, defining how it would like to request JPEG decompression.

Writing small adapters to make any of four JPEG libraries implement this interface has been trivial. Each time I switch the adapter class is the only thing that gets thrown out and rewritten. During the switchover from ImageSharp to JpegLibrary I had 2 adapters, and I switched back and forth several times by just commenting or uncommenting a few lines of code.

I got 2 benefits from this design. 1) My PDF parsing code, which is the code that matters, treats all stream compression formats identically, and using an interface that the PDF parser defines and makes sense for the PDF parser. 2) Implementing this interface for a variety of formats in terms of a variety of dependencies has proven to be trivial. Very little code is thrown away when the dependencies change.

Right now, I have chosen a static dependency from my parser to the Jpeg parser. Jpeg is a very stable format, and I seriously doubt it is going to change significantly, even over the next five decades I might remain on the planet. The unlikely possibility that a user would want to supply their own JPEG parser was not worth the complexity of injecting the dependency. Because I cabined this dependency behind an abstraction that I own, however, I retain the choice to inject this dependency if this library becomes a problem in the future. I will never be at the mercy of JpegLibrary in Melville.Pdf like I am to Caliburn Micro in PhotoDoc.

Lesson 3: Give Architectural Rules Teeth

The previous lesson taught us that the risk of taking a dependency is that it insidiously weaves its way into the code. The risk is that the more you use a dependency, the more its types infest the code, and when dependencies change, the removal can be painful. The reader might rightly argue that JpegLibrary was a very simple interface – it takes a stream and returns a stream – so it may not be the best illustration of the ability to contain dependencies within a codebase.

For the next demonstration I would ask you to look at the SharpFont Dependency. SharpFont provides core services very near to the heart of PDF rendering. SharpFont implements an abstraction over 5 or 6 different font files formats that PDF supports. The library is intimately involved in every character that is written. Furthermore, PDF defines character mappings as a complicated mix of tables from the font file and tables from the PDF file that are combined using a complicated mess of overlapping rules. SharpFont is an ideal candidate to embed itself in the codebase never to be removed.

I would love to ditch SharpFont someday. It has native dependencies I would prefer to avoid. It also has a very C-centric API that does not play nicely with the C# garbage collector. Its glyph mapping scheme is not thread safe, so I must serialize all the font operations with a semaphore. As much as I hate this library, it parses several notoriously tricky font file formats quickly and correctly, and nothing else I found does. My relationship with SharpFont is not that of a cherished wife, but a necessary mother-in-law.

The clean solution, as I already discussed, is to wrap up all the ugliness I don’t like in a thin wrapper class that implements the interface I wish SharpFont presented me. That class is FreeTypeFont, which implements the IRealizedFont interface. Unlike the wrapper class in the past section, FreeTypeFont is not a trivial class. It has numerous helper classes, some static data, and implements a significant portion of Melville.Pdf’s useful features.

One risk is that SharpFont defines a bunch of accessory types on its own. SharpFont defines enums for various character styles, classes to represent font families, fonts, characters, and various mapping tables. If my wrapper class takes these types as arguments or returns them from public operations, then the wrapper class will fail to insulate the rest of the code from this least favored of my dependencies. Even putting FreeTypeFont in its own assembly would be insufficient because in C# assembly dependencies are transitive.

The Roslyn C# compiler allows custom analyzers that run during the compilation and can emit warnings, or even errors, that effectively add additional constraints to C# that are specific to a project. I implement such an analyzer that contains dependencies. In an architecture definition file I have restricted any references to SharpFont exclusively to the namespace of Melville.Pdf.Model.Renders.FontRendering.FreeType namespace and its descendants. The code I could possibly be required to rewrite if I eventually switch dependencies is carefully penned in one namespace because the compiler will not let me say the name of any of SharpFont’s types outside that namespace. (And yes, the analyzer uses the Roslyn semantic model, so it is smart enough to detect forbidden type usages that do not explicitly say the type name.)

The FreeTypeFont wrapper is a significant piece of code. If I ever find a replacement for SharpFont, rewriting FreeTypeFont will be expensive and difficult – that is the cost of rewriting a significant part of the library. Because I have the architecture analyzer, I have a solid upper bound on how expensive it will be: I might have to replace all nine classes that can see the library, but it won’t be worse than that.

Incidentally, as I went through the code writing this article, I noticed the opposite architectural problem. Code that parses PDF font structures had migrated into the “danger zone” where accessing SharpFont was allowed. It took me about 15 minutes to move these classes to more appropriate namespaces, and no unit or integration tests broke in the process. It I ever actually get to ditch SharpFonts, there is more code I could move out of the danger zone and reuse. It was not worth creating those abstractions right now because I suspect I am stuck with SharpFont for the foreseeable future.

I use the architecture analyzer throughout Melville.PDF to keep the high-level dependency graph acyclic and to contain my dependencies. As I have been living with the architecture analyzer for the last year, I am surprised by the number of times I inadvertently violate architectural rules even when trying to create a carefully layered design. I am convinced that architectural rules need teeth to be observed.

Conclusion

Dependencies are inevitable in any software project because no one writes on the bare metal anymore. Dependencies cause problems when your project ages more slowly than the code you depend upon. Since software has an insidious habit of lasting longer than anyone anticipated, Clean Architecture dictates that one contain dependencies specifically because they are likely to change.

This article also ends the four-part series reporting the results of a natural experiment comparing two codebases from before and after I adopted clean coding practices as promoted by Robert Martin. As discussed above, I have reaped significant benefits in terms of testability, code reuse, and flexibility to switch dependencies. I believe this will make it easier to adapt Melville.PDF to future platforms than has been possible with PhotoDoc, but the future remains to be seen. Most interesting to me, though is that cyclomatic complexity, class cohesion, and Microsoft’s maintainability index did not differ appreciably between the projects. This suggests to me that SOLID design provides additional maintainability benefits beyond those considered by earlier co