Tony Robalik

Posted on Nov 24

Is the Java ecosystem cursed? A dependency analysis perspective

#gradle #java #kotlin

I am the author of the moderately popular (⭐ 2k) Dependency Analysis Gradle Plugin, a static analysis tool that helps Gradle build authors maintain a healthy dependency graph. I also maintain some of the largest Gradle repos on the planet: a Kotlin backend repo with over 2500 subprojects, and an Android repo with more than 7200 subprojects (both proprietary). I have… seen some shit.

Note: I refer to both the cases above as being part of the "Java ecosystem," though both use Kotlin as the preferred language, and one runs on the JVM while the other runs on ART (the Android runtime) on mobile devices.

I come to you with a simple proposition: I believe the Java ecosystem is cursed. Hear me out.

We are cursed with…

Lying metadata, overuse of "fat" jars with underuse of package relocation, split packages, undocumented usage of reflection to access upstream dependencies, usage of terms like "upstream" that have different meanings in different contexts, misuse of protobuffers, different compilers with different notions of their obligations vis-a-vis the Java class file format…

Lying metadata

This was already covered in-depth in This is why we can't have nice things: When POM files lie, but the summary is: sometimes dependencies have hand-written metadata, which is certainly A Choice given that build tools exist. I suppose it's harder to teach a build tool to lie.

It's just a list, man

Despite the bewildering complexity of dependency resolution engines in tools like Gradle and Maven, at the end of the day a classpath is just a list of class files (and jars that package class files). When your running program "sees" a class or interface for the first time, it has to load it. It does this with a ClassLoader. The classloader searches the classpath (just a list of class files!)¹ and picks the first class file that matches the class it just encountered. Importantly, your classpath may have more than one class file for that class. Even well-behaved builds may have this problem, for a variety of reasons, some of which are noted below.

As I was writing this post, I saw yet another reason to fear the classpath, in the November Gradle newsletter: Maven-Hijack: Software Supply Chain Attack: Exploiting Packaging Order. Bad actors can make use of this fundamental property of the JVM to insert malicious code into your applications. Or, as we'll see below, you can just do it to yourself!

Fat jars without package relocation

Shadow is a powerful tool for creating "uber" or "fat" jars, which are jars that contain all their external dependencies rather than relying on a classpath. This can simplify deployments of applications since deployers only need to worry about a single jar instead of dozens, hundreds, or thousands of jars. This is fine. It becomes cursed when libraries make use of this tool, resulting in broken classpaths that contain duplicate class files such that runtime behavior is dependent on the classpath's order. I would like to point the maintainers of these libraries at Shadow's powerful relocation abilities, which enable it to change the package of bundled classes such that there can be no duplicate class problem.

As I've said before, the extent to which Java's packages exist in a global namespace is not well-appreciated.

Split packages

As will be discussed tangentially below, the existence of split packages complicates dependency analysis because it makes it harder to connect class names with the modules that provide them, since there is now a 1-to-many relationship between packages and modules.

I mostly work in Kotlin repos, both backend and Android, neither of which use JPMS (the Java Platform Module System). I can't say from direct experience how widely used is JPMS in the pure Java world, but as the maintainer of an increasingly complicated static analysis tool, I can say I wish more projects used it.

(Kotlin users would say they get the benefits of JPMS thanks to the internal visibility modifier, but they're wrong.)

Funhouse mirrors (aka reflection)

The com.amazonaws:aws-java-sdk-core has a method, getProfileCredentialService(), which uses reflection to access a class from the com.amazonaws:aws-java-sdk-sts library. This is a compound curse, composed of these properties:

com.amazonaws:aws-java-sdk-sts depends on com.amazonaws:aws-java-sdk-core, not the other way around.
Triggering the code path for getProfileCredentialService() will throw an exception if com.amazonaws:aws-java-sdk-sts is not on the classpath, raising the question of why the dependencies are structured this way.
The Java ecosystem has first-class functionality, Service Loaders, for dynamically instantiating something that might or might not be on the classpath.

The Dependency Analysis Gradle Plugin has had support for Service Loaders since the beginning of its existence. First-class features such as service loading are great for static analysis tools like DAGP, as they give it well-known places to search during analysis. Ad hoc approaches like reflection are trickier, and require substantially more complex approaches to handle. DAGP added support for Class.forName("...") in v3.3.0. Pre-3.3.0, DAGP would suggest removing the sts dependency as unused if it couldn't detect any direct reference to any of the classes it provides in the bytecode, leading to runtime failures, either in CI (ok) or post-deployment (bad).

Protocol Buffers

Protocol buffers, aka protobufs, are an amazing tool for making a build engineer's days a living nightmare. First we must note that there are at least two competing protobuf compilers in the JVM world: Google's protoc and Square's Wire. I happen to work at a company that uses both. I don't think I hate myself, but maybe God does. These compilers generate code (Java or Kotlin) from the protobuf format that are mutually incompatible without adapters,² meaning that once you have both in your codebase, you will probably always have both—congrats.

I have also seen several modules with both plugins in use simultaneously. Well.

I work with Gradle. Each of the competing compilers comes with a Gradle plugin. I may be slightly biased, but I think the Wire Gradle Plugin is better. Nevertheless, the relative ease with which either can be configured leads to Fun Situations such as: two modules can each depend on the same proto files, possibly at different versions, leading to generated code with the same exact class name but different definitions. And now if you have a third module that depends on these two modules, you're in a situation where your module may fail to compile if you just so happen to change the order of your dependency declarations, or worse, it may compile in both cases but fail at runtime for a similar reason. This is because, as discussed above, a classpath is just a collection of jars and class files, and whichever class file gets loaded first wins forever.

I have now worked in two separate extremely large codebases that have significant usage of protos, and it is no exaggeration to say that dealing with them is almost the worst part of my job ("AI" has recently taken that crown).

Yolo compilers

It turns out that different compilers have different ideas of what the resultant class files should look like. Chapter 4 of the JVM specification discusses the Constant Pool. class files contain a table, known as the constant_pool, which contains a reference to every constant present in the source code of a Java file. This is useful for static analysis because the various JVM compilers all³ inline constants for runtime efficiency. This means that a constant like public static final String CONSTANT = "magic" gets turned into simply "magic" at the use-site, and similarly for Kotlin's const val. Therefore simply analyzing the bytecode directly with a tool like asm won't enable static analysis tools to connect the user of a constant to the maybe-separate module that provides the constant. Thanks to the constant pool, however, we can see the full reference to the provider and make the connection.

This only works for class files compiled with javac, however. For both kotlinc and ec4j (the Eclipse compiler for Java, yes this does exist and Real Teams in the world rely on it), keeping these full references to inlined constants in the constant pool is considered unnecessary.

The Dependency Analysis Gradle Plugin has a class, ConstantPoolParser, which parses the constant pool of a class file and extracts the set of class file references for all the class's inlined constants. When passed a reference to a class file compiled with something other than javac, the returned set is empty. This leads to "unused dependency" false positives, when a dependency is only used for the constants it contains, a surprisingly common situation.

As a consequence, DAGP utilizes some heuristics to try to workaround this situation—with imperfect results. I won't go into details here, but they involve parsing source code for import statements,⁴ looking at the ldc bytecode instruction,⁵ etc. Source parsing falls over in the presence of split packages, and the ldc bytecode only provides the constant value, not its name. Together, the heuristics get most of the way there, and it's unlikely the tool will get more accurate here without much more sophisticated source code analysis. Happy to work on that if you want to fund me!

Special thanks

Special thanks to Luis Cortés once again for the thorough review!

Who stalks us in the darkness?

The above list of ~~grievances~~ curses should not be taken as comprehensive. It is merely the list of things that have most recently destroyed my will to live.⁶

…wait, who is that behind me, in the dark wood…?

Baba Yaga (no not that one) This is what I think of when I imagine the Kotlin compiler given human form.

I'm eliding some complexity around the classloader hierarchy, and the possibility you may have a custom classloader that doesn't follow standard behavior. See A crash course in classpaths for more information. ↩
To be clear, this is not about Java/Kotlin interop, but the fact that each compiler (protoc and wire) simply emit different code from the same protobuf schema. ↩
I think they all do, but I haven't checked exhaustively. At least java, ec4j, and kotlinc do. ↩
See here for where DAGP parses source code in a very simplified way. ↩
This post is already too long for me to explain in depth what I mean here. You can see where DAGP visits the LDC instruction here. ↩
Kidding! Once again the thing that's killing my will to live is just "AI." ↩

Top comments (2)

Ben Halpern • Nov 24

Cursed is a good way to describe most software ecosystems

GnomeMan4201 • Nov 24

If it has dependencies, it has an attack surface. Java’s issue isn’t “too many packages” — it’s the transitive trust model combined with almost no runtime verification. When you’re pulling 200+ libraries just to print “Hello World,” you’re not just cursed… you’re handing adversaries 200 potential injection points before your code even runs.