Anna Voronina

Posted on Nov 21

Building the PVS-Studio megapolis

#cpp #vscode #programming #testing

Have you ever wanted to see your code in a whole new light? For example, imagine what your code base would look like if it were a city. Sounds a bit unreal, doesn't it? Let's take a walk through the city of PVS-Studio and discover its secrets :).

First things first

While surfing the web, we came across a tool that can transform code into a city. Grappl builds a visual map of your code base, showing the scale and hierarchy of all files and classes. The result is breathtaking—it feels like you're flying over your own project!

If you, like me, have ever wondered what your code looks like from a bird's eye view, you can now experience it with the Visual Studio Code source-code editor and the Grappl extension.

The creators designed the tool to tackle the following tasks.

Visualize the code base architecture to enhance communication between technical and non-technical team members, such as managers and analysts.
Automatically link code with tasks from trackers such as Trello or Jira, as well as with commits. This approach speeds up code delivery, reduces maintenance time, and streamlines the work of reviewers and testers.

Today, however, I invite you to indulge a bit and simply explore the buildings of our vast PVS-Studio megapolis.

Grappl assigns a different color to each type of code element. Let's take a look at the the symbols in the following diagram to understand how it works:

First, we'll test the tool on some other projects. Since we regularly use our analyzer to check different projects, I still have a few saved repositories. Here's, for example, the PHP project. I wrote an article about the bugs we found there. This is what its code city looks like:

The map shows many files, functions, and structures, but it lacks classes and namespaces. The reason is that the code is written in C, where these entities simply don't exist.

If you have already explored this map, you have probably noticed the tall blue towers. I was wondering what was inside the one on the far left. It turned out to be the lex_scan(zval * , zend_parser_stack_elem * ) function from the Zend/zend_language_scanner.c file. It's about ten thousand lines long, which explains why the building is so tall.

Another example is the Xenia project in C++. We also checked it using PVS-Studio, and you can find the analysis results in this article. This is what its city looks like:

The project isn't very big, so parsing the code didn't take long, but the result is impressive. Here, you can see a lot of different namespaces and classes. Compare this map to the PHP one: If the C project looked like Washington, this one's more like New York. And since they separated their own code from the third-party libraries, you can spot a suburban area off in the distance—its code base is much smaller than the main city.

So, what kind of city is PVS-Studio?

We looked at cities of other projects, but the real magic begins when you visualize the tool you use every day.

A quick reminder

PVS-Studio is a static code analyzer that has been on the market for over 15 years. It was originally created to detect issues during the migration from 32-bit to 64-bit systems and supported only C and C++ languages. However, its range of features has expanded significantly over time, and it now effectively detects a wide variety of common errors. Follow the link to learn more about the stages of the project development.

PVS-Studio can now detect errors in C, C++, C#, and Java code. Additionally, the analyzer provides an extensive set of diagnostic rules that identify potential vulnerabilities from the CWE list, as well as deviations from the MISRA and OWASP ASVS standards.

Today, I invite you to take a look at the visualization of the C and C++ analyzer code. As someone involved in developing this part of the tool, I'm fascinated to see what our city would look like. Let's take a look under the hood and explore the architecture of PVS-Studio.

We're looking at a vast city whose structure reveals a great deal about the project. The division into districts is clear, so we'll look at the most important ones.

Large residential area: testing modules

Don't let the size of our test database surprise you. Building a reliable product requires extensive testing on thousands of samples, and this is just a small part of it. We also run additional tests on a separate set of open-source projects that aren't included in this visualization.

As for this test set, it consists of several parts. The diagram shows the following tests.

Rules are functional diagnostic rule tests. They're our main indicator: any changes to the core or diagnostic rules immediately appear here, showing progress or regression.
DocumentationTests are functional tests that check whether the analyzer issues warnings for examples from the documentation. If the documentation describes an error, the analyzer must find it.
ParseTests are functional code parsing tests. Most of them are combinations of standard header files generated using our utility. The goal is to ensure that code parsing completes without errors or crashes. We update these tests whenever major changes are made to the standard library.
UETests are functional tests that check how the analyzer handles Unreal Engine's custom containers.
StructTests are tests that check the accuracy of structure size and alignment calculations. There are plenty of them, so we check all possible scenarios.
AnnotationsTests are functional tests for the user annotation system, which allows users to mark up types and functions in JSON format. This provides the analyzer with additional context.
CodeCheck are functional tests that check various compiler-specific code constructs.

As you may have noticed, PVS-Studio runs many tests, including self-testing, to reliably detect errors in users' code bases. If you'd like to try out our static analyzer on your product, you can get a trial version here.

Central business district: the analyzer core

Most of the essential PVS-Studio operations take place here. We're working on a major update to our framework, so two cores are currently present in the code base. This setup ensures a smooth, gradual transition process to the new architecture.

We've been developing the new core for about two years, but we can already see that it's bigger than the old one. That said, we still use the same modules as before, such as the preprocessor and lexer.

However, to understand how far we've come, we should start from the beginning. The heart of our city is where static analysis magic happens. Let's take a look inside the old core.

To explore it more thoroughly, I suggest we'll move to the "construction site", which is the code analysis process.

It all starts with the preprocessor. First, we load the *.i file generated by an external preprocessor and perform its initial preparation:

replace tabs with spaces;
remove the BOM;
write built-in functions;
handle our special comments;
map the original and preprocessed files using #line directives;
track line number mismatches between the files.

Then, the code is sent to the lexer (Lexer) for the lexical analysis. Yes, the lexer is rather small, but its task is extremely important: It breaks the continuous program text into basic building blocks, or tokens, such as keywords, operators, and identifiers.

These tokens are then passed to the parser (Parser). Its task is to understand the code structure by assembling a complete architectural diagram, or syntax tree, from scattered building blocks. This is the tree that our diagnostic rules will travel along.

After the tree is built, the semantic analyzer (Analyzer) comes into play. Its tasks include building a type system, maintaining symbol tables, and establishing semantic relationships between code elements. In the original architecture, the semantic analyzer and parser functioned independently of each other, which created limitations for system development.

However, PVS-Studio 7.38 introduced a new core. It includes revamped components, such as a parser, semantic analyzer, and type system. It's also much bigger:

In the new architecture, the parser and semantic analyzer work closely together forming a syntax tree. Additionally, the new core includes the foundation for an abstract syntax tree (AST), which is still being finalized before it can be fully integrated into the system. For now, the parser builds a tree of the old type using SyntaxTreeBuilder.

The annotation block, which is an internal and external entity annotation system, is an integral part of the analyzer. It helps us understand the code being checked much better. The system contains the semantics of standard library classes, as well as information on how the functions behave, their preconditions, and their side effects. This section also contains the user annotation mechanism in JSON format.

After the tree is built, more details need to be added. To do this, we use data-flow analysis, which tracks the variable values as it traverses the program code.

It helps detect tricky errors such as signed integer overflow, division by zero, buffer overflow, and others.

Once the analyzer has grasped all the code, the diagnostic rules show up on stage.

Control point: static analyzer

This area contains the core components that keep everything up and running:

the command-line and configuration file parsers handle user settings and define analysis parameters;
the analysis configuration system provides flexible tools to tailor the checking process for specific needs;
the diagnostic rule manager keeps track of enabled rules and determines which warning should be issued for a given line;
the code tree traversal sequentially traverses the generated tree and triggers diagnostic rules for the appropriate nodes.

This compact yet crucial area runs the core and operations center, which we'll explore now.

Operations center: diagnostic rules

This is the most densely populated area of our city, housing over 700 code analysis experts. Each diagnostic rule is like a highly specialized expert, indeed. Together, they form the ideal project review team.

There are plenty of diagnostic rules here, as well as files containing all warning messages, utilities, and function declarations for various diagnostic groups.

You might notice that the V826 diagnostic rule stands out from the rest. That's because it deals with complex container analysis: The diagnostic rule must detect data access patterns, estimate the algorithmic complexity of operations, and determine a more efficient alternative to the standard container.

Let's wrap up our journey

The PVS-Studio map is more than just a beautiful visualization; it reflects a complex yet well-thought-out architecture where each area plays a role in identifying code defects.

Turn your own code into a city and share spectacular views in the comments :)