Why symbolic execution is the leading-edge method for generating test values

#testing #devops

In the first blog post of our blog series on Symflower's Core Technology, we explained how symbolic execution works and how we apply it to test source code, and generate unit tests. In this article, we will explain why we decided to go with it, and why symbolic execution is superior to other methods that aim to generate test values.

Methods for automated test value generation

When picking a technique to receive input values for different software tests, we can differentiate between white-box and black-box testing:

White-box testing takes the source code into account and picks specific test values to trigger certain paths of code to verify their behavior. If the code is exhaustively analyzed, for instance, by a clever algorithm, one aims to choose the test inputs in a manner that triggers all possible execution paths in order to reveal all the unexpected and incorrect outcomes. Examples for white-box testing techniques include bounded model checking and symbolic execution.

Black-box testing checks the correctness of programs without looking into the code, which means that the project is not analyzed. Instead, many tests are executed with any possible values. There are different methods to decide values such as boundary-value analysis and fuzzing.

To learn more about software testing itself, check out our guide to software testing.

In the following sections, we discuss each method with examples, as well as how they work, and their limitations.

Boundary-value analysis

Boundary-value analysis is commonly used by manual testers and thought for seminars on software test certifications. For generating test values with this method, the input value domains are partitioned. For instance, when testing a function that takes an integer as input, we could split the integer range every hundred steps into intervals from -100 to 0, 0 to 100, 100 to 200, and so on. Then, a minimum, maximum, and middle value are picked to test the code. Additionally, values just below and above the minimum and maximum are used. In our example, we could pick the values -1, 0, 1, 50, 99, 100, and 101 as inputs to test the function in the interval from 0 to 100. The idea behind this technique is to catch likely corner cases.

The following example is a function that takes an integer array and an integer. If the given integer has the value 123, an index access is performed on the array. This access results in an index-out-of-range exception if the array does not contain an element at the desired position 456.

func foobar(list []int, x int) int {
  if x == 123 {
    return list[456]
  }
  return 0
}

The easiest way to trigger the exception is to call the function with an empty list, or just an undefined list (in Go this is nil in Java it is null) and the number 123 for x. This is how such a test could look like, and is one of the tests Symflower generates for the given function:

package tutorial

import (
  "testing"
)

func TestSymflowerFoobar(t *testing.T) {
  list := []int{}
  x := 123

  foobar(array, x)
}

However, with boundary-value analysis, countless tests are performed, tests that are likely to still miss the index-out-of-range exception.

Since the source code is not analyzed, possible misbehavior of a program such as the index-out-of-range exception in the example above is not specifically searched for, and is only revealed with pure luck. Therefore, this method can only achieve poor coverage in real-world source code. For example, the "if" branch of the function above is only covered by boundary-value analysis if we are lucky enough to pick exactly the right value for x.

Fuzzing

With fuzzing, random values are produced automatically to test the code. The function under test is executed repeatedly with the generated input values in order to reveal problematic behavior, such as program failures.

However, finding bugs like the index-out-of-range exception in the example above in a reasonable amount of time depends on how the fuzzer chooses the input values. How long does it take the generator to produce our malicious input 123? Has the random value generation stopped before that? If so, the fuzzer does not expose the bug. Covering the input domain is even trickier when dealing with more complex structures, as the possibilities for different inputs explode. See our blog post about the Go fuzzer to view an analysis of a simple example about strings. Even though Go's native fuzzing is very powerful, the tool does not execute the desired string case of the example in the first 20 (!) minutes. Hence, similarly to boundary-value analysis, the reached coverage of fuzzing depends on luck, while wasting resources for an awful lot of unnecessary program executions.

Additionally, the generated test values are not like a human would pick them. For instance, to reveal an error path that is guarded with the condition x > 123, the assignment x = 124 in the corresponding test is much more intuitive than an arbitrary 32-bit integer greater than 123, let's say x = 274032187. As a result, developers have a hard time understanding and using tests that contain these values for debugging found problems.

Imagine a string structure "Color" that only allows the strings "red", "blue", "green" and "yellow", and the following piece of code:

switch myColorInput {
case "blue":
  return "sky"
case "green":
  return "grass"
case "yellow":
  return "sun"
default:
  return "rainbow"
}

A fuzzer may provide a test that covers the default case, but since the underlying structure is not analyzed, the chance is virtually zero that a value "red" is generated. Consequently, it is not obvious from the produced random string values to the developer that they missed one case.

Bounded model checking

Bounded model checking is a technique that takes the given source code into account, and can therefore produce higher coverage. For analyzing programs, models of the given source code are specified as logic formulas that are checked for satisfiability. Read on to learn about satisfiability problems and how to solve them. For deeper insights into bounded model checking of finite program executions, we recommend this article by Armin Biere et al..

A clear disadvantage of bounded model checking, however, is that the analyses are limited in depth and complexity, e.g. bugs that occur only on deep levels of loops and recursions cannot be found in a scaling manner or even not at all.

In the following example, the given function takes two integers and performs a number of loop iterations, which is determined by the input. If the loop is executed the right amount of times, the program fails with a division by zero. The bug may or may not be found, depending on the implementation of the bounded model checker.

func f(j int, k int) int {
  if k < 10 {
    return 0
  }

  a := 0
  for i := 0; i < j; i++ {
    if i > k {
      a++ // This needs to be executed at least 10 times to reveal the error. Hence, "j" needs to be chosen to be a high enough number.
    }
  }

  if a > k { // The edge case "a == k" was overlooked in this condition.
    return 1
  }

  return j / (a - k) // Here, the division by zero happens for "a == k".
}

Symbolic execution

With symbolic execution, the source code is executed with symbolic values instead of actual ones, meaning that the instances can be picked at the end of the analysis. Conditions on these symbolic values are collected along the way and expressed in mathematical language. Finally, values that fulfill these conditions are computed. In this way, symbolic execution is able to trigger all possible execution paths of the program with the produced input values. This technique is the premier class of program analysis since it is theoretically unlimited in depth and complexity. By utilizing symbolic execution, Symflower searches for all possible bugs in your software and supplies optimized unit tests that are easy to read and ideal for debugging. To see an example and learn more, read our blog post about symbolic execution.

Takeaways

Please note that black-box testing techniques can be helpful to check the behavior of your program. However, lots and lots of tests are generated unnecessarily, resulting in a waste of resources. And the main problem remains, as the same for bounded model checking: Not all computation paths are validated, and dangerous bugs stay hidden. This is why Symflower decided to go with symbolic execution, the champion of program analysis.

Naturally, there are also issues that are problematic for symbolic execution, such as path explosion or calls to the environment like programs that interact with files. Tackling these challenges is, however, worth it to be able to analyze arbitrary projects. Concluding, Symflower picks the best of the best methods, which is the only way of fully autonomously assuring quality for all software there is.

Stay tuned about Symflower's Core Technology and upcoming blog posts of this series. Sign up for our newsletter and follow us on Twitter, Facebook, and LinkedIn for more content.