Feature Engineering for Code Quality Evaluation

#machinelearning #security #softwareengineering

Our key here is to transform source code files into somthing measurable, in order to provide meaningful insights and find patterns to identify potential vulnerabilities and general bugs due to coding practices.

Some common measurements are:

Number of code lines
Number of nested functions
Number of recursions
Number of defined functions
number of defined classes
- number of defined methods inside a class
Cyclomatic Complexity

Cyclomatic complexity is an important metric, published by T.J. McCabe in 1976, to illustrate how to manage and control source code complexity independently of the size, relying on only the structure to quantify it.
As an example, we can evaluate two pieces following pieces of Python code.

 if price > 200:
    if recurring_customer:
        if season == 'summer'
            discount = 0.3
        else discount = 0.15
else:
    discount = 0.01

the following graph represents the code:

Using the usual definition for cyclomatic complexity,

M = E - N + 2P, where

E, is the number of edges of the graph
N, is the number of nodes of the graph
P, is number of connected components

we get a cyclomatic complexity number of M = 10 - 8 + 2(1) = 4.

if price > 200:
    if recurring_customer:
        if season == 'summer':
            discount = 0.3
        else: 
            discount = 0.15
    else: 
        discount = 0.01

Getting again a cyclomatic complexity number M=10−8+2(1) = 4.

Cyclomatic complexity is directly related to the number of decision points (branches) in the code.
Although the structure of the paths has changed, the number of decisions (if statements) has remained the same (3). Therefore, the number of independent paths that need to be tested is still four.

pylint can be used to automatically calculate it in this experiment.

Back to the main discussion, the number of features were kept at a bare minimum in order to ensure that one can use simple tools like grep, awk, cut, sed and pylint (in the case of a Python code) to extract all the needed features.

After getting a dataset created using your code base data, with columns representing all the described features and lines your source code files, you can use PCA as explained in my PCA article to optimize the number of the features.
K-Means (example here) can be applied to evaluate the result creating a heat map and grouping similar code to have a better overview about the main structure of your code base.

DEV Community

Feature Engineering for Code Quality Evaluation

Top comments (0)