DEV Community

Sam
Sam

Posted on • Updated on

Programatically collecting statistics and quality metrics from Python packages

While in the process of comparing a large number of machine learning models as part of data science competitions run at Unearthed, in addition to analysing the predictive performance of models, we were interested in comparing some statistics and quality metrics for the actual source code files of each model.

Given there are a number of tools out there for analysing packages, the design goals were being able to collect of wide spectrum of information and something that could easily be invoked programatically (vs output from a CLI), to tie into our data science pipeline. With that in mind, the following options were evaluated:

In the end pylint and radon had the most promising and accessible feature set.

Radon had documented APIs for programatically accessing statistics in four categories: cyclomatic complexity, maintainability index, raw metrics and halstead metrics. An example of programatically collecting some statistics using the API is:

from radon.cli import Config
from radon.cli.harvest import MIHarvester


def measure_maintainability(source_dir):
    harvester = MIHarvester([source_dir], harvester_config())
    for path, raw_maintainability_statistics in harvester.results:
        print(raw_maintainability_statistics['mi'])


def harvester_config():
    return Config(exclude=None, ignore=None, order=SCORE, no_assert=False, show_closures=True, multi=4, by_function=False, min='A', max='F', include_ipynb=False)
Enter fullscreen mode Exit fullscreen mode

Pylint lacked documentation on programatically collecting information from a source directory, but examining the entrypoint to the CLI command lead the following code snippet:

from pylint.lint import Run


def lint_directory(source_dir):
    buffer = io.StringIO()
    with redirect_stdout(buffer):
        try:
            Run(['--output-format=json', '--disable=' + ','.join(disabled_checks), source_dir])
        except:
            pass
    lint_results = json.loads(buffer.getvalue().replace("\n", ""))
    print(lint_results)
Enter fullscreen mode Exit fullscreen mode

After collating each of the metrics, the response ended up looking like the following JSON payload:

{
  "aggregated_analysis": {
    "complexity": {
      "min": 1.0,
      "max": 5.0,
      "mean": 1.7619047619047619
    },
    "maintainability_index": {
      "min": 42.44120650814055,
      "max": 73.1183133154694,
      "mean": 61.16721652696456
    },
    "code_statistics": {
      "loc": 940,
      "lloc": 521,
      "sloc": 566,
      "comments": 183,
      "multi": 31,
      "blank": 186,
      "single_comments": 157
    },
    "halstead_metrics": {
      "h1": {
        "min": 1.0,
        "max": 15.0,
        "mean": 8.0
      },
      "h2": {
        "min": 3.0,
        "max": 137.0,
        "mean": 52.0
      },
      "N1": 110.0,
      "N2": 215.0,
      "vocabulary": {
        "min": 4.0,
        "max": 152.0,
        "mean": 60.0
      },
      "length": 325.0,
      "calculated_length": {
        "min": 4.754887502163469,
        "max": 1031.03375429972,
        "mean": 374.5962139339611
      },
      "volume": {
        "min": 12.0,
        "max": 2101.89897889864,
        "mean": 748.9542971398513
      },
      "difficulty": {
        "min": 0.6666666666666666,
        "max": 10.510948905109489,
        "mean": 5.309205190592052
      },
      "effort": {
        "min": 8.0,
        "max": 22092.952770905413,
        "mean": 7577.510451793251
      },
      "time": 1262.9184086322082,
      "bugs": 0.7489542971398513
    },
    "linting": {
      "types": {
        "convention": 206,
        "warning": 55,
        "refactor": 12
      },
      "symbols": {
        "invalid-name": 85,
        "line-too-long": 53,
        "trailing-whitespace": 33,
        // ...
      }
    }
  },
  "file_analysis": {
    // The same metrics, but for each file in the package.
  }
}
Enter fullscreen mode Exit fullscreen mode

Top comments (0)