DEV Community

Sam for Unearthed Solutions

Posted on • Updated on


Programatically collecting statistics and quality metrics from Python packages

While in the process of comparing a large number of machine learning models as part of data science competitions run at Unearthed, in addition to analysing the predictive performance of models, we were interested in comparing some statistics and quality metrics for the actual source code files of each model.

Given there are a number of tools out there for analysing packages, the design goals were being able to collect of wide spectrum of information and something that could easily be invoked programatically (vs output from a CLI), to tie into our data science pipeline. With that in mind, the following options were evaluated:

In the end pylint and radon had the most promising and accessible feature set.

Radon had documented APIs for programatically accessing statistics in four categories: cyclomatic complexity, maintainability index, raw metrics and halstead metrics. An example of programatically collecting some statistics using the API is:

from radon.cli import Config
from radon.cli.harvest import MIHarvester

def measure_maintainability(source_dir):
    harvester = MIHarvester([source_dir], harvester_config())
    for path, raw_maintainability_statistics in harvester.results:

def harvester_config():
    return Config(exclude=None, ignore=None, order=SCORE, no_assert=False, show_closures=True, multi=4, by_function=False, min='A', max='F', include_ipynb=False)
Enter fullscreen mode Exit fullscreen mode

Pylint lacked documentation on programatically collecting information from a source directory, but examining the entrypoint to the CLI command lead the following code snippet:

from pylint.lint import Run

def lint_directory(source_dir):
    buffer = io.StringIO()
    with redirect_stdout(buffer):
            Run(['--output-format=json', '--disable=' + ','.join(disabled_checks), source_dir])
    lint_results = json.loads(buffer.getvalue().replace("\n", ""))
Enter fullscreen mode Exit fullscreen mode

After collating each of the metrics, the response ended up looking like the following JSON payload:

  "aggregated_analysis": {
    "complexity": {
      "min": 1.0,
      "max": 5.0,
      "mean": 1.7619047619047619
    "maintainability_index": {
      "min": 42.44120650814055,
      "max": 73.1183133154694,
      "mean": 61.16721652696456
    "code_statistics": {
      "loc": 940,
      "lloc": 521,
      "sloc": 566,
      "comments": 183,
      "multi": 31,
      "blank": 186,
      "single_comments": 157
    "halstead_metrics": {
      "h1": {
        "min": 1.0,
        "max": 15.0,
        "mean": 8.0
      "h2": {
        "min": 3.0,
        "max": 137.0,
        "mean": 52.0
      "N1": 110.0,
      "N2": 215.0,
      "vocabulary": {
        "min": 4.0,
        "max": 152.0,
        "mean": 60.0
      "length": 325.0,
      "calculated_length": {
        "min": 4.754887502163469,
        "max": 1031.03375429972,
        "mean": 374.5962139339611
      "volume": {
        "min": 12.0,
        "max": 2101.89897889864,
        "mean": 748.9542971398513
      "difficulty": {
        "min": 0.6666666666666666,
        "max": 10.510948905109489,
        "mean": 5.309205190592052
      "effort": {
        "min": 8.0,
        "max": 22092.952770905413,
        "mean": 7577.510451793251
      "time": 1262.9184086322082,
      "bugs": 0.7489542971398513
    "linting": {
      "types": {
        "convention": 206,
        "warning": 55,
        "refactor": 12
      "symbols": {
        "invalid-name": 85,
        "line-too-long": 53,
        "trailing-whitespace": 33,
        // ...
  "file_analysis": {
    // The same metrics, but for each file in the package.
Enter fullscreen mode Exit fullscreen mode

Top comments (0)

Timeless DEV post...

Git Concepts I Wish I Knew Years Ago

The most used technology by developers is not Javascript.

It's not Python or HTML.

It hardly even gets mentioned in interviews or listed as a pre-requisite for jobs.

I'm talking about Git and version control of course.

One does not simply learn git