While in the process of comparing a large number of machine learning models as part of data science competitions run at Unearthed, in addition to analysing the predictive performance of models, we were interested in comparing some statistics and quality metrics for the actual source code files of each model.
Given there are a number of tools out there for analysing packages, the design goals were being able to collect of wide spectrum of information and something that could easily be invoked programatically (vs output from a CLI), to tie into our data science pipeline. With that in mind, the following options were evaluated:
In the end pylint and radon had the most promising and accessible feature set.
Radon had documented APIs for programatically accessing statistics in four categories: cyclomatic complexity, maintainability index, raw metrics and halstead metrics. An example of programatically collecting some statistics using the API is:
from radon.cli import Config
from radon.cli.harvest import MIHarvester
def measure_maintainability(source_dir):
harvester = MIHarvester([source_dir], harvester_config())
for path, raw_maintainability_statistics in harvester.results:
print(raw_maintainability_statistics['mi'])
def harvester_config():
return Config(exclude=None, ignore=None, order=SCORE, no_assert=False, show_closures=True, multi=4, by_function=False, min='A', max='F', include_ipynb=False)
Pylint lacked documentation on programatically collecting information from a source directory, but examining the entrypoint to the CLI command lead the following code snippet:
from pylint.lint import Run
def lint_directory(source_dir):
buffer = io.StringIO()
with redirect_stdout(buffer):
try:
Run(['--output-format=json', '--disable=' + ','.join(disabled_checks), source_dir])
except:
pass
lint_results = json.loads(buffer.getvalue().replace("\n", ""))
print(lint_results)
After collating each of the metrics, the response ended up looking like the following JSON payload:
{
"aggregated_analysis": {
"complexity": {
"min": 1.0,
"max": 5.0,
"mean": 1.7619047619047619
},
"maintainability_index": {
"min": 42.44120650814055,
"max": 73.1183133154694,
"mean": 61.16721652696456
},
"code_statistics": {
"loc": 940,
"lloc": 521,
"sloc": 566,
"comments": 183,
"multi": 31,
"blank": 186,
"single_comments": 157
},
"halstead_metrics": {
"h1": {
"min": 1.0,
"max": 15.0,
"mean": 8.0
},
"h2": {
"min": 3.0,
"max": 137.0,
"mean": 52.0
},
"N1": 110.0,
"N2": 215.0,
"vocabulary": {
"min": 4.0,
"max": 152.0,
"mean": 60.0
},
"length": 325.0,
"calculated_length": {
"min": 4.754887502163469,
"max": 1031.03375429972,
"mean": 374.5962139339611
},
"volume": {
"min": 12.0,
"max": 2101.89897889864,
"mean": 748.9542971398513
},
"difficulty": {
"min": 0.6666666666666666,
"max": 10.510948905109489,
"mean": 5.309205190592052
},
"effort": {
"min": 8.0,
"max": 22092.952770905413,
"mean": 7577.510451793251
},
"time": 1262.9184086322082,
"bugs": 0.7489542971398513
},
"linting": {
"types": {
"convention": 206,
"warning": 55,
"refactor": 12
},
"symbols": {
"invalid-name": 85,
"line-too-long": 53,
"trailing-whitespace": 33,
// ...
}
}
},
"file_analysis": {
// The same metrics, but for each file in the package.
}
}
Top comments (0)