5% of 666 Python repos had comma typo bugs (including V8, Sentry, Tensorflow, and PyTorch)

#testing #python #webdev #programming

We found that 5% of the 666 Python open source GitHub repositories had the following three comma-related bugs caused by typos:

Too many commas

Typo accidentally adding a comma to end of a value, turning it into a tuple

As far as the Python parser is concerned the parentheses are optional for non-empty tuples. According to the documentation: it is actually the comma which makes a tuple, not the parentheses. The parentheses are optional, except in the empty tuple case.

value = (1)  # evaluates to int
value = 1,  # evaluates to tuple with one element
value = (1,)  # evaluates to tuple with one element
value = 1, 2  # evaluates to tuple with two elements

By accidentally suffixing a comma on value = 1, expect to get a TypeError when the code later attempts to perform integer operations on a variable that evaluated to tuple.

Too few commas

Accidentally missed comma from string in list/tuple/set, resulting in unwanted string concatenation.

Implicit string concatenation that resulted from a typo can change the behaviour of the application. Take for example:

def is_positive(word):
    words = (
        'yes',
        'correct',
        'affirmative'
        'agreed',
    )
    return word in words

is_positive('agreed') is True  # evaluates to False

is_positive('agreed') evaluates to False because a typo resulted in the comma being missed from the end of 'affirmative', resulting in 'affirmative' and 'agreed' being implicitly concatenated to 'affirmativeagreed'.

Accidentally missed comma from 1 element tuple, making it a str instead of a tuple.

Detecting the bugs

We did not go through the repositories line by line. Instead we ran the repositories through our suite of static analysis checks which we run on AWS Lambda, taking a grand total of 90 seconds.

The main difficulty was reducing false positives. Syntactically there is no difference between a missing comma in a list and implicit string concatenation done on purpose. Indeed, when we first wrote the "probably missing a comma" checker we found that 95% of the problems were actually false positives: there are perfectly cromulent reasons a developer would do implicit string concatenation spanning multiple lines:

User agent string
file path
URL
CSV file contents
Very long message
HTML
Sha hash
SQL
JSON

After checking the 666 codebases we understood the key drivers of false positives and added allowances for implicit string concatenation for those types of data. After we gave allowances for these valid reasons to do implicit string concatenation the false positive rate went down to negligible non-annoying level. At this point we were left with a list of 24 repositories that had real bugs that slipped through code review process.

How common?

Overall we detected these bugs in 24 of the 666 repositories - and many of them were big open source projects. We raised the issues and helped fix over the course of a very busy day of managing tickets and PRs. Here's the most interesting ones:

5% of repositories had one of these three bugs, and most of them were "big" well known and well-used projects with many contributors and very robust code review processes. This highlights that for manual processes like code review to detect 100% of bugs 100% of the time then 100% of humans must perform 100% perfectly during code review 100% of the time. See the cognitive dissonance? We do code review because we expect human error when writing the code, but then expect no human error when reviewing the code. There are just some things a human can do well, and there are some things a bot can do better. Noticing commas in the wrong place is one of those things. For example:

Can you see the problem? Line 572 has implicit string concatenation!

It's missing a commas, so should be
'"()<>[]:,;@\\"!#$%&\'-/=?^_{}| ~.a"@example.org', '" "@example.org',

I'm not surprised that was missed by a human!

The impact

Some of the missing commas have little to no impact, but some are rather impactful. Take the Sentry one as an example:

Notice the missing comma in the list:

There is a comma missing between "releases" and "discover" resulting in the two being implicitly concatenated together to form "releasesdiscover".

The impact is the test requesting "/releasesdiscover" which does not exist, and so instead of "/releases" and "/discover" being tested, instead the 404 page is tested. That means for all we know, the organisation switcher on /releases and /discover could be broken. There is some poor dev out there maintaining the organisation switcher relying on this test and using it as a quality gate to give them confidence they have not broken the product but the test is not really working. It's these kind of things that can harm a nights sleep! It ain't what you don't know that gets you into trouble. It's what you know for sure that just ain't so.