DEV Community: Chun Fei Lung

You should be able to turn off your camera in virtual meetings

Chun Fei Lung — Wed, 29 Sep 2021 20:31:31 +0000

Turn off that camera!

I read and summarise software engineering papers for fun, and today we’re having a look at The fatiguing effects of camera use in virtual meetings: A within-person field experiment (2021) by Shockley et al.

Earlier this year I posted a summary of “Please turn your cameras on: Remote onboarding of software developers during a pandemic”, which included the following suggestion:

Team members should turn their cameras on during video calls, as this makes it easier for new hires to understand the dynamics of the team, and helps them bond and form connections with their team members.

I think this is good advice. But you may want to turn them off again once everyone has settled in…

Virtual meeting fatigue

During 2020 many organisations were forced to transition from office work to remote work in an attempt to reduce the spread of COVID-19. Face-to-face meetings were replaced by video calls and lengthy commutes briefly became a thing of the past.

But workdays didn’t become less exhausting. It didn’t take very long before workers noticed that something called “Zoom fatigue” or “virtual meeting fatigue” was what made their days so exhausting, especially after a day filled with virtual meetings.

It’s not entirely clear what causes virtual meeting fatigue. While the number of meetings increased during the pandemic, the overall time spent in meetings was actually reduced by 11.5%. This is why some scholars suspected that other properties of virtual meetings, like camera usage, might be the cause of this mysterious fatigue.

Why point the finger at the camera?

The reason why scholars think that camera usage may be the culprit has to do with self-representation; the idea that people want to be viewed positively by others and thus behave in ways that make them look good.

Self-representation plays an important role in social exchanges, in the form of unwritten social cues. Taking care to present yourself in a positive light has practical career benefits, but also comes at a cost: it’s cognitively demanding.

There are several reasons why virtual meetings could be more demanding than face-to-face ones:

Most popular virtual meeting platforms show all participants in a grid layout that gives each participant the feeling that they are constantly being watched by all other participants.
Virtual meeting software also tends to show you your own video image, which makes you more aware of the fact that others can see (and possibly silently judge) you.
Users constantly receive nonverbal cues that are hard to interpret, e.g. because you can’t really tell what someone is looking at. They generally try to compensate for this by sending extra intentional cues (e.g. nodding exaggeratedly), which also requires more cognitive effort.

Self-representation is costly, but it’s costlier for some than for others:

Because women tend to have lower statuses, are judged more harshly and are held against higher grooming standards than men, they feel pressured to invest more effort into self-representation than men;
New employees still need to “earn” their reputation and thus have a stronger need to maintain a professional appearance, whereas older employees can kind of do whatever they want because people already know they are qualified.

The authors hypothesise that self-representation leads to fatigue, which might cause employees to perform less effectively. The authors are specifically interested in two specific indicators:

the ability to voice ideas and
the ability to stay engaged.

Testing the effect of camera usage

The authors of the paper conducted a 4-week study with 103 participants at BroadPath, a US company within the healthcare sector that employs several thousand remote workers throughout the United States. About half of all participants were in managerial roles, although at least some also seem to have more technical roles in IT and software development.

Half of the participants were asked to keep their camera off for the first two weeks and to turn it on for the last two weeks. Conversely, the other half made sure it was on for the first two weeks and kept it off for the last two. All participants also completed a daily survey that asked them how they felt about their workday.

Please turn your cameras off?

The authors found that camera usage is indeed positively related to fatigue. The assumptions that self-representation is more costly for women and new employees also seem to be correct.

This cannot be said for the hours spent in virtual meetings and the number of virtual meetings, which aren’t correlated with fatigue.

With regard to the hypothesised effects of fatigue, the results suggest that camera use has a negative effect on voice and engagement. However, the measured effect is indirect, so this result should be taken with a grain of salt.

Combined, the results suggest that camera usage is particularly fatiguing for women and newer employees and disproportionately hurts their ability to participate in meetings effectively.

Does this mean that cameras should always be turned off in virtual meetings? Not necessarily, but it’s good to at least give people the option to turn off their camera. It would also be more environmentally friendly.

This study also didn’t look at the effect that camera usage has on other participants, which you might want to consider as well.

A comparison of libraries for named entity recognition

Chun Fei Lung — Mon, 27 Sep 2021 10:52:10 +0000

What’s your favourite thing about SpaCy? Mine’s SpaCy.

I read and summarise software engineering papers for fun, and today we’re having a look at A replicable comparison study of NER software: StanfordNLP, NLTK, OpenNLP, SpaCy, GATE (2019) by Schmitt et al.

Natural language processing (NLP) is a subfield of artificial intelligence that is dedicated to the understanding, processing, and generation of natural languages, like French and English.

Named entity recognition (NER) is a subtask of NLP that aims to identify entities (persons, locations) in texts. This can be used for things like machine translation, automated question answering, and automated text summarisation.

Why it matters

If you need NER, there’s no need to implement it yourself. There are several popular libraries that can do this for you nowadays. Five of these libraries, Stanford CoreNLP, NLTK, OpenNLP, SpaCy, and GATE, were already mentioned in the title.

Which library is right for you depends on various criteria, like its performance, cost, documentation, license, and the programming language in which it is implemented.

Many of these libraries have been evaluated in comparison studies, but sadly not in a way that makes it easy to compare findings.

How the study was conducted

This paper describes a comparison between the five aforementioned NER libraries, in a sufficiently clear and complete way, so that its results can be replicated.

The process looks roughly like this:

Selection of two corpora that are not domain-specific, freely available, and in English: the Groningen Meaning Bank (GMB) and the CoNLL 2003 corpus.
Selection of five NER libraries that are free and open-source software, well-documented, available for Linux, and can recognise at least three types of entities: persons, organisations, and locations.
Comparison of each NER library’s generated NER annotations with annotations in the “gold data”, which contains the annotations that we’d expect. This is done by computing the precision, recall, and F-score for each library.

What discoveries were made

The table below shows the results of the comparison. Don’t worry too much about its size and all the numbers, I’ve included a hangover-proof summary below the table.

Library	Entity	Precision	Recall	F-score	Precision	Recall	F-score
		CoNLL 2003			GMB
Stanford NLP	Location	91.30	88.73	90.00	83.10	63.64	72.08
	Organisation	86.32	80.92	83.53	71.40	47.42	56.99
	Person	92.72	82.68	87.41	78.59	84.70	81.53
	Overall	90.06	73.67	81.05	79.81	63.74	70.88
NLTK	Location	52.47	65.47	58.26	77.13	77.10	77.12
	Organisation	36.20	24.80	29.44	42.06	35.54	38.53
	Person	61.09	66.11	63.50	38.07	55.87	45.28
	Overall	51.78	45.56	48.47	60.96	63.91	62.40
GATE	Location	59.63	78.63	67.82	79.03	48.16	59.85
	Organisation	50.58	21.29	29.96	45.08	37.68	41.05
	Person	69.53	62.67	65.92	46.53	53.70	49.86
	Overall	61.48	47.44	53.55	61.72	46.78	53.22
OpenNLP	Location	76.54	52.22	62.08	84.34	45.84	59.40
	Organisation	38.06	14.87	21.39	59.27	30.64	40.39
	Person	83.94	37.17	51.52	62.34	41.98	50.17
	Overall	68.68	30.44	42.18	37.35	41.71	39.41
SpaCy	Location	73.38	75.36	74.36	77.04	56.64	65.28
	Organisation	40.95	36.24	38.45	41.20	36.50	38.70
	Person	66.89	56.22	61.09	67.41	69.14	68.27
	Overall	60.94	49.01	54.33	66.15	54.32	59.66

Stanford NLP’s library is the only one that has (somewhat) high scores and blows the other libraries out of the water. The other four libraries have a roughly similar level of performance.

Note that Stanford NLP’s library performs especially well on the CoNLL 2003 dataset. This is because it comes with a classifier that was partially trained on CoNLL 2003! The scores for GMB are therefore more likely to be representative for real-world texts.

The results for Stanford NLP are similar to those from other studies. However, the accuracy for three of the other libraries (NLTK, GATE, and OpenNLP) (*) may differ as much as 66% from the values reported in existing studies. It is not clear what causes such huge discrepancies.

(*) Apparently there weren’t any studies that evaluated SpaCy’s performance

How do you hide low-quality tweets from Twitter searches?

Chun Fei Lung — Mon, 20 Sep 2021 15:02:42 +0000

I created my current Twitter account in 2009. Back then, the service was still relatively new and no one really knew what to use it for. Consequently nearly all of the “content” produced by individuals was crap: most people probably used it to share dumb status updates about literally everything, as if they were trying to implement some sort of digital real-life version of event sourcing.

Things have improved a lot since then. There’s a lot more interesting content on Twitter, especially for developers… provided that you can find it.

You can follow individual accounts or lists that have been created by other users of course. Or you could subscribe to certain topics of interest.

Tweets about specific things made by casual users (whose tweets have very few likes and retweets) can be found using Twitter’s search functionality. It works for the most part. But it sorely lacks an easy way to filter out (what I think are) uninteresting tweets.

Because everyone probably has their own preferences and definitions of “good” and “bad” tweets, here’s what I mean by uninteresting tweets.

Tweets that only show up in search results because they contain an obscene number of hashtags that aren’t relevant at all:
// Detect dark theme var iframe = document.getElementById('tweet-1438971371963432961-448'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1438971371963432961&theme=dark" }

As far as I know, there’s no easy way to filter out tweets that contain more hashtags than actual content.

Keywords don’t have to be hashtags however. Most people who regularly search for tweets about a popular programming language will likely have seen something like this:
// Detect dark theme var iframe = document.getElementById('tweet-1438971177754578945-214'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1438971177754578945&theme=dark" }

It gets worse when a programming language uses a fairly generic name that has other more common meanings, like PHP.

I wish there was an easy way to filter out accounts from the Philippines (especially those with Korean avatars 😅), as its currency is also abbreviated using PHP:
// Detect dark theme var iframe = document.getElementById('tweet-1438950117185228800-626'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1438950117185228800&theme=dark" }

While you can filter by tweet location, it doesn’t seem possible to exclude locations. Also, very few tweets actually have location data.

Then there are also very personal tweets about partial hospitalisation programmes (also PHP) for patients with mental illnesses. I’m not providing an example of such a tweet here for reasons that should be fairly obvious…

PHP is far from the only language or concept that has this problem. java, go, nlp, and (to a lesser extent) python also have other, more common uses.

How I (try to) filter out bad tweets

I mostly use Twitter’s search for personal (rather than commercial or professional) reasons. I therefore simply “subscribe” to search queries using Twitter’s own freely available TweetDeck web app.

My queries usually look like this:

I start with the keywords that I’m interested in.
I append a lang:en so that most of the search results will be in English
Virtually all advertisements will include links, which I sometimes (but not always) remove from my results using -filter:links.
Because there are no easy ways to filter out spam, I then add a list of words that I want to exclude:

Keyword	Why
hxxps	(Automated?) tweets about security vulnerabilities on specific URLs
threat	(Automated?) tweets about security vulnerabilities on specific URLs
hiring	Recruitment ads
jobs	Recruitment ads
remote	Recruitment ads
talent	Recruitment ads
talents	Recruitment ads
vacancies	Recruitment ads
vacancy	Recruitment ads
retweet	Promotional tweet
rt	Promotional tweet
blockchain	Keyword stuffing
coinbase	Keyword stuffing
crypto	Keyword stuffing
cryptocurrency	Keyword stuffing
digitalmarketing	Keyword stuffing
iot	Keyword stuffing
ml	Keyword stuffing
nft	Keyword stuffing
pytorch	Keyword stuffing
anatomy	Homework/thesis services
assignment	Homework/thesis services
biology	Homework/thesis services
chemistry	Homework/thesis services
course	Homework/thesis services
essay	Homework/thesis services
essayhelp	Homework/thesis services
essaypay	Homework/thesis services
essays	Homework/thesis services
essaysdue	Homework/thesis services
exam	Homework/thesis services
exams	Homework/thesis services
grade	Homework/thesis services
grades	Homework/thesis services
homework	Homework/thesis services
paper	Homework/thesis services
codingpics	Low-quality tweets
meme	Low-quality tweets
programmingjoke	Low-quality tweets
programmingjokes	Low-quality tweets
programmingmemes	Low-quality tweets
doctor	Is likely about hospitalisation
hospital	Is likely about hospitalisation
album	People trying to sell stuff
buy	People trying to sell stuff
buyer	People trying to sell stuff
cost	People trying to sell stuff
costs	People trying to sell stuff
currency	People trying to sell stuff
deal	People trying to sell stuff
discounted	People trying to sell stuff
dm	People trying to sell stuff
dropship	People trying to sell stuff
dropshipping	People trying to sell stuff
fashion	People trying to sell stuff
fee	People trying to sell stuff
free	People trying to sell stuff
gcash	People trying to sell stuff
gift	People trying to sell stuff
items	People trying to sell stuff
kpop	People trying to sell stuff
payment	People trying to sell stuff
paypal	People trying to sell stuff
peso	People trying to sell stuff
pesos	People trying to sell stuff
pm	People trying to sell stuff
price	People trying to sell stuff
prices	People trying to sell stuff
rewards	People trying to sell stuff
sale	People trying to sell stuff
sales	People trying to sell stuff
sell	People trying to sell stuff
shop	People trying to sell stuff
sold	People trying to sell stuff
spend	People trying to sell stuff
unlock	People trying to sell stuff
usd	People trying to sell stuff
wts	People trying to sell stuff
"how much"	People trying to buy stuff
wtb	People trying to buy stuff

As you can see the list is pretty long. This means that there are inevitably going to be interesting tweets that I never see because they are excluded by my queries.

What about you?

This approach mostly works for me, but I have the feeling there are much better ways to do this.

How do you separate the (t)wheat from the chaff? Do you use extensions? Alternative Twitter clients? Share your strategies in the comments!

Can Java and C++ devs write better Python code than Python devs?

Chun Fei Lung — Tue, 14 Sep 2021 10:37:35 +0000

Each person has their own, unique style

I read and summarise software engineering papers for fun, and today we’re having a look at Do Java developers write better Python? Studying off-language code quality on GitHub (2018) by Horschig, Mattis, and Hirschfeld.

Things can often be coded in different ways. For instance, you can use different algorithms, use fewer or more lines of code, implement functionality using different libraries or frameworks, or use a certain code style.

Why it matters

Most programming language communities have coding conventions. These conventions ensure that code written by different people looks similar. This can make code more readable, less prone to errors, and more maintainable.

Spend enough time with a language, and you will eventually be able to apply all of a language’s conventions effortlessly.

However, each language has its own coding conventions (*). So what happens when you switch to a different language? You might write code that’s less maintainable or more prone to errors. Or maybe you’re actually able to write better code, because your new language has fewer (or worse) conventions.

(*) And in some cases there are actually multiple sets of conventions!

Well, let’s find out what happens!

How the study was conducted

A very large part of today’s open source development happens on GitHub. GitHub provides an API that can be used to retrieve data about its platform, but there is (or was) also a GHTorrent project that mirrored GitHub’s (public parts of) repositories, user profiles, commits, issues, and other artifacts.

The researchers used the latter to look for developers who have made a large number of contributions in their primary language, and a much smaller number in some secondary language. We can treat these developers as the experimental group. We also need a control group; that one consists of users that only contributed using one programming language.

Then, the researchers mined the dataset for projects that were edited by developers using their secondary language.

For this study, they looked at Python projects that were edited by Java and C++ developers. These are compared to Python projects that were only edited by Python developers.

To study the effect of language switching, all projects were analysed using Pylint, which can find various types of issues in Python code:

fatal errors that result in code that doesn’t work at all;
errors that cause runtime errors when the code is executed;
warnings for code that is error prone or has severe style issues;
refactoring hints for complex or messy code; and
violations of coding conventions.

What discoveries were made

The analysis ended up including data for 84 Java developers, 91 C++ developers, and 100 Python developers.

The table below shows the differences in code quality per issue type (lower is better):

Code quality issue	Java group	C++ group
Line too long	3.59	1.44
Invalid name	1.43	1.52
Wrong import order	—	1.83
Ungrouped imports	0.16	0.14
Bad whitespace	—	0.38
Unnecessary semicolon	4.42	20.62
Redefining built-in names	0.57	—
Bad indentation	3.39	3.28
Redefining outer name	1.68	2.21
Undefined loop variable	—	3.28
Unused import	0.63	0.81
Unused variable	1.56	2.25
Complex method/function	0.84	1.48
Too many public methods	0.26	0.46
Too few public methods	0.34	0.58
No else return	—	1.52
Undefined variable	—	1.55
Assignment from no return	28.27	—

What might be surprising is that Java/C++ developers sometimes write better code than Python developers. The researchers provide the following explanations for each individual result:

Line too long: Python lines should not be longer than 80 characters. C++ and Java developers tend write lines that are longer than that.
Invalid name: Class names in Python should be CamelCased, while method and field names should be snake_cased. Programmers from the other two languages regularly violate these naming conventions.
Wrong import order: Module imports should be ordered such that standard libraries are imported first, followed by third-party libraries, and finally local imports. C++ developers violate this convention a lot more often, but Java developers seem to do the same thing as Python developers.
Ungrouped imports: Multiple imports from the same package should be grouped together. Java and C++ developers do this way more often than Python developers.
Bad whitespace: C++ and Java developers are less likely to miss or add too much whitespace around operators, brackets, and blocks than Python developers.
Unnecessary semicolon: Python doesn’t need semicolons at the end of lines, but (especially) C++ and Java developers tend to add them anyway.
Redefining built-in names: Developers may accidentally use variable names which are already used for existing names (e.g. input and str). This may cause unexpected or confusing errors. Java developers do this less often than Python developers, despite being less familiar with the language. This is probably because they use IDEs (which would point out such mistakes) rather than simple text editors.
Bad indentation: Whitespace is important in Python, so it helps if tabs and spaces are used consistently. Java and C++ developers aren’t as good at this as Python developers.
Redefining outer name: Shadowing names from outer scopes is discouraged in Python, but both Java and C++ developers do this more often than Python developers.
Undefined loop variable: Using loop variables outside the loop can be useful in some situations, but only when the loop was actually executed. C++ developers are 3 times more likely to write code with potentially undefined variables.
Unused import: Both Java and C++ developers are less likely to have unused imports in their files.
Unused variable: On the other hand, Java and C++ developers are more likely to forget about previously defined variables.
Complex method/function: C++ developers are more likely to write methods or functions with a cyclomatic complexity above 10.
Too many public methods: Java and C++ developers tend to make smaller classes and thus don’t run into this issue as often.
Too few public methods: The opposite, where classes are merely used as glorified data structures without any behaviour of their own, also occurs less often with Java and C++ developers.
No else return: Having an else statement after an if is considered bad style. C++ developers use this more often than Python developers.
Undefined variable: Undefined variables are often not reachable right now, but might become reachable when the code is modified in the future and thus cause errors later. C++ developers are more likely to write code with undefined variables.
Assignment from no return: Java developers are more likely to use “void” functions in assignments or as expressions, possibly because these would have been checked in Java during compilation – but not in Python.

Gender diversity makes teams communicate more effectively

Chun Fei Lung — Thu, 09 Sep 2021 18:03:30 +0000

Software teams could use more women, but they aren’t lining up yet

I read and summarise software engineering papers for fun, and today we’re having a look at Gender diversity and women in software teams: How do they affect community smells? (2019) by Catolino et al.

Software development teams in western countries largely consist of males or are even male-only. Hiring policies that favour women over men can help restore the balance somewhat, but are also controversial. Do these policies actually make economical sense? Spoiler alert: yes, they do.

Why it matters

The key to good software development is good communication and collaboration between team members. Women tend to be better at these things, but most software teams don’t have even a single female member.

Since communication is such an important part of software development, one might expect that teams with women are better equipped to avoid so-called “community smells” and thus outperform male-only teams.

How the study was conducted

The authors compared data about communication flows from 20 male-only open source projects with 20 open source projects from teams with at least one female member.

What discoveries were made

Development teams without women indeed suffer from more community smells than teams that have any positive number of women – even if it’s just one. Given what we already know about social group dynamics this is hardly a surprise.

There are many types of community smell, but the authors chose to further analyse four types that are likely to be affected by the presence of women:

The community consists of multiple organisational siloes that don’t communicate much with each other – and when they do, communication is handled by just one or two group members;
Community members are overwhelmed by a black cloud of information due to a lack of structured communication;
There are lone wolves who work on their own and do not collaborate or communicate with others;
One member handles all communication across two or more subcommunities and thus results in radio silence.

The presence of women clearly affects whether the black clouds and radio silence smells (which are related to quality of communication) occur.

It’s not as clear for the organisational silo and lone wolf smells (which have to do with organisational structure), as the authors only find a partial relation between gender diversity and the two smells.

The results confirm that gender diversity in teams is a good thing to strive for and underlines the importance of team composition as a way to combat community smells.

Refactoring does not solve all problems… right away

Chun Fei Lung — Mon, 06 Sep 2021 19:10:43 +0000

Is it an improvement? I guess wheel never know with these uphill battles

I read and summarise software engineering papers for fun, and today we’re having a look at Old habits die hard: Why refactoring for understandability does not give immediate benefits (2015) by Ammerlaan, Veninga, and Zaidman.

Whenever shortcuts are taken during the development of a software system, it accumulates technical debt.

This debt makes it harder to understand and make changes to the system, so the development speed for a system with a lot of technical debt will eventually come to a grinding halt.

Why it matters

Refactoring is a process where the structure of code is improved without changing the functionality of the system. Many in the software development community argue that well-structured code is easier to understand, and thus easier to modify and less prone to bugs.

Unfortunately there is little empirical evidence that refactoring actually has beneficial effects on developer productivity. This study tries to shed some light on the matter.

How the study was conducted

A comparative experiment was conducted at Exact, a software company that produces business software with development teams that are distributed over multiple continents.

The study consists of 5 different experiments and included 30 participants (all developers) from 11 different teams and two different countries (Malaysia and The Netherlands).

In each experiment, a developer was asked to perform a small coding task on components from a codebase with 2.7 millions of lines of code: they either had to fix a small bug or make a small change in functionality. Participants in the experimental group were given a refactored version of the code, while those in the control group were given the original code.

The experiment includes three types of refactorings:

small Rename field or variable, and Extract function refactorings;
medium Extract class and Adapter pattern refactorings, accompanied by one or more unit tests;
large refactorings to divide responsibilities, also accompanied by unit tests.

What discoveries were made

Results were mixed.

Results

In the first (small) experiment some helper methods were extracted from the code. Surprisingly, developers who saw the refactored version needed more time to make the requested change, not less.

The second (small) experiment had a similar setup, but was (apparently) easier to complete. This means that the productivity measurements for this experiment are less noisy. In this case, about 75% of the participants in the experimental group finished before 25% of the developers with the original code.

The third (small) experiment again used similar refactorings and also resulted in lower finishing times for those who saw the original code without refactorings. It’s possible that flow of method arguments and return values between multiple smaller methods was harder to understand than a linear flow in a large method.

In the fourth (medium) experiment participants were asked to fix a bug. It appears that those in the experimental group had slightly lower finishing times than those in the control group. Another notable finding is that developers who were quite experienced in unit testing performed better than other participants.

In the fifth (large) experiment, developers who saw the original code once again did much better than developers who had to work with the refactored code, presumably because it takes more time to understand the relations between classes that emerge from a large refactoring. However, the quality of solutions also differed: whereas most developers in the control group fixed the bug using a “quick fix”, those in the experimental group managed to fix the root cause.

Discussion

The experimental results show that most of the time the original, unrefactored code was “better” for productivity. However, when the original and refactored code were shown to participants side-by-side, most preferred the refactored code.

The authors argue that this discrepancy can be explained by the habits of developers, who are used to reading long, procedural methods and thus simply need more time to get used to dealing with multiple classes and methods.

However, even if refactorings lead to a (possibly temporary) decrease in understandability, the possible increases in maintainability and testability could still make the refactoring worthwhile.

Does it matter if you write tests before or after you write your code?

Chun Fei Lung — Mon, 30 Aug 2021 10:34:53 +0000

Adding features during refactoring is counterproductive! It’s a fallacy that may blow up in your face.

I read and summarise software engineering papers for fun, and today we’re having a look at A dissection of the test-driven development process: Does it really matter to test-first or to test-last? (2017) by Fucci and others.

Test-driven development is a development practice that involves short, iterative cycles in which the programmer writes tests before adding new functionality or refactoring existing code. It’s commonly believed that writing tests first leads to higher-quality code and improved productivity. This study puts that belief to the test.

Why it matters

Test-driven development (TDD) has multiple characteristics that set it apart from “traditional” programming, but the “tests first, code later” aspect tends to be the thing that most people talk about (and remember).

There’s more to it than that however, so let’s talk definitions first.

TDD is an programming technique which involves cyclic, iterative implementation of new features.

In each cycle a programmer carries out the following tasks:

Writing unit tests for the desired behaviour;
Writing code to make those tests pass;
Strictly refactoring code to improve its design, i.e. without modifying its behaviour (*).

(*) Doing so could nullify or even reverse the benefits of refactoring.

A cycle is finished when all new and existing unit tests pass, and the programmer is content with the program’s design. Ideally, all cycles are short and roughly the same length; around 5 minutes long and never be longer than 10 minutes.

TDD advocates claim that adherence to these practices will lead to improved quality and productivity.

In a nutshell, TDD has four characteristics:

The sequence in which tests are written; before or after coding
The granularity (length) of cycles
The uniformity of cycle lengths
The amount of effort spent on refactoring

How do these four characteristics affect the external quality (**) of the produced software and the developer’s productivity?

(**) “Does the software do what it’s supposed to do?”

How the study was conducted

The authors held several five-day workshops about unit testing and TDD at two Nordic companies.

During the workshop, participants were asked to individually implement three tasks, of which two were greenfield and one was brownfield. Some participants made use of a test-first sequence, while others used a test-last sequence.

TDD dictates that development is done iteratively using many short cycles. To help participants work on their tasks in small steps, the researchers refined each task into clearly delineated stories and sub-stories. Tasks were then “graded” using acceptance test suites for each user story in order to determine the quality of submitted solutions.

All participants made use of a special Eclipse IDE that collected information about actions that are performed in it, like:

Code modification
Test modification
Code compilation
Test execution

This information is used to determine how participants applied TDD.

Combining timestamps from the IDE logs with the pass rate of the acceptance test suite allows one to calculate the productivity of each developer.

What discoveries were made

You probably already guessed by now that Betteridge’s law of headlines strikes again, but in what way?

Correlation

Granularity and uniformity are positively correlated, i.e. developers who use shorter cycles are able to keep them consistently short, while those who use larger cycles tend to have cycles of varying lengths. Both factors also appear to affect external quality: smaller cycles and cycles that have consistent lengths are associated with better external quality.

A small, but statistically significant correlation exists between granularity and refactoring effort: developers who use coarser cycles spend less time on refactoring.

Regression

To better understand the relation between TDD’s four characteristic factors and the two outcome variables (quality and productivity), the authors constructed two models.

The basic idea here is that each model should predict one of the outcome variables using information about the code-test sequence, cycle granularity and uniformity, and refactoring effort.

A good model is also simple, and should not include superfluous input variables. The process of trimming these variables, feature selection, is described in the original article.

I’ll simply list the most noteworthy discoveries here:

Code-test sequence is not part of either model, which suggests that – at least for external quality and developer productivity – it does not matter whether you write your tests before or after your “real” code (***);
Cycle granularity and uniformity, and refactoring effort are all negatively correlated with both quality and productivity.
The negative correlation between refactoring effort and the two outcome variables is likely due to floss refactoring (****).

(***) This study did not look at the effects on internal quality (i.e. maintainability), which is also pretty important.

(****) This is a form of refactoring that also includes other activities, like implementation of new features. These new features might not be covered by tests and are therefore more likely to introduce regression bugs.

The hidden costs of résumé-driven development

Chun Fei Lung — Thu, 26 Aug 2021 16:44:01 +0000

The war is on hold right now, we’ll resume it later

I read and summarise software engineering papers for fun, and today we’re having a look at Résumé-driven development: A definition and empirical characterization (2021) by Fritzsch and others.

You’ve probably already heard of the agile manifesto, but did you know there’s also a manifesto for resume-driven development?

Specific technologies over working solutions

Hiring buzzwords over proven track records

Creative job titles over technical experience

Reacting to trends over more pragmatic options

Fortunately this one’s just satire, but resume-driven development does exist. The term describes a phenomenon where developers choose tech stacks, architectures, methodologies, and protocols not because they are the best tools for the job, but because they look good on a resume.

Why it matters

Hiring is a process that involves two types of stakeholders: employers and applicants. In an ideal world, the employer lists the skills they need in a job advertisment, while job applicants promote the skills that they have in their resumes.

But we do not live in an ideal world: The hiring process at tech companies is often flawed. Moreover, both employers and applicants tend to oversell the “cool” skills in their advertisements and resumes.

Such overselling may lead to (costly) disappointment for both parties.

How the study was conducted

The researchers conducted an exploratory survey to gain insight from both the hiring and applicant perspectives. Their survey received 591 responses, of which 130 answered for the hiring perspective and 558 for the applicant perspective.

About 90% of the participants stated Germany as their country of residence, while about 7% is from some other European country. You may want to keep this in mind when you see the results.

What discoveries were made

The results tell us there’s some sort of arms race going on between employers and applicants, without actually telling us there’s an arms race going on between employers and applicants.

Employers

Employers generally value both broad (73%) and deep (66%) knowledge and experience in technologies. When they have to decide between the two, 42% prefers applicants with broad knowledge, while only 22% would choose an applicant with specialist knowledge.

When asked whether they believe knowledge and experience in latest/trending technologies or established technologies are important, 85% indicated that the latter are important, while only 59% valued latest/trending technologies. About 39% of respondents prefers applicants who are experienced with established technologies, whereas only 20% prefers applicants who know the latest/trending stuff.

A majority (59%) of the respondents in this group admits that technology trends and hypes affect what they advertise in their job offerings. An even larger majority (71%) believes that applicants like working with the latest/trending technologies.

In other words: employers say they want applicants that have experience with latest/trending technologies, but what they want are applicants who know established technologies and (that can easily learn) a broad set of different technologies.

Applicants

The employers’ belief that applicants enjoy using latest/trending technologies in their work is largely correct. About 73% of them does, while 18% finds it inconvenient or stressful to constantly learn new technologies.

A large majority (82%) is convinced that using latest/trending technologies in their work makes them more attractive for potential future employers.

Curiously enough, only 42% of applicant respondents believes that using these novel technologies actually makes them better developers. Moreover, only half (49%) of them had mostly positive experiences with latest/trending technologies. About 20% reported that they once used latest/trending technologies for a project even though they weren’t ideal for the use case.

Fortunately, in most cases developers tend to select technologies based on a project’s system requirements and the skills that are already available among its developers.

This suggests that while developers believe that latest/trending technologies are very important, this often does not affect actual selection of technologies.

Consequences

Resume-driven development may have several consequences:

If developers choose to use latest/trending technologies that increases the technological diversity in their company. This increases complexity and may negatively impact maintainability and reliability.
False expectations and disappointment about the job may lead to frustrated developers when the actual work turns out to involve different technologies than what was promised.
A strong focus on technologies in hiring criteria may lead to neglect of other (more) important skills and traits, like soft skills, self-motivation and willingness to learn.

Facebook’s attempts to improve mutation testing

Chun Fei Lung — Wed, 25 Aug 2021 17:45:11 +0000

White-box testing is possible when you have access to internal structures

I read and summarise software engineering papers for fun, and today we’re having a look at What it would take to use mutation testing in industry – A study at Facebook (2021) by Beller and others.

Mutation testing is a way to determine the quality of your test suite. It works by generating a large number of changed versions of the code, which are called mutants. Examples of changes include deletions of method calls, disabling if conditions, and replacing magic constants.

If the test suite is good enough, it should be able to “kill” these mutants by having at least one previously succeeding test fail.

Why it matters

The result of mutation tests is a so-called mutation score, which is the ratio of mutants that a test suite manages to kill. Many researchers and developers argue that mutation scores are superior to traditional code coverage, as it’s actually based on a program’s behaviour.

But mutation testing is not a silver bullet:

Mutants can be generated in many different ways, which means that mutation testing becomes infeasible for anything but the smallest code bases.
It is also not clear to developers what they can do to improve the mutation score, and whether an improved score actually has any practical benefits (other than better-looking metrics).

Can these issues be fixed?

How the study was conducted

The authors of the paper built a tool that they call Mutation Monkey. It comes with two pipelines, a training and an application pipeline.

Mutation testing is often very costly – not only because generating all the different mutants takes a lot of time and processing power, but also because many of the generated mutants are easily killed (or not even syntactically valid) and thus useless.

The training pipeline solves this problem by semi-automatically learning bug-inducing patterns from three sources:

Defects4J, a collection of bugs extracted from popular OSS Java projects;
An internal database of fixes for crashes that happened in the production version of the Facebook app. By “reversing” these fixes it becomes possible to reintroduce crashes;
Commits with modifications that made an originally failing test pass.

This process is only partially automated, because experts are still needed to decide which and how many patterns to implement, and for the creation of patch-like templates that implement the patterns.

The application pipeline applies the mutation templates to the production version of the code. To reduce the number of mutants that have to be generated (remember, building and testing is expensive!), the pipeline tries to avoid “unprofitable” spots, like logging calls, and runs a light-weight syntax checker to catch syntactically invalid mutants.

The remaining mutants are submitted to the code review system outside of peak (office) hours, which makes scaling easier and is cheaper. Mutants that pass the test suite are then presented to developers. The pipeline also tells developers which tests visited the mutated block of code. This information should make it easier for developers to decide what they want to do.

What discoveries were made

Kill rates were fairly similar across the various mutation patterns. However, some mutations were applied successfully a lot more than others. For instance, the NULL_DEREFERENCE pattern was applied almost 2,000 times, while the REMOVED_SYNCHRONIZED mutations only occurred 143 times within the same period of time.

Interestingly, the REMOVED_SYNCHRONIZED is also the only pattern with a much higher kill rate, which suggests that developers are aware that synchronisation-related bugs are hard to debug and thus spend more time writing tests for them.

The researchers also conducted interviews with 29 developers to learn more about the effectiveness of Mutation Monkey’s approach.

Most – if not all – developers had not heard of mutation testing prior to the experiment, and needed more information than what was provided by Mutation Monkey.

However, after explanation from the researchers about 85% believed that Mutation Monkey is a useful tool that could help them write (better) tests. Virtually everyone was also positive about the test coverage information that was included with the test reports.

However, less than half of the developers confirmed that they would write a test for the gap that Mutation Monkey had found. When asked why not, developers often gave the following reasons:

they want Mutation Monkey to come up with a test;
the mutated code was of minor importance;
the mutated code was about to be deprecated;
the code was still new and likely to undergo iteration before stabilising; and
the mutated code is in a badly tested part of the code base (😕?!).

In other words, this new approach seems to be better than existing approaches, but still yields too many false positives.

Codes of ethics, do they work?

Chun Fei Lung — Tue, 24 Aug 2021 19:37:31 +0000

Some decisions are ethical, others just Zuck

I read and summarise software engineering papers for fun, and today we’re having a look at Does ACM’s code of ethics change ethical decision making in software development? (2018) by McNamara and others.

Codes of ethics provide guidelines that help you do the right thing, but do they actually work?

Why it matters

Software developers constantly make ethical considerations, e.g. when deciding how much user data to collect or time to spend on mitigating security risks. Sadly, developers are people and thus don’t always make the right decisions.

The Volkswagen emissions scandal (also known as Dieselgate) is a highly publicised example of a case where engineers were told to write software that would cause cars to “lie” about their pollution levels during emission tests. The engineers voiced their concerns about this unethical practice internally, but did not inform the authorities. The scandal eventually cost the company more than $30 billion in fines and led to possibly hundreds of early deaths.

To encourage ethical behaviour many professional organisations, like the ACM, have published a code of ethics that provides guidelines for ethical behaviour. While the effectiveness of such codes of ethics has been studied in the past, no one has done this yet for the computing field.

How the study was conducted

A survey was created that described a fictional company that a respondent had just joined as a lead developer. It presented 11 software-related ethical cases, along with an ethical decision, an unethical decision, and an “unsure” option for each case.

The survey was spread among a large number of software engineering students and professional software engineers. About half of the respondents were simply told that the fictional company had strong ethical standards, while the other half was told that the company followed the ACM code of ethics.

What discoveries were made

No statistically significant difference was found between the control group and the group that saw a brief version of the code of ethics. Responses from students were also very similar to those from professional software engineers.

Two of the cases that were presented in the survey were based on recent news stories: the Waymo v. Uber dispute and the aforementioned Dieselgate scandal. None of the respondents recognised the Waymo dispute, but 20 respondents did mention that they recognised the Dieselgate story.

The researchers found that those who did not recognise the Dieselgate story were more likely to favour the creation of test-evading software, whereas none of the 20 respondents who recognised the story chose to act unethically.

This suggests that engineers can be influenced to make more ethical decisions by providing examples of similar news-worthy decisions that make clear that unethical decisions can have undesirable consequences.

Six ways to mess up your MVC architecture

Chun Fei Lung — Mon, 23 Aug 2021 10:25:57 +0000

Focus on the model

I read and summarise software engineering papers for fun, and today we’re having a look at Code smells for model-view-controller architectures (2018) by Aniche and others.

Why it matters

Code smells are poor design and implementation choices that hinder comprehensibility and maintainability of code.

Many studies have shown that code smells make code less maintainable and more prone to bugs. Some smells cause code to be changed more often due to violations of the single responsibility principle, which states that a class or module should have only one reason to change.

Most of these studies are based on a catalog of code smells that was originally defined by Martin Fowler and Kent Beck in Refactoring. The smells are generally applicable to any system written in an object-oriented manner, but overlook the role that a class or module may have in the system’s architecture.

This is why we have to study smells for specific architectures, like MVC. We must learn about their characteristics and impact, so that developers can understand how to avoid those smells and static analysis tools know how to recognise them.

How the study was conducted

The study makes use of surveys, interviews, and repository mining.

Creating the catalog

The authors started with a three-step data gathering process:

A simple survey that asked respondents to list good and bad practices for dealing with models, views, and controllers. The survey yielded 22 complete responses.
A more comprehensive survey that aimed to elicit good and bad practices for each of the five major MVC roles (controller, entity, service, component, and repository). This survey was completed by 14 respondents.
Unstructured interviews were held about good and bad practices for each of the five roles. The authors interviewed 17 professional developers, all of whom were familiar with the Spring MVC framework.

Two authors performed an open coding process to group good and bad practices into categories. Practices that were not specific to the MVC pattern were discarded.

The coding process resulted in a list of nine possible smells, which were presented to a core Spring MVC maintainer. Three of the smells were removed because they were specific to Spring and therefore not likely to affect users of other MVC frameworks.

Understanding code smells

To understand the characteristics and impact of code smells, the researchers:

analysed 120 Spring MVC projects that are hosted on GitHub;
asked 21 Spring MVC developers to take part in a survey that assessed their perception of the six code smells;
looked for experts in development of MVC applications using frameworks other than Spring.

What discoveries were made

I’ll list the code smells first, and discuss their characteristics and impact afterwards.

Code smells

Six MVC smells were identified.

Promiscuous controller

A controller should provide cohesive operations and endpoints to clients, i.e. it should depend on a limited number of services (at most 3) and handle at most 10 routes (*).

(*) The values 3 and 10 were derived using a formula that can be found in the original article. Don’t think of these (and other numerical values in this section) as absolute thresholds; they’re more like rules of thumb.

Promiscuous controllers can be broken up into two or more classes until each controller is no longer promiscuous.

Brain controller

Flow control in controllers should be very simple, e.g. ideally a controller shouldn’t interpret input to determine what actions to take. Since a controller with a lot of flow control will be littered with method invocations, the researchers argue that the number of non-framework methods that can be executed by a controller should never exceed 55.

Since business logic is supposed to be implemented in model layer, it has no place in controllers. Any business logic in a brain controller should be moved to an entity, component, or service class.

Meddling service

A service should contain business logic and/or handle business logic among domain classes, but never contain SQL queries.

Data access must always be handled by repositories instead.

Brain repository

Repositories are meant to handle anything related to data persistence, but should not contain complicated business logic or complex queries (**).

(**) More specifically, queries that join multiple tables with complex filters, queries that are constructed dynamically, or objects that are manually assembled from query results.

One could therefore argue that a brain repository is a class whose McCabe’s complexity exceeds 24 or SQL complexity exceeds 29. Complex logic and SQL queries should live in different methods. Logic that’s used by multiple repositories should be in a component.

Laborious repository method

Methods should have only one responsiblity and do one thing. For a repository method, this means that it should execute only one query.

Methods that execute multiple queries should be split, so that each method only executes one query. Methods can be private or public, depending on whether the persistence action makes sense on its own.

Fat repository

Each repository should only deal with a single entity, otherwise it loses its cohesion and becomes harder to maintain.

If a repository deals with multiple entities, each entity should get its own dedicated repository.

Characteristics and impact

The most common smell in the analysed dataset is the fat repository (20.5%), followed by the promiscuous controller (12.2%) and the brain controller (7.4%).

Brain controllers and laborious repository methods are often also affected by the traditional “complex class” code smell and in 59% of cases a brain controller is also a “God class”. The other smells do not appear to overlap at all with traditional code smells, which suggests that the smells in this catalog really are from a distinct category.

Impact on change- and defect-proneness

Analysis of Spring MVC projects shows that classes affected by MVC and traditional smells are significantly more prone to changes (almost 3 times more likely) and defects (2 times more likely).

These differences become smaller when artifact size is taken into consideration (***): classes affected by smells are still prone to changes, but are not more defect-prone.

(***) More code means more opportunities for bugs to appear, so this isn’t very surprising of course

There are also differences between MVC smells and traditional smells: the latter have a stronger negative impact on change- and defect-proneness.

Of the six MVC smells, the brain repository and meddling service have the strongest impact on change-proneness, while the meddling service is the only MVC smell that clearly results in more bug-fixing activities.

Perception by developers

Developers clearly perceive classes affected by MVC smells as problematic, particularly in the case of the meddling service, fat repository, and brain controller smells.

Some developers were able to correctly identify and define the smells without prior knowledge of the researchers’ catalog. On the other hand, over half of all participants did not perceive classes affected by the laborious repository method as problematic.

Introduction and survival

Once an MVC smell is introduced in a system, it tends to survive for quite a long time. In general, there’s more than 50% chance that a smell will survive for longer than 500 days. Fat repositories even have an 80% chance of surviving more than 1,500 days. 69% of smells are never removed at all.

These smells are not always caused by code aging (****): some smells already exist when the code artifact is first committed to the repository. For laborious repository methods this even happens 86.5% of the time!

(****) Do keep in mind that this analysis was done on open source projects. Closed-source projects may have different characteristics.

Generalisability to other frameworks

Most of the identified smells are generalisable to other frameworks (specifically VRaptor, Ruby on Rails, ASP.NET MVC, and Play!).

Application built using frameworks that use the active record pattern don’t appear to suffer from meddling services and generally don’t use repositories, which would essentially eliminate three of the MVC smells. Instead, they’re more likely to suffer from fat models.

Should you copypaste code from Stack Overflow?

Chun Fei Lung — Sat, 21 Aug 2021 11:12:23 +0000

Hot answers aren’t always the best answers

I read and summarise software engineering papers for fun, and today we’re having a look at Are code examples on an online Q&A forum reliable? A study of API misuse on Stack Overflow (2018) by Zhang and others.

Why it matters

If you’re stuck with a programming problem or have just started experimenting with a new framework or library, code examples on Stack Overflow can be tremendously helpful.

Many of them are short and to the point, which makes them easy to understand and reuse.

Unfortunately, herein also lies the rub: the code examples don’t always show all the code that one should use in a production environment.

For example, a code example might show how to open and read from a file, but neglect to point out that you first need to check whether that file actually exists or that the file handle should be closed afterwards.

This may cause all kinds of issues when software is deployed in a production environment, like resource leaks and program crashes.

How the study was conducted

The goal of the study is to determine whether and how code examples on Stack Overflow differ from best practices when it comes to using libraries.

Discovering these best practices is far from trivial: the number of libraries are countless, and each has its own gotchas and best practices.

Mining GitHub

The authors therefore designed a tool called ExampleCheck.

ExampleCheck infers API usage in three steps. More specifically, it:

searches GitHub for snippets in which an API’s method is invoked. It then uses program slicing to filter out statements that are specific to the program. The result is a normalised representation (*) that consists of a sequence of statements that are related to the invoked method.
identifies common patterns in the sequence of statements surrounding calls to the API’s method. Additionally, it filters out calls that are used in only a few outlier examples.
determines which guard conditions should precede API method calls. This is done by first creating canonicalised versions in which project-specific predicates are replaced with true and API-specific variables are given generic names. The conditions are then simplified and merged until only the most frequently appearing patterns remain.

(*) This means that things that are specific to the analysed project, like code style are converted such that two snippets from two different programs that essentially do the same thing will look identical to each other.

The authors ran ExampleCheck on 380,000 GitHub projects for 100 popular Java API methods from 9 different domains. On average, each method has about 55,000 associated snippets, ranging from 211 to more than 450,000.

ExampleCheck infers 245 API usage patterns. Manual inspection shows that 180 of those patterns are usable for the next part of the study.

Mining Stack Overflow

The authors extract code snippets from all Stack Overflow answer posts that mention one of the 100 Java API methods, and gather some additional information for each post, like the number of votes and whether the post was accepted as a correct answer.

ExampleCheck is used to check whether the sequence of method calls in each snippet is subsumed by one of the identified API usage patterns.

A manual verification of 400 randomly selected posts suggests that about three quarters of the reported posts are true positives.

False positives are generally caused by a lack of deep knowledge about postconditions of methods and usage patterns that are correct, but not used very frequently. Finally, warnings in natural text and examples that are distributed over several <code> blocks also result in false positives.

What discoveries were made

ExampleCheck detects potential API misuse in 31% of Stack Overflow posts that were considered for this study.

If reused without modification, the code in these posts would likely result in crashes (76%), incomplete actions (18%), or resource leaks (2%).

Specifically, APIs for databases, IO, and networking often lack exception handling and proper closing of resources. Examples for cryptography APIs and string manipulation are unreliable for similar reasons: input and output should always be validated, especially if a method might return a null value or throw exceptions.

This wouldn’t really be an issue if it was clear to readers which posts contain API misuse. Unfortunately, that isn’t the case.

For instance, highly voted posts aren’t necessarily more reliable. Moreover, posts with API misuse have more views on average than posts without any misuse. A possible reason for this is that highly voted posts tend to contain concise step-by-step explanations, i.e. they’re written for simplicity and readability rather than real-world circumstances.

It would be helpful if Stack Overflow were to provide some method to show best practices next to code snippets in answer posts. The authors propose a browser extension that adds this functionality. You can find a screenshot and description of the extension in the original article.