Software Best Practice?

Left to right; a glass sphere, a pen, a checklist of best practices, and the corner of a book on a cherry wood table.

Blood letting was probably one of the most commonly practised medical procedures for thousands of years, and didn't fall out of fashion until the late 18^th century. The practice involved opening an artery to remove large quantities of blood, and it was used as a treatment for more or less everything. Medicine, as a body of knowledge, was missing a great deal of important details that would have discouraged the use of blood letting. Doctors had to rely on experience, and deduction based on the limited knowledge they had.

The prevailing theory of disease for most of recorded history has been humorism. Humorism stated that all diseases were the result of imbalances in the bodies' four humors (blood, yellow bile, black bile, and phlem). Although blood letting probably came first, humorism provided theory to justify its use. Even after humorism's popularity came to an end, blood letting was easily justified. Inflamed or infected areas appear engorged with blood, and it was reasonable to assume that the blood was in fact part of the cause. Patients who received the treatment swore by its effects. Even those who were unsure would appear calmer and more comfortable afterwards. Despite almost universal consensus on its effectiveness, it resulted only in additional suffering and death.

Eventually experiment and statistics would convince the world of the dangers of blood letting. Despite the long history of physicians who saw it work, and the logical theories that supported its claims, they would eventually concede that the numbers weren't lying. The proof was there for anyone willing to keep track, those treated with blood letting were significantly more likely to succumb to their illness.

In the world of software development we tend to elevate simple, precise, computable logic and deductions above other types of reasoning. This is hardly surprising given the nature of our day to day work. It is also unsurprising that we then tend to apply this type of reasoning to situations where better tools are available. When your only tool is a hammer, every problem begins to look like a nail.

One area of concern is the application of this kind of reasoning to how we (software developers) should apply our trade. Determining how people work best isn't easily achieved with flow charts and algorithms. Even Game Theory has limited real world applications, because it assumes people are rational agents when they are almost certainly not. People are complicated and our behaviour doesn't lend itself well to being analysed this way. We are chaotic in the mathematical sense of the word.

Why then do books on software development try to convince us of particular best practices using deduction? If we codify our requirements in tests, and run those tests whenever we build our applications, we will be alerted quickly of any problematic changes to our code, and therefore can produce software with fewer bugs. This last sentence is typical of the kind of deductive arguments used to convince us of certain best practices. But as with medicine, deduction isn't enough. Although the foundation on which we base our deduction is much more solid, that doesn't mean we're not making the same mistakes. How can we be sure that we're not suffering under the same faults of logic that generations of physicians had been?

An Alternative

There are many existing fields that deal with this kind of human level complexity, and they reason very differently to fields like physics, maths, engineering, computer science, etc. Medicine, economics, psychology, and sociology (to name a few) all deal with human level complexity much more successfully. The first set of sciences deal with reliable rules. So reliable that determining some future event can be as simple as plugging values into an equation. If I apply 50kN to a stationary 1000kg asteroid it will get 5cm/s/s acceleration. No experiment is required. We know this to be true within very small error margins, and when there are other factors they are clear to see and easy to adjust for. Simple deductions don't work when we're dealing with people.

The human sciences in contrast rely much more heavily on experiment and statistical analysis. If we want to know how group X will perform in situation Y, then we put them in that situation, measure their performance, and compare it to a control group. I don't want to imply that experiments of this kind don't occur in the world of software, but the quantity, quality, and availability of this kind of research is drastically lacking.

Take the practice of software testing for example. A systematic review of the empirical, experimental evidence for testing was published in 2014¹ which found that only 5 journal articles were published detailing experiment based research in the previous thirteen years. Not only that but none of these papers were up to scratch with the standards typically expected of this kind of work. Almost none of them even employed any statistical analysis, or published the raw data. In 2016 a meta-systematic review² (detailing all the reviews in software testing) was published that found only 4 systematic reviews ever conducted on the experimental evidence for testing. All four of these reviews could find no evidence reliable enough to draw any reasonable conclusions about the effectiveness of varying software testing practices.^1,3,4,5

There are, fortunately, some facts to be known about how software developers work best. This is where we ought to be focusing. When we are dealing with people, and in industry we always are, we should be looking to sciences like medicine and sociology for how to conduct research. Ultimately, the question of how software developers react to the wide variety of best practices, is a question of sociology. All the following questions are questions about how people react to these best practices, rather than questions about the practice directly.

Do good variable names allow us to understand and change code more quickly?
Do unit tests catch bugs that the systems' users actually care about?
Are classes and methods with a single responsibility easier to read and understand?
Is it easier maintain classes built using composition rather than inheritance?

It's common, when asking someone about some software development best practice, to be given some kind of logical, deductive, argument that justifies belief in the practice's effectiveness. Logic, deduction, and anecdote are better than nothing but they're not great. I would be a hypocrite if I told you not to use those kinds of arguments at all. But, we shouldn't be justifying our practices, we should be proving that they are effective. I hope to convince you, not only that doing research like this is important, but also that it would be easy for software developers.

What Good Research Looks Like

The fundamentals of a good piece of research are pretty simple. Form a hypothesis first. This way you can't look for other results in your data afterwards and claim you were actually looking for something else. This might not immediately seem like a problem, but it is. Then take two groups of people, apply different practices to each, and compare the results. We've been doing this since biblical times though⁶, so we can do better.

How can we know for certain that the differences we see are because of the change we are testing? Well, we can't know for certain without understanding absolutely everything that happened at every stage. Which, when we're dealing with people, is effectively impossible. But there are ways of being more certain than we might otherwise be. Effectively this boils down to removing anything we can think of that might feasibly be seen as an alternative explanation for the results we get. Here is a short list of common things we account for;

Subject bias.
Researcher bias.
Selection bias.
Flukes.

Subject bias is a term used to describe the fact that participants in research very often change their behaviour when they think they understand the purpose of the research. For example if you tell a group of software developers they are taking part in research to work out whether unit testing is worth doing, the participants in the unit testing group may try harder because they think they're in the "good group". This doesn't have to be a conscious decision on their part. It can be entirely unconscious. To remedy this problem we prevent the subjects from knowing whether they are in the experimental group or the control group. We call this blinding. There are a number of ways of doing this, but one of them might be to tell them very little about the purpose of the research so they can't know whether they are in the control or not.

Researcher bias is a similar problem but affecting the people doing the researcher themselves. A researcher might need, for example, to classify participant reactions into categories. If the researcher knows whether the participant is in the experimental group or the control group this may influence their decision in unclear instances. We can do a similar thing to solve this problem. We just stop researchers from knowing which participants are in which group until all the data has been collected. We call this double blinding.

Selection bias occurs when the subjects for your experiment are selected in such a way that it unwittingly affects the results. This can be due to selecting subjects for the control or experiment groups manually, or it can be simply because the total sample you chose is not representative of the general population. There's no hard and fast way to avoid all selection bias but if you get a large and broad selection of subjects your sample is more likely to be representative. Also, if you ensure that subjects are added to the experiment and control groups randomly we won't accidentally affect the results that way (we call this randomization). Effectively we want to make sure that there aren't other obvious differences between our two groups except the thing we want to test.

How do we account for the fact that our results might just be a fluke? There are two ways. The first is that we commit to publishing the results whatever they happen to be. If we only ever publish positive results then in certain areas only the flukes would get published. If we always publish then our flukes don't matter because they can be compared to the entire body of evidence. The other way we account for flukes is with statistical testing. It is outside the scope of this piece to discuss how we use statistics to determine that our results are unlikely to be due to chance. But it suffices to say that we absolutely should do this, and it is difficult to do correctly so involve a statistician.

There is a lot more to good research than what I've detailed above, but if you're thinking about all the above then you've made a good start. If you're thinking of conducting research then you're going to want to spend a lot more time reading about good research practices than you've spent so far reading this. Lastly, your research needs to be reproducible. That means that every detail that would be required to do your research again, and get a similar result, should be documented. Software developers are in an extremely good position to write some of the most reproducible research around by the very nature of our profession. More on this later.

Why This Should Be Easy

Computer science and software development research should be getting experimental research right. Not least of all because we are in a uniquely good position to produce experiment based studies of an especially high quality. Why? Because many of the best practices for producing good experimental research can be automated to improve them above what we expect from other fields. Additionally, the work software developers do is on a computer. Most of the experiments we are likely to care about can be performed over the internet without costly, and time consuming, experiment designs. Our experiments can be written as an application whose source code could be publicly hosted on Github so that other institutions could reproduce them with a few lines of code or hosting configuration. To illustrate this point, lets consider an example.

Imagine that we want to find out whether good variable names affect the time it takes for a programmer to understand a short code segment, or whether they are able to understand it at all. The study design is simple. We devise a series of functions performing common programming tasks. We also provide arguments for each function and pose a question for each that is designed to test the readers understanding of that function. We take all these functions and create three groups; one where the functions have descriptive variable names, one with meaningless variable names, and one with descriptive but inaccurate variable names (actively incorrect). In order to get this experiment out to participants and collect the appropriate data we can do something that most other fields do not have the luxury of (without incurring additional expense). We develop a web application to administer the experiment. Let's briefly explore the characteristics of this hypothetical application.

It's a web application.
It's hosted without any required configuration by sending it to a hosting service.
It has its own database where it securely stores the collected data.
To start the research you provide a list of emails to send the research to.
Participants are randomly allocated a group when they start their participation.
The time taken, marks, etc, for each question are recorded.
It performs the appropriate statistical analysis and produces digestible results.

Our hypothetical application above embodies the research methodology. It produces research that is double blinded, randomised, controlled, with the appropriate statistical analysis, and above all it is as reproducible as you could realistically imagine. If the code for this application is hosted openly, under an open source license, on Github then the precise details of its methodology can be examined and critiqued directly. And variations of the research method can be declared and similarly examined by simply forking the project.

TLDR;

The lack of experimental evidence for software development best practices is disgraceful, not just because of the lack of research, but because we are in a position to do research of the highest caliber. I hope this reaches the eyes of those of you who, having read this, might be inspired to do better.

References

1. J E González, N Juristo, S Vegas. A systematic mapping study on testing technique experiments. Intl Symp on Emp Software Eng & Measurement 2014 3:1-4

2. V Garousi, M Mäntylä. A systematic literature review of literature reviews in software testing. Information & Software Tech 2016 80:195-216

3. R A Silva, S Souza, P Souza. A systematic review on search based mutation testing. Information & Software Tech 2017 81:19-35

4. N Juristo, A M Moreno, S Vegas. Reviewing 25 Years of Testing Technique Experiments. Empirical Software Engineering 2204 9:7-44

5. M Hays, J F Hayes, A C Bathke. Validation of Software Testing Experiments. ICST 2013

6. The Bible - The Book of Daniel 1:5-16