A/B testing With no B: You Are Eating Soup with a Fork

Over the years I have reviewed many journal and conference papers and have graded several course projects. One of the most common mistakes I have encountered with –across many disciplines– is tools being used in the wrong place. Unfortunately, using the wrong tool is not like using 3.45 instead of pi (3.14) for calculating the circumference of a circle and having a large but consistent error; it is like confusing circumference of a circle with the volume of a sphere.
It seems to me that these mistakes are becoming more and more frequent which is alarming. Here I present three examples that I keep seeing.

A/B testing
One of the most commonly used tools in business intelligence, user experience research, medical research, and many other areas is A/B testing; A “statistical hypothesis testing” where two (or more) variants, groups (treatment and control), versions (A and B), or conditions are compared with each other. This test is suitable to answer questions such as:
– Which interaction technique takes less of user’s time?
– Does a drug affect a user’s behaviour?
– Are men more likely to take risks or women?
A/B testing is a great tool because it is flexible enough to be used in different fields and very straightforward. However, it does not fit many situations.

For starters, it is a relative measurement tool. You cannot use it to prove something is good or bad. You can only use it to show that something is better or worse than something else. Secondly, it needs at least two variations or groups for comparison.

In many occasions I have seen people try too hard to use A/B testing where in the wrong place. For example, they design an interface (let’s call it interface A) and want to show that it is a great interface, so they compare it with another interface (interface B); at this point they have only proven that A is better than B. Without any measurement about the quality of interface B this gives us no useful information. In some rare cases interface B is a known interface such as Microsoft Word; which still does not mean they can ignore the grading of interface B. In many cases though, interface B is as new and unknown as interface A. They can both be garbage and one being better than the other does not make it great. When I see cases like this the first thing that comes to my mind is that whoever has designed such an experiment probably started with one idea (even a great idea) but has come up with a worse idea just for the sake of having a race between the two. Similar to a person who plays chess with himself; you can never be sure if they put enough effort to play both sides, and even if he did, if a win indicates the player’s experience.

Likert scale
Another very commonly used tool is Likert scale where the responses to a questionnaire are in the form of multiple choices between two extremes (1. strongly agree, 2. agree, 3. neither agree nor disagree, 4. disagree, 5. strongly disagree). A Likert scale is ordinal, a good Likert scale is also balanced between the negative and positive responses. However, a Likert scale, by itself, is not a continuous, let alone linear, measurement. As a result, one cannot take the average of Likert scale results. In other words, if two respondents agreed with a statement –chose “2. agree”– and one person strongly disagreed with it –chose “5. strongly disagree”– , we cannot conclude that on average people neither agree nor disagree based on a simple math that the mean of two 2s and one 5, (2+2+5)/3, is 3. Mr. Achilleas Kostoulas explains it in more details in his blog. Many others point out the limitations of Likert scale, in most cases (9 out of 10) I see people treating Likert scale not only as a continuous but also a linear measurement tool; they even run statistical analysis methods unfit for such data too. In some cases Likert scales are used as an indirect subjective method of comparing two or more things. A simple example would like this:
One a scale of 1 to 5 (1. strongly agree, 2. agree, 3. neither agree nor disagree, 4. disagree, 5. strongly disagree) please answer the following questions:
Q1. I like Apples.
Q2. I like Oranges.
You can replace Apples and Oranges with Interface A and B.
Here the researcher is trying to create two scores for Apples and Oranges to compare them where he can just simply ask which one do you like more, Apples or Oranges? I know that it sounds like I am stating the obvious but seeing too many similar mistakes from “experts” convinces me that it is not so obvious to some people.

Analysis of Variance is a set of tools for comparing the means of two groups. Its main strength, in my opinion, is that it is relatively easier to learn/teach than other statistical methods of comparison, period. To be able to use ANOVA, three assumptions need to be met:
1. Independence of observations – this is an assumption of the model that simplifies the statistical analysis.
2. Normality – the distributions of the residuals are normal.
3. Equality (or “homogeneity”) of variances, called homoscedasticity — the variance of data in groups should be the same.
In most cases, the researchers do not even bother to check the first and third assumptions. However, most researchers use a histogram to show the distribution of the data; unfortunately, in many of those cases the histogram shows a non-normal distribution. To be a normal distribution, at the very least, the data should be continuous; but I have even seen ANOVA being used on binomial data (a special case of discrete probability distribution).

Mistakes like these cannot be ignored because a paper or a report is well-written or an experiment is hard to conduct. When the methodology is wrong, the results are useless; when the results are useless, the discussions and the conclusions based on them cannot be trusted. It is worse than hammering a screw to a wall. It is like using motor oil for cooking. Regardless of the quality of the beef, I am not going to risk my life eating that steak.

Leave a Reply

Your email address will not be published. Required fields are marked *