It was a Monday afternoon. My coworker Jason, a Senior UX Researcher, was talking through a study. Jason had done most of the planning and I had done most of the analysis. There were dozens of people on the Zoom call, most of which were much higher than me on the totem pole, and many of which I had never met. Then Jason passed it over to me. It was my turn to talk.
But first, I needed to get over my paralyzing impostor syndrome. Who was I to present findings from my first big quantitative user experience (Quant UXR) study? Yes, I had gone through my analysis with Jason, and another Senior Researcher, Sarah. But surely, the audience would point out all the flaws in my analysis or ask questions that I couldn’t answer.
They didn’t. But if Jeff Sauro and James R. Lewis were in the audience, they would have had a lot to say.
Coaching from the pros
After this Monday afternoon presentation, I followed my own advice from How to get started with Quantitative UX Research.
So, you like stats, do you? Me too! Let’s get nerdy. First, read Quantifying the User Experience by Jeff Sauro and James Lewis.
This book was exactly what I needed. Sauro and Lewis are pros when it comes to all things Quant UXR. Like great coaches, they let me know what I did right, which boosted my confidence. They also gave me some constructive criticism that I can use to make future studies even better.
If you’re also new to Quant UXR, read Quantifying the User Experience. But first, read this post so you can learn what I learned.
Before I jump into my lessons learned, let me give you a little more context.
Myself and Jason, who I mentioned earlier, heard buzz about changing something fundamental in our design system. I will leave it to you, the reader, to guess what that fundamental change was. No, if you guess right, I won’t tell you. Anyway, Jason and I thought that this would be a perfect chance to sharpen our Quant UXR skills.
We asked for static high-fidelity designs for all potential choices. We ended up with four variants- one “control” that was identical to our live site and three new “challengers”. In other words, instead of an A/B test, this was an A/B/C/D test.
In this study, I owned analysis, data visualization, and presentation of findings. Jason was primarily responsible for planning and running the study.
Jason and I wanted to see if this fundamental design change would have an effect on our users. We decided to measure the performance of each variant based on accuracy and completion time of a couple simple tasks. We also measured the subjective appeal of each variant based on the words used to describe the interface (using the Microsoft Desirability Toolkit) and a Likert score for perceived ease of use.
The idea was that if one variant was significantly better than the others in terms of performance and desirability, we should go with that variant. If, on the other hand, they were all about the same, then the designers could confidently recommend one variant based on other factors.
For this study, we used UsabilityHub, an online survey/usability testing tool. We set up four tests, each identical other than the images used, and included two click tests, the Desirability Toolkit question, and the Likert ease of use question. These tests were unmoderated and we had about 100 participants per variant, for a total of about 400 people.
In the interest of time, we used the participants provided by UsabilityHub without filtering based on demographics. We got results back in just a few days.
I’ll go into much more detail on my analysis in the reflection section below, but once the results were in, I exported them to CSV then got to work in Excel. I converted the raw data into scores, then ran ANOVA tests on the scores. I created charts in Excel then tweaked them in Sketch to make them look better.
In the end, the four variants performed almost identically. Even the adjectives that test participants used to describe the variants were shockingly similar. So we recommended that we choose one variant based on other factors then implement it. We haven’t made this change yet and we may end up running another round of testing, but I learned a lot during this study and I’m looking forward to doing more!
Back to my imaginary coaches, Jeff Sauro and James R. Lewis. As I read through the book, I pictured them relating almost everything in the book to this study. They praised me for my good choices and scolded me for things I should have done better. Here are a few of those learnings.
Nailed it! ANOVA for the win
For my analysis, I used a statistical technique called Analysis of Variance, or ANOVA. I only knew about that technique because of this Crash Course Statistics ANOVA video. “Yeah,” I thought, “comparing four design variations is like comparing different potato varieties, so I’ll go with that!” I had my doubts but went for it.
When I read Quantifying the User Experience, though, all doubts went away.
There’s a handy-dandy flow chart in Chapter 1 that clearly says if…
- you are comparing data; and
- there are different users in each group; and
- there are three or more groups
…then you should use ANOVA.
That’s exactly what my study was! I could just picture Sauro and Lewis giving me high fives…
However, when I got to Chapter 10, I realized that I had misunderstood the results.
If the observed value of F for a given alpha, df1, and df2 is greater than the critical value, this only indicates that at least one of the means is significantly different from at least one of the others.Chapter 10, Quantifying the User Experience
In other words, I gave ANOVA four sets of data and ANOVA said “Yes”. I interpreted that to mean that “Yes, the best variant is better than the other three variants by a statistically significant difference.” What it really meant was “Yes, one or more of the variants are better than one or more of the other variants by a statistically significant difference.”
After the ANOVA, I should have done t-tests between the variants to see which differences were statistically significant. Was there one variant that was clearly better than the other three? Maybe there were two good variants and two that weren’t as good. I can’t know for sure unless I test each pair. I’ll do that in future studies that involve ANOVA.
Averaging discrete (Ordinal) measures
In my study, we had a 5-level Likert question. In my analysis I averaged these scores without thinking twice. Turns out, I should have thought twice because averaging multipoint scales is one of the enduring controversies in Chapter 9 of Sauro and Lewis. In the end, the authors sided with me, but not before they pointed out some very valid objections to averaging Ordinal data.
In other words, Coach Sauro and Coach Lewis gave me a stern talking to for not having my head in the game, but then they patted me on the back because I did the right thing.
Combining metrics into single scores
I also did a bunch of work to come up with a single score for each variant. It felt quite arbitrary but I wanted a single number to report for each variant. This was another “Enduring controversy” mentioned by Sauro and Lewis.
The jury is still out in the statistics world, but the authors again sided with my approach. It’s okay to combine metrics. Next time I do this, I’ll use a method like the one described in A method to standardize usability metrics into a single score (Sauro and Kindlund, 2005)”. That method involves using z scores so that things like completion rates, completion times, and satisfaction into a single score.
A 95% confidence interval means that you’re going to have false positives 5% of the time. In other words, the more tests you do, the more likely you’re going to have a false positive. This is illustrated wonderfully in the following xkcd comic:
I didn’t consider this in my study. Sauro and Lewis describe several ways to deal with this, including the Benjamini-Hochberg method, which I plan to use going forward.
Always report confidence intervals!
My coaches made me to 100 push-ups for this one. Then run 100 laps. I should have known better.
When I presented results, I included several simple bar charts with one bar per design variant. The height of the bars were always quite similar. I kept saying “this was barely statistically significant” or “this was almost statistically significant”. What I should have done was include confidence intervals in my results so my audience (and myself) could see what was significant at 95% confidence and what wasn’t.
Thanks for reading!
This was long and dense. If you got this far, thank you. If you managed to read this, I highly recommend reading Quantifying the User Experience. Let me know how it goes!