I’m a former chemical engineer, software tester, programmer, and UX designer who is transitioning into quantitative UX research. I write here monthly. If you’re new to Quant UXR, check out my article How to get started with Quantitative UX Research. If you’re an expert, leave me a comment to let me know all the mistakes I’ve made in this post!
When I read the book Quantifying the User Experience, the following line caught my eye.
Sauro and Kindlund (2005) described methods for converting different usability metrics to z-scores—another way to get different metrics to a common scale (their Single Usability Metric, or SUM).
This was exciting to me because, as I mentioned in Expert Quant UXR advice from Sauro and Lewis, I had recently combined a variety of metrics into a single score. What I didn’t mention in that post is that I used arbitrary linear formulas for each metric.
It was something like this:
- Desirability score = 10 * (Number of positive adjectives chosen – Number of negative adjectives chosen) + 2
- Accuracy score = 30 * (Percentage of questions answered correctly)
Then I added up my four “scores” to get a single score for each variant. Highest score wins, right?
It felt wrong. Probably because it was so arbitrary. I had a specific weighting in mind before starting the study, which was good. But I wasn’t confident that my formulas preserved that weighting, which wasn’t good.
It’s kind of like when your college professor says 90% is an A, 80% is a B, and so on. If there’s an incredibly hard test with an average grade of 30%, is it fair to give an “F” to everyone? Even the genius who got 50%? No, it’s arbitrary. Which is why exams are often graded “on a curve”. In other words, they’re graded using z-scores.
Anyway, what I should have done in my study was use z-scores. Which is what the SUM paper describes.
This post is a summary of that paper. I’ll skip over the statistics that I don’t understand, like principal component analysis and scree plots, and focus on the core idea of standardizing then combining different metrics.
SUM, the Single Usability Metric
Sauro and Kindlund used z-scores to combine four metrics of usability into a single usability metric, or SUM. They started with a standard definition of usability:
There is general agreement as to what the dimensions of usability are: effectiveness, efficiency & satisfaction
And they listed some standard metrics for these dimensions:
- Effectiveness: Completion rates and errors
- Efficiency: Completion time
- Satisfaction: Standardized questionnaires
Then they did principal component analysis (which I honestly don’t yet understand) to get weightings for these four metrics, and found that they were all roughly equal.
Because of that, they decided to convert each metric to a z-score then average the z-scores into a single metric.
I believe that we can extend this same method combine any set of metrics using the steps below.
It’s like grading on a curve, but for usability experiments instead of college exams.
How to combine different metrics into a single score
1) Decide on your metrics
In my earlier study, I wanted to see which variant was the best, so I measured four things:
- accuracy (answering questions correctly)
- completion time
- adjectives (using the Microsoft Desirability Toolkit)
- perceived ease of use (5-point Likert scale)
These are all different measures: percentages, times, word counts, and Likert scores.
In your study, maybe you only have two metrics that you want to combine, or maybe you have five. No problem. Just pick them before running your study.
2) Gather your data
Run your study to get metrics for your variants.
3) Normalize your data
- Decide on a “benchmark”: What is a “Good” score for that metric?
- Subtract the benchmark from the raw value
- Divide by standard deviation to convert to a z-score
- Convert the z-score to a percentage
- Average your percentages to get a single score
For example, if you have a 5-point Likert scale for usability, the authors recommend using a score of 4/5 as a benchmark. So if your average is 4.2 and your standard deviation is 1.1, then your z-score is (4.2-4.0)/1.1= 0.18.
Remember how to convert z-scores to percentiles? If not, check out my post Stats in spreadsheets. A z-score of 0.18 represents a percentile of 57%.
In other words, your 4.2 represents a score of 57%. That score is now ready to be combined with your other metrics.
Completion rates: a special case?
The study authors skipped the “subtract the benchmark” step for completion rates. As described in The Single Usability Metric (SUM) — A completion rate conundrum by Carl J. Pearson, this is a surprising omission because it is inconsistent. It means that all components of SUM have an easy to understand convention of “>50% is good”, whereas completion rates in SUM are more like “>78% is good”.
So no, completion rates shouldn’t be treated as a special case. Standardize them just list all the other metrics by subtracting 78%.
Handle with care
The original SUM paper did some careful analysis to ensure that their four metrics could be combined with equal weighting without skewing the results. Your results may vary if you skip the careful analysis step.
However, even if you do skip that step, combining metrics using z-scores can’t be worse than combining using arbitrary formulas like I originally did! I’ll definitely use this technique in future studies.