While a lack of recommending seems like it would correlate with NOT recommending, we suspect there may be a difference but are unsure how much.
In our previous article, we looked to the published literature and found little research that assessed how a low likelihood-to-recommend (the Net Promoter Score item) measures active discouragement.
The studies we found looked at Positive Word of Mouth (PWOM) and negative word of mouth (NWOM). Their key findings were:
Good benchmarks are not developed overnight, and in most cases, the process takes years (and even then, there is no guarantee of broad adoption). Despite the bipolar scale’s superior prediction of NWOM, one of the desirable features of the NPS is its public benchmarks.
Rather than replacing the NPS item in UX and CX research, we were curious about the statistical relationship between the standard likelihood-to-recommend item and a separate item designed to measure the likelihood of recommending against—in other words, the likelihood of discouraging friends and colleagues from engaging with a brand or product.
We conduct periodic SUPR-Q^{®} surveys to take the temperature of the user experience of websites and mobile apps for key companies in various sectors. In August 2024, we collected data from 324 participants on their experience with one of the social media platforms they had used in the past year (Facebook, Instagram, LinkedIn, Snapchat, TikTok, or X).
As part of that survey, respondents indicated their likelihood to recommend that platform on the web and/or its mobile app (depending on their past experience) with a standard item format (Figure 1) and their likelihood to discourage others from using the platform in general with a custom item (Figure 2).
We used several methods to gain insight into the relationship between ratings of likelihood-to-recommend and likelihood-to-discourage.
Table 1 shows the correlations among the measurements of likelihood-to-recommend the websites in the social media survey (LTRWeb), likelihood-to-recommend the social media mobile apps (LTRApp), and likelihood-to-discourage others from using the social media platforms (LTDiscourage).
Metric 1 | Metric 2 | Correlation | R^{2} | n |
---|---|---|---|---|
LTRWeb | LTRApp | 0.70 | 49% | 162 |
LTRWeb | LTDiscourage | −0.52 | 27% | 188 |
LTRApp | LTDiscourage | −0.57 | 32% | 298 |
The coefficient of determination (R^{2}) is the square of the correlation. This can be interpreted in different ways; for example, the percentage of shared variance between two variables or the extent to which variation in one variable accounts for variation in the other.
All three correlations were statistically significant (p < .0001). LTRWeb accounted for just over a quarter of the variation in LTDiscourage; LTRApp accounted for just under a third. Although these are very different metrics from the percentage of NWOM, the magnitudes are similar to the 25% of NWOM against currently used products reported in previous research.
Guidelines differ on how high a correlation must be to indicate that two variables are measuring the same thing. Some suggest the appropriate value is ±0.90 (81% shared variance), and others recommend ±0.80 (64% shared variance). To support the claim that two variables aren’t just strongly related but essentially measuring the same thing, the stringent benchmark of ±0.90 or even ±0.95 (90% shared variance) seems reasonable. For example, the ten-item System Usability Scale (SUS) and its single ease of use item correlate at .95, with other research demonstrating that they measure the same underlying construct of perceived ease of use. Although significant, the correlations between LTR and LTDiscourage in Table 1 are far from indicating measurement of the same construct.
Scatterplots are visual representations of correlations. Figures 3 and 4 show the scatterplots between LTR and LTDiscourage for websites and mobile apps. The bounding boxes show the NPS designations of Detractors (LTR ratings from 0 to 6), Passives (LTR ratings from 7 to 8), and Promoters (LTR ratings from 9 to 10).
Figure 3: LTRWeb and LTDiscourage scatterplot (jittered).
The magnitude of the correlations depicted in Figures 3 and 4 are similar, as are the distribution of points in the scatterplots. The general trend, consistent with negative correlations, is as likelihood-to-recommend increases, likelihood-to-discourage decreases. But as previously discussed, they are not perfect (or even very high) correlations, and they are not measuring exactly the same thing.
For Figures 5 and 6, we assigned the two lowest LTDiscourage ratings to one category of extreme intensity (Extremely Low Likelihood-to-Discourage), the two highest ratings to another category of extreme intensity (Extremely High Likelihood-to-Discourage), and the intermediate ratings to a category of moderate intensity (Moderate Likelihood-to-Discourage). We then crossed those categories with the NPS categories of Detractors, Passives, and Promoters.
Again, the results were very similar for web and mobile app ratings. Promoters were much more likely than Detractors (or even Passives) to have an extremely low likelihood-to-discourage. On the other side of the scale, Detractors were responsible for the vast majority (82%) of the extremely high likelihood-to-discourage ratings (and accounted for 64% of the more moderate ratings). This result is consistent with our previous finding that, when given the opportunity, some respondents classified as Detractors did not make negative comments about the brand they rated, but 90% of the negative comments captured in the study were made by Detractors.
An analysis of the likelihood of 324 social media users to recommend and their likelihood to discourage the use of social media platforms found:
Likelihood-to-recommend measures likelihood-to-discourage, but not perfectly. Likelihood-to-recommend accounts for about a quarter to a third of the variation in likelihood-to-discourage. That is significant, but it leaves about two-thirds to three-fourths of variation in likelihood-to-discourage unaccounted for. This suggests that NOT recommending is not a perfect or even a strong substitute for measuring intent to recommend against or discourage others from a brand.
Detractors account for 80%+ of discouragers. Not all Detractors discourage. But almost all those who are extremely likely to discourage are Detractors (very unlikely to recommend). This is similar to our analysis of negative comments where not all Detractors make negative comments, but 90% of negative comments come from Detractors.
NPS Promoters are more likely than Detractors to have low ratings of likelihood-to-discourage. Across ratings for web and mobile app usage, most very low discouragement ratings are from Promoters (42–47%) with the remainder split between Passives (27–29%) and Detractors (27–29%).
NPS Detractors are much more likely than Promoters to have high ratings of likelihood-to-discourage. Across ratings for web and mobile app usage, the vast majority of high discouragement ratings are from Detractors (82%).
Discouragement might not be exactly the same as recommending against. We used a discouragement scale to measure this behavioral intention because it includes more active wording than “recommending against,” and we found it easier to ask. But there could be a difference (likely small) between responses to measures using “discouragement” wording versus the “recommending against” wording used in some previous research.
Bottom line: If researchers can get ratings of only one behavioral intention in contexts where recommendation is a plausible user behavior, it should be likelihood-to-recommend. For a clearer picture of the full range of behavioral intention, there appears to be value in also collecting ratings of likelihood-to-discourage.
]]>Clutter can lead to a poor user experience. Poor experiences repel users.
So how does one measure clutter?
Earlier, we did a deep dive into the literature to see how clutter has been first defined and then measured. We found the everyday concept of clutter was defined with two components: disorganized collection of what’s relevant and the presence of irrelevant objects.
But most measures of clutter were based on objective measures such as grouping and layout complexity. The only questionnaire we found measured airplane cockpit displays, which didn’t seem relevant to website clutter.
So, we began to build our own questionnaire for measuring website clutter.
In this article, we briefly review the exploratory research we conducted and then analyze new data to validate what we found using a statistical technique called confirmatory factor analysis.
In the first iteration of that exploratory research, we started with a preliminary clutter questionnaire that measured two aspects of clutter based on the literature: content clutter (e.g., irrelevant ads and videos) and design clutter (e.g., too much text, illogical layout).
The first iteration of the Perceived Website Clutter Questionnaire (PWCQ) included one item for overall clutter, six for content clutter, and ten for design clutter (see Figure 1 for the entire questionnaire used in our surveys).
The format for overall clutter was an 11-point agreement item (“Overall, I thought the website was too cluttered,” 0: Strongly disagree, 10: Strongly agree). The format for content and design clutter used five-point agreement items (1: Strongly disagree, 5: Strongly agree). The short labels and item wording for the content and design clutter items were:
After applying the exploratory techniques of parallel analysis, factor analysis, item analysis, and item retention, the revised version of the PWCQ had two items for content clutter (Content_ALot, Content_Space) and three for design clutter (Design_UnpleasantLayout, Design_TooMuchText, and Design_VisualNoise). Using regression analysis, these five items accounted for 45% of the variation in the one-item measure of overall clutter (highly significant) with excellent scale reliabilities (ranging from .88 to .91 overall and for the two subscales).
When developing a standardized questionnaire, however, exploratory research is just the first step. To have confidence in the questionnaire, it’s important to follow exploratory research with confirmatory research.
We used three approaches to validate the clutter questionnaire: confirmatory factor analyses, sensitivity analyses, and range analyses. The data for these analyses came from eight retrospective SUPR-Q^{®} consumer surveys conducted between April 2022 and January 2023. Each survey targeted a specific sector, and, in total, we collected 2,761 responses to questions about the UX of 57 websites. The sample had roughly equal representation of gender and age (split at 35 years old). Table 1 shows the participant gender and age for each survey, with sector names linking to articles with more information about each survey (including the websites selected for the sectors). Participants were members of an online consumer panel, all from the United States.
Sector | Date | Websites | Female (%) | Male (%) | Under 35 (%) | 35 or older (%) | |
---|---|---|---|---|---|---|---|
Real Estate | 269 | Apr-2022 | 5 | 48 | 51 | 48 | 52 |
Travel Aggregator | 452 | Apr-2022 | 9 | 48 | 51 | 48 | 52 |
Business Info | 183 | Jul-2022 | 3 | 46 | 53 | 42 | 58 |
Domestic Air | 350 | May-2022 | 7 | 48 | 49 | 58 | 42 |
International Air | 200 | May-2022 | 5 | 53 | 46 | 61 | 39 |
Ticketing | 234 | Jun-2022 | 5 | 45 | 52 | 40 | 60 |
Clothing | 550 | Dec-2022 | 13 | 52 | 45 | 48 | 52 |
Wireless | 523 | Jan-2023 | 10 | 47 | 50 | 40 | 60 |
Overall | 2,761 | – | 57 | 49 | 49 | 48 | 52 |
Some survey content differed according to the nature of the sector being investigated, but all surveys included the SUPR-Q, basic demographic items, and the first iteration of the perceived clutter questionnaire. For each survey, we conducted screeners to identify respondents who had used one or more of the target websites within the past year, then invited those respondents to rate one website with which they had prior experience. On average, respondents completed the surveys in 10–15 minutes (there was no time limit).
To support independent exploratory and confirmatory analysis, we split the sample into two datasets by assigning every other respondent to an exploratory (n = 1,381) or confirmatory (n = 1,380) sample by sector and website in the order in which respondents completed the surveys. These sample sizes ensured that we far exceeded the recommended minimum sample sizes for exploratory and confirmatory factor analysis.
Figure 2 shows the item loadings for a confirmatory factor analysis (CFA) assuming no structure in the items (i.e., a one-factor model, left panel) and the same items in a two-factor model (Content and Design, right panel).
There are many ways to assess the quality of CFA. Following the recommendations of Jackson et al. (2009), we focused on Comparative Fit Index (CFI), Root Mean Square Error of Approximation (RMSEA), and Bayesian Information Criterion (BIC). There are guidelines for good levels of model fit for CFI (> 0.90) and RMSEA (< 0.08), but not for BIC, which is used for the relative comparison of models (smaller is better).
For the one-factor model, the CFI was 0.74, RMSEA was 0.20, and BIC was 6,166. For the two-factor model, the CFI was 0.92, RMSEA was 0.11, and BIC was 2,144. Thus, accounting for the Content/Design two-factor structure led to better fit statistics including an acceptable level of CFI, but RMSEA was greater than 0.08.
Figure 3 shows the CFA for the five items retained during the exploratory analyses. The fit statistics for this model were excellent, with a CFI of 0.997, RMSEA of 0.047, and BIC of 96. This CFA model confirmed the construct validity of the two-factor structure identified in the exploratory analyses. For the final version of the PWCQ, see Figure 4.
Using the full dataset (n =2,761), we conducted ANOVAs to check the sensitivity (significance of the main effect of website) of the three clutter metrics, all of which were statistically significant:
We assessed the range of these metrics across websites (after rescaling to a common 0–100-point scale for ease of comparison) to get a sense of the extent to which the dataset included websites with different levels of clutter. The distributions are shown in Figure 5 and summarized in Table 2.
Design Clutter scores tended to run lower than Content Clutter scores, with a ten-point difference in medians (50th percentiles). For Content Clutter and Design Clutter, the range of scores was slightly more than half of the possible range of the metric. The range for Overall Clutter was a little more restricted, covering about 40% of the possible range of the metric. The 5th–95th percentiles for the metrics were from 20 to 51 for Content Clutter, 12 to 41 for Design Clutter, and 20 to 45 for Overall Clutter. None of the websites had a mean score on these metrics higher than 65.
Clutter Metric | Min | 5th | 10th | 25th | 50th | 75th | 90th | 95th | Max | Range |
---|---|---|---|---|---|---|---|---|---|---|
Content | 11 | 20 | 21 | 26 | 33 | 37 | 44 | 51 | 62 | 51 |
Design | 9 | 12 | 16 | 20 | 23 | 29 | 36 | 41 | 65 | 56 |
Overall | 11 | 20 | 23 | 29 | 32 | 38 | 44 | 45 | 50 | 39 |
Table 2: Summary of distributions for Content Clutter, Design Clutter, and Overall Clutter after conversion to a 0–100-point scale.
For the eight surveys we conducted, our focus was to gather information about top websites in their sectors, so we did not focus on including websites with unusually high levels of clutter. There is some possibility that including very cluttered websites might have led to different analytical solutions. That said, our exploratory and confirmatory analyses are appropriate for the types of websites we typically study in our consumer surveys and may work well when assessing very cluttered websites because we did not see evidence of ceiling or floor effects with these clutter metrics.
Confirmatory analysis of over 1,000 ratings of the perceived clutter of 57 websites found:
Confirmatory factor analysis of the five-item version of the PWCQ indicated excellent fit. CFA of the five-item version of the clutter questionnaire had excellent fit statistics (CFI = 0.997, RMSEA = 0.047, BIC = 96), better than a similar two-factor CFA of the 16-item version (CFI = 0.92, RMSEA = 0.11, BIC = 2,144).
Clutter questionnaire scores varied across websites but with possible range restrictions. The sensitivity analyses of Content Clutter, Design Clutter, and Overall Clutter showed significant variation in the means of these metrics by website, but our analyses of the ranges of these values showed that none of them, after rescaling values to 0–100-point scales, had any clutter score greater than 65, covering about half the possible range for Content Clutter and Design Clutter and about 40% of the range of Overall Clutter.
Bottom line: We expect UX researchers and practitioners to be able to use this version of the clutter questionnaire when the research context is similar to the websites we studied in our consumer surveys. We don’t anticipate serious barriers to using the clutter questionnaire in other similar contexts including task-based studies, mobile apps, and very cluttered web/mobile UIs, but because that research has not yet been conducted, UX researchers and practitioners should exercise due caution.
For more details about this research, see the paper we published in the International Journal of Human-Computer Interaction (Lewis & Sauro, 2024).
]]>Is dissatisfied the opposite of satisfied?
Is discourage the opposite of recommend?
And is not recommending the same as recommending against?
When computing the Net Promoter Score (NPS), people who rate the 0–10-point likelihood-to recommend-item high (a 9 or 10) are categorized as promoters and those who give low ratings (6 or lower) are described as detractors.
The term detractor suggests (intentionally) that these customers play a role in spreading negative word of mouth by discouraging other people from making purchases. Hearing bad things (negative word of mouth) may have as large (if not a larger) impact on company growth than positive word of mouth.
But is being less likely to recommend or even saying negative things about a brand or product the same as actively discouraging others from purchasing or using?
We are generally a little skeptical of bold claims and like to see the data or look to replicate. In earlier research, we examined the claim that detractors account for 80% of negative word of mouth. We found most (90%) of the negative comments in our study were, indeed, associated with detractors.
However as noted in that analysis, it’s important to differentiate between the majority of detractors saying negative things versus the majority of negative comments coming from detractors. Not all detractors say bad things. In fact, in our analysis (and in others), some positive comments do come from detractors and some negative comments also come from promoters. However, this doesn’t negate the value of knowing where more negative comments are likely to come from.
But making a negative comment isn’t necessarily the same as actively discouraging others from engaging with a brand (regardless of where the person was on the likelihood to recommend scale, high or low). For example, here’s a negative comment about a bad experience on United Airlines from our earlier analysis of detractors:
“I took a flight back from Boston to San Francisco 2 weeks ago on United. It was so terrible. My seat was tiny and the flight attendants were rude. It also took forever to board and deboard.”
While hardly an endorsement of the airline, does this explicitly discourage others from using United?
Comments that suggest actual discouragement (for example, “I would tell my friends to stay away!”) are less common in our data.
In this article, we review data in the relevant literature to understand whether the Net Promoter Score adequately measures discouragement.
We identified five relevant papers, four from the same group of authors led by Robert East. All the papers investigated the frequency of word-of-mouth (WOM) behaviors that are usually defined as statements that advise others to make purchases (positive word-of-mouth, PWOM) or to avoid purchases (negative word-of-mouth, NWOM) from a company or brand.
In The Relative Incidence of Positive and Negative Word of Mouth (2007), East, Hammond, and Wright analyzed 15 categories (e.g., cars, ISPs, and dentists) from a few thousand participants (on average, 153 respondents per brand in the UK between 2001-2003). Participants were asked how many times they either recommended or advised against using a brand.
They reported that 78% of PWOM was directed at the participant’s main/current brand, whereas 77% of NWOM was directed toward a former brand. Thus, as a rough approximation, three-fourths of PWOM and one fourth of NWOM is about the main brand.
After the publication of the 2007 paper, East presented that research at the 2008 ANZMAC conference (Measurement Deficiencies in the Net Promoter Score), augmented with some additional data. With the additional data, their estimates of PWOM and NWOM for the current brand changed slightly to 72% and 25%—still the same approximately three-fourths to one-fourth ratio. He also provided some early results from a study that was published later in 2008, discussed next.
In Measuring the Impact of Positive and Negative Word of Mouth on Brand Purchase Probability (East, Hammond, & Lomax, 2008), the authors conducted 11 new in person surveys in the UK between 2005-2007 with 2,544 respondents answering questions about one or two categories of purchases (e.g., cell phones, credit cards, restaurants). Respondents were asked whether they had received positive or negative advice for any of the brands listed, whether the advice was positive or negative, and whether it affected their brand choice. Their survey used the 11-point Juster Scale to assess purchase intention before and after the WOM.
Technical note: Like the standard item for likelihood to recommend, the Juster scale has eleven response options from 0 to 10. Unlike the typical unlabeled response options of likelihood to recommend (except for its endpoints), the Juster scale is fully labeled with probability statements (e.g., 0: No chance, almost no chance [1 in 100]; 1: Very slight possibility [1 in 10]; … 9: Almost sure [9 in 10]; 10: Certain, practically certain [99 in 100]). Because likelihood to recommend and Juster scale formats differ in ways that tend to be more cosmetic than substantial (e.g., full versus endpoint labeling of response options), we suspect the format differences likely have little effect on respondent behavior (but we have not yet tested this).
Overall, 64% claimed that PWOM and 48% claimed that NWOM affected their decisions. When impact was measured as the shift in purchase probability, PWOM produced a mean shift of 0.20 and NWOM produced a shift of −0.11. This indicates that PWOM is more influential than NWOM, but NWOM is itself influential and should be measured. Because most NWOM against a brand is produced by consumers who are not current users of a brand, the authors concluded that a major weakness of the typical measurement of NPS is that companies only ask it of their current customers.
In a follow-up paper published three years later (The NPS and the ACSI: A Critique and Alternative Metric), East, Romaniuk, and Lomax (2011) argued that neither the NPS nor the American Customer Satisfaction Index (ACSI) adequately measure NWOM because ex-customers and never-customers aren’t sampled in their methodologies. They further criticized the NPS based on reasoning that the intention to recommend is likely to have less influence on future purchase behaviors than the memory of having received a recommendation in favor of or against a brand.
The authors delivered surveys to homes in the UK between 2007-2008 and received 2,254 usable responses. Respondents reported whether they had given or received positive and negative advice across several categories including grocery stores, banks, luxury brands, and cell phones. Respondents completed Juster scales to measure behavioral likelihoods and a version of the three-item ACSI to measure satisfaction.
The authors totaled the negative advice reported given and then determined what percent came from detractors on their NPS-like Juster scale and ACSI-like scale (see Table 1).
Main Supermarket | ||
Main Coffee Shop | ||
Skin Care Products |
Similar to the authors’ earlier published findings, NPS detractors accounted for a minority of the total NWOM across used and unused brands (31%). The authors also showed the NPS correlated highly with the ACSI (which we also found in our earlier analysis of CSAT and NPS), with ASCI accounting for a similar amount of NWOM (28%). This suggests the inability to fully account for NWOM has less to do with the measure used but more to do with who is measured (the sampling strategy).
In an unpublished manuscript (Measuring Customer Satisfaction and Loyalty: Improving the ‘Net-Promoter’ Score) by a different set of researchers (Schneider, Berent, Thomas, & Krosnick, 2008), the authors conducted two studies in which they manipulated rating scale labels for likelihood-to-recommend items. Although it was not the focus of their research, they did measure the association between the Net Promoter Score (standard likelihood-to-recommend item) and stated past positive and negative recommendations.
As part of their research, Schneider et al. asked 4,883 respondents questions about eight brands (automotive manufacturers and airlines), also asking whether they were familiar with the brands and whether they were customers. This research was highly exploratory including over 150 regression analyses with varying outcomes, making it difficult to construct a comprehensive narrative that could satisfactorily account for all the results.
Despite this, we were especially intrigued by comparison of the standard unipolar likelihood-to-recommend scale with a 7-point bipolar version designed to allow respondents to indicate the extent to which they recommended for or recommended against purchasing from a brand (see Figure 1).
The key regression results for these two items appear in Table 2. Before including variables in regression analyses, Schneider et al. standardized all values to a 0-1 scale where 0 was the lowest possible value for a rating scale and 1 was the highest possible value, permitting interpretation of regression weights (the cells in Table 2) as measures of the strength of the regression model.
Predictor | ||||
Unipolar | ||||
Bipolar | ||||
Predictor | ||||
Unipolar | ||||
Bipolar |
The table shows the results for regressions modeling the predictive strength of each item (unipolar and bipolar) for PWOM (number of positive recommendations and number of people who made at least one positive recommendation for all respondents and for respondents who were brand customers) and NWOM (number of negative recommendations and number of people who made at least one negative recommendation for all respondents and for respondents who were brand customers). The standard unipolar scale was as good as or better than the bipolar scale when modeling PWOM, especially for the Customers Only condition. In contrast, the bipolar scale was as good as or better than the unipolar scale when modeling NWOM, especially for the All Respondents condition.
Findings of Schneider et al. relevant to this review are:
Our review of one unpublished and four published papers about the NPS’s ability to measure recommending against a brand (i.e., NWOM, discouragement) found:
NWOM is worth measuring. PWOM appears to be more prevalent and influential than NWOM, but NWOM is itself influential and should be measured.
Surveying only existing customers is problematic for assessing NWOM. Asking only existing customers will almost surely understate the percentage of people likely to discourage or recommend against. Proper measurement of the state of a brand requires surveying customers and noncustomers.
NPS is not necessarily the best measure of recommending against. Consistent with being based on a unipolar measure of likelihood to recommend that is a strong predictor of PWOM, NPS appears to properly measure encouragement/recommendation for. It also appears to significantly measure NWOM, but it may not be the best predictor.
A bipolar scale may better predict NWOM but at a loss of benchmarks. Good benchmarks are not developed overnight, and in most cases the process takes years (and even then, there is no guarantee of broad adoption). Despite the bipolar scale’s superior prediction of NWOM, one of the desirable features of the NPS is its published benchmarks.
Future research: How much might researchers gain by asking a discouragement question in addition to the standard likelihood to recommend? We’ll explore this in an upcoming study.
East, R., Hammond, K., & Wright, M. (2007). The relative incidence of positive and negative word of mouth. International Journal of Research in Marketing, 24, 175–184.
East, R. (2008). Measurement deficiencies in the Net Promoter Score. Sydney, Australia: ANZMAC.
East, R., Hammond, K., & Lomax, W. (2008). Measuring the impact of positive and negative word of mouth on brand purchase probability. International Journal of Research in Marketing, 25, 215–224.
Schneider, D., Berent, M., Thomas, R., & Krosnick, J. (2008). Measuring customer satisfaction and loyalty: Improving the ‘Net-Promoter’ score. Unpublished manuscript.
East, R., Romaniuk, J., & Lomax, W. (2011). The NPS and the ACSI: A critique and an alternative metric. International Journal of Market Research, 53(3), 327–346.
]]>In a previous article, we described our search for a measure of perceived clutter in academic literature and web posts, but we were left unquenched.
We found that the everyday conception of clutter includes two components that suggest different decluttering strategies: the extent to which needed objects (e.g., tools in a toolbox) are disorganized and/or the presence of unnecessary objects (e.g., a candy wrapper in a toolbox). The first situation requires reorganizing the needed objects, while the second requires discarding unnecessary objects.
The literature in UI design has mostly focused on objectively measuring information displayed on screens (e.g., local density, grouping, feature congestion). We found a published questionnaire for subjective clutter in advanced cockpit displays, but we did not find any standardized questionnaires developed for the measurement of perceived clutter on websites.
So, we decided to develop our own.
The development process for a standardized questionnaire has two major research activities: exploratory and confirmatory. In this article, we focus on the exploratory research.
Consistent with the literature we reviewed, we hypothesized that at least two factors might contribute to the perceived clutter of websites: content clutter and design clutter.
We expected content clutter to be driven by the presence of irrelevant ads and videos that occupy a considerable percentage of display space and have negative emotional consequences (e.g., they’re annoying). Considering the components of the everyday conception of clutter, these would be the candy wrappers in the toolbox—items that website users would prefer to discard, perhaps by using ad blockers.
Our conception of design clutter is that it is driven by issues with the presentation of potentially relevant content that make it difficult to consume (e.g., insufficient white space, too much text, illogical layout). Analogous to the everyday definition of clutter, this content is similar to a hammer in the toolbox—it should be retained but needs reorganization.
The first iteration of the perceived website clutter questionnaire (PWCQ) included one item for overall clutter, six for content clutter, and ten for design clutter (see Figure 1 for the entire questionnaire used in our surveys). The format for overall clutter was an 11-point agreement item (“Overall, I thought the website was too cluttered,” 0: Strongly disagree, 10: Strongly agree). The format for content and design clutter was five-point agreement items (1: Strongly disagree, 5: Strongly agree). The short labels and item wording for the content and design clutter items were:
The data for these analyses came from SUPR-Q^{®} data collected in eight retrospective consumer surveys conducted between April 2022 and January 2023. Each survey targeted a specific sector, and in total, we collected 2,761 responses to questions about the UX of 57 websites. The sample had roughly equal representation of gender and age (split at 35 years old). Table 1 shows the participant gender and age for each survey, with sector names linking to articles with more information about each survey (including the websites selected for the sectors). Participants were members of an online consumer panel, all from the United States.
Sector | n | Date | Websites | Female (%) | Male (%) | Under 35 (%) | 35 or older (%) |
---|---|---|---|---|---|---|---|
Real Estate | 269 | Apr-2022 | 5 | 48 | 51 | 48 | 52 |
Travel Aggregator | 452 | Apr-2022 | 9 | 48 | 51 | 48 | 52 |
Business Info | 183 | Jul-2022 | 3 | 46 | 53 | 42 | 58 |
Domestic Air | 350 | May-2022 | 7 | 48 | 49 | 58 | 42 |
International Air | 200 | May-2022 | 5 | 53 | 46 | 61 | 39 |
Ticketing | 234 | Jun-2022 | 5 | 45 | 52 | 40 | 60 |
Clothing | 550 | Dec-2022 | 13 | 52 | 45 | 48 | 52 |
Wireless | 523 | Jan-2023 | 10 | 47 | 50 | 40 | 60 |
Overall | 2,761 | – | 57 | 49 | 49 | 48 | 52 |
The eight surveys shown in Table 1 were retrospective studies of the UX of websites in their respective sectors. Some survey content differed according to the nature of the sector being investigated, but all surveys included the SUPR-Q, basic demographic items, and the first iteration of the perceived clutter questionnaire. For each survey, we conducted screeners to identify respondents who had used one or more of the target websites within the past year, then invited those respondents to rate one website with which they had prior experience. On average, respondents completed the surveys in 10–15 minutes (there was no time limit).
To support independent exploratory and confirmatory analysis, we split the sample into two datasets by assigning every other respondent to an exploratory (n = 1,381) or confirmatory (n = 1,380) sample by sector and website in the order in which respondents completed the surveys. These sample sizes ensured that we far exceeded the recommended minimum sample sizes for exploratory factor analysis and multiple regression (and for future confirmatory factor analysis and structural equation modeling), even after splitting the sample.
A parallel analysis of the clutter items indicated retention of two factors. Table 2 shows the alignment of items (identified with item code) with factors from maximum likelihood factor analysis and Promax rotation (KMO = 0.95). Content and design items aligned as expected with Content and Design factors. The reliabilities (coefficient alpha) were acceptably high (Content and Design factors were both 0.95; their combined reliability was 0.96).
Item | Content | Design |
---|---|---|
Content_ALot | .855 | .011 |
Content_TooMany | .883 | −.034 |
Content_Space | .881 | .035 |
Content_Distracting | .897 | .016 |
Content_Irrelevant | .774 | .033 |
Content_Annoying | .892 | −.015 |
Design_HardToRead | −.114 | .832 |
Design_SmallFont | −.084 | .778 |
Design_DistractingColors | .024 | .765 |
Design_UnpleasantLayout | .039 | .829 |
Design_WhiteSpace | .086 | .723 |
Design_TooMuchText | .063 | .776 |
Design_NotLogical | .061 | .803 |
Design_Disorganized | .035 | .844 |
Design_VisualNoise | .219 | .664 |
Design_HardToStart | .000 | .795 |
Item loadings were especially high for content items due to high item correlations, which is good for scale reliability but indicates an opportunity to improve scale efficiency by removing some items. The situation was similar but not quite as extreme for the design items.
A common strategy for deleting items is to identify those with lower factor loadings. For example, for the Content factor, the lowest item loading was for Content_Irrelevant (.774), and for the Design factor, the lowest item loading was Design_VisualNoise (.664). However, because we collected a measure of overall perceived clutter (Overall Clutter), we were able to use an alternative strategy of backward elimination regression analysis to select the subset of clutter and design items that were best at accounting for variation in Overall Clutter.
Backward regression (key driver analysis) of the six content items retained three: Content_ALot, Content_Space, and Content_Distracting, accounting for 35.5% of variation (adjusted-R^{2}) in Overall Clutter. Backward regression of the ten design items plus deletion of items with negative beta weights retained three: Design_UnpleasantLayout, Design_TooMuchText, and Design_VisualNoise, accounting for 39% of variation (adjusted-R^{2}) in Overall Clutter.
Backward regression of these six items revealed some evidence of variance inflation, and in this combination, Content_Distracting no longer made a significant contribution to the model. After removing Content_Distracting, the remaining five items accounted for almost half of the variation in Overall Clutter (adjusted-R^{2} = 45%), and all variance inflation factors (VIF) were less than 4. The reliabilities (coefficient alpha) for the revised Content and Design factors were, respectively, 0.91 and 0.88; their combined reliability was 0.90.
For the exploratory research, the method of consulting the literature and expert brainstorming to arrive at the initial item set established content validity for the clutter questionnaire (Nunnally, 1978). The expected alignment of items with factors in the factor analysis is evidence of construct validity. Evidence of concurrent validity of the clutter factors comes from their significant correlations with the single-item measure of overall clutter (content clutter: r(1,379) = 0.60, p < 0.0001; design clutter: r(1,379) = 0.61, p < 0.0001).
Based on the exploratory analyses, Figure 2 shows the revised version of the PWCQ, with the overall clutter item, two items for the assessment of content clutter, and three items for the assessment of design clutter.
Exploratory analysis of 1,381 ratings of the perceived clutter of 57 websites found:
The proposed questionnaire items aligned with the expected factors of content and design clutter. A parallel analysis indicated the retention of two factors. Exploratory factor analysis showed that the content clutter items formed one factor, and the design clutter items formed the other.
Scale reliability was very high for the overall and factor scores. The reliability for each factor was .95 with a combined reliability of .96. Reliabilities this high indicate an opportunity to increase scale efficiency by reducing the number of items.
We used multiple regression to increase the efficiency of the questionnaire while keeping its reliability high. The revised questionnaire retained the overall item, two items for content clutter, and three items for design clutter. Reliability coefficients dropped a bit from the original questionnaire but remained high (.91 for content clutter, .89 for design clutter, and .90 combined).
The revised questionnaire had high concurrent validity. Concurrent validity was evident from the highly significant correlations between the factor scores and the single overall clutter item scores.
Bottom line: This exploratory development of a standardized clutter questionnaire for websites produced an efficient two-factor instrument with excellent psychometric properties (high reliability and validity).
]]>The mean is one type of average. Would it be more appropriate to estimate and compare SUS medians instead of means?
To investigate this question, we analyzed and compared SUS means and medians collected from over 18,000 individuals who used the SUS to rate the perceived usability of over 200 products and services. But before we get to those results, it helps to understand the difference between the median and mean and why you might choose one or the other.
One of the first and easiest things to do with a data set is to find the mean. The mean is a measure of central tendency, one way of summarizing the middle of a dataset. To calculate the mean, add up the data points and divide by the total number in the group (the sample size, n). With the mean, every data point contributes to the estimate. The mean is the preferred method for data that are at least roughly symmetrical (i.e., the mean is about midway between the lowest and highest values). Of the many types of symmetrical distributions, the best known is the normal distribution.
When the data aren’t symmetrical, the mean can be sufficiently influenced by a few extreme data points to become a poor measure of the middle value. In these cases, the median, the center point of a distribution, is a better estimate of the most typical value. For example, this often happens with distributions of time data. For samples with an odd number of data points, the median is the central value; for samples with an even number, it’s the average of the two central values. With the median, only the center or center two data points directly contribute to the estimate while the other data points establish the center.
The median is a problematic measure of central tendency for individual rating scales. Because respondents select one number from a rating scale, the dataset is composed of integers. For a five-point scale, the median can take only the following values no matter how large the sample is: 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, and 5.0. (And it can only take the intermediate values when n is even.)
The mean, on the other hand, can take any value between 1 and 5, and as the sample size increases, it becomes more and more continuous. Because the mean can acquire a larger number of values, it can reflect significant differences between two samples more reliably than the median difference.
When scales are open-ended (have at least one endpoint at infinity, like time data), extreme values can affect the mean but will not affect medians. Rating scales, however, are not open-ended, so the median does not have a compelling advantage over the mean when analyzing individual rating scales.
Things get more complicated when working with metrics that are composites of many individual rating scales, like the SUS. The SUS is made up of ten five-point items with the final score (the mean of the ten items) interpolated to range from 0–100, so it can take 40 values in 2.5-point increments (0, 2.5, 5.0, 7.5, … 97.5, 100). The median can take these values when n is odd. When n is even, 39 intermediate values (such as 1.25 and 3.75) are also possible, for a total of 79 potential median values separated by just 1.25 points. As for individual rating scales, the mean becomes more continuous as the sample size increases, but with so many possible median values, the difference in mean–median sensitivity is reduced for the SUS.
Historically, the typical practice has been to compute SUS means rather than medians, even though the distribution of the SUS is known to be asymmetrical. Thus, the methods developed over decades to interpret SUS scores are based on means, not medians.
We would never recommend relying on the median of an individual rating scale as a measure of central tendency, but we were curious about the typical difference between the means and medians of datasets collected with the SUS. Fortunately, we have a LOT of SUS data to answer this question.
We compiled a large set of SUS data with 18,853 individual SUS scores, assessing 210 products and services, studied in 2010 through 2022 (primarily business and consumer software products, websites, and mobile apps). Figure 1 shows the expected asymmetric distribution of the data.
The distribution is left-skewed (with a number of atypically low SUS scores on the left). Spikes are at values of 50, 75, and 100, but not at 0 or 25. The spike at 50 was somewhat expected because one way to get a score of 50 is to rate each SUS item with the same response option. There is always a concern that this straightlining might indicate a respondent who isn’t carefully considering each item. This concern is partially (but not totally) alleviated by the differences in the number of 75s and 100s compared to 0s and 25s. For example, the only pattern that produces a score of 100 is alternating selections of 5 for odd-numbered items and 1 for even-numbered items, and the only way to get 0 is to reverse that pattern. If most respondents were selecting patterns at random, then the number of 100s and 0s would be similar—but they aren’t. Despite this, researchers should be suspicious enough of scores of 50 to investigate other aspects of those respondents’ behaviors to see whether the data should be retained or excluded from analysis. For the following analyses, we retained all the data but focused on analysis at the product rather than the individual level.
Figure 2 shows the scatterplot of the SUS means and medians from all 210 products in the dataset. As expected, the means and medians had a strong linear relationship (r(208) = .97, p < .0001)—an almost perfect correlation.
Figure 3 shows the average difference (with 95% confidence intervals) between means and medians for all data (n = 210) and the data split between studies with relatively low sample sizes (n < 30, 61 products with n ranging from 5 to 26) and those with larger sample sizes (n > 30, 149 products with n ranging from 30 to 1,969).
Inspection of the confidence intervals in Figure 3 shows that, on average, SUS medians were about 2 points higher than SUS means, and there was no significant difference due to sample size. The standard deviation of the difference was larger when n < 30 (4.3 versus 2.5), reflected in the slightly larger range of the confidence interval.
For all products and the sample size splits, the lower limit of the confidence intervals was higher than 0, indicating that the median–mean difference was statistically significant (e.g., a difference of 0 is not plausible). Using the data from all products as the most precise estimate of the difference (smallest confidence interval), the confidence interval ranged from 1.7 to 2.5. This means that a mean difference of 2.0 is plausible, but differences lower than 1.7 or higher than 2.5 are not.
Perhaps there’s a difference when sample sizes are smaller? To explore the median–mean difference with smaller sample sizes, we focused on the 149 products with at least 30 SUS scores. For each of these products, we assigned a random number to each respondent and then sorted respondents by random number. We created two new datasets, with one containing the first ten randomly selected participants and the other containing the next 20 randomly selected participants.
Figure 4 shows the scatterplots for each new dataset. Consistent with the full set of data shown in Figure 1, the means and medians for both datasets have a strong linear relationship (n = 10: r(147) = .94, p < .0001; n = 20: r(147) = .95, p < .0001). The spread of points in the two graphs are consistent with their standard deviations (n = 10: 4.2; n = 20: 3.6), slightly more for the smaller sample size.
Figure 5 shows the average difference (with 95% confidence intervals) between means and medians for the two datasets.
The results in Figure 5 are consistent with those in Figure 3. SUS medians were about 2 points higher than the means with no significant difference due to sample size. The standard deviation of the difference when n = 10 (4.0) was slightly higher than when n = 20 (3.5).
Consistent with the analysis of all the data, the lower limit of the confidence intervals was higher than 0, indicating a statistically significant median–mean difference. When n = 10, the 95% confidence interval around the difference ranged from 1.3 to 2.6; when n = 20, it ranged from 1.6 to 2.7, both just a bit wider than the best (full-data) estimate of 1.7 to 2.5.
We compared the means and medians of SUS scores from 18,853 individuals who used the SUS to rate the perceived usability of 210 products and services. Our key findings and conclusions are:
We recommend using the mean rather than the median. As described in more detail below, the medians of SUS distributions are slightly but consistently higher than the means, so researchers who use the various tools developed over the past few decades to interpret the SUS would slightly but consistently overestimate the quality of the user experience if they reported the median rather than the mean.
There was a statistically significant but small two-point difference between the means and medians. When a distribution is left-skewed, you expect the median to be larger than the mean because the extremely low scores exert some pull on the location of the mean. For these data, the median was typically about 2 points higher than the mean (between 1.7 and 2.5 with 95% confidence).
The difference between means and medians was consistent across different sample sizes. We estimated the difference using all data for all products, splitting the products into two groups with one containing the products that were assessed with n < 30 and the other with n ≥ 30, and randomly selecting individual participants from products with n ≥ 30 to assess sample sizes of 10 and 20. All analyses had average median–mean differences of about 2 with similar ranges of 95% confidence intervals.
The observed median–mean difference was consistent with the gaps between the possible medians of the SUS. An individual SUS score can take 40 values in 2.5-point increments from 0 to 100. When a sample size is odd, the median is restricted to these values. When a sample size is even, the median can, in some cases, also take 39 values midway between the 2.5-point increments. Of the 210 products in our dataset, 123 (59%) had an even sample size and 87 (41%) had an odd sample size. It is interesting that the observed difference of about 2 points is between the smaller increment of 1.25 points and the larger increment of 2.5 points. So:
This range (1.8 to 2.5) is remarkably close to our best estimate of the plausible range of the median-mean difference (1.7 to 2.5). It could be a coincidence, but it’s still intriguing.
The difference between SUS means and medians is small, but using the median could sometimes be problematic when using existing methods of interpreting the SUS. Given the typical distribution of the SUS, the median will almost always be greater than the mean—by our estimates, usually about 2 points higher. In most cases, this is close enough that the interpretation of both measures of central tendency will be the same. There might be differences, however, at interpretive boundaries (e.g., between B and A on a curved grading scale, or between Excellent and Best Imaginable on an adjective scale like those shown in Figure 6).
]]>In practice, the answer is based on both statistics AND logistics.
A statistical formula will tell you an optimal number to select. But the real-world logistical constraints of budgets, recruiting challenges, and time will often dictate the maximum number of participants you can test with.
In our earlier article, we described the sample size formula for problem discovery studies and how two parameters (likelihood of a problem and problem occurrence) impact the sample size.
In our experience, these logistical constraints lead research teams to set aside a specific number of days and a specific budget to run studies. The problem discovery formula may suggest testing 18 participants, but if you have only two days to collect data, your maximum sample size may be only ten. What do you get with ten? And then what if two participants don’t show up, and you have to toss out the data from another because of prototype issues, leaving you with 7?
A practical approach to handling sample size discussions is to flip the common question of “What sample size do I need?” to “What will I be able to detect given a specific sample size?”
In this article, we present a table (a kind of size chart for sample sizes for discovery studies) and walk through how to use it and the associated graphs to see what you can expect to get with different sample sizes for problem discovery studies.
To help UX researchers plan for a variety of discovery percentages and problem probabilities in formative usability studies, we created Table 1. There is a row in the table for each sample size from 1 to 25 and columns for different possible problem probabilities from 1% to 75%. The values in the table cells are discovery rates (the likelihood of observing the problem at least once) for each combination of sample size (n) and problem probability (p) computed using the formula 1 – (1 – p)^{n}. For easier lookups, all values are shown as percentages. Because 100% discovery is, strictly speaking, not possible, the discovery rates of 100% in the table mean that the expected percentage of discovery is at least 99.5% (and for high problem probabilities and large sample sizes, more like 99.99999%).
n | 1% | 5% | 10% | 15% | 25% | 30% | 50% | 75% |
1 | 1% | 5% | 10% | 15% | 25% | 30% | 50% | 75% |
2 | 2% | 10% | 19% | 28% | 44% | 51% | 75% | 94% |
3 | 3% | 14% | 27% | 39% | 58% | 66% | 88% | 98% |
4 | 4% | 19% | 34% | 48% | 68% | 76% | 94% | 100% |
5 | 5% | 23% | 41% | 56% | 76% | 83% | 97% | 100% |
6 | 6% | 26% | 47% | 62% | 82% | 88% | 98% | 100% |
7 | 7% | 30% | 52% | 68% | 87% | 92% | 99% | 100% |
8 | 8% | 34% | 57% | 73% | 90% | 94% | 100% | 100% |
9 | 9% | 37% | 61% | 77% | 92% | 96% | 100% | 100% |
10 | 10% | 40% | 65% | 80% | 94% | 97% | 100% | 100% |
11 | 10% | 43% | 69% | 83% | 96% | 98% | 100% | 100% |
12 | 11% | 46% | 72% | 86% | 97% | 99% | 100% | 100% |
13 | 12% | 49% | 75% | 88% | 98% | 99% | 100% | 100% |
14 | 13% | 51% | 77% | 90% | 98% | 99% | 100% | 100% |
15 | 14% | 54% | 79% | 91% | 99% | 100% | 100% | 100% |
16 | 15% | 56% | 81% | 93% | 99% | 100% | 100% | 100% |
17 | 16% | 58% | 83% | 94% | 99% | 100% | 100% | 100% |
18 | 17% | 60% | 85% | 95% | 99% | 100% | 100% | 100% |
19 | 17% | 62% | 86% | 95% | 100% | 100% | 100% | 100% |
20 | 18% | 64% | 88% | 96% | 100% | 100% | 100% | 100% |
21 | 19% | 66% | 89% | 97% | 100% | 100% | 100% | 100% |
22 | 20% | 68% | 90% | 97% | 100% | 100% | 100% | 100% |
23 | 21% | 69% | 91% | 98% | 100% | 100% | 100% | 100% |
24 | 21% | 71% | 92% | 98% | 100% | 100% | 100% | 100% |
25 | 22% | 72% | 93% | 98% | 100% | 100% | 100% | 100% |
There are two ways to use Table 1:
Returning to the question posed at the beginning of the article, what do you expect to get with a sample size of ten or a sample size of seven?
We start in the “n” column and go down till we find 10. That’s the given sample size. We then go across the columns, inspecting the expected discovery rates from the very low problem probability of 1% to the very high probability of 75%. The first discovery rate in the “10” row is 10%. This is low and means we’ll have only a 10% chance of seeing problems that affect 1% of the population at a sample size of 10 (in other words, with ten participants we can expect to discover about 10% of problems that have a 1% probability of occurrence). It’s not a 0% chance (yes, we’re saying there’s a chance), but those are low numbers to count on. Moving to the right, we get to 97% discovery for problems that affect 30% or more of the user population. That means at a sample size of ten, we’ll have a great chance (almost 100%) that we’ll see the relatively common problems (something that affects about a third or more of all users).
Repeating this now with seven participants shows that our chance of detecting problems (if they exist) that affect 30% of the population drops down to 92%. Still, a good of seeing these and any more likely problems. Where we had a good chance of seeing problems of 15% with ten participants (80% likelihood of discovery), with only seven participants, the discovery likelihood drops to 68%.
Another interesting pattern the table shows is that for rare problems affecting only 1% of the population, the sample size and chance of detecting these 1%ers track closely. At ten users, you’ll have about a 10% chance of detecting them, a 1% chance with one participant, and about a 22% chance with 25 participants. Not shown in the table, at 100 participants you’d expect to discover about 63% of them. It’s hard to uncover uncommon problems with small sample sizes.
To get started, you need to make two decisions:
For example, suppose you’ve decided to focus on discovering problems that will happen to at least 15% of the population of interest (the problem detection probability), and the desired likelihood of discovery is at least 80%.
The smallest sample size that meets those criteria is n = 10. From Table 1, what you can expect with ten participants is:
This means that with ten participants, you can be reasonably confident that the study, within the limits of its tasks and population of participants (which establish what problems are available for discovery), is almost certain (> 90% likely) to reveal problems for which the problem detection frequency is at least 25%. As planned, the likelihood of discovery of problems with a detection probability of 15% is 80%.
For problems with a detection probability of less than 15%, the rate of discovery will be lower but will not be 0 when n = 10. For example, the expectation is that you will find about 65% of problems for which the detection probability is 10%, and you’ll find about 40% (almost half) of the problems available for discovery whose detection probability is 5%. You would even expect to detect 10% of the problems with a detection probability of just 1%. That’s not a bad haul for a small-sample qualitative study.
Another way to use the table is to scan from left to right to see whether the patterns of discovery rates for any of the sample sizes are acceptable, but scanning is easier when the data in the table are presented in graphs.
Figures 1 and 2 show different ways to depict the information in Table 1.
Figure 1 shows the trajectory of problem discovery for different problem probabilities for each sample size from 1 to 25. The trajectories are almost linear for the lowest probabilities (1% and 5%) and dramatically nonlinear for the highest probabilities (50% and 75%). The speed with which high-probability problems approach discovery over 95% is why we didn’t include any higher than 75% in the table. Any problem with a probability of occurrence over 80% should be discovered with two participants.
The box in Figure 1 shows the expected discovery rates for each of the different problem probabilities when n = 10, matching the extended example presented in Table 1.
Figure 2 shows the trajectory of problem discovery for different sample sizes (5, 10, 15, 20, and 25) for various problem probabilities (1%, 5%, 10%, 15%, 25%, 30%, 50%, and 75%).
The trajectories are similar to those shown in Figure 1, becoming less linear and smoother as the sample size increases from 5 to 25. The distance between the lines illustrates the diminishing returns in discovery rates associated with increasing sample sizes. For example, increasing the sample size from 5 to 10 shows significant benefits in the middle of the range of problem probabilities, but the benefit achieved from increasing the sample size from 20 to 25 is much smaller.
One sample size doesn’t fit all research needs for problem discovery studies like formative usability studies. Fortunately, tabular and graphic aids can help UX researchers determine and justify sample sizes for these types of studies.
What do you get with a specific sample size in problem discovery studies? For each possible sample size, you are likely to observe (at least once) some of the problems that will happen to only a small percentage of the population of interest, more of the problems that affect a moderate percentage, and most of the problems that affect most of the population. The likelihood of discovery increases as the sample size increases but with diminishing returns.
What drives sample size decisions for problem discovery studies? The appropriate sample size for problem discovery studies depends on two factors—the smallest problem probability you wish to detect and the desired discovery rate. In other words, how rare of an event do you need to be able to detect at least once, and what percentage of those events do you need to discover in the study?
What decision aids are available to guide sample size decisions for problem discovery studies? You can use the table and graphs presented in this article to understand what you can expect to get with different sample sizes for problem discovery studies. This can be useful for initial sample size planning and understanding the consequences of events that lead to the reduction of the initially planned sample size.
Technical note: Some early approaches to sample size decisions for formative usability studies relied on the average observed value of p across a group of discovered problems. This approach does not compute the variability of the mean of p. Also, estimates of mean p from samples consistently overestimate the actual likely value of p. There are some complex mathematical approaches to deal with these issues, but the method we describe in this article avoids the issues because it does not require estimating an average value of p.
]]>Good intentions? Because someone influential said to use it online?
A measure is valid if it can be demonstrated that it measures what it is intended to measure, has the expected alignment of items with factors, and has the expected statistical relationships with other metrics. Its usage also depends on its practicality.
So how do you demonstrate validity? It takes data and disclosure.
At MeasuringU, we originally benchmarked websites using the SUS. Enough data were publicly available that we could generate percentile rankings from raw SUS scores that made the perceived usability data more interpretable.
But we knew that the quality of the website user experience was more than just usability.
We started to develop what’s come to be known as the Standardized User Experience Percentile Rank Questionnaire (SUPR-Q^{®}) in 2011 and published our findings in 2015.
The SUPR-Q is a short (eight-item) questionnaire that measures perceptions of Usability, Trust, Appearance, and Loyalty for websites. The combined score provides an overall measure of the quality of the website user experience.
We wanted to maintain the percentile ranking we had built from the SUS data, so the SUPR-Q also provides relative rankings expressed as percentiles. A SUPR-Q percentile score of 50 is average (roughly half the websites evaluated in the past with the SUPR-Q have received better scores and half received worse). The normative database contains responses from more than 10,000 participants and 150 websites (updated on an ongoing basis, about once per quarter). Its compactness and normed database made it practical, but we needed to show it also had strong psychometric properties.
During its development, the final version of the SUPR-Q was informed by psychometric analysis of over 4,000 responses across 100 website experiences. Iterative item selection led to an efficient questionnaire with two items per construct with validity established using exploratory factor analysis and acceptable levels of reliability (coefficient α > .70) for the overall and most subscales (Overall: α = .86, Usability: α = .88, Trust: α = .85, Appearance: α = .78, Loyalty: α = .64). In a study of 40 websites (n = 2,513), the global SUPR-Q and its subscales discriminated well between the poorest and highest quality websites, providing evidence of its sensitivity.
In this article, we report the results of a confirmatory factor analysis (CFA) to validate the SUPR-Q questionnaire and a multiple regression analysis of the basic SUPR-Q measurement model (how well the Usability, Trust, and Appearance metrics account for variation in the Loyalty metric).
Shown in Figure 1, the SUPR-Q measures four website UX factors with eight questions: Usability (easy to use, easy to navigate), Trust (trustworthy, credible), Appearance (attractive, clean, and simple), and Loyalty (likelihood to revisit, likelihood to recommend). The item scores for each subscale are the averages of the two items (after dividing the 0–10-point LTR rating by 2). The overall scale is the average of the four subscales.
How did we know the eight items we selected measured our intended constructs? We used a statistical technique called exploratory factor analysis (EFA). This approach shows how well the data we collect (what we can observe) measure what we can’t see but want to measure (e.g., usability, loyalty). As a measure is used and more data are collected, it’s good practice to show that the original factors still provide good measures of the constructs.
Now that we have used the SUPR-Q for over a decade, we decided to conduct a confirmatory factor analysis (CFA). As indicated in their names, researchers use EFA in the early stages of research to explore different plausible factor structures (e.g., items to retain, number of factors), then use CFA on an independent set of data to assess the model fit of the most promising factor structure found during EFA.
There are many ways to assess the quality of fit of a CFA model. We focused on the combination of Comparative Fit Index (CFI), Root Mean Square Error of Approximation (RMSEA), and Bayesian Information Criterion (BIC). There are guidelines for good levels of model fit for CFI (> 0.90) and RMSEA (< 0.08), but not for BIC, which is used to compare models (smaller is better).
For this analysis, we used SUPR-Q data from eight retrospective consumer surveys conducted between April 2022 and January 2023. Each survey targeted a specific sector, and, in total, we collected 2,761 responses to questions about the UX of 57 websites. The sample had roughly equal representation of gender and age (split at 35 years old). Table 1 shows the participant gender and age for each survey, with sector names linking to articles with more information about each survey (including the websites selected for the sectors).
Sector | n | Date | Websites | Female (%) | Male (%) | Under 35 (%) | 35 or older (%) |
---|---|---|---|---|---|---|---|
Real Estate | 269 | Apr 2022 | 5 | 48 | 51 | 48 | 52 |
Travel Aggregator | 452 | May 2022 | 9 | 48 | 51 | 48 | 52 |
Business Info | 183 | Jul 2022 | 3 | 46 | 53 | 42 | 58 |
Domestic Air | 350 | May 2022 | 7 | 48 | 49 | 58 | 42 |
International Air | 200 | May 2022 | 5 | 53 | 46 | 61 | 39 |
Ticketing | 234 | Jun 2022 | 5 | 45 | 52 | 40 | 60 |
Clothing | 550 | Dec 2022 | 13 | 52 | 45 | 48 | 52 |
Wireless | 523 | Jan 2023 | 10 | 47 | 50 | 40 | 60 |
Overall | 2,761 | – | 57 | 49 | 49 | 48 | 52 |
The eight surveys shown in Table 1 were retrospective studies of the UX of websites in their respective sectors. Some survey content differed according to the nature of the sector being investigated, but all surveys included the SUPR-Q and basic demographic items. For each survey, we conducted screeners to identify respondents who had used one or more of the target websites within the past year, then invited those respondents to rate one website with which they had prior experience. On average, respondents completed the surveys in 10–15 minutes (there was no time limit).
Figure 2 shows the results of the CFA. The loadings (link weights) for each item with respective factors were very strong (from .74 to .89) and statistically significant (p < .0001). The model had excellent fit statistics (CFI: .993, RMSEA: .05, BIC: 284.6). The reliability of the overall and all subscales exceeded .70 (Overall: α = .90, Usability: α = .88, Trust: α = .87, Appearance: α = .80, Loyalty: α = .73).
When we developed the SUPR-Q model, we knew the four factors (Usability, Trust, Appearance, and Loyalty) were correlated from both published literature and our data. Correlation, of course, does not mean causation, as it can be difficult to disentangle order effects without controlled manipulation in experiments. However, some work has shown that attitudes to usability affect attitudes towards appearance. We had reason to believe that UX quality and its components affect intent to use and likelihood to recommend (Loyalty).
In addition to its usefulness as a single measure of the UX of websites, the components of the SUPR-Q can be used in a framework in which Usability, Trust, and Appearance predict (are antecedents of) Loyalty. The model shown in Figure 3 is based on the data set described in Table 1. Values on the links from Usability, Trust, and Appearance to Loyalty are multiple regression beta weights (all statistically significant with p < .0001, beta weights ranging from .22 to .32), with the three predictors accounting for almost half (46%) of the variation in Loyalty—a highly significant model.
In the future, we plan to use this basic framework to model additional consequences, such as brand attitude and additional antecedents such as perceived clutter and usefulness.
The SUPR-Q has over a decade of data and usage. Our psychometric analyses (CFA and regression model) of the basic SUPR-Q model using data from retrospective studies of eight sectors (n = 2,761 across 57 websites) found:
The SUPR-Q exhibits strong evidence of validity. The results of the CFA showed that the SUPR-Q items had strong fit. All links between items and their respective factors were statistically significant and strong (ranging from .74 to .89, all p < .0001). The fit statistics were excellent (CFI: .993, RMSEA: .05, BIC: 284.6). These findings strongly support the construct validity of the SUPR-Q.
The SUPR-Q exhibits acceptable to good reliability. For these analyses, the SUPR-Q scale reliabilities, assessed with coefficient alpha, all exceeded the commonly used criterion of .70 (Overall: α = .90, Usability: α = .88, Trust: α = .87, Appearance: α = .80, Loyalty: α = .73). These estimates of reliability were very close to those reported in the original SUPR-Q publication, but this time the estimate for Loyalty, originally .64, was .73. Some analysts have suggested that the Spearman-Brown method provides better reliability estimates than coefficient alpha when scales have just two items, but there were no meaningful differences in the reliability estimates for these data.
SUPR-Q components predict loyalty. The antecedents of the basic SUPR-Q measurement model account for almost half of the variation in Loyalty. All three antecedents are significant key drivers of Satisfaction with beta weights in roughly the same range (.22 for Usability, .26 for Appearance, and .32 for Trust), accounting for 46% of the variation in Loyalty.
Bottom line: The basic SUPR-Q measurement model is psychometrically strong, making it an excellent starting point for investigating how its components relate to additional constructs such as brand attitude, usefulness, and perceived clutter.
Social media reflects and affects where we work, who we vote for, what we purchase, and what we do in our free time.
Since we last examined the social media space in 2018, social media has deviated from its original photo-sharing roots to incorporate short-form video, AI, gaming, dating, and shopping, which raises a question—is social media still social?
We explore this question and the user experience of social media apps in our latest benchmarking survey.
On the heels of what may become known as the “Social Media Olympics,” viewership of the international competition was up 77%, largely attributed to the loosened social media regulations that allowed athletes to share their experiences of the iconic games.
Social media remains a juggernaut in politics as well, being used heavily by candidates on all sides as things heat up before the 2024 presidential election. Democratic nominee Kamala Harris is leaning heavily into social media trends like “brat summer,” while the Republican nominee, former President Donald Trump, spoke with Elon Musk as millions of viewers tuned into the live feed on X, formerly Twitter.
Social media’s influence isn’t going unnoticed by governments, who are wary of its reach and influence. The European Union sent a warning letter to Elon Musk before his interview with Donald Trump, reminding him that X is subject to the EU’s Digital Services Act (DSA), the bloc’s relatively new law regulating illegal content and disinformation on large social media platforms. In the U.S., details concerning the U.S. government’s role in coercing platforms to censor information have emerged. There is bipartisan support for banning TikTok in the U.S. over concerns the Chinese government may use it to spy on U.S. citizens. Brazil recently banned X, fining its citizens for attempting to access it.
Artificial intelligence is unsurprisingly making its mark in the social media sphere, as fears of deepfakes and misattributed content begin to materialize. Some of the first large-scale deepfake scams have surfaced in recent months, with Prince William seemingly endorsing a crypto platform, and Elon Musk promoting a radical investment opportunity.
Social media titans like Meta are bolstering their response to this type of content through recent policy changes on digitally altered media, requiring new “AI info” labels.
It’s no surprise, then, that the landscape of social media has evolved quickly since our 2018 benchmark, encouraging us to update our benchmark data. We used MUiQ to benchmark the desktop website and app experience of six popular platforms:
We recruited 324 participants in August 2024 to reflect on their most recent experience with one of the platforms they had used in the past year. While the sample size isn’t large relative to the billions (with a b!) of people who use social media, it’s large enough to detect patterns between platforms and to compare to our historical sample.
Participants in the studies answered the eight-item SUPR-Q (including the Net Promoter Score) and questions about their prior experience. In particular, we were interested in exploring the quality of the user experience, users’ reasoning behind social media use, the proliferation of AI, levels of trust, and social media’s impact on mood and self-esteem. (Full details are available in the downloadable report.)
Are social media apps still social? We asked respondents to select which activities they engage in while using the platform. A selection of activities is shown in Table 1. The full activity list is available in the downloadable report.
Stay informed about other people in your life | Keep in touch with friends | Watch short-form videos | |
---|---|---|---|
FB | 66% | 79% | 38% |
IG | 55% | 43% | 57% |
37% | 19% | 4% | |
Snapchat | 52% | 78% | 36% |
TikTok | 32% | 18% | 91% |
X | 55% | 43% | 57% |
Average | 49% | 46% | 48% |
While participants still report using social media to stay informed about people in their lives (49%) or keep up with friends (46%), platforms seem to be moving more toward impersonal content consumption. As shown in Table 1, across all platforms a substantial 48% of respondents reported watching short-form videos. This aligns with findings that users are primarily using social media for entertainment (60%) and distraction (44%).
There is, however, variation between apps. Facebook and Snapchat are still firmly planted in the “social” aspect of social media, with 79% of Facebook users and 78% of Snapchatters stating they use it for keeping in touch with friends.
Alternatively, Instagram and TikTok lean more towards content, with 91% of TikTok users and 57% of Instagram users reportedly using the app to watch short-form videos (Figure 1). This is an especially large jump for Instagram, which introduced Reels only four years ago in August 2020. In open-ended responses, users expressed dissatisfaction with the lack of posts from friends or non-influencers.
X is by far the platform used most for staying informed about the news (68%), with other platforms like Facebook and Snapchat landing significantly lower at 17% and 10% respectively. While X is more popular for news consumption, respondents mentioned particular dissatisfaction with misinformation, political content, and general negative social interactions. While there was a time when social media was used to promote political discourse, users reported that all social media sites are now seen as a suboptimal medium for expressing political opinions (only 9% reported using it for this purpose). Meta, which once promoted this type of content in the past, has deliberately deprioritized political content.
Similar efforts may be driving low numbers across the board, with 5% of Instagram users expressing political opinions and only 2% of LinkedIn users.
TikTok users rated the platform high across many of the metrics we collected, including usability, preferred content, mental health impact, and brand attitude. Conversely, X scored low in all these categories.
Content is king, and given the high overall ratings, it’s not too surprising that TikTok’s content was rated as both more original and more relevant than other platforms. However, X’s content was rated almost as high in originality compared to TikTok, suggesting that X continues to fill a need.
The generally favorable attitudes toward TikTok likely make the case for banning TikTok in the U.S. harder.
Participants who used apps to access their social media completed a simplified version of the Standardized User Experience Percentile Rank Questionnaire for Mobile (SUPR-Qm) and the Net Promoter Score for mobile.
SUPR-Qm scores for the six social media apps hovered around average (53 on a 100-point scale), with TikTok being the only app that significantly surpassed the others with the highest SUPR-Qm score (63). X fell below average for social media apps with the lowest score (47).
Mirroring a similar trend to the SUPR-Qm, TikTok led the pack with an NPS of 23%, while X fell significantly behind at −51%.
The average NPS for all apps was a paltry −21%, with TikTok having the only positive Net Promoter Score. Why do apps that are so ubiquitously used have such low Net Promoter Scores?
An examination of the open-ended responses suggests that their ubiquity might be why users do not feel compelled to recommend:
“Most phones have it already.” — Facebook user
“All my friends already use it.” — Snapchat user
There were also concerns about the negative societal effects of social media:
“I don’t think anyone should start using Instagram. It’s best to spend as little time on social media as possible.” — Instagram user
“I don’t like using it if I don’t have to so I wouldn’t recommend it.” — X user
“I mostly dislike LinkedIn for what it stands for and not how it functions.” —LinkedIn user
Most people in our sample reported accessing social media via mobile app, though a smaller contingent still uses their desktop, especially for LinkedIn. Of the respondents, 92% reported using the mobile app, whereas only 58% used the desktop website. This lower usage on desktop may explain some of the low SUPR-Q scores, which averaged at the 4^{th} percentile. Websites generally had lower usability scores and SUS scores, which were all hovering around the historical average of 68.
LinkedIn participants reported struggling with the app’s navigation and usability. Many reported difficulties accomplishing basic tasks, including searching for jobs, sending messages, and updating profile information. TikTok participants also struggled with video lagging, bugs, and technical issues.
In the age of artificial intelligence, the ability to morph and distort physical, audible, and written information means such content is skyrocketing in popularity (Figure 2). It’s no surprise then that these drastic changes are hurtling into our news feeds as quickly as they’re developing in real time.
More than half of respondents reported seeing AI-generated content (58%), misleading generated content (57%), and deepfakes (32%).
The highest percentage of misleading generated content was on Facebook (70%) and Instagram (66%). AI-generated content was reported as most prevalent on TikTok (72%) and X (79%).
LinkedIn and Snapchat had the lowest reports of GenAI content, with 52% and 42% respectively reporting no experience seeing GenAI content, deepfakes, or bots. This is perhaps due to the professional nature of LinkedIn and the close-friend, ephemeral nature of Snapchat.
While most users are encountering AI-generated content on social media, the majority of respondents (76%) reported not using AI to generate content themselves. The largest percentage of people using AI-generated content was on Snapchat (20%), likely due to the wide array of AI-powered offerings such as AI snaps, AI captions, and AI profile backgrounds.
Users have long struggled with the push-and-pull of a trust/distrust relationship with social media. From ongoing legislative battles to docudramas like The Social Dilemma, users have questioned how much personal information to divulge to these platforms.
Table 2 shows that while respondents don’t think the government should necessarily intervene, most platforms besides LinkedIn should regulate themselves more than they currently do. Interestingly, TikTok had the lowest percentage of users agreeing the government should regulate the platform.
Respondents also emphasized across the board that user accounts should be banned if they incite violence.
Govt Should Regulate | Self-Regulate | Regulated More | Ban Violent Accounts | |
---|---|---|---|---|
FB | 25% | 49% | 45% | 83% |
IG | 20% | 46% | 38% | 75% |
TikTok | 14% | 51% | 39% | 89% |
X | 23% | 63% | 54% | 84% |
Snapchat | 18% | 50% | 42% | 92% |
15% | 19% | 13% | 88% | |
Average | 19% | 47% | 39% | 85% |
Table 3 shows the percentage of respondents who agreed with various statements about the platforms. The fewest users reported trusting the content on Facebook (11%) and X (13%). Trust scores have been low since we began benchmarking social media platforms in 2012. Facebook has the highest proportion of users who perceive a significant risk of getting scammed (66%), perhaps due to Facebook Marketplace, the classified ad section of the platform. Likewise, few users trust Facebook (0%), LinkedIn (8%), or X (9%) with their credit card number. A surprisingly large number (40%) of users believe that Snapchat is being used to facilitate illegal activities. (Full details are available in the downloadable report.)
Giving Credit Card | Content is Trustworthy | Scam Risk | Illegal Activities | |
---|---|---|---|---|
FB | 0% | 11% | 66% | 34% |
IG | 14% | 18% | 41% | 27% |
TikTok | 16% | 32% | 47% | 12% |
X | 9% | 13% | 48% | 32% |
Snapchat | 12% | 16% | 46% | 40% |
8% | 48% | 35% | 6% | |
Average | 10% | 23% | 47% | 25% |
With concerns about trust comes questions related to parental control over social media, a topic hotly debated nationwide. Respondents generally felt that teenagers are the youngest demographic that should have access to social media with or without parental consent. Most reported that the appropriate age to access social media without parental consent was from 16 (34%) to 18 (27%), ultimately clarifying opinions about having children or adolescents on social media.
Social media is blamed for poor mental health. However, our analysis suggests social media may have a more nuanced effect on how people believe social media affects their mood and self-esteem. Figure 3 shows that respondents were more likely to say that social media has a positive rather than negative effect on their mood. TikTok users reported nearly four times more positive rather than negative effects. On the other hand, X participants were almost three times more likely to report that using the platform had a negative effect on mood versus a positive effect (34% vs. 13%). Is the bad mood from X due to Musk’s takeover? Our 2018 analysis suggests it may be more than just new-owner headaches. In 2018, Twitter also had the worst negative mood rating (24%), although in 2024 the percentage of users saying it worsened their mood rose to 34%—a movement in the wrong direction. It’s likely that the type of content, such as political discourse and dialog-based format, has a large impact on how users feel its usage affects their mood.
While social media companies could improve their platforms for users in a few distinct ways, it seems that much of the negative experience is driven by the users themselves. Social media acts as both a mirror and a megaphone, reflecting and amplifying the world around us. So, perhaps the platforms are not inherently bad, but rather a reflection of who and where we are today.
Our analysis of the user experience of six social media websites found:
For more details, see the downloadable report.
]]>How many times have you heard that question?
How many different answers have you heard?
After you sift through the non-helpful ones, probably the most common answer you’ve heard is five. You might have also heard that these “magic 5” users can uncover 85% of a product’s usability issues. Is that true? Are five enough, too few, or too many?
How can you know? Can you really know?
Or are we just resigned to hearing the most dogmatic voices on social media? What are the alternatives?
Perhaps we should average the advice of others or make our lives easier by sidestepping the question altogether.
We’ve seen both approaches taken. But is there a better way to find sample sizes?
And is there a single sample size that is right for all usability studies?
You probably know the answer: One sample size does not fit all studies. Not much of a surprise there. But there is a way to get to a sample size that doesn’t involve democracy or demagoguery.
The first step in finding a sample size is to define the study type. For the purposes of sample size estimation, there are three types of usability studies: Problem Discovery, Estimation, and Comparison (Table 1).
# | Type | Purpose | Example | Formative or Summative |
---|---|---|---|---|
1 | Problem Discovery | Finding Problems and/or Insights | What are the usability problems for the check-out flow? | Formative |
2 | Estimation | Estimating a Value/Parameter | What is the SUS score for all users of a product? | Summative |
3 | Comparison | Making a Comparison | Is there a difference in SUS scores or is the score above average? | Summative |
In contrast to the focus on measurements taken during summative user research (study types 2 and 3), the goal of problem discovery usability studies (type 1) is to discover and enumerate the problems that users have when performing tasks with a product. It’s considered a formative type of evaluation.
So, what’s the sample size for each study type? 5, 50, 100?
While defining the study type helps narrow the proper approach to sample size estimation, it still doesn’t warrant recommending one number. Because there’s math involved, it’s understandable that people seek a simple single number. We’ve been trained to find a single answer to simple math problems: 2+2 always equals 4. The square root of 9 is always 3. The answer is determined because there aren’t any variables—life is great!
As soon as you introduce variables, however, things get more complicated. The hypotenuse of a triangle is always equal to the square root of the sum of the squares of the two other sides (a^{2 }+ b^{2} = c^{2}), but the actual length of the hypotenuse depends on the length of the two sides.
The methods for finding sample sizes for summative studies are typically taught in university statistics classes. Those methods include several variables whose values can differ from study to study, including alpha and beta decision criteria (which control the long-run probability of Type I and Type II errors), the standard deviation of the metric, and the smallest difference that you need to detect to make the necessary decisions (i.e., the critical difference). Changing any of these variables will change the sample size needed to meet the requirements.
Problem discovery sample sizes use a less familiar approach. We’ve discussed in previous articles the mathematics commonly used to derive sample sizes for formative problem discovery usability studies and how well that math matches reality.
So, what is the formula for finding sample sizes for problem discovery studies?
While you don’t need to fully understand the derivation of the formula to use it, it helps to know how to use it. It has only two elements: n and p.
P (at least once) = 1 − (1 − p)^{n}
The p is how likely a problem (or event) is to occur in the tested population and n is the sample size. In this formula, they compute the probability of seeing the problem at least once in a formative usability study with n participants.
Technical note: We manipulated the binomial probability formula to get to 1 − (1 − p)^{n}, but there are other ways to arrive at this formula, including the Poisson probability formula and capture-recapture models.
The formula above computes the probability of detecting a problem given a sample size and its frequency in the population. It can be rearranged using algebra to solve for the sample size.
Because n is an exponent in the formula, it’s necessary to use logarithms to manipulate the formula to focus on the sample size instead of the probability of discovering the event of interest at least once. The resulting formula is:
Don’t worry too much about the formula other than to note that it shows that the sample size for a discovery study is driven by the discovery goal (P(at least once)) and how likely an event is to happen during the discovery (p).
As mentioned above, in the best-known rule of thumb for usability study sample sizes, the “magic number 5,” the claim is that five participants are enough for the discovery of 85% of usability problems (strictly speaking, 85% of the problems that are available for discovery given the constraints of the study regarding the sampled population and tasks).
Nothing is inherently right or wrong with a discovery goal of 85%. It deviates from the more expected convention of 95% or 90% used in confidence intervals, but like a confidence level, the discovery goal can take any value from 1% to 99%. So, where did 85% originally come from?
Several early investigations into using these formulas to predict problem discovery rates as a function of sample size (e.g., Virzi, 1990; Nielsen & Landauer, 1993) reported finding that four or five participants discover 80–85% of the problems in large-sample usability studies. Over time, these findings became the simplified “magic number 5” rule.
An early test of the simple goal of 85% discovery was an economic ROI simulation published in 1994 (by Jim) that estimated the costs associated with running additional participants, fixing problems, and failing to discover problems in formative usability studies. Although all the independent variables influenced the sample size at the maximum ROI, the variable with the broadest influence was the average likelihood of problem discovery (p), which also had the strongest influence on the percentage of problems discovered at the maximum ROI. The results indicated that, when the target value of p is small (e.g., 10%), practitioners should plan to discover about 86% of the problems available for discovery in the study. When p is greater (e.g., 25–50%), the appropriate goal is about 98% discovery.
Things get trickier determining how often events of interest occur during the study. A common estimate of that likelihood is 31%. But where did that come from?
In the research Jakob Nielsen and Thomas Landauer published in 1993, which was the basis of their recommendation for running formative usability studies with five participants, the value they computed for the likelihood of problem occurrence was .31.
This was the average of the problem discovery rates reported in 11 usability studies they had conducted or had acquired from other researchers at the time (including one from Jim Lewis—see Figure 1 for the correspondence between Nielsen and Lewis in 1991). When they used their version of 1 − (1 − p)^{n} and graphed the expected percentage of discovery for sample sizes from 1 to 15 and p = 31%, their estimated discovery rate was 85% when n = 5.
If you plug .85 and .31 into the sample size formula, you get:
n = ln(1 − .85)/ln(1 − .31) = (−1.897)/(−0.371) = 5.11
So, math supports running five participants in a discovery study if (1) the discovery goal is 85% and (2) the probability of the occurrence of an event of interest is 31%. (You can also use our online calculator, which will do the math for you.)
But as mentioned above, one size does not fit all. What if, in your research context, you need to discover more or fewer than 85% of the events of interest, and what if their probability of occurrence is less or greater than 31%?
In those cases, you need a size chart, analogous to shopping for a men’s dress shirt to fit a given neck size and sleeve length (desired discovery rate and problem likelihood). We’ll publish that size chart in a future article.
How many participants do you need for a usability study?
It depends first on the study type. There are three study types—discovery, estimation, and comparison. In contrast to estimation and comparison studies, sample size estimation for discovery studies uses a different mathematical approach.
It still depends within study types. Don’t rely on averaging together recommendations or looking for a single number that will always work even when focusing within a study type such as discovery.
What about the “magic number 5?” The controversial claim based on the research of Nielsen and Landauer that “five is enough” turns out to sometimes be true, but only for a limited range of research contexts.
What about any other magic number? Because the appropriate sample size for discovery studies depends on two factors, no one magic number will be appropriate for all research contexts. In fact, there is no magic number for sample sizes for any type of usability study, formative or summative.
Use the formula for problem discovery. The problem discovery formula can be used to find the sample size based on expected problem occurrences (p) and the likelihood of seeing a problem at least once. You can also use the online calculator.
Parameters have defaults but should be changed when necessary to fit the research needs. The typical parameter for discovering problems is 85%, but this can be increased or decreased depending on the context. The parameter of 31% for the probability of problem occurrence came from an average across datasets from the 1990s. It’s not a bad place to start, but it shouldn’t be the only value for this parameter. Using values of 10%, 20%, and even 5% may make sense depending on how important it is to discover uncommon problems.
If there isn’t a magic number, should we give up on sample size estimation for formative usability studies? Giving up on magic numbers doesn’t mean you have to give up on sample size estimation for formative usability studies (or any other type of discovery study). You just need to be able to make decisions about (1) how rare of an event you need to be able to detect at least once and (2) what percentage of those events you need to discover in the study.
Bottom line: It would be nice if this process were simpler, but unfortunately, one sample size does not fit all research requirements. Fortunately, there is a mathematical model that can guide UX professionals to make reasoned decisions about sample size requirements for formative usability studies.
]]>Clutter can make a space feel stressful and make it hard to find things.
But it’s not just your mother talking about clutter. We often use the same language to describe digital spaces like websites.
In our UX research practice, we have frequently encountered users and designers criticizing website interfaces for being cluttered and stakeholders who worry about the experiential and business consequences of a cluttered website.
But what exactly does it mean for a website to appear cluttered? Is the Wayfair home page (Figure 1) cluttered?
How about the JetBlue home page (Figure 2)?
It’s one thing to casually describe something with a word like clutter; it’s another thing to measure it. In this article, we describe our search for a way to quantify the perception of clutter on websites.
So, what do we mean when we say something is cluttered? Dictionary definitions of clutter tend to equate it with messiness or untidiness. As a transitive verb, the Merriam-Webster definition is “to fill or cover with scattered or disordered things that impede movement or reduce effectiveness” and, as a noun, “a crowded or confused mass or collection.” The Oxford Dictionary’s verb and noun definitions are, respectively, “to crowd (a place or space) with a disorderly assemblage of things” and “a crowded and confused assemblage.”
These definitions do not address two potential components of clutter. One component is the extent to which the disorganized objects are needed but should be better arranged (e.g., tools in a toolbox). The other is the extent to which some objects are unnecessary and should be discarded (e.g., old candy bar wrappers in a toolbox). This distinction is sometimes brought out in definitions of declutter (e.g., “to declutter is to tidy up a mess, especially by getting rid of objects … clean and organize a space”).
Even in this everyday sense, these two components of clutter suggest different decluttering strategies:
There is a long history of defining and measuring clutter in user interface design, especially for mission-critical applications (e.g., aircraft cockpit displays), drawn from research in disciplines like human factors engineering and perceptual psychology. In most cases, this research focused on objective rather than subjective measurements of clutter.
Tom Tullis (1984) published an early review and analysis on how to objectively measure clutter in the monochrome (green screen) alphanumeric displays used in the 1970s and early 1980s. He identified four basic format characteristics:
He explored different ways to objectively measure these characteristics that, along with the reviewed literature, supported several key design recommendations. These recommendations seem surprisingly relevant for modern website design. For example:
Using an objective approach to the measurement of clutter based on perceptual psychology, Rosenholtz et al. (2007) evaluated three metrics:
They found that these three measures correlated with different empirical measures of search performance (e.g., searching for objects in cluttered maps or on cluttered desks). They also reported that color variability (number of colors and how different they are) affected visual clutter. Design recommendations consistent with this research include:
In the context of advanced cockpit displays, Kaber et al. (2008) developed a subjective clutter questionnaire. Their participants were four expert test pilots with experience using advanced heads-up displays (HUDs). They rated the clutter of images of a flight approach scenario depicting multiple display conditions. The initial version of the clutter questionnaire contained 14 semantic differential items gleaned from a literature review of display clutter (e.g., sparse/dense, monochromatic/colorful, empty/crowded, ungrouped/grouped). After each trial in the experiment, participants provided a single rating of overall clutter (20-point scale from “low clutter” to “high clutter”) and rated each of the 14 semantic differentials regarding their utility for describing clutter (20-point scale from “low” to “high”). Exploratory factor analysis indicated the 14 items aligned with four components:
In a review of definitions and measurement of display clutter, Moacdieh and Sarter (2015) wrote, “Despite the widespread agreement on the harmful nature of ‘clutter,’ researchers have yet to reach consensus on a definition and a reliable way of manipulating and measuring the phenomenon.”
Their primary goal was to investigate the literature for definitions and metrics describing clutter on visual search performance. Common definitions include display density (number of entities on a screen), display layout (arrangement, nature, and color of entities), target background/distractor similarity, task irrelevance (both essential and nonessential entities are displayed), and performance/attentional costs. Approaches to measurement include image processing, performance evaluation, eye tracking, and subjective evaluation (perceived clutter).
In the Moacdieh and Sarter (2015) review, most researchers who measured perceived clutter did so with a single rating of overall clutter. A notable exception was the questionnaire developed by Kaber et al. (2008).
Despite the clear value of the Kaber questionnaire in its intended context (professional pilots familiar with aircraft displays and associated technical terminology such as redundant/orthogonal), it does not seem well suited to assessing the perceived clutter of websites.
We now turn to the more familiar domain of website design. The term “clutter” seems to be part of the website design vernacular, evident in online articles that discuss the topic of decluttering websites. For example:
Even though typical user goals and behaviors with websites (e.g., browsing for information, making online purchases) differ from those of pilots using displays to land aircraft, many of these website design recommendations are consistent with the design guidance implied by clutter research in other domains.
We conducted a search of the peer-reviewed literature specifically targeting standardized questionnaires for the assessment of perceived website clutter, but there were no relevant results. We did, however, find relevant research in the fields of marketing and advertising regarding the extent to which online ads contribute to the perception of clutter on websites. This is a continuation of lines of research originally conducted on magazines and television (Speck & Elliott, 1997) in which a primary objective metric is the proportion of advertisements in the total space of a medium (Kim & Sundar, 2010).
Using a standardized questionnaire developed for assessing consumer reaction to online ads (specifically, the constructs of perceived intrusiveness, irritation, informativeness, and entertainment value), Edwards et al. (2002) reported that ads perceived as intrusive elicited irritation and ad avoidance. Interruptive ads that occur during an online shopping task have been found to increase primary task time, with early interruptions being more disruptive than later interruptions (Xia & Sudharshan, 2002).
Forced presentation of ads irritates users especially when ads are not skippable, but when ad clutter is high, skippability doesn’t reduce irritation (Senarathna & Wijetunga, 2023). Experimental manipulation of ad location and relevance found that both factors affect the perception of ad clutter (Kim & Sundar, 2010).
Brinson et al. (2018) investigated why consumers install ad blockers, noting that “To discourage the use of ad blockers, publishers and ad industry leaders have been experimenting with a variety of methods to improve users’ experiences—from decluttering websites to developing less intrusive ad formats.” They found concerns about information privacy influence attitudes toward personalized advertising when messages are hyper-targeted based on too many layers of personal data—ads often described as “creepy.”
Based on this research, web design guidelines relevant to advertisements include:
In short, website designers face numerous challenges regarding the management of perceived clutter. An effective ad strategy is critical for many websites, and failing to strike an appropriate balance between corporate and user needs can lead to negative impressions of the website and its parent enterprise. Website designers must also deal with more traditional design elements associated with perceived clutter, such as density, white space, logical grouping, layout complexity, and color.
This literature review serves as a starting point in our search for a measurement of perceived clutter. Our next steps, which we will cover in future articles, are to:
In this literature review of the construct of clutter, we found:
The everyday conception of clutter includes two components. The perception of clutter can be driven by a disorganized collection of needed objects and/or the presence of unnecessary objects. These components suggest different decluttering strategies—reorganizing needed objects and discarding unnecessary objects.
Research on the measurement of clutter in UI design has mostly focused on objective measurement. Early research on alphanumeric displays explored metrics such as overall density, local density, grouping, and layout complexity. Later research evaluated metrics based on perceptual psychology like feature congestion, subband entropy, and edge density.
No standardized questionnaires are currently available for the measurement of perceived clutter on websites. There is a published questionnaire for subjective clutter in advanced cockpit displays, but its technical items and factors do not seem to be appropriate for assessing consumer websites. A more promising line of research comes from the fields of marketing and advertising regarding consumer reaction (positive and negative) to online advertisements.
Bottom line: This literature review covered past work in the measurement of clutter, both objective and subjective, in the research domains of the presentation of information on displays and the influence of advertisements on user experiences. This review is a first step in the search for a clutter metric for websites.
For more details about this research, see the paper we published in the International Journal of Human-Computer Interaction (Lewis & Sauro, 2024).
]]>