MeasuringU

What Is the Difference Between Ease and Satisfaction?

Jeff Sauro, PhD • Jim Lewis, PhD — Tue, 02 Dec 2025 22:18:10 +0000

“Satisfaction” is used rather broadly in vernacular speech.

We can feel satisfied with a meal, a movie, or a moment.

Our feeling of satisfaction blends utility (it fed me), affect (I enjoyed it), and expectation (it lived up to or exceeded what I wanted).

The dessert, the movie ending, or the moment can all be satisfying or unsatisfying. And we apply that same idea to measuring satisfaction in consumer and user research.

Customer or consumer satisfaction has a rich history going back almost a century. In the 1940s, satisfaction was used as a marketer’s goal for meeting consumer needs and wants (e.g., Duddy and Revzan, 1947, where they described “Dissatisfaction … because of defects in manufacture or design”).

Day (1977) and his colleagues began a modern tradition of research aimed at understanding satisfaction and related post-choice constructs.

The historical focus of UX research, in contrast, has been more on ease of use than on general satisfaction because ease is a key experiential driver of the broader construct of satisfaction.

However, because modern UX research often overlaps with traditional market research, UX research professionals may need to measure or at least consider customer satisfaction in testing efforts such as benchmarks or surveys.

In this article, at a conceptual level, we discuss the measurement of satisfaction and perceived ease of use in UX research at the study (overall) level and as post-task metrics.

Study Level

At the study level, measuring one or both of overall satisfaction and ease of use is fairly common, but market research emphasizes satisfaction and UX research emphasizes ease of use.

Study-Level (Product) Satisfaction

Satisfaction can be measured in anything: a brand, a product, a feature, a website, or a service experience. Consequently, it can be referred to as CSAT (customer satisfaction) or product satisfaction.

In our taxonomy of UX metrics, satisfaction is most commonly collected at the study level (Figure 1), meaning it’s collected only once to assess the overall product experience. It can be measured using a simple seven- or eleven-point scale or one of the industry metrics, such as the Microsoft Net Satisfaction score (NSAT) or the American Customer Satisfaction Index (ACSI, which is a proprietary mix of three aspects of satisfaction).

Figure 1: Study-level metrics from the MeasuringU taxonomy of UX metrics include four measures of satisfaction.

Study-Level Ease

Perceived ease of use can also be measured at the product level. We have written extensively about the many ways of measuring ease. Most major standardized UX questionnaires designed to assess user experiences at the study level (e.g., SUPR-Q^®, SUS, PSSUQ, UX-Lite^®, and TAM) have at least one item (and sometimes more) dedicated to measuring perceived ease (Figure 2). However, satisfaction with a product conceptually involves more than just finding a product easy to use. Product usefulness, pricing, integration, and support are some of the other common constructs that can affect satisfaction, but ease is a fundamental driver.

Figure 2: Study-level metrics from the MeasuringU taxonomy of UX metrics include numerous standardized questionnaires that measure perceived ease (e.g., SUS, PSSUQ, SUPR-Q, TAM, UMUX, UMUX-Lite, UX-Lite).

Task Level

At the task level, measuring task ease is more common than measuring task satisfaction, largely because task-level evaluation, as conducted in usability studies (a key UX activity), is relatively rare in market research. That said, nothing prevents researchers from including an item to assess task satisfaction in addition to the perceived ease of task completion. We now include both in our taxonomy of UX metrics at the task level (Figure 3).

Figure 3: Attitudinal task-level metrics from the MeasuringU taxonomy showing measures of perceived ease and general satisfaction (along with confidence and workload effort).

Task-Level Satisfaction

While less common than ease ratings, satisfaction ratings can also be collected at the task level. After a participant attempts a task, they can be asked how satisfied they are with the website, the app or product, or a specific functional area.

Task-Level Ease

The Single Ease Question (SEQ^®) is the most commonly used post-task measure of ease (with over 600 citations). This single-item scale provides a reliable and sensitive measure of the perceived ease or difficulty of a task experience.

Satisfaction vs. Ease

At both the study and task level, conceptually, satisfaction is a broader construct than ease. We often depict satisfaction as a consequence of several variables along with ease, including, to name a few, usefulness, price, and quality. Under this broader construct, people can be satisfied with a product or experience but not necessarily think it’s easy to use. Higher satisfaction scores would have to come from other aspects of the product than ease, such as great features, low price, or wide availability. While the two differ conceptually, we’ll explore qualitatively how much measures of satisfaction and measures of ease actually differ in a future article.

Historical Role of Satisfaction in UX Research

Although satisfaction is often considered a primary measure for market researchers, it also played an early and fundamental role in the definition of UX metrics, reflected in the high-level constructs of the first international usability standard, ISO-9241 Part 11. In that standard, the specified components of usability were effectiveness, efficiency (objective metrics), and satisfaction (a subjective metric).

The use of these constructs was common before the publication of the ISO standard in 1998, dating back to at least the early 1980s. The first draft of the standard appeared in 1988, and specific metrics were investigated in the European MUSiC project.

ISO-9241 Part 11 and other early UX standards specified measurement of satisfaction but did not usually require specific metrics. The definition of satisfaction from ISO-9241-11 was “freedom from discomfort and positive attitudes towards the use of the product.” Regarding the measurement of satisfaction, the 2001 ANSI standard Common Industry Format for Usability Test Reports stated (p. 11):

A variety of instruments are available for measuring user satisfaction of software interactive products, and many companies create their own. Whether an external, standardized instrument is used or a customized instrument is created, subjective rating dimensions such as Satisfaction, Usefulness, and Ease of Use should be considered for inclusion, as these will be of general interest to customer organizations. A number of questionnaires are available that are widely used. They include: ASQ, CUSI, PSSUQ, QUIS, SUMI, and SUS. While each offers unique perspectives on subjective measures of product usability, most include measurements of Satisfaction, Usefulness, and Ease of Use.

So, in the early days of UX research and measurement, especially in the context of usability testing, the concept of satisfaction was broad and included measures of ease of use as proxies of satisfaction.

The After-Scenario Questionnaire (ASQ)

While the Single Ease Question (SEQ) has become the industry standard for measuring post-task ease, prior to its development, we used the After Scenario Questionnaire (ASQ).

The three-item ASQ was developed by Jim Lewis and colleagues in 1988 as part of an IBM Research system usability metrics project (Figure 4). Although the item format is completely different from the standard SEQ (a lower rating indicates better experience, agreement rather than item-specific endpoints, and satisfaction-based wording), it contains a single-item seven-point measure of perceived ease.

Consistent with the emerging definition of usability at the time of its development, all the items of the ASQ, including the ease item, were framed in the context of satisfaction but were directed at potential drivers of satisfaction at the task level. Over time, the focus in UX practice shifted away from this nod to satisfaction and streamlined the wording of items to focus on their distinctive content (e.g., the SEQ).

Figure 4: The ASQ (from 1990).

Satisfaction and the Single Usability Metric (SUM)

When we wrote about the Single Usability Metric (SUM) in 2005, we incorporated satisfaction into our model along with objective metrics such as task completion rates and times. Our original three-item measure of satisfaction was based on the ASQ, retaining ratings of task ease, task completion time, and overall task satisfaction, then averaging the three ratings for a composite satisfaction score. That’s another broad use of satisfaction as a measure of users’ attitude toward the experience, consistent with the guidance in ISO 9241-11 and other usability standards of the time.

Future Role of Satisfaction in UX Research

As we consider the future of satisfaction in UX research, we see an opportunity for UX researchers not to feel like they should avoid satisfaction because it’s historically been a metric tracked by another department. We recommend, when appropriate, including general satisfaction (e.g., CSAT) in their battery of subjective metrics, at least at the study level (and possibly at the task level).

Whether or not ratings of satisfaction and perceived ease of use have a high correlation and correspondence (which we are investigating and plan to report in future articles), outside of UX, they are conceptually distinct. Ease is influential in UX research; satisfaction is influential in CX and market research. This is likely due to the logic that perceived ease affects satisfaction rather than the reverse.

UX research tends to focus on the precursors to satisfaction because these precursors are closely related to the specific designs of products or services. CX and market research tend to focus on satisfaction as a driver of important business metrics (e.g., behavioral intentions such as likelihood to repurchase, actual repurchasing behavior, or profit).

Including the measurement of satisfaction in UX research (when appropriate) would enable the quantification of the extent to which perceived ease of use (and other UX metrics such as perceived usefulness, appearance, and trust) drives satisfaction, a major step in bridging the gap between UX and business metrics.

Summary and Takeaways

Customer satisfaction is a key metric in market research and business.

Historically, a particular conceptualization of satisfaction has also been important in UX research, but the business and UX conceptualizations are not the same.

From the early days of usability testing to the later development of international usability standards, the UX conceptualization of satisfaction has been any attitude toward a product. Of the various attitudes people might have toward a product, ease of use is a major one.

In this article, we’ve discussed the measurement of satisfaction and ease of use at study and task levels and have presented an argument for measuring both in UX research (at least at the study level) to provide a quantitative connection between UX and business metrics.

Note: Our taxonomy is a living document. Like anything else, the popularity of metrics can rise and fall, new methods make formerly difficult metrics easier to collect, and metrics that do not have benchmarks this year might have some next year. We plan to update this infographic over time, so in the future, you can click here for the latest version.

Rake Weighting: How to Weight Survey Data with Multiple Variables

Jim Lewis, PhD • Jeff Sauro, PhD — Wed, 19 Nov 2025 05:19:57 +0000

Having a representative sample is ideal when making inferences about your customer or user population. In practice, it can be difficult to recruit the right proportion of respondents, leaving your sample out of balance with the population.

One way to adjust for being off balance is to weight the data you collected to get the sample back into proportion with the population percentages (such as for variables like age, geographic region, or experience levels).

In previous articles, we’ve discussed whether you should weight survey data before analysis and how to do simple weighting of means and percentages, with only one variable affected (e.g., duration of product experience).

But what if you need to balance more than one demographic variable?

If you have access to all the cross-tabulations in the reference population, you could use a variation of the simple method to create weights for each combination of demographic variables (e.g., males between 30 and 39 years old, females between 40 and 49 years old, etc.). However, that approach quickly falls apart as the number of variables increases for two reasons: (1) the crosstabs for the reference population are usually not available at that level of detail, and (2) even when detailed crosstabs are available, some combinations typically account for very small proportions of the population, making them problematic to model.

A popular alternative approach is rake weighting (also known as rim weighting or iterative proportional fitting). Rake weighting (or just raking) is a statistical technique that adjusts the sample using multiple weights to match known population characteristics. For example, you may want your survey sample to match not solely the age of your customers, but the age, gender, and geographic region simultaneously.

Think of raking a yard to clean up scattered sample variables to align them to the population (Figure 1). You run the rake horizontally to level the ridges across the columns of data. It looks better, but now some rows of data are a little high or low. So, you switch and rake vertically to even those out, but that nudges a few rows off again. You keep alternating, raking down rows, then across columns, with each repetition getting closer to the reference population, stopping when both directions look right.

Figure 1: Conceptual depiction of raking age and gender variables.

In this article, we describe why and how to use rake weighting.

Why Use Rake Weighting?

While raking leaves and soil can be satisfying (zen garden, anyone?), raking isn’t something you should immediately expect to do with survey data in UX research. When thinking of rake weighting, keep the following in mind:

Weight when samples substantially deviate (10%+). Consider weighting when your sample meaningfully differs from the target population. There are no strict guidelines, but by convention, weighting becomes a consideration when key variables deviate by more than about 10%.

Raking is rarely used in UX research. Rake weighting is rarely necessary in UX research because there’s usually no meaningful reference population, and demographics tend to have little influence on UX metrics. The most frequent use of raking is in political polling, a research context where there are known demographic distributions. Demographic variables often have little effect on measures of user experience (e.g., The System Usability Scale: Past, Present and Future, pp. 586–587). For market research and some types of UX research, there are a few sources for demographics of users of popular apps, such as Snapchat and TikTok, that would be better than the U.S. census to use for reference populations.

Advantages of Rake Weighting

When there is a need to weight data on multiple demographic variables, a compelling advantage of rake weighting is that there is no need to know all the crosstabs, just the sample and population proportions for each variable.

Another advantage is that several computer programs are available that perform rake weighting, including at least three implemented in R: survey, svyweight, and anesrake. (If you need help getting started with R, numerous resources are online.) When we do complex weighting, we generally use anesrake, which is an implementation of the ANES (American National Election Study) weighting method.

How to Use Rake Weighting

To use rake weighting, you first need to demonstrate a need for weighting. If necessary, you next need to get the weights, and then you need to apply them.

In this section, we’ll work with data from our 2024 SUPR-Q^® survey of social media platforms. We recruited 324 participants in August 2024 to reflect on their most recent experience with one of six social media platforms: Facebook, Instagram, LinkedIn, Snapchat, TikTok, and X. We were interested in a wide range of UX topics (e.g., overall quality of experience, levels of trust, impact on mood and self-esteem). In this article, our examples focus on the measurement of brand attitude and reluctance to engage in political discourse on the platforms.

Also, for our example, we use demographic distributions of the adult U.S population (18 years of age and older) as the reference population for gender, age, and income because it’s commonly used for that purpose in many research contexts. Note that we do not recommend this as good practice for UX research because, as mentioned above, the entire US population is rarely the target audience for a specific product or service, and demographic variables often have little effect on UX metrics. It does, however, work well in our examples here as a quick check of the value (or not) of employing this kind of demographic weighting in future SUPR-Q surveys.

Comparing Survey and UX Population Demographics

Tables 1, 2, and 3 show the distributions of gender, age, and income in our sample and the U.S. population.

As shown in Table 1, there were representation discrepancies for female (overrepresented), male (underrepresented), and nonbinary (overrepresented) respondents. The slight overrepresentation of nonbinary respondents is consistent with the greater percentage of younger respondents in the U.S who identify as something other than traditional gender models.

Gender	U.S.	Sample	Difference
Female	50.5%	61.0%	−10.5%
Male	48.5%	35.0%	13.5%
Nonbinary	1.0%	4.0%	−3.0%

Table 1: Gender demographics.

Table 2 shows that younger participants (18–39) were overrepresented, and older participants (60+) were underrepresented.

Age	U.S	Sample	Difference
18–24	12.8%	21%	−8%
25–29	8.6%	20%	−11%
30–39	17.1%	31%	−14%
40–49	15.5%	16%	0%
50–59	16.4%	10%	6%
60+	29.7%	2%	28%

Table 2: Age demographics.

Surprisingly, as shown in Table 3, there was relatively little difference in the income distributions.

Income (K$)	U.S.	Sample	Difference
0–24	19%	15%	4%
25–49	20%	22%	−2%
50–99	30%	33%	−3%
100–149	15%	17%	−2%
150–199	8%	8%	0%
200+	8%	5%	3%

Table 3: Income demographics.

The sample and reference population differences in gender and age distributions were large enough to justify weighting (assuming comparison with a suitable reference population). For the purposes of this exercise (to demonstrate rake weighting with several demographic variables), we used all three: age, gender, and income.

Getting Weights

As mentioned above, we use the R package anesrake when we need weights based on multiple demographic or other respondent variables. Using the package itself is reasonably simple, but it requires input data with specific properties. Table 4 shows the first rows of an Excel file with the data we used for these examples.

RespID	Platform	Gender	GenderNum	AgeGroup	AgeNum	Income	IncomeNum	Brand	Pol	PolBot2
3	X	Female	1	25-29	2	$50k-$99k	3	2	1	1
4	X	Female	1	40-49	4	$50k-$99k	3	3	1	1
5	TikTok	Female	1	30-39	3	$25k-$49k	2	6	2	1
6	Facebook	Female	1	30-39	3	$100k-$149k	4	7	3	0
7	Instagram	Female	1	18-24	1	$0-$24k	1	1	4	0
8	Facebook	Male	2	50-59	5	$25k-$49k	2	4	1	1
9	Snapchat	Male	2	18-24	1	$50k-$99k	3	5	1	1
10	LinkedIn	Male	2	30-39	3	$0-$24k	1	5	3	0

Table 4: First eight rows of the sample Excel file.

Our first steps in the process to prepare this data for anesrake were to identify the R packages to install, document the key for matching variable choices to numbers (anesrake likes to work with numbers), then load the libraries with the following R script:

#Install libraries if needed:

install.packages(“openxlsx”)
install.packages(“anesrake”)
install.packages(“weights”)

#Key for matching variables to numbers for this example (optional but recommended)
#GenderNum: 1=Female, 2=Male, 3=Nonbinary
#AgeNum: 1=18-24, 2=25-29, 3=30-39, 4=40-49, 5=50-59, 6=60-69
#IncomeNum: 1=$0-$24k, 2=$25k-$49k, 3=$50k-$99k, 4=$100k-$149k, 5=$150k-$199k, 6=$200k+

#R script starts here after loading libraries
library(openxlsx)
library(anesrake)
library(weights)

Next, we used the read.xlsx function of the openxlsx package to put the Excel data into an R data frame and verified that analysis of the data frame produced the expected percentages of the levels of the demographic variables from the sample (it did):

#Read the data from the Excel file
dat <- read.xlsx(“/Users/Jim/Documents/MeasuringU/Benchmarks/2024/Social Media/Social Media 2024 Rake Weights Exercise.xlsx”,sheet = ‘SocialMedia’)

#Verify the expected percentages (output in Figure 2)
wpct(dat$GenderNum)
wpct(dat$AgeNum)
wpct(dat$IncomeNum)

Figure 2: Observed percentages from the sample (consistent with Tables 1, 2, and 3).

Next, make sure the data columns in the data frame are defined as factors:

# Make sure data columns are factors
dat$GenderNum <- factor(dat$GenderNum, levels = 1:3,
labels = c(“Female”,”Male”,”Nonbinary”))
dat$AgeNum <- factor(dat$AgeNum, levels = 1:6,
labels = c(“18-24″,”25-29″,”30-39″,”40-49″,”50-59″,”60-69”))
dat$IncomeNum <- factor(dat$IncomeNum, levels = 1:6,
labels = c(“$0-$24k”,”$25k-$49k”,”$50k-$99k”,”$100k-$149k”,”$150k-$199k”,”$200k+”))

Then set the targets obtained from the reference population:

#Set targets
targets <- list(
GenderNum = c(Female = 0.505, Male = 0.485, Nonbinary = 0.010),
AgeNum = c(`18-24` = 0.128, `25-29` = 0.085, `30-39` = 0.171,
`40-49` = 0.155, `50-59` = 0.164, `60-69` = 0.297),
IncomeNum = c(`$0-$24k` = 0.19, `$25k-$49k` = 0.20, `$50k-$99k` = 0.30,
`$100k-$149k` = 0.15, `$150k-$199k` = 0.08, `$200k+` = 0.08)
)

The last step is to define case IDs for the data:

#Define case IDs starting from 1 to the length of any of the variables
dat$caseid <- 1:length(dat$GenderNum)

Now everything is ready to run anesrake and review a summary of its output (see Figure 3) and to print the weights for easy copying into the Excel file (see Figure 4)—targets, dat, and caseid are the required arguments for the anesrake function (for a full description of the required and optional arguments, see the anesrake specification):

#Calculate weights and print summary of results
outsave <- anesrake(targets, dat, caseid=dat$caseid)
summary(outsave)

Figure 3: Key outputs of the anesrake summary showing the number of iterations to convergence (52) and the resulting weighted and unweighted counts and percentages.

# Print weights for each case to review
dat[“Weights”]

Figure 4: Weights for the first eight cases.

Finally, print the entire data frame to an Excel file to include the weights in the last column (Figure 5).

# Print data frame to Excel
write.xlsx(x = dat, file = “/Users/Jim/Documents/MeasuringU/Benchmarks/2024/Social Media/Social Media 2024 Rake Weights Exercise with Weights.xlsx”, sheetName = “Data”, colNames = TRUE, rowNames = FALSE, overwrite = TRUE)

Figure 5: First eight rows of the printed Excel file with the rake weights.

The weights ranged from 0.119 to 5.0. The smallest weight was for the most overrepresented combination of demographic variables (nonbinary, 25–29, $150–$199k), and the largest was for the least represented (all respondents, 60–69 years old).

One aspect of the “art” of weighting is deciding whether to limit the range of weights to no less than 0.5 and no more than 2.0. As we’ve discussed in a previous article, this practice has some benefits (e.g., reducing the impact of weighting on the precision of group means), but it reduces the sum of the weights so they no longer match the sample size. For this reason, as in this example, we generally prefer to work with unadjusted weights.

Using Weights

The specific method for using weights depends on your statistical software. In the following examples, we’ve imported the data into SPSS and used its WEIGHT function (specifically, “WEIGHT BY Weights”) to get unweighted and weighted results for two questions from our social media survey.

Brand Attitude

Figure 6 shows the mean responses by platform for ratings of brand attitude on a seven-point scale (“Overall, how would you rate your attitude towards [social media platform]?”; 1 = Very Unfavorable; 7 = Very Favorable). The weighting slightly affected the mean ratings (by no more than 0.3 points for a given platform, 5% of the range of the scale) but did not change the interpretation of the results. Weighted or unweighted, TikTok had the highest brand attitude, X had the lowest, and the other platforms clustered together in the middle (slightly lower for Facebook).

Figure 6: Mean ratings of brand attitude, unweighted and weighted.

Reluctance to Engage in Political Discourse

Respondents in the social media survey were asked “How likely are you to share political content on social media?” using a five-point scale from Very Unlikely (1) to Very Likely (5). Most respondents indicated a reluctance to engage in political discourse on their platforms, so one of our analyses of the results was as a bottom-two-box score, where larger percentages indicate greater reluctance to share political content (Figure 7).

Figure 7: Bottom-two-box scores for likelihood of sharing political content, unweighted and weighted.

Like the first example, weighting had some effect on bottom-two-box percentages (the largest change was for Facebook from 72% to 63%), but little effect on the key narrative emerging from the data. With or without weighting, users of LinkedIn are especially reluctant to share political content. Although a majority of Facebook, Instagram, and X users selected a bottom-two response option, users of these platforms are more likely than LinkedIn users to share political content.

Summary and Discussion

When there is a need to weight multiple demographic variables to match a sample to a reference population, rake weighting is a popular option, and the R package anesrake is a popular tool. In this article, we discussed why and how to use rake weighting with a sample R script and examples applied to data from a 2024 survey of the UX of social media platforms. The key points in the article are:

Raking is rare in UX research but has its place. For some research questions, especially national voting estimates, weighting a sample to match the U.S. population is an important analytical step. Weighting is less common in UX research because there usually is no suitable reference population. For example, UX metrics are more affected by user characteristics such as the extent of experience with a product than by demographic variables such as gender, age, or income.

Rake weighting has significant advantages over other methods. Rake weighting is a popular approach to computing weights based on multiple participant variables because it requires input of only the sample and population for each variable (no need for complex crosstabs), and several open-source R packages perform rake weighting.

As expected, weighting with UX census data had little effect on the results from a UX survey of social media. To provide examples of rake weighting and its effect on UX analyses we conducted in 2024, we computed unweighted and weighted results for two items from a survey of the UX of social media platforms (mean ratings of brand attitude and bottom-two-box scores for likelihood to share political content on platforms). Weighting had very little effect on mean ratings of brand attitude. Weighting had slightly more effect on bottom-two-box percentages measuring the reluctance to engage in political discourse (the most political item in the survey) but still had no substantial effect on the interpretation of the results.

Bottom line: UX researchers should exercise caution in analyzing weighted rather than unweighted data unless there is a good reference population for the research questions. When there is a need to weight data based on multiple user characteristics, rake weighting is a popular and effective method.

What Metrics Has MeasuringU Created?

Jeff Sauro, PhD • Jim Lewis, PhD — Wed, 12 Nov 2025 00:52:12 +0000

At MeasuringU^®, we don’t just use UX metrics—we create them.

But what have we created, and what have we just used or extended?

Across our combined careers, we (Jeff and Jim) have published 16 psychometrically qualified UX metrics (both creating original and modifying existing questionnaires) plus a method for combining prototypical usability metrics, and we have made major contributions to a popular standardized UX questionnaire that we did not create, the System Usability Scale (SUS).

In this article, we briefly describe each of these metrics (presented in roughly reverse chronological order by decade) and provide key links to more information about them (so you won’t need to ask ChatGPT and risk hallucinated references).

2020–2025

From 2020 to 2025, we developed and published four standardized UX questionnaires: UX-Lite^®, SUPR-Qm^® V2, TAC-10, and PWCQ.

UX-Lite^®

The UX-Lite has its roots in the UMUX-LITE (more on the UMUX-LITE below). It’s a two-item questionnaire that is essentially a miniature version of the Technology Acceptance Model (TAM), assessing the perceived ease-of-use and perceived usefulness of products and services with two five-point scales. It’s becoming an increasingly popular metric in UX research and practice.

From 2020 to 2024, we published 15 articles on the UX-Lite, many of which explored different ways to phrase the “usefulness” item because its original wording was overly complex. In addition to demonstrating the reliability and validity of the UX-Lite, it has also proved to be useful in regression and structural equation modeling of higher-level outcome metrics like ratings of overall experience, behavioral intentions (e.g., likelihood to recommend, likelihood to reuse), and actual user behaviors.

Key Characteristics

Measures: Perceived ease of use and perceived usefulness
Number of items: 2
Reliability: 0.75 (coefficient alpha unless otherwise specified)
Types of Validity: Content, construct, concurrent
Number of subscales: 2 (single-item scales)
Interpretive norms: Yes
Development method: Classical test theory

Key Links and Publications

Lewis, J. R., & Sauro, J. (2023). Effect of Perceived Ease of Use and Usefulness on UX and Behavioral Outcomes. International Journal of Human-Computer Interaction, 40(20), 6676–6683.
Measuring UX: From the UMUX-LITE to the UX-Lite
Evolution of the UX-Lite
How to Score and Interpret the UX-Lite

SUPR-Qm^® V2

The mobile app experience is a unique and defining aspect of our interactions with our devices. While the experience shares many characteristics with using software and websites on a traditional monitor, the mobility, screen size, and interaction style make the experience distinct. Consequently, we developed a questionnaire, the SUPR-Qm, to measure attitudes toward the mobile app user experience. In 2025, we published the second version of the SUPR-Qm, reducing the number of items from the original 16 to five.

Key Characteristics

Measures: Intensity of the UX of mobile apps
Number of items: 5
Reliability: 0.83
Types of Validity: Content, construct, concurrent
Number of subscales: 0
Interpretative norms: Yes
Development method: Rasch scaling

Key Links & Publications

Lewis, J. R., & Sauro, J. (2025). Streamlining the SUPR-Qm: The SUPR-Qm V2. Journal of User Experience, 20(2), 65–88.
Ten Things to Know About the SUPR-Qm
How to Score and Interpret the Five-Item SUPR-Qm V2

TAC-10

We based the TAC-10 on research conducted at MeasuringU from 2015 through 2023 and presented it at UXPA 2024. The TAC-10 is a select-all-that-apply checklist of ten different technical activities. We published six blog articles in 2023 detailing its development, including why there was a need for a measure of tech savviness in UX research (to enable discrimination of interface and participant characteristics when analyzing UX data) and how to use the TAC-10 to classify participants into different levels of tech savviness.

Key Characteristics

Measures: Level of tech savviness
Number of items: 10
Reliability: 0.67 (Spearman–Brown for dichotomous data)
Types of Validity: Content, construct, concurrent
Number of subscales: 0
Interpretative norms: Yes
Development method: Rasch scaling

Key Links & Publications

PWCQ

In our UX research practice, we frequently encounter users and designers who criticize website interfaces for being cluttered and stakeholders who worry about the experiential and business consequences of a cluttered website. But what exactly does it mean for a website to appear cluttered? To answer this question, we developed the Perceived Website Clutter Questionnaire (PWCQ), a five-item questionnaire with two subscales: Content Clutter and Design Clutter.

Key Characteristics

Measures: The perceived clutter of websites
Number of items: 5
Reliability: 0.90
Types of Validity: Content, construct, concurrent
Number of subscales: 2
- Content Clutter: Reliability = 0.91
- Design Clutter: Reliability = 0.88
Interpretative norms: No
Development method: Classical test theory

Key Links & Publications

Lewis, J. R., & Sauro, J. (2024). Measuring the Perceived Clutter of Websites. International Journal of Human-Computer Interaction, 41(9), 5260–5273.
Confirming the Perceived Website Clutter Questionnaire
Incorporating Clutter in the SUPR-Q Measurement Framework

2010–2019

From 2010 through 2019, we (Jeff and Jim) both collaborated and worked separately on the creation and publication of seven standardized UX questionnaires, plus the publication of books, papers, and numerous articles on how to use and interpret the SUS.

SUPR-Q^®

At MeasuringU, we originally benchmarked websites using the SUS. But we knew that the quality of the website user experience was more than just usability, so we developed the Standardized User Experience Percentile Rank Questionnaire (SUPR-Q) in 2011 and published our findings in 2015. The SUPR-Q is a short (eight-item) questionnaire that measures perceptions of Usability, Trust, Appearance, and Loyalty for websites. The combined score provides an overall measure of the quality of the website user experience. The normative percentile database contains responses from more than 10,000 participants and 150 websites (updated on an ongoing basis, about once per quarter).

Key Characteristics

Measures: Perceptions of the quality of UX with websites
Number of items: 8
Reliability: 0.90
Types of Validity: Content, construct, concurrent
Number of subscales: 4
- Usability: Reliability = 0.88
- Trust: Reliability = 0.87
- Appearance: Reliability = 0.80
- Loyalty: Reliability = 0.73
Interpretive norms: Yes
Development method: Classical test theory

Key Links & Publications

Sauro, J. (2015). SUPR-Q: A Comprehensive Measure of the Quality of the Website User Experience. Journal of Usability Studies, 2(10), 68–86.
SUPR-Q License & Calculator Package
Validating the Basic SUPR-Q Measurement Model

SUPR-Qm^®

Our original version of the mobile app questionnaire had 16 items selected from a larger set using Rasch scaling. We list this here for historical purposes, but our current practice is to use the SUPR-Qm V2 (see above).

Key Characteristics

Measures: Intensity of the UX of mobile apps
Number of items: 16
Reliability: 0.94
Types of Validity: Content, construct, concurrent
Number of subscales: 0
Interpretative norms: Yes
Development method: Rasch scaling

Key Links & Publications

Sauro, J., & Zarolia, P. (2017). SUPR-Qm: A Questionnaire to Measure the Mobile App User Experience. Journal of Usability Studies, 13(1), 17–37.
Lewis, J. R., & Sauro, J. (2025). Streamlining the SUPR-Qm: The SUPR-Qm V2. Journal of User Experience, 20(2), 65–88.
How Stable is the SUPR-Qm After 8 Years?

UMUX-LITE

The UMUX-LITE is a mini-TAM with two seven-point items, assessing perceived ease of use and perceived usefulness. It was derived from the four-item UMUX (Usability Metric for User Experience) when Jim was at IBM (in collaboration with Brian Utesch and Deb Maher) and is the predecessor to the UX-Lite. At MeasuringU, we prefer the UX-Lite (described above) due to its enhanced flexibility, but the UMUX-LITE is also used in current UX research and practice.

Key Characteristics

Measures: Perceived ease of use and usefulness
Number of items: 2
Reliability: 0.83
Types of Validity: Content, construct, concurrent
Number of subscales: 2 (single-item scales)
Interpretative norms: No
Development method: Classical test theory

Key Links & Publications

Lewis, J. R., Utesch, B. S., & Maher, D. E. (2013). UMUX-LITE: When There’s No Time for the SUS. In Proceedings of CHI 2013 (pp. 2099–2102). Association for Computing Machinery.
Measuring Usability: From the SUS to the UMUX-Lite

MOS-X2

As part of his work on speech systems at IBM, Jim and other collaborators at IBM developed variants of the Mean Opinion Scale (MOS) that had first been published by others in the 1990s. The MOS-X2 is the culmination of that research, a four-item questionnaire that assesses four key characteristics of user experiences with synthetic voices: Intelligibility, Naturalness, Prosody, and Social Impression.

Key Characteristics

Measures: The perceived intelligibility, naturalness, prosody, and social impression of synthetic voices
Number of items: 4
Reliability: 0.85
Types of Validity: Content, construct, concurrent
Number of subscales: 4 (single-item scales)
Interpretative norms: Yes
Development method: Classical test theory

Key Links & Publications

Lewis, J. R. (2018). Investigating MOS-X Ratings of Synthetic and Human Voices. Voice Interaction Design, 2(2), 1–22.
The Evolution of the Mean Opinion Scale: From MOS-R to MOS-X2

SUISQ-R

The original version of the Speech User Interface Service Quality (SUISQ) questionnaire was developed at IBM and published by Melanie Polkosky in 2008. During its development, participants rated the quality of recorded interactions rather than interactions in which they participated, leaving open the question of the extent to which the findings would generalize to personal as opposed to observed interactions. Collaborating at State Farm, Jim and Mary Hardzinski collected SUISQ data in a large-sample usability study and (1) replicated the factor structure of the original and (2) used item analysis to reduce the questionnaire from 25 to 14 items (getting the SUISQ-R) while still adequately measuring its four subscales: User Goal Orientation, Customer Service Behaviors, Speech Characteristics, and Verbosity.

Key Characteristics

Measures: Service quality of speech applications
Number of items: 14
Reliability: 0.88
Types of Validity: Content, construct, concurrent
Number of subscales: 4
- User Goal Orientation: Reliability = 0.91
- Customer Service Behavior: Reliability = 0.88
- Speech Characteristics: Reliability = 0.80
- Verbosity: Reliability = 0.67
Interpretative norms: No
Development method: Classical test theory

Key Links & Publications

Lewis, J. R., & Hardzinski, M. L. (2015). Investigating the Psychometric Properties of the Speech User Interface Service Quality Questionnaire. International Journal of Speech Technology, 18(3), 479–487.

EMO

The Emotional Metric Outcomes (EMO) questionnaire was also developed while Jim was consulting at State Farm. His collaborators at State Farm wanted a standardized questionnaire for assessing the emotional consequences of interaction with a company. They published the EMO in both long (16 items) and short (8 items) versions, measuring four subscales: Positive Relationship Affect, Negative Relationship Affect, Positive Personal Affect, and Negative Personal Affect. The key characteristics below are for the more efficient short version.

Key Characteristics

Measures: Emotional consequence of interacting with a company
Number of items: 8
Reliability: 0.88
Types of Validity: Content, construct, concurrent
Number of subscales: 4
- Positive Relationship Affect: 0.89
- Negative Relationship Affect: 0.72
- Positive Personal Affect: 0.83
- Negative Personal Affect: 0.82
Interpretative norms: No
Development method: Classical test theory

Key Links & Publications

Lewis, J. R., & Mayes, D. K. (2014). Development and Psychometric Evaluation of the Emotional Metric Outcomes (EMO) Questionnaire. International Journal of Human-Computer Interaction, 30, 685–702.
Lewis, J. R., Brown, J., & Mayes, D. K. (2015). Psychometric Evaluation of the EMO and the SUS in the Context of a Large-Sample Unmoderated Usability Study. International Journal of Human-Computer Interaction, 31(8), 545-553.

mTAM

The mTAM is a modified version of the TAM (Technical Acceptance Model), a questionnaire developed in the 1990s to assess the drivers of technology acceptance. In its original version, the TAM had 12 items measuring two subscales, Perceived Ease-of-Use and Perceived Usefulness, with items worded to focus on potential future use. For the mTAM, the only modification was to change the focus to ratings of actual use. Note that we do not use this as a practical UX questionnaire, but we have used it when exploring how other standardized metrics work within the Technology Acceptance Model.

Key Characteristics

Measures: Perceived ease of use and perceived usefulness
Number of items: 12
Reliability: 0.95
Types of Validity: Content, construct, concurrent
Number of subscales: 2
- Perceived Ease of Use: Reliability = 0.95
- Perceived Usefulness: Reliability = 0.95
Interpretative norms: No
Development method: Classical test theory

Key Links & Publications

Lah, U., Lewis, J. R., & Šumak, B. (2020). Perceived Usability and the Modified Technology Acceptance Model. International Journal of Human-Computer Interaction, 36(13), 1216–1230.
Lewis, J. R. (2019). Comparison of Four TAM Item Formats: Effect of Response Option Labels and Order. Journal of Usability Studies, 4(14), 224–235.

SUS

No, we didn’t develop the System Usability Scale (SUS)—that honor goes to John Brooke—but between 2010 and 2019, we conducted extensive research to improve its interpretability and flexibility, with publications listed in the Key Links below.

Key Characteristics

Measures: Perceived usability
Number of items: 10
Reliability: 0.91
Types of Validity: Content, construct, concurrent
Number of subscales: 0
Interpretative norms: Yes
Development method: Classical test theory

Key Links & Publications

Sauro, J. (2011). A Practical Guide to the System Usability Scale: Background, Benchmarks & Best Practices. MeasuringU Press. Note: First book about the SUS, focusing on its measurement characteristics and practical use, including the curved grading scale developed at MeasuringU.
Sauro, J., & Lewis, J.R. (2011). When Designing Usability Questionnaires, Does It Hurt to Be Positive? In Proceedings of CHI 2011 (pp. 2215–2223). Association for Computing Machinery. Honorable Mention for Best Paper award.
Sauro, J., & Lewis, J. R. (2012/2016). Quantifying the User Experience: Practical Statistics for User Research. Morgan Kaufmann. Note: Extensive coverage of SUS research and use in Chapter 8.
Lewis, J. R., & Sauro, J. (2017). Revisiting the Factor Structure of the System Usability Scale. Journal of Usability Studies, 12(4), 183–192.
Lewis, J. R., & Sauro, J. (2017). Can I Leave This One Out? The Effect of Dropping an Item from the SUS. Journal of Usability Studies, 13(1), 38–46.
Lewis, J. R., & Sauro, J. (2018). Item Benchmarks for the System Usability Scale. Journal of Usability Studies, 13(3), 158–167.
Lewis, J. R. (2018). The System Usability Scale: Past, Present, and Future. International Journal of Human-Computer Interaction, 34(7), 577–590.
Extensive publication of articles on the SUS at MeasuringU.com. Some recent examples are:

2000–2009

In this decade, Jeff introduced the Single Ease Question (SEQ) and a method for combining multiple UX metrics into a Single Usability Metric (SUM). Jim and colleagues at IBM investigated and published enhancements to the Mean Opinion Scale (MOS), most notably, the MOS-X.

SEQ^®

The concepts of ease of use and usability are deeply intertwined. No one knows who the first person was to ask someone to rate the ease of completing a task in a usability study, but in 2009, Jeff and Joe Dumas were the first to publish a version of the item that is now known as the Single Ease Question (SEQ). Since its initial publication, it has undergone some cosmetic changes, and research has established good norms for its interpretation.

Key Characteristics

Measures: Perceived ease of completing a task in a usability study
Number of items: 1
Reliability: 0.80 (test-retest)
Types of Validity: Content, concurrent
Number of subscales: 0
Interpretative norms: Yes
Development method: Classical test theory

Key Links & Publications

Sauro, J., & Dumas, J. (2009). Comparison of Three One-Question, Post-Task Usability Questionnaires. In Proceedings of CHI 2009 (pp. 1599–1608). Association for Computing Machinery. Nominated for Best Paper Award.
The Evolution of the Single Ease Question (SEQ)

SUM

The Single Usability Metric (SUM) is not a standardized questionnaire, so there is no list of key characteristics in this section. Instead, it is a standardized method for combining prototypical usability metrics such as completion rates, completion times, and subjective ratings (e.g., satisfaction or ease), an important step toward a unified measure of usability that we continue to use in benchmark studies.

Key Links & Publications

Sauro, J., & Kindlund, E. (2005). Using a Single Usability Metric (SUM) to Compare the Usability of Competing Products. In Proceedings of HCII 2005 (pp. 235–244). Human Computer Interaction International.
Sauro, J., & Lewis, J.R. (2009). Correlations among Prototypical Usability Metrics: Evidence for the Construct of Usability. In Proceedings of CHI 2009 (pp. 1609–1618). Association for Computing Machinery. Nominated for Best Paper Award.
10 Things to Know about the Single Usability Metric (SUM)

MOS-X

The Mean Opinion Scale-Expanded (MOS-X) is a 15-item questionnaire developed at IBM to obtain listeners’ subjective assessments of synthetic speech on four dimensions: Intelligibility, Naturalness, Prosody, and Social Impression. In current practice, we have replaced this questionnaire with the four-item MOS-X2 (described above).

Key Characteristics

Measures: The perceived intelligibility, naturalness, prosody, and social impression of synthetic voices
Number of items: 15
Reliability: 0.93
Types of Validity: Content, construct, concurrent
Number of subscales: 4
- Intelligibility: 0.88
- Naturalness: 0.86
- Prosody: 0.86
- Social Impression: 0.86
Interpretative norms: Yes
Development method: Classical test theory

Key Links

Polkosky, M. D., & Lewis, J. R. (2003). Expanding the MOS: Development and Psychometric Evaluation of the MOS-R and MOS-X. International Journal of Speech Technology, 6(2), 161–182.
Lewis, J. R. (2018). Investigating MOS-X Ratings of Synthetic and Human Voices. Voice Interaction Design, 2(2), 1–22.

1990–1999

This was the decade in which Jim published his first three standardized UX questionnaires at IBM: ASQ, PSSUQ, and CSUQ. They continue to be used in UX research and practice, but we don’t use them at MeasuringU because they’re a bit antiquated in style and content (included in this article for historical completeness).

ASQ

The After-Scenario Questionnaire (ASQ) was an early attempt to develop a concise but comprehensive three-item questionnaire to administer after tasks in usability studies with ratings of satisfaction with ease, completion time, and support information.

Key Characteristics

Measures: Task-level satisfaction
Number of items: 3
Reliability: 0.93
Types of Validity: Content, construct, concurrent
Number of subscales: 0
Interpretative norms: No
Development method: Classical test theory

Key Links & Publications

Lewis, J. R. (1991). Psychometric Evaluation of an After-Scenario Questionnaire for Computer Usability Studies: The ASQ. SIGCHI Bulletin, 23(1), 78–81.
Lewis, J. R. (1995). IBM Computer Usability Satisfaction Questionnaires: Psychometric Evaluation and Instructions for Use. International Journal of Human-Computer Interaction, 7(1), 57–78.

PSSUQ

The Post-Study System Usability Questionnaire (PSSUQ) was an early standardized usability questionnaire to administer at the end of a usability study containing three subscales: System Usefulness, Information Quality, and Interface Quality (most recently slightly redesigned as Version 3).

Key Characteristics

Measures: Perceived usability
Number of items: 16
Reliability: 0.96
Types of Validity: Content, construct, concurrent
Number of subscales: 3
- System Usefulness: Reliability = 0.96
- Information Quality: Reliability = 0.92
- Interface Quality: Reliability = 0.83
Interpretative norms: No
Development method: Classical test theory

Key Links & Publications

Lewis, J. R. (1992). Psychometric Evaluation of the Post-Study System Usability Questionnaire: The PSSUQ. In Proceedings of the 36^thAnnual Meeting of the Human Factors Society (pp. 1259–1263). HFES.
Lewis, J. R. (1995). IBM Computer Usability Satisfaction Questionnaires: Psychometric Evaluation and Instructions for Use. International Journal of Human-Computer Interaction, 7(1), 57–78.
Lewis, J. R. (2002). Psychometric Evaluation of the PSSUQ Using Data from Five Years of Usability Studies. International Journal of Human-Computer Interaction, 14(3), 463–488.
Lewis, J. R. (2019). Using the PSSUQ and CSUQ in User Experience Research and Practice. MeasuringU Press.

CSUQ

The Computer System Usability Questionnaire (CSUQ) is a version of the PSSUQ modified for use as a general standardized UX questionnaire outside of the confines of a usability test (achieved primarily by changing references to “tasks and scenarios” to “work”). Its key characteristics are very close to those found for the PSSUQ.

Key Links & Publications

Lewis, J. R. (1995). IBM Computer Usability Satisfaction Questionnaires: Psychometric Evaluation and Instructions for Use. International Journal of Human-Computer Interaction, 7(1), 57–78.
Lewis, J. R. (2019). Measuring Perceived Usability: SUS, UMUX, and CSUQ Ratings for Four Everyday Products. International Journal of Human-Computer Interaction, 35(15), 1404–1419.
Lewis, J. R. (2019). Using the PSSUQ and CSUQ in User Experience Research and Practice. MeasuringU Press.

Summary

From 1990 to 2025, we developed and published 16 standardized UX questionnaires from the general measurement of perceived usability to specialized measurement of the UX of websites and speech applications. Table 1 lists those questionnaires and their key characteristics in descending chronology.

Questionnaire	Measures	Number of Items	Reliability	Number of Subscales	Interpretative Norms	Development Method
UX-Lite	Perceived ease and usefulness	2	0.75	2	Yes	Classical Test Theory
SUPR-Qm V2	Intensity of UX of mobile apps	5	0.83	0	Yes	Rasch Scaling
TAC-10	Level of tech savviness	10	0.67	0	Yes	Rasch Scaling
PWCQ	Perceived website clutter	5	0.90	2	No	Classical Test Theory
SUPR-Q	Quality of UX of websites	8	0.90	4	Yes	Classical Test Theory
SUPR-Qm	Intensity of UX of mobile apps	16	0.94	0	Yes	Rasch Scaling
UMUX-LITE	Perceived ease and usefulness	2	0.83	2	No	Classical Test Theory
MOS-X2	UX of synthetic voices	4	0.85	4	Yes	Classical Test Theory
SUISQ-R	Service quality of speech apps	14	0.88	4	No	Classical Test Theory
EMO	Emotional interaction	8	0.88	4	No	Classical Test Theory
mTAM	Perceived ease and usefulness	12	0.95	2	No	Classical Test Theory
SEQ	Perceived task ease	1	0.80	0	Yes	Classical Test Theory
MOS-X	UX of synthetic voices	15	0.93	4	Yes	Classical Test Theory
ASQ	Task-level usability	3	0.93	0	No	Classical Test Theory
PSSUQ	Study-level usability	16	0.96	3	No	Classical Test Theory
CSUQ	Computer usability	16	0.97	3	No	Classical Test Theory

Table 1: Summary of standardized questionnaires created by MeasuringU researchers (all questionnaires have published evidence of content and concurrent validity; all except the SEQ have construct validity).

In addition to these questionnaires, the SUM, a method for combining prototypical usability metrics, was created at MeasuringU.

And even though we did not create the SUS, we have published numerous studies on making it more flexible and interpretable (e.g., curved grading scale and item benchmarks).

What Makes a Good UX Research Moderator?

Jeff Sauro, PhD • Jim Lewis, PhD — Wed, 05 Nov 2025 03:00:31 +0000

Human research moderators aren’t going away.

Despite technological advancements, such as remote unmoderated testing (with and without thinking aloud) and AI moderators, a live researcher asking questions to a live person will always be needed. Technological innovations are less likely to render things obsolete than make them more specialized (like Internet > TV > Radio).

Interactions between researchers and participants often require a moderator with good interpersonal skills. The moderator must know how to read the verbal and nonverbal cues from the participant while also getting the needed information—a combination of being professional and being human.

But what makes a good research moderator? In this article, we explore the attributes that differentiate between inadequate and excellent moderators.

Understanding the Research Question(s)

Why are you moderating a session in the first place? Is the research team trying to uncover unmet needs in a market, understand why a new feature isn’t being used, or discover how people react to a new limitation? If you’ve designed the study, it’s a lot easier to know what the research questions are and the reasons behind them. But with teams (and certainly when you hire an external company like MeasuringU), understanding the “why?” behind the test goes a long way to modulating moderation style, such as when to probe and when to go off script (which we’ll cover next).

Tip: Put yourself in the head of the designer, product owner, or other stakeholder who will be on the receiving end of your findings. Do they just need to know broad themes, or do they need specific design guidance?

Understanding the Study Type

One of the first challenges of being a moderator in UX research is knowing what type of study you are working on. UX research includes several methods that employ a moderator, such as usability testing, interviews, contextual inquiries, and in rare cases, focus groups.

The studies will often dictate the style and how you adapt to participant input. A summative benchmark study focuses on tasks and metrics. In contrast, an in-depth interview about a new product idea usually requires a lot of probing and ad hoc pivoting based on what participants say.

But sometimes, even within a method, moderators need to modulate their style. For example, a moderator might be running a large in-person summative benchmark study, but the client still wants to know what’s driving low metrics. This means the research moderator has to find opportunities to probe.

An unstructured in-depth interview requires a different approach than a moderated summative evaluation. Here are common study types and how to adjust:

Formative usability test: Focus on diagnosing problems and encouraging participants to think aloud. Avoid telling participants what to do.
Summative usability test: Keep structure tight; prioritize metrics and consistency across sessions. Minimize interaction with participants.
Exploratory interview: Probe deeply, follow new threads, and explore motivations.
Follow-up interview (post-survey): Clarify unexpected quantitative findings or themes.
Contextual inquiry: Observe real behavior first; question later.

Skilled moderators adjust their styles accordingly, probing only when appropriate in a metric-driven study, or stepping back during an observational one.

Modulating Style to Match the Goals and Study Type

About ten years ago, we identified a spectrum from babysitter to therapist that suggests how to modulate your style. Tedesco and Tranquada (2013) also identified a reasonable set of four styles: Friendly Face, Down to Business, Inquisitive Mind, and By the Book (summarized in Table 1):

Style	Description	Good For	Possible Pitfalls
Friendly Face	Personable and encouraging	Participants who are timid, disengaged, or need reassurance	Being too casual risks going off-topic or influencing responses
Down to Business	Serious, professional, efficient	Structured studies; chatty or informal participants	Tone too harsh, intimidating, or condescending; suppresses openness
Inquisitive Mind	Curious and probing	Exploratory or discovery studies	Could derail study flow/time and lead to irrelevant tangents
By the Book	Scripted, methodical, precise	Studies needing consistency or technical accuracy	May miss spontaneous insights or intimidate participants to silence

Table 1: Tedesco and Tranquada’s four moderation types.

Fundamental Moderation Skills

There are numerous skills across moderation types that act as indicators of a moderator’s expertise.

Establishes Rapport

A good moderator makes a participant feel comfortable, not to make friends, but to hopefully draw out honest feedback and opinions that provide insights. Simple empathy, small talk, and attentive body language go further than icebreaker questions alone. Rapport lets the participant know the moderator is listening and processing what they are saying.

Probes Appropriately

If we may loosely quote the late, great Kenny Rogers, a good moderator knows when to probe them and when to hold them (hopefully they don’t need to walk away or run). Proper probing is a skill that differentiates effective from ineffective moderation. It requires deeply understanding the research questions. In many cases, there may be a need to probe more than once to drill down to a root cause.

Probe 1: “Can you describe why you rated that task as difficult?”
Participant: “Because I had a hard time with the search bar.”
Probe 2: “What about the search bar was difficult?”

You may need to dig a little to understand the motivation. Some people refer to this as the five whys. In practice, you’re unlikely to have to probe five times (usually one to three will suffice). But on the other hand, you don’t want to belabor an issue if a participant isn’t providing valuable feedback.

Manages Time Well

Participants have scheduled time (usually 30 or 60 minutes), and stakeholders have scheduled time. A good moderator can modulate the probing (even skipping less important tasks) without “rushing” a participant. Sometimes parts of the study may need to be dropped to stay on time, a potentially tricky judgment that requires significant expertise to get right.

Uses Interpersonal Skills Effectively

While it’s easier to naturally have good interpersonal skills, they can still be learned and applied in the context of moderation. Tedesco and Tranquada identified five interpersonal qualities that are important for the moderator:

Empathy—Recognizing and validating participant frustration (intervening if needed).
Flexibility—Adapting to technical issues or unexpected answers.
Creativity—Finding new ways to phrase or demonstrate questions.
Sense of humor— Lightens tension or resets rapport appropriately.
Authority—Not Stanley Milgram style, but staying on track and answering questions without being distracted.

Detects Misrepresentation (Detects BS)

Having participants misrepresent themselves is a waste of everyone’s time. While it seems improbable that someone would show up claiming to be an insurance agent or DevOps engineer but actually have no idea what those roles are, it unfortunately happens. With remote moderated sessions and AI, participants can easily fake their way past screeners (and even pre-screening interviews). In some cases, we’ve seen participants clearly reading from a prompt after each question (on phones or second computers). A good moderator should look for those signals, politely end the session, and notify the panel.

Uses Silence Strategically

Awkward silences may make conversations with friends and coworkers uncomfortable, but they can be a powerful tool for the inquisitive moderator who needs (and has the time) to uncover motivations.

Avoids Leading Questions

While a good discussion guide helps minimize biased questions, when you need to go off-script or follow up on what a participant says, a good moderator avoids putting words in participants’ mouths or suggesting answers. Don’t you agree?

Takes Notes Discreetly

Noting when something happened, what participants said, or what problem popped up all help to facilitate faster reports. Even with AI-generated transcripts and videos, having a quick reference leads to efficient reports. On the other hand, you don’t want note-taking to distract the participant (“why did she write that down?”) or slow down the session. Because keyboard clicking can be distracting for in-person and remote participants, handwriting is still preferred by some for quiet and quickness.

Manages Observers and Stakeholders

We have one-way mirrors in our labs in Denver, and our MUiQ platform supports hidden observers (a digital one-way mirror). Stakeholders commonly want more from participants and the moderator (all while the moderator is watching, listening, and taking notes). Juggling those demands takes serious multitasking skills.

Knows When to Move On

Asking the question, probing, then probing again, may sometimes yield few new insights. Or maybe a tech issue is worth troubleshooting, but only to a point. A good moderator knows when to move on (to stay on schedule or get to other areas of interest).

Knows When to Assist

In usability testing, moderators walk a fine line between observing and intervening. Moderators need to let participants experience the product naturally so real problems emerge. But sometimes, assistance is necessary.

An assist occurs when a moderator helps a participant move forward after getting stuck, ideally to preserve the flow of the session and uncover later issues. The key is to assist sparingly and deliberately: step in only when a participant truly can’t progress, when time is running short, or when continued actions along the current path could lead to uninteresting errors or data loss. Experienced moderators use assistance strategically and transparently, preserving test validity while ensuring productive sessions.

Goes Off Script When Needed

Sometimes it’s obvious when to probe an action or dig deeper into a comment, so those prompts can be put into a good discussion guide. But it’s hard to prescribe all those probes when tasks need to change or be dropped, or entirely new questions need to be asked. Understanding a research question (and the intent of the research) makes effectively going off script easier, but it’s never really easy.

How Do You Know Who’s a Good Moderator?

We know it when we see it, of course. But how do you objectively measure the quality of a moderator? Is it just as difficult as measuring the effectiveness of a leader? As far as we know, there is no objective rubric. Like the quality of a usability study test, it’s a lot easier to list factors that should be included (e.g., findings and quotes), but how do you objectively assess the quality of a moderator? We’ll cover that in an upcoming article.

Scatterplot Jitter—Why and How?

Jim Lewis, PhD • Jeff Sauro, PhD — Wed, 29 Oct 2025 05:16:59 +0000

Scatterplots are powerful tools for visualizing data, especially when data is continuous and unbounded (or nearly so). For example, Figure 1 shows the relationship between concurrently collected System Usability Scale (SUS) and UX-Lite^® data for 40 consumer software products.

Figure 1: Example of scatterplot of concurrently collected SUS and UX-Lite data.

Examination of the scatterplot shows a strong relationship between the SUS and UX-Lite (r = .94). Even though the points are clustered closely together, because the SUS and UX-Lite scores range from 0 to 100 and there were no ties, each point is visually distinct.

But what happens when plotting responses to multipoint rating scales?

Ties Are Likely with Multipoint Rating Scales

In August 2024, we collected data from 298 participants on their experience with one of the social media apps they had used in the past year (Facebook, Instagram, LinkedIn, Snapchat, TikTok, or X). Figure 2 shows examples of two multipoint rating scales that we used in that survey to investigate the relationship between likelihood to recommend and likelihood to discourage.

Figure 2: Example of two multipoint rating scales.

There are 11 responses in the first item and 10 in the second (110 possible pairs of values). If there is any correlation between the responses to the questions, there is a good chance that there will be ties for any reasonably large sample size. This likelihood is even greater when there are fewer response options, like in standard five- or seven-point scales.

Figure 3 shows the scatterplot for the data we collected with these items from 298 respondents.

Figure 3: Scatterplot for ratings of likelihood to recommend and likelihood to discourage.

The correlation between these variables is statistically significant (r(296) = −.57, p < .0001).

But there’s a problem with this visualization of the relationship. The points in the graph look too scattered for the correlation to be that strong. Without ties, the plot would show 298 dots, but it shows only 77. This means that almost 75% of the data was tied and therefore hidden in this representation. Consequently, rather than showing a downward trend from left to right across the scatterplot (which you would expect for a strongly significant negative correlation), the points seem to be scattered haphazardly. In fact, the correlation of the 77 data points displayed in Figure 3 is just −.19.

You might wonder if this happens because it’s inappropriate to compute standard (Pearson) correlations with rating scale data, which, for individual ratings, are discontinuous and ordinal. We do not hold this view because, in aggregate, the means of rating scales become increasingly continuous as the sample size increases, and we’ve found that parametric statistical analyses work well with this type of data. Also, a scatterplot is used to visualize data—it’s not a method for statistical analysis. Finally, if we compute Spearman rank correlations (which do not assume continuity in the raw data), we find they closely match the Pearson correlations (for all the data Pearson = −.57, Spearman = −.52; for the partial data in Figure 3 Pearson = −.19, Spearman = −.21).

Use Jitter to Reveal More Points

There are different strategies to indicate the presence of ties in these types of scatterplots (e.g., color, size, or shape coding). One of the most effective methods is to “jitter” the data by randomly increasing or decreasing the raw values. The effect is something like you’d get if you had tried to plot dots on a graph after drinking too much coffee.

Jittering has the effect of keeping tied scores close together but spreading them out just enough to reveal more points in the scatterplot and provide a better visualization of the trend, as shown in Figure 4.

Figure 4: Scatterplot for ratings of likelihood to recommend and likelihood to discourage (jittered).

With the jittered version, it’s easier to see the trend from the upper left corner to the lower right. Some points are still obscured—we count about 186 points in the jittered version. Even so, 62% of the 298 points are revealed compared to 25% in the unjittered version.

How to Jitter in Excel and Google Sheets

Some statistical packages include a jitter setting when creating scatterplots. Notable exceptions include SPSS and spreadsheets like Excel and Google Sheets.

To create a jittered scatterplot in Excel or Google Sheets, the first step is to create jittered versions of the values of the two variables being plotted. Use the following function to create them:

=value + (RAND() – 0.5) / 3

This function uses RAND() to generate a new random number between 0 and 1. Subtracting 0.5 from RAND() shifts the random number to range from −0.5 to +0.5, so the jittering spreads the new values to both sides of the original values. The constant (3 in the example above) controls the magnitude of the jitter, which you can adjust as needed for your data.

Note that whenever you make a change in the spreadsheet, the random numbers will be recalculated. So, at some point, you’ll probably want to copy and paste the jittered data as values to lock them down.

Summary and Discussion

Scatterplots are powerful data visualization tools, but they do not work well when the values being plotted can easily tie (e.g., rating scale data). In this article, we described a popular strategy for breaking ties to reveal otherwise hidden relationships.

Ties are likely in scatterplots of rating scale data. Because rating scales have a fixed number of points, usually 5–11, if two rating scales are correlated, there is a good chance that ties will show up for any reasonably large sample size.

Ties in scatterplots obscure the visualization of structure. When graphing a scatterplot with rating scale data, the maximum number of dots is the product of the number of response options in the two scales. For two five-pointhat there will be ties for any t scales, there are only 25 possible pairs of values. For the example presented in this article, there were 110 possible pairs. Even with this number of possible pairs, 75% of the data were tied and thus invisible in the unjittered visualization.

Use jitter to reveal more points. Jittering the data, that is, randomly increasing or decreasing the raw values using a function like =value + (RAND() – 0.5) / 3, leads to the presentation of more dots in the plot, improving the visualization of the relationship between the two variables. And drink as much or as little coffee as you’d like!

A Report Card for the Net Promoter Score

Jeff Sauro, PhD • Jim Lewis, PhD — Tue, 21 Oct 2025 21:53:35 +0000

Should you use the Net Promoter Score? Maybe, maybe not.

We’re not here to debate whether you should use it or not (and you may not have a choice). Instead, we want to use data (rather than opinions) to review and grade 13 claims made about the NPS, some from NPS critics and others from NPS proponents. At the end, we give a report card on how well these claims stand up against the evidence.

NPS Background (If You Need It)

If you are somehow not familiar with the NPS, here’s some quick background. It was introduced in 2003 by Fred Reichheld in a Harvard Business Review article, “The One Number You Need to Grow.” The NPS is computed from multiple responses to one question about likelihood to recommend (LTR, Figure 1), usually followed by an open-ended question in which respondents can discuss the reason for their ratings because, on their own, measures of behavioral intention like the LTR cannot diagnose specific problems (and no one has ever credibly claimed that they can).

Figure 1: The likelihood-to-recommend (LTR) rating that is the basis of the Net Promoter Score (NPS).

In NPS terminology, respondents who select 9 or 10 on the LTR question are Promoters, those who select 0 through 6 are Detractors, and all others are Passives. The NPS is the percentage of Promoters minus the percentage of Detractors (the “net” in “Net Promoter”). So, while we often use “LTR” and “NPS” interchangeably, there’s a subtle difference when doing analysis (LTR is an eleven-point rating scale, NPS is a trinomial derived from LTR ratings).

A substantial number of Fortune 500 companies use the NPS and even report on it in their earnings calls. The NPS is ubiquitous, with some saying it has a cult-like following among CEOs.

It’s also developed an anti-cult, with some calling its use harmful. We have extensively reviewed the published literature and have conducted a lot of our own analyses to generate data to sift through the noise and bluster. We discussed many of these findings in our three-part course on the Net Promoter Score.

Here are 13 claims that we’ve examined using published data and our own analyses:

NPS is predictive of future company growth.
NPS is a consistently better predictor of growth than satisfaction.
NPS is a reliable metric.
NPS’ net-box scoring method can be justified.
Promoters actually recommend.
Detractors actually detract.
The LTR item has an appropriate number of points.
A single item is sufficient for measuring the NPS.
Measuring future recommendation instead of past behavior is better.
NPS predicts likelihood to discourage.
NPS should always be used.
NPS should never be used.
NPS has no value because it doesn’t tell you what to fix.

Now to the grades.

NPS is predictive of future company growth (C+)

The title “The One Number You Need to Grow” is a strong claim, but it’s common practice in business journals to write such titles to draw attention to an article; any caveats, exceptions, or other nuances appear in the body of the article. To support that claim, Reichheld reported that the NPS was the best or second-best predictor of growth in 11 of 14 industries. Others have challenged that claim.

In 2018, we revisited the data Reichheld used to support the claim that NPS was predictive of company growth. Reichheld had actually demonstrated a correlation between the NPS and past growth rates, so for this replication, we tracked down data from seven industries that Reichheld had cited in his 2003 HBR article and 2006 book The Ultimate Question. The resulting regression models did corroborate Reichheld’s findings, with the NPS accounting for a substantial 38% of the changes in growth for the seven industries for the following two years and 30% for the following four years. But were those seven industries cherry-picked?

In our second study, we increased the number of industries to 14 (with 158 companies), deliberately excluding the industries reported by Reichheld. For that expanded dataset, the relationship between NPS and future company growth was weaker but still significant, accounting for 12% (r = .35) of change in growth over two years and 10% (r = .31) over four years.

This line of research has shown a strong correlation between NPS and historical revenue and a modest ability to predict immediate and distant future growth that is stronger in some industries than others. These findings do not support claims that the NPS has no predictive value. It’s not a perfect predictor (no measure is), but it’s also not a perfectly useless one.

Overall, our grade for the claim that the NPS is predictive of future company growth is C+.

NPS is a consistently better predictor of growth than satisfaction (D)

Satisfaction and behavioral intentions are theoretically different. Satisfaction is a feeling, while NPS is based on the behavioral intention to recommend. Satisfaction is the spark, while behavioral intention is the flame. One ignites the other, but they’re distinct. Without the spark of a positive feeling, there’s no fire to recommend. Nonetheless, in the space of high-stakes competition among consulting companies that promote different metrics, Reichheld made it a point to claim that the NPS is better than satisfaction at predicting company growth despite a long research history showing that satisfaction is linked to future business outcomes, including customer retention and share of wallet. For example, Reichheld wrote:

Our research indicates that satisfaction lacks a consistently demonstrable connection to actual customer behavior and growth.

It is difficult to discern a strong correlation between high customer satisfaction scores and outstanding sales growth.

Broad analyses of the relationship between feelings and behavior usually find modest correlations around .35, while correlations between intentions and behaviors tend to be a bit stronger (around r = .50). But is the relationship between the NPS and behavior consistently stronger than the one between satisfaction and behavior (as measured with business metrics)?

NPS and satisfaction scores are typically highly correlated (averaging around .83), so you would expect only a modest difference between how NPS and satisfaction correlate with the same outcome variable.

And, indeed, that’s what we found in our extensive analysis. Returning to the expanded dataset we collected with NPS and growth numbers for 14 industries and 158 companies, satisfaction data from the American Customer Satisfaction Index (ACSI) was available for 71 companies from 12 industries. For seven of those industries, the NPS had a higher correlation with growth than satisfaction, and for the other five, the reverse was the case.

As you’d expect from their strong correlation, the NPS is not a consistently better predictor of growth than satisfaction, so the grade for this claim is D. (It would have been F if the NPS were consistently worse or A if the NPS were always better.) In practice, it looks like it probably doesn’t matter much which of these two metrics a business uses, and there doesn’t seem to be any obvious upside for a business to change from one to the other.

NPS is a reliable metric (A)

Statistical reliability assesses the consistency of measurement. To estimate the test-retest reliability of the NPS, we conducted an experiment in multiple phases. First, 2,529 U.S.-based online respondents rated their experiences (including brand attitude, satisfaction, and LTR) with one of a selected set of top 50 brands. In the second phase, participants received the same survey from 17 to 47 days later; a final sample size of 259 responded to both surveys. The correlation between the first and second LTR ratings was .75 (with a nonsignificant change in means of just 1.2%). There is little consensus on what constitutes “adequate” or “good” test-retest reliability, but a common criterion for an acceptable level of reliability is .70. With that threshold, the NPS performs well.

Our grade for the claim of acceptable NPS reliability is A.

NPS’ net-box scoring method can be justified (A−)

Despite some criticism that no evidence supports the NPS practice of designating LTR responses from 0 to 6 as detractors, 7–8 as passives, and 9–10 as promoters, Reichheld explained in the 2003 HBR article:

When we examined customer referral and repurchase behaviors along this [0–10-point] scale, we found three logical clusters. “Promoters,” the customers with the highest rates of repurchase and referral, gave ratings of nine or ten to the question. The “passively satisfied” logged a seven or an eight, and “detractors” scored from zero to six.

Specifically, Reichheld has reported that respondents designated as detractors produced 80% of the total negative word-of-mouth comments, and those designated as promoters provided 80% of referrals.

Reichheld described the computation of the NPS as subtracting the percentage of detractors from the percentage of promoters without explicit justification. Is this scoring method better than using means? We’re not so sure, but it is, however, common practice in business analysis to use these types of net-box scores, where it’s deemed acceptable to give up some information in the LTR to focus on the extreme ratings because they are often better predictors of future behavior. Importantly, in our earlier analysis of NPS predicting future growth, we did use the net-box scoring, and it performed as well or better than satisfaction ratings using means.

Because they provide different types of information, we recommend tracking both the NPS and the mean LTR.

Our grade for the claim that the NPS box scoring method is justified is A−.

Promoters actually recommend (A)

We conducted a longitudinal study to see whether we could replicate Reichheld’s estimate that promoters provide about 80% of referrals. We asked participants from an online U.S. panel to rate their likelihood to recommend several common brands (n = 6,026), their most recent purchase (n = 4,686), and their most recently recommended company/product (n = 2,763) using the eleven-point LTR item (between 502 and 1,027 LTR ratings per brand).

We followed up with a second survey at 30, 60, and 90 days with similar brand lists, asking which, if any, respondents had recommended during that time. About 28% of all respondents reported recommending, with 51% of the recommendations coming from promoters. Promoters accounted for 77% of recommendations from the most recently recommended product and 60% for the most recent purchase. Promoters were between 2 and 16 times more likely to recommend than detractors. Our estimates of the percentage of recommendations coming from promoters were lower than Reichheld’s 80%, but not that much lower (51% to 77%). We believe some of the confusion around this claim comes from a belief that most promoters recommend—often, only around 25% do. Rather, when people do recommend, they are much more likely to be promoters.

Our grade for the claim that promoters are more likely to recommend is A.

Detractors actually detract (A)

We now move to the detractors at the low end of the LTR scale. To investigate Reichheld’s claim that 80% of negative word-of-mouth comes from detractors, we consulted about 500 U.S. customers, who provided LTR ratings and open-ended comments regarding their most recent experiences with one of nine prominent brands and products. Of 452 comments collected in the study, 39% were positive and 21% were negative.

About 24% of negative comments came from people who selected 0 on the LTR. In contrast, 35% of positive comments came from people who gave an LTR rating of 10. As the LTR increased from 0 to 10, the percentage of negative comments went down (r = −.71) and the percentage of positive comments went up (r = .87).

After applying the NPS classification scheme, 90% of negative came from detractors, while only 14% of negative comments came from promoters (Figure 2). We again think there is some confusion around this claim because it’s not that most detractors make bad comments or say bad things (14% of positive comments came from detractors). It’s that the people who do say bad things (negative word of mouth) are much more likely to be detractors.

Figure 2: Percent of positive or negative comments associated with each NPS classification.

Our grade for the claim that detractors are more likely to detract is A.

The LTR item has an appropriate number of points (A)

The NPS has been criticized for using the eleven-point LTR scale and then converting those to three response categories (detractor, passive, promoter). Why not just ask people if they’d recommend, wouldn’t recommend, or are unsure? Or why not use a scale with fewer points?

The main problem with using a three-point scale is the loss of response intensity. With just three options, people are restricted to Yes, Maybe, and No. With eleven options in the NPS framework, people can indicate much finer gradations of intention than they could with just three options, from “NO!!!!” to “No” to “no” to “yes” to “Yes” to “YES!!!” As mentioned earlier, more extreme intentions are associated with a greater likelihood of action, so it’s important to capture the extremes.

When we tested three- and eleven-point versions of the LTR in a survey with 824 respondents, 63% rated three points as easier, but 81% rated eleven points as allowing them to express their feelings more adequately. Respondent behavior was very different when the LTR was three versus eleven points. For example, as shown in Figure 3, “Yes” clearly clusters near the top of the scale, but the eleven-point scale differentiated between the extreme “Yes” responders—53% selected 10, compared to the less extreme 9s (17%)—and more tepid 8s and 7s (11% and 9% respectively). The same is seen with the “No” respondents, who also selected 0 the most (43%), but there were quite a few less intense responses (26% selected between 1 and 4).

Figure 3: Distribution of how Yeses, Maybes, and Noes map to the eleven-point LTR item, showing not all Yeses are created equal, with more than half selecting 10 and 37% selecting 7, 8, and 9.

But what about using a five- or ten-point scale instead of the eleven-point scale? We also looked into those and found only modest differences among scores generated from five-, ten-, and eleven-point scales. We recommend using the eleven-point scale (for comparability to published benchmarks), but if you have historical data using a ten- or five-point version, you should probably stick with it. There’s little benefit in changing to an eleven-point version, but keep in mind your NPS scores will have some comparison error to benchmarks (our analysis suggested around +/− 4%).

Our grade for the claim that the LTR item has an appropriate number of points is A.

A single item is sufficient for measuring the NPS (A)

A typical practice in the development of standardized metrics is to have multiple items per construct because multi-item metrics tend to be more reliable (and it’s easier to measure their reliability). There are, however, numerous single-item metrics in use that work well. One way to increase the reliability of single-item metrics is to increase the number of response options, which is why single-item measures of behavioral intention (like LTR) often have seven to eleven points.

Some NPS critics suggest that a large number of response options is difficult to use or that people just won’t use the full range offered. When we’ve investigated actual respondent behavior with LTR, we’ve found no indication that respondents have trouble selecting a response from an eleven-point scale, are confused by the number of response options, or restrict the range of their selections.

Our grade for the claim that a single item is sufficient for measuring the NPS is A.

Measuring future recommendation instead of past behavior is better (A−)

Another common NPS criticism is the claim that it’s better to ask people what they’ve recently done rather than asking them what they intend to do in the future. After all, who can really accurately predict the future?

Even though it seems like past behavior should indicate a behavior that people habitually repeat and thus should be preferred instead of intentions that can be forgotten or thwarted, a literature review by Ouellette and Wood (1998) found that behavioral intentions (what people think they will do) were better predictors of future behavior (r = .54) than past behaviors (r = .39), especially when the behavior isn’t performed on a frequent basis.

In our own research (n = 2,672), where respondents were asked to give current LTR ratings of products they had recommended in the past, 8% indicated they would not recommend them in the future (usually due to a problem experienced with the product or lack of recommendation opportunity).

Our best estimate from the recommendation literature and our own research is that between 50% and 60% of promoters ultimately make recommendations (about three times the percentage of detractors who ultimately recommended).

Critics might take these percentages as more of a glass-half-empty indicator because not all who intend to recommend ultimately do so. However, it isn’t realistic to expect recommendation behavior to equal recommendation intention. Once someone expresses an intention to recommend, there are obstacles to actual recommendation, including never having an opportunity to recommend or not having a strong enough intention to exert the effort to recommend. A more positive framing of the key finding from these studies is that there is a strong relationship between the intensity of the intention to recommend and future recommendation behavior, so changes to the quality of UX and CX that affect likelihood to recommend are important.

That said, because questions about past recommendations and intention to recommend appear to be measuring different things, there may be value in including an item in surveys that asks about past recommendation behaviors. In other words, past recommendations are a complement, not a replacement, for future intent.

Our grade for the claim that measuring future recommendation is better than focusing on past behavior is A−.

NPS predicts likelihood to discourage (C)

Does being less likely to recommend mean that people will discourage others? There isn’t much research published on the relationship between likelihood to recommend and active discouragement as opposed to simply being unwilling to recommend. In our analysis of the likelihood of 324 social media users to recommend and their likelihood to discourage the use of social media platforms, we found likelihood to recommend predicts about 30% of likelihood to discourage. That is significant, but it leaves about two-thirds to three-fourths of the variation in likelihood to discourage unaccounted for. This suggests that NOT recommending is not a strong substitute for measuring intent to recommend against or discourage others from a brand. If researchers can get ratings of only one behavioral intention in contexts where recommendation is a plausible user behavior, it should be LTR. For a clearer picture of the full range of behavioral intention, there appears to be value in also collecting ratings of likelihood to discourage. This doesn’t fail the claim because if you want to stick with just one item (LTR), our analysis suggests you still get a good, albeit imperfect, read on discouragement.

Our grade for the claim that the NPS predicts likelihood to discourage is C.

NPS should always be used (D)

The NPS is very popular, and some of its proponents push for its near-universal use. Yet even in Reichheld’s original HBR article, despite the claim in the title “The One Number You Need to Grow,” he reported that LTR was best for most but not all companies, ranking first or second in 11 out of 14 case studies. In our experience, LTR works well only when the product or service can reasonably be recommended. For example, it can seem unreasonable to ask employees to use LTR to rate enterprise software they must use for, say, booking travel or doing expense accounting. In our experience, employees asked to rate their likelihood of recommending these types of products either balk at the request or treat the LTR as if it were a satisfaction rating. They can do it, but it’s not ideal.

This claim doesn’t completely fail because even in cases where the NPS is used inappropriately, it often correlates highly with other, more appropriate measures of satisfaction and loyalty. Often, having the same common measure, even an imperfect one, can keep dialogue going when departments speak different languages. While the word “OK” isn’t always the best way to express yourself around the world, it’s almost always understood.

Our grade for the claim that the NPS should always be used is D.

NPS should never be used (F)

Based on our analysis of the other claims, it does not seem reasonable to say that the NPS should never be used, though there are some who assert this. The NPS has its weaknesses, but it has considerable strength when used properly (e.g., together with open-ended questions to get details about the “why” of the rating) in the proper context.

There can be legitimate concerns about Campbell’s Law (also known as Goodhart’s Law)—that using metrics like the NPS for high-stakes incentives can lead to corruption of the process, which then leads to the Sauro and Lewis corollary, mismanagement leads to mismeasurement. The solution to this problem isn’t to drop the NPS, as any metric that replaces it will become subject to Campbell’s Law. Better approaches are to use other metrics (e.g., actual sales) for incentives or to increase governance to reduce corruption and the resulting distortion of the measurements.

Our grade for the claim that the NPS should never be used is F.

NPS has no value because it doesn’t tell you what to fix (F)

This claim is more of a misconception or misunderstanding about the value and purpose of measures like the NPS. The criticism voiced by a few notable UX figures is that a measure should tell you what to do. And worse, if they don’t, they’re flawed or have no practical value. Following this argument, the NPS is not diagnostic; therefore, it has no value. We disagree.

Throwing out measures because they don’t tell you what to do means throwing out most measures in most situations. From the NPS and satisfaction to temperature and weight, as well as completion rates and SUS scores, many measures aren’t diagnostic.

But it’s unrealistic to expect a high-level measure to dictate a specific course of action (e.g., “Do we lower prices,” “Should we invest in that new feature,” “What color button do I use,” or “Do we change the top navigation labels”). While some standardized UX questionnaires can help narrow your focus, it’s a stretch to expect them to tell you exactly what to fix.

For example, the SUPR-Q^® has items on ease/navigation, trust, and appearance. Scoring lower on these attributes narrows the focus but won’t tell you exactly what’s causing it. Likewise, the UX-Lite has items that assess usefulness and usability. Scoring low on usefulness, for example, narrows in on functionality more than usability, but it doesn’t tell you what’s not useful. This is why leading practice in using the NPS is to include an open-ended question in which respondents can explain their rating. This combination is powerful because tracking the NPS over time enables detection of unexpected declines in the high-level measurement, while the open-ended question provides context and corrective direction.

It’s not choosing between a measure or an observation; it’s choosing the right measures for your observations. This claim has a serious logical flaw because it’s a bit like saying thermometers have no measurement value because they don’t tell you how to react to a fever (e.g., ice bath, pain reliever, antibiotics, some combination of treatments).

Our grade for the claim that the NPS has no value because it doesn’t tell you what to fix is F.

Summary and Report Card

Figure 4 shows the final report card for these claims about the NPS.

Figure 4: The NPS claims report card.

Our review of these claims and the evidence for and against them somewhat favors the ability of the NPS to predict growth, but it’s not consistently better than more traditional measures of satisfaction. It is also partially predictive of discouragement, but not well enough to substitute for specific discouragement metrics when carefully measuring intention to discourage is important.

On the more positive side, the NPS is a reliable metric and, despite seeming “wacky” to some critics, Reichheld’s box scoring produces a reasonable partitioning of LTR responses into the three NPS categories (detractor, passive, promoter) with evidence that promoters actually promote and detractors actually detract. The single LTR item is sufficient to measure recommendation likelihood, the eleven-point scale works well in practice to identify extreme responders, and future recommendation works a little better than past behavior when predicting future recommendation behavior.

Regarding when the NPS should be used, neither of the extreme claims (always use it or never use it) is warranted. The decision about whether to implement the NPS in a company depends on whether it’s possible for their product or service to reasonably be recommended. Companies that already use the NPS should probably continue to do so, while companies that use a standard satisfaction metric should probably continue to do that.

UX Practitioners’ Satisfaction with Pay Transparency

Jim Lewis, PhD • Jeff Sauro, PhD — Wed, 15 Oct 2025 00:30:07 +0000

Is sharing pay information a good idea? What happens when companies share more about how they pay their people?

So-called pay transparency refers to company policies that encourage the sharing of compensation-related information, such as salary ranges, pay scales, and compensation structures. This information may be supplied to current employees, job candidates, or the public. If current trends continue, by 2026, about half of U.S. employees will work for companies that practice some degree of pay transparency.

It seems that greater pay transparency would reduce unfair or discriminatory compensation policies, and there is some evidence that it helps. But the Harvard Business Review article “The Complicated Effects of Pay Transparency” lists a set of potential downsides of increased pay transparency that research has revealed. For example, pay transparency can lower employees’ relative bargaining power and can detach pay from performance, leading to potential consequences for productivity and turnover. Furthermore, when pay transparency reveals unfair compensation decisions, employee morale declines.

How has pay transparency affected the UX profession? We have some data from the 2024 UXPA salary survey to help us assess UX practitioners’ attitudes toward pay transparency.

The 2024 UXPA Salary Survey

In 2024, we continue a decades-long tradition by partnering with the UXPA to collect and understand the UX profession’s compensation, skills, and composition.

The survey data came from respondents recruited through postings on professional networks and websites such as UXPA and LinkedIn. Additional respondents were recruited through snowball sampling. The survey ran from April through October 2024, and the final sample size was 444 respondents from around the world (67% from the U.S., 4% from the UK, 4% from Canada, and smaller percentages from 34 other countries). Most respondents reported being User Researchers (64%) and User Experience Designers (31%).

The core salary questions in the survey remain consistent across administrations, but each survey also includes a set of custom items. In the 2024 survey, those custom items, shown in Figure 1, included three on the topic of pay transparency to estimate the range of pay transparency, satisfaction with pay transparency, and satisfaction with the fairness of compensation decisions (n = 383 who responded to these questions).

Figure 1: The pay transparency items from the 2024 UXPA salary survey (created on the MUiQ^® platform).

Pay Transparency Scope

Table 1 shows the percentages of responses to the select-all-that-apply question about different types of pay transparency.

Scope of Pay Transparency	%
Key executives	6%
Anyone in my company	7%
Anyone on my team	10%
New job postings	24%
None of these	49%

Table 1: Scope of UX practitioners’ pay transparency.

Roughly half (51%) of respondents reported working for a company with at least some pay transparency (consistent with published estimates for all U.S. employees), with the most common category being new job postings (24%).

Low Satisfaction with Pay Transparency and Pay Decisions

We next analyzed the responses to the two questions that asked respondents about their satisfaction with pay transparency and compensation decisions. As shown in Figure 2, ratings were low for satisfaction with pay transparency and fairness of compensation decisions, with both failing to pass the midpoint of the scale—3.9 for satisfaction with pay transparency and 4.7 for satisfaction that compensation decisions are fair.

Figure 2: UX practitioners’ satisfaction with pay transparency and fairness.

To dig deeper into how job transparency might affect these satisfaction ratings, we split the sample between respondents who reported having some type of pay transparency and those who responded “None of these” on the select-all-that-apply item (Figure 3). Those who had some pay transparency had significantly higher satisfaction ratings than those who did not (Pay Transparency: t(381) = 7.0, p < .0001; Fair Compensation Decisions: t(381) = 4.5, p < .0001).

Figure 3: Satisfaction with pay transparency and fairness as a function of pay transparency scope.

Pay Transparency and Fairness Associated with Higher Job Satisfaction

Finally, we computed correlations between satisfaction with pay transparency, compensation fairness, and overall job satisfaction, with the results shown in Table 2.

Correlation between	r	R²	p
Pay Transparency and Fair Compensation	0.704	50%	< .0001
Pay Transparency and Job Satisfaction	0.333	11%	< .0001
Fair Compensation and Job Satisfaction	0.438	23%	< .0001

Table 2: Correlations among pay transparency, compensation fairness, and overall job satisfaction.

For this set of satisfaction ratings, we found:

Satisfaction with pay transparency accounted for 50% of the variation in satisfaction that compensation decisions are fair.

Satisfaction with fair compensation decisions accounted for 23% of the variation in overall job satisfaction.

Note that in our analysis of what drives UX salary, we did not find that pay transparency had a measurable effect on driving higher or lower salaries. There are several potential reasons why this effect wasn’t significant in our regression modeling of UX salaries. It might not affect pay at all, it could be a small effect undetectable with our sample size, or it could be accounted for in other variables that competed with it in the regression model. This last possibility is especially intriguing because variation in pay transparency across U.S. regions, which were in the regression model, may have sufficiently accounted for any variation in salary driven by a more limited binary measure of no transparency versus some transparency (e.g., California has some of the strictest pay transparency laws in the U.S. and also has the highest UX salaries).

Summary

Using data from UX practitioners who answered the pay transparency question in the 2024 UXPA Salary Survey (n = 383, 67% U.S.-based), we examined how common pay transparency is within UX organizations and how it relates to satisfaction with compensation and overall job satisfaction. We found:

About half of UX professionals report at least some level of pay transparency. This is consistent with projections that by 2026, half of U.S. employees will work for companies that have pay transparency policies.

Satisfaction with pay transparency and fairness is low among UX practitioners. Mean satisfaction scores were below the midpoint for satisfaction with pay transparency and satisfaction that compensation decisions are fair, implying there is room for improvement in compensation communication and policies.

Pay transparency is associated with higher job satisfaction scores. UX practitioners who reported having some pay transparency had significantly higher mean satisfaction scores for pay transparency and perceived fairness of compensation than those practitioners who reported no transparency.

Fairness matters. Perceived fairness in pay decisions is more strongly associated with overall job satisfaction than pay transparency.

Transparency doesn’t necessarily lead to higher pay. But pay transparency is likely linked to better morale and trust within organizations and reduced incidence of unfair pay disparity.

Pay transparency needs to be carefully managed. The findings from the 2024 UXPA salary survey strongly suggest that there is plenty of room to improve job transparency in the companies that hire UX practitioners, but it is critical that pay transparency policies support rather than diminish the belief that compensation is fair.

What UX Hiring Managers Want and What UX Practitioners Report Doing (2025)

Jim Lewis, PhD • Jeff Sauro, PhD — Tue, 07 Oct 2025 22:44:58 +0000

Is there a misalignment in skills in the UX profession? Specifically, when UX managers are hiring, is there a misalignment in what applicants and current practitioners report doing and the skills managers need?

In a previous article, we did a deep dive into what UX practitioners reported doing in the 2024 UXPA salary survey, tracking UX methods usage from 2014 through 2024.

But to what extent do the activities reported by UX practitioners in 2024 match what UX hiring managers looked for in 2025?

The 2024 UXPA Salary Survey

Roughly every two years, MeasuringU assists the User Experience Professionals Association (UXPA) in conducting a salary survey to glean insights from the UX community and track historical patterns in salaries and other aspects of UX work.

For the 2024 survey, data were collected from April through October 2024, receiving 444 responses from 37 countries, with 67% of responses from the U.S., 4% from Canada, and 4% from the UK. In addition to the main topic of salary, the 2024 survey included attitudes about certification, previous and expected use of AI, and planned hiring in 2025. While the data are about a year old now, it still provides a good contemporary snapshot of the industry and will likely remain applicable into 2026.

UX Managers’ Hiring Goals for 2025

For the 71 respondents who planned to hire in 2025, two of the questions they answered concerned the number of planned hires and their desired skills.

Number of Planned Hires in 2025

As shown in Figure 1, of managers who intended to make hires, 72% planned to make one or two UX hires, 83% planned to make one to three hires, and 94% planned to make up to five hires. Only 5% planned to make more than eight new hires.

Figure 1: Number of planned UX hires in 2025.

Comparing What UX Hiring Managers Want with What UX Practitioners Report Doing

Table 1 shows a side-by-side comparison of the skills UX hiring managers wanted to hire in 2025 and the skills that UX practitioners reported using in 2024.

Method	What Managers Want	What Practitioners Do	Absolute Difference
User research/interviews/surveys	77%	75%	2%
Usability testing	75%	69%	6%
Personas/user profiles	59%	59%	0%
Heuristic expert review	51%	50%	1%
Information architecture	49%	42%	7%
Surveys	49%	59%	10%
Online research	48%	34%	14%
Benchmarking competitive studies	48%	44%	4%
Interface interaction design	44%	36%	8%
Card sorting	44%	45%	1%
Task analysis	42%	32%	10%
Creating high fidelity prototypes	42%	33%	9%
UX design workshops	41%	43%	2%
Creating low fidelity prototypes	38%	37%	1%
Requirements gathering	37%	37%	0%
Visual graphic design	35%	21%	14%
Contextual inquiry/ethnography	34%	34%	0%
Focus groups	30%	27%	3%
Conceptual design	30%	38%	8%
Satisfaction surveys	30%	39%	9%
Strategic consulting	28%	30%	2%
Ethnography	25%	20%	5%
Analyze web metrics	25%	31%	6%
Market research	21%	27%	6%
Training in UX	20%	29%	9%
Accessibility testing	18%	21%	3%
Accessibility expert reviews	17%	13%	4%
Content strategy	11%	16%	5%
Web development	10%	9%	1%
Technical writing	8%	10%	2%
Content creation	8%	17%	9%
Eye tracking	6%	7%	1%

Table 1: Skills UX managers wanted to hire in 2025 and what UX practitioners reported doing in 2024 (heat map in the Absolute Differences column goes from green for small differences to red for large differences).

Table 1 shows strong correspondence between the skills UX managers wanted in 2025 and the skills UX practitioners reported using in 2024.

In all but two cases, the difference between manager and practitioner percentages was 10% or less. The exception was Online Research (48% managers, 34% practitioners, difference of 14%) and Visual Graphic Design (35% managers, 21% practitioners, difference of 14%). Of these two, only Online Research was in the top ten wanted skills for managers.

The term “online research” can mean two things to UX practitioners. One is conducting primary research (e.g., running your own unmoderated usability study) with a tool like the MUiQ^® platform. The other is conducting secondary research (critically examining the published research of others to see what they have already discovered) using web services like Google Scholar or AI services like Deep Research. Jakob Nielsen recently described secondary research as “one of the most underutilized weapons in the UX arsenal,” so it is encouraging to see online research in the top ten skills sought by hiring UX managers.

Another way to visualize the relationship between the skills UX managers want to hire and what UX practitioners report doing is with a scatterplot (Figure 2). A regression line through the points in the scatterplot shows 87% shared variance between the percentages for managers and practitioners (a highly significant correlation, r(30) = .93, p < .0001). The equation for the regression line has a slope of .86 and an intercept of .04, which is nearly a perfect regression (which would have a slope of 1 and an intercept of 0).

Figure 2: Scatterplot of skills UX managers wanted to hire in 2025 and those UX practitioners reported using in 2024.

We added five callouts to Figure 2, one for the method least frequently chosen by managers and practitioners (eye tracking), two for the methods most frequently chosen by managers and practitioners (user research and usability testing), and two for the methods with the greatest differences between manager and practitioner selection frequencies (visual graphic design and online research).

Summary and Discussion

Of the 444 respondents in the 2024 UXPA Salary Survey, 71 were UX managers who intended to add staff in 2025. Key findings related to the plans and expectations of these managers were:

Strong expectations of at least one or two new hires: 72% of these UX managers planned for one or two new hires. Only 5% planned to make eight or more new hires, and 22% planned to increase their staff by three to five new hires.

Strong alignment of skills: The three top skills (user research, usability testing, personas/user profiles) were in the same order for managers’ selections of desired skills and the skills practitioners reported doing. Across all skills tracked in the UXPA survey, the correlation of selection percentages by managers and practitioners was very high (r(30) = .93, p < .0001).

Notable gap in online research: While alignment was strong overall, managers more frequently emphasized online research (48%) than practitioners reported doing (34%). This suggests a potential growth area, particularly in secondary research methods.

Bottom line: From a broad perspective, there was little daylight between the skills UX managers wanted to hire in 2025 and the skills of practitioners working in 2024. Of all the challenges UX applicants and employers face, most of the time, they shouldn’t need to worry about skills mismatch.

Ten Things to Know About the SUPR-Qm

Jim Lewis, PhD • Jeff Sauro, PhD — Wed, 01 Oct 2025 04:14:56 +0000

We use our mobile phones a lot. Planning trips, sending money, following our favorite influencers, keeping in touch with friends and family.

While it seems commonplace now, the capabilities of our mobile phones and their applications provide a high level of convenience and speed (for better or worse) to our leisure and business.

The mobile app experience is a unique and defining aspect of our interactions with our devices. While the experience shares many characteristics with using software and websites on a traditional desktop/monitor, the mobility, screen size, and interaction style make the experience distinct. Consequently, we developed a questionnaire, the SUPR-Qm^®, to measure attitudes toward the mobile app user experience.

In this article, we discuss ten things to know about the SUPR-Qm (now in its second version), from the statistical to the practical.

The SUPR-Qm is the most efficient published measure of the UX of mobile apps. We know of three published questionnaires that measure the UX of mobile apps. Two of them (MPUQ and the Aug-MOD) are lengthy diagnostic questionnaires with over 70 items each, neither of which has published norms for interpreting their scores. Furthermore, detailed diagnostic UX questionnaires tend to have items that become less relevant over time (e.g., from the MPUQ, “Are the HOME and MENU buttons sufficiently easy to locate for all operations?”). The third questionnaire (the original 16-item SUPR-Qm from 2017) is experiential (meaning it provides an overall, non-diagnostic score) with normed data available for interpretation. The SUPR-Qm V2 is a new streamlined five-item questionnaire derived from the SUPR-Qm.

The SUPR-Qm is easy to administer and connects to the SUS. The five items in the SUPR-Qm V2 are simple, five-point agreement scales that are easy and quick to administer and score (Figure 1). The items have been extensively tested (demonstrating reliability and validity) and have content that offers continuity with existing questionnaires like the SUS. Specifically, the item “The app is easy to use” can be used to compute an estimated SUS score with our regression equation.

Figure 1: The SUPR-Qm V2 questionnaire.

It’s scored on a 0–100 scale. The traditional way to score a Rasch scale composed of multipoint rating scales is to add a respondent’s ratings. For the SUPR-Qm V2, this would produce a scale from 5 to 25. To get a score that is easier to interpret, we convert the average of the selected ratings to a score from 0 to 100 using linear interpolation. For example, a raw SUPR-Qm V2 mean of 3.0 would be reported as 50 (((3 − 1) / (5 − 1))(25) = 50). The SUPR-Qm calculator will do this automatically for you.

It’s easy to interpret with a curved grading scale. Grades are universally loathed by students but universally understood by executives. To develop a curved grading scale for the SUPR-Qm V2, we computed the associated logits, transformed the logits to probabilities, and assigned standard letter grades and grade points to those probabilities using common probability ranges for the grades, as shown in Table 1.

SUPR-Qm V2 Score	Curved Grade	Grade Point	Probability Range
87.0–100.0	A+	4.0	96–100%
79.0–86.9	A	4.0	90–95%
74.5–78.9	A−	3.7	85–89%
70.5–74.4	B+	3.3	80–84%
63.5–70.4	B	3.0	70–79%
60.5–63.4	B−	2.7	65–69%
57.5–60.4	C+	2.3	60–64%
46.5–57.4	C	2.0	41–59%
42.8–46.4	C−	1.7	35–40%
27.5–42.7	D	1.0	15–34%
0.0–27.4	F	0.0	0–14%

Table 1: Curved grading scale for interpreting SUPR-Qm V2 scores.

To illustrate the scoring system with real data, we used SUPR-Qm V2 ratings we collected of Spotify’s free (n = 46) and paid (n = 60) services (results shown in Figure 2). After interpolating the raw scores to a 0–100-point scale, the mean rating for the free service was 62.4 (a grade of B−) and for the paid service was 71.3 (a grade of B+). This difference was statistically significant (t(104) = 2.28, p = .025). For the free service, the 95% confidence interval ranged from 56.3 (C) to 68.5 (B), and for the paid service, the interval ranged from 66.3 (B) to 76.2 (A−). This means it’s implausible that the population mean for the free service to have a grade lower than C or higher than B, while for the paid service, the population mean is unlikely to have a grade lower than B or higher than A−.

Figure 2: SUPR-Qm V2 comparison of Spotify free and paid services (error bars are 95% confidence intervals, with the paid version scoring significantly higher than the free version).

The SUPR-Qm is not meant to be diagnostic. The SUPR-Qm provides an overall score of a mobile app’s user experience, but it won’t tell you what to fix. Its strengths are that it’s quick to take (five items) and easy to interpret. Historically, there have been two approaches to UX questionnaires. One is to have lots of items with the diagnostic goal of using the answers to point to problematic design areas (e.g., the 90-item Version 2 of the QUIS with ratings of system characteristics such as readability, response speed, helpfulness of error messages). The other, more experiential approach is to have rating scales that broadly assess how users feel about the system (e.g., the perceived ease-of-use and perceived usefulness items of the UX-Lite^®). In this framework, the SUPR-Qm is an experiential questionnaire.

It evolved from 16 dynamic to five fixed items. The original SUPR-Qm contained 16 items developed with a technique called Rasch scaling, where items get progressively harder to agree with. It was designed to be administered adaptively using software (when people agreed, a more difficult item would be presented, and vice-versa). Under this adaptive approach, most participants would answer only four to seven items.

But in practice, it isn’t always practical to administer adaptively using software like MUiQ^®, so researchers would often use all 16 items. To streamline the questionnaire, we analyzed responses from thousands of participants and used the Wright map shown in Figure 3 to identify a set of five key items. These items were selected to represent the full range of difficulty, from easiest to hardest, with coverage at key points in between. The result is the SUPR-Qm V2: a shorter, fixed five-item version that maintains the measurement precision of the original scale.

Figure 3: Wright map of the original SUPR-Qm with the selected items for the SUPR-Qm V2 indicated in boxes.

Note: A Wright map is organized as two vertical histograms with the items and respondents (persons) arranged from easiest (most likely to agree) on the bottom to most difficult (least likely to agree) on the top. For example, most participants agreed or strongly agreed (4s and 5s) with the items “Easy” and “EasyNav.” In contrast, few participants highly rated apps as “AppBest.” On the left side, the Wright map shows the mean (M) and two standard deviation points (S = one SD and T = two SD) for the measurement of participants’ tendency to agree. On the right side of the map, the mean difficulty of the items (M) and two standard deviation points (S = one SD and T = two SD) for the items are shown.

Three items aren’t enough. If we’re looking to cut, why not just use three items (easiest, hardest, and in the middle)? To see if three would do, we ran an experiment where we randomly assigned participants to an initial three-item or five-item version. Participants in both groups then answered the remaining 13 or 11 items (order randomized) so we could compute the full SUPR-Qm score for comparison. The score from the five-item version matched the full score, whereas the three-item version didn’t (Figure 4). So, we selected SUPR-Qm05 as the best candidate for SUPR-Qm V2.

Figure 4: Comparison of three- and five-item versions of the SUPR-Qm with concurrently collected 16-item versions.

Scores were stable over five years. An advantage of Rasch scaling is the theoretical stability of scales across changes in time, with some empirical estimates of Rasch scales being stable for as long as 15 years. To investigate the stability of the SUPR-Qm V2, we analyzed data collected with our MUiQ platform from February 2019 through May 2023, dividing the total sample size of 4,149 respondents into groups A (February 2019 through August 2021, n = 2,143) and B (February 2022 through May 2023, n = 2,006). As shown in Figure 5, the locations of scores on the underlying logit scales were nearly identical for both Group A and Group B, demonstrating the scale stability of the SUPR-Qm V2 with varying dates and industries.

Figure 5: Stability of the Rasch scales for SUPR-Qm V2, indicated by the overlap of lines for Groups A and B.

When using the SUPR-Qm V2, be sure to randomize the presentation of the items. Because the items are clearly different in how easy they are for respondents to agree with, it is important to avoid always showing them in either an easiest-to-hardest or hardest-to-easiest order. The best strategy for item presentation is randomization.

Use the SUPR-Qm V2 calculator to quickly score your data. We now have an Excel calculator designed specifically for use with the SUPR-Qm V2, making it easy to conduct analyses and produce a variety of useful graphs (Figure 6). To see the calculator in action, view the video tour.

Figure 6: Home page of the SUPR-Qm V2 calculator.

UX Professionals’ Job Satisfaction (2024–2025)

Jeff Sauro, PhD • Jim Lewis, PhD — Wed, 24 Sep 2025 02:44:06 +0000

The last couple of years have not been easy for those in the UX profession. With an increase in layoffs and AI disruption, uncertainty has grown about job security and even whether to leave the profession entirely.

How has this uncertainty affected the current satisfaction that UX professionals feel about their job?

What you do and who you work with impacts job satisfaction (which has been measured extensively for decades across industries).

To measure job satisfaction of UX professionals specifically, we looked to the UXPA salary survey, which has polled practitioners about their salary and job satisfaction since 2014. Participants are asked to rate their overall satisfaction with their current position on a scale of 0 to 100, with 0 being not satisfied at all and 100 being completely satisfied.

Job Satisfaction Declines (A Little)

With the UXPA survey results published in late 2024, we have a new datapoint. As shown in Figure 1, the mean satisfaction score from the 402 respondents who rated their job satisfaction was 70 (±2 with 95% confidence). This drop from the mean of 74 in 2022 is statistically significant, but it is consistent with the overall trend since 2014 (e.g., overlapping confidence intervals with 2016 and 2018).

Figure 1: Mean satisfaction by year (95% confidence intervals).

We conducted some follow-up analyses to dig into the drop in job satisfaction from 2022 to 2024.

Is Job Loss Driving Lower Satisfaction?

In 2022, no respondents out of 625 indicated they had lost their jobs; in 2024, 9 out of the 402 respondents selected “Other” for their employment status and shared that they were currently unemployed. This difference of 2.2% is small but statistically significant (Fisher p = .0002).

When we compared the job satisfaction means for these two groups (unemployed and employed), the difference was statistically significant (t(8.2 df adjusted for heterogeneity) = 2.6, p = .03). The mean for unemployed respondents was 45 compared to 70.5 for employed respondents. Note, however, that the mean for employed respondents was only half a percent higher than the grand mean, so the low scores for unemployed respondents are only part of the story.

We suspect that at least two factors in 2024 were not present in the job landscape of 2022: fear of layoffs and fear of AI. The nature of a salary survey like the one conducted by UXPA is that almost all respondents have job-related income. But even those who have jobs must still be affected by changes in their job marketplaces. In the 2022 UXPA survey, 17% of respondents reported loss of staff at their places of employment, but in 2024, that number was 35%—twice as many as in 2022. And in 2022, there was relatively little chatter about AI replacing UX jobs, but in 2024, over a quarter of respondents (28%) expressed concern that AI would cause UX job losses. In some ways, it’s surprising that the job satisfaction rating in 2024 stayed as high as it did.

UX Job Satisfaction is Comparable to Other Industries

To make any measure more meaningful, it needs to be compared to something. While there isn’t a single well-established database available for comparing UX satisfaction scores, other 100-point scales can be used to assess job satisfaction, as research has shown that single-item measures of job satisfaction highly correlate with multi-item measures. We therefore looked for other single-item measures of job satisfaction across industries. In previous years (2018 and 2022), we used Glassdoor’s job satisfaction report for the best jobs, but this was not available for 2024 or 2025. Our source for this article is the job satisfaction ratings from payscale.com.

To compare the UXPA satisfaction data to this data, the payscale.com average scores (collected using a five-point scale) were converted to a percentage using linear interpolation. For a five-point scale, this is (x − 1) / 4, expressed as a percentage where x is the average on the five-point scale. So, 4 on a five-point scale becomes 75%, which corresponds to 75% on the UXPA satisfaction scale (75 out of 100 becomes 75%). These scores are shown in Table 1 for various (mostly) tech jobs.

Job	Job Satisfaction
UX Designer	76%
Project Manager	75%
UX Researcher	74%
Data Analyst	74%
Solutions Architect	74%
Business Analyst	74%
UI Developer	71%
UXPA Average	70%
Data Scientist	70%
Mobile Engineer	70%
Reliability Engineer	69%
Nurse Practitioner	69%

Table 1: Job satisfaction scores from payscale.com (downloaded 9/2/2025, converted to % scale) and compared to the 2024 UXPA average score.

Glassdoor’s list of the top 20 jobs from 2018, in terms of highest satisfaction ranged from 3.9 to 4.3 (71% to 83%), making this a reasonable benchmark to use to establish a range of good job satisfaction scores. With a mean of 70%, this suggests the jobs within the UX profession (as measured by the 2024 UXPA survey) are not too far from the top.

Table 1 includes the payscale.com job satisfaction scores for the key UX jobs of UX designer, project manager, UX researcher, and business analyst, all of which were higher than the overall UXPA mean of 70%. We computed the UXPA 2024 job satisfaction for those jobs that were, respectively, 71% (n = 118), 73% (n = 21), 70% (n = 262), and 76% (n = 12). These payscale.com and UXPA 2024 job satisfaction scores aren’t wildly different, but the payscale.com estimates tend to be higher—closer to the 2022 UXPA mean of 74%.

Many variables might contribute to these differences (including random variation), but they may be due in part to the different times these data were collected (from April to October 2024 for the UXPA ratings, September 2025 for the payscale.com ratings). In our previous article on projections for the UX job market in 2025, we concluded, “While the net jobs added and net business outcome metrics [for 2024] are at the lowest point since 2009, the data suggests the worst might be over. It’s hard to know what exactly will happen in tech, but if history is a guide, the subsequent two-year period after the nadir of 2009 had net jobs and business prospects at their highest.” Perhaps the higher UX job scores from payscale.com are related to some recovery in the UX job market over the past year.

Drivers of Satisfaction Are Harder to Model

When we looked to understand the key drivers of salary from the 2024 UXPA Salary Survey, we were able to explain a substantial 72% of the variance (similar to years past). When applying the same multivariate regression technique to explain satisfaction (including how salary may play a role), we could only explain a much lower 8% of the variance. This isn’t surprising because the correlation (the linear relationship) between salary and satisfaction was a significant but weak .11 (p < .03).

Part of the reason could also be the restriction of range (74% of respondents rated their job satisfaction as 65 or greater), and the relationship may not be well modeled by a line (linear relationship).

Historically, we find that the variables that have had a statistically significant impact on satisfaction include country, job, salary, and the number of hours worked per week. To find these relationships, we sometimes have to look at the extremes in some variables and not just the full range, suggesting some nonlinear relationships between satisfaction and job characteristics.

Job Satisfaction Varies by Country

In the analysis of salaries, the biggest predictor of salary was where you live, both the country and, within the U.S., the region. The highest salaries are in places where it’s very expensive to live (e.g., Northern California). To see how satisfaction differs by geography, we compared satisfaction scores by the countries with the most respondents for the last five salary surveys.

Figure 2 shows that respondents from Canada reported a higher mean satisfaction score (80) in 2024 than the other countries (significantly higher than the U.S. at 71 or the UK at 59). Canada was not significantly higher than Germany (68), most likely due to the uncertainty around the German estimate, given its small sample size (n = 12). The most surprising result given historical means was the very low level of job satisfaction in the UK. Even with its small sample size (n = 19), the upper limit of its 95% confidence interval fell below the lower limit of the Canadian interval and almost below the lower limit of the U.S. interval.

Figure 2: Difference in UX job satisfaction by country for 2014, 2016, 2018, 2022, and 2024 (95% confidence intervals).

Some Variation in Job Satisfaction by Job Title

Respondents in the UXPA salary survey were asked, “Which describes your current position? (Select all that apply).” Figure 3 shows the job titles for which the sample size was greater than ten respondents. The jobs with the nominally highest satisfaction (77) were Product Manager and Product Owner, while those with the nominally lowest satisfaction (69) were Usability Practitioner and Design Operations. Because the 95% confidence intervals overlapped and all contained the overall mean of 70, these nominal differences were not statistically significant.

10, 95% confidence intervals). " width="932" height="420" srcset="https://measuringu.com/wp-content/uploads/2025/09/figure3-1-1-300x135.png 300w, https://measuringu.com/wp-content/uploads/2025/09/figure3-1-1-1024x461.png 1024w, https://measuringu.com/wp-content/uploads/2025/09/figure3-1-1-768x346.png 768w, https://measuringu.com/wp-content/uploads/2025/09/figure3-1-1-600x270.png 600w, https://measuringu.com/wp-content/uploads/2025/09/figure3-1-1.png 1476w" sizes="(max-width: 932px) 100vw, 932px" />

Figure 3: Mean satisfaction scores for a selection of job titles (all titles with n > 10, 95% confidence intervals).

Higher Income Tends to Lead to Higher Satisfaction

Money might not buy happiness, but higher incomes tend to be associated with higher job satisfaction. As shown in Figure 4, respondents who reported salaries from $100K–150K (132 respondents) and above $150K (115 respondents) reported statistically higher average satisfaction scores than those with salaries of less than $50K (38 respondents).

Figure 4: Mean satisfaction levels by reported salary (all countries in USD, 95% confidence intervals).

To control for possible country effects, we examined differences for only the U.S.. There were no differences in the means for the top three income groups. Only one U.S. participant reported a salary of less than $50K, so we can’t statistically assess the job satisfaction of that group.

Working More Pays More … but May Lead to Less Satisfaction

Respondents also reported how many hours they worked per week (Figure 5). Most (58%) reported working exactly 40 hours per week, and the bulk of respondents (71%) reported working 40 to 49 hours per week.

The overall main effect of hours worked per week was statistically significant (F(4, 394) = 2.7, p = .03), with a general pattern of falling job satisfaction as the hours increased from less than 40 hours per week (73) to 50–59 hours per week (64).

However, things changed for those with 60+ hours. The eight respondents (about 2% of the sample) who reported working more than 60 hours had an average satisfaction of 80, very different from the drop in job satisfaction for this group that we observed in 2018 and 2022. This difference, as in our previous data, is not statistically significant because the sample size for 60+ hours was meager, and there was a lot of variability in ratings of satisfaction among those eight respondents (50, 65, 75, 80, 80, 90, 99, 100).

Figure 5: Satisfaction ratings by the number of reported hours worked per week.

Additional Variables that Didn’t Predict Satisfaction

We examined several other variables based on research in the job satisfaction literature. Although some cases had some small differences, they don’t add to the predictive ability of satisfaction (meaning other variables already accounted for what they’re adding). These variables are gender, job level, company size, age, experience, U.S. region, and supervising direct reports. A few notes on some of these:

No gender differences in satisfaction. The mean job satisfaction for males was 70.2 (n = 170) and for females was 69.9 (n = 218), a nonsignificant difference of 0.3 (p = .88).
Job level had no significant effect on satisfaction (p = .21). Those with entry-level and mid-level nonsupervisory jobs have similar satisfaction scores (n = 120; Sat = 68). The mean satisfaction ratings for senior-level jobs (supervisory and non-supervisory) were both 70 (n = 252).
Company size also doesn’t matter much (p = .14). While the 43 respondents at larger companies (5,001–10,000 employees) reported the highest satisfaction level (74), it wasn’t statistically different from smaller companies with 1,000 or fewer employees (n = 124; Sat = 71).
Age didn’t have much of an impact. The youngest respondents (18–25; n = 15; Sat = 73) reported higher satisfaction scores than the oldest respondents (56+; n = 36; Sat = 70), but it wasn’t statistically different (p = .63).
Experience also didn’t matter much. There was no significant difference across the means for the seven experience groups (p = .34). Those with the most experience (21+ years) had about the same job satisfaction (n = 54; Sat = 71) as those with the least experience (0–2 years, n = 43; Sat = 72). Across the groups, the nominally highest job satisfaction (73) was for those with 8–10 years of experience (n = 61), and the nominally lowest (64) was for those with 16–20 years of experience (n = 75).

Summary

UX professionals report generally high job satisfaction (70 out of 100). However, job satisfaction has declined slightly in 2024 compared to the last decade (and since 2022). The drop is partly explained by a small but significant increase in unemployment and by widespread fears of layoffs and AI replacing UX roles. Compared to other industries, UX satisfaction remains relatively strong. Satisfaction varies somewhat by geography and income: Canadian respondents reported the highest levels, while those earning over $100K tended to be more satisfied. Overall, while salary explains little of the variance, job security concerns and industry shifts appear to be the strongest forces shaping UX professionals’ outlook today.

MeasuringU

What Is the Difference Between Ease and Satisfaction?

Study Level

Study-Level (Product) Satisfaction

Study-Level Ease

Task Level

Task-Level Satisfaction

Task-Level Ease

Satisfaction vs. Ease

Historical Role of Satisfaction in UX Research

The After-Scenario Questionnaire (ASQ)

Satisfaction and the Single Usability Metric (SUM)

Future Role of Satisfaction in UX Research

Summary and Takeaways

Rake Weighting: How to Weight Survey Data with Multiple Variables

Why Use Rake Weighting?

Advantages of Rake Weighting

How to Use Rake Weighting

Comparing Survey and UX Population Demographics

Getting Weights

Using Weights

Brand Attitude

Reluctance to Engage in Political Discourse

Summary and Discussion

What Metrics Has MeasuringU Created?

2020–2025

UX-Lite®

Key Characteristics

Key Links and Publications

SUPR-Qm® V2

Key Characteristics

Key Links & Publications

TAC-10

Key Characteristics

Key Links & Publications

PWCQ

Key Characteristics

Key Links & Publications

2010–2019

SUPR-Q®

Key Characteristics

Key Links & Publications

SUPR-Qm®

Key Characteristics

Key Links & Publications

UMUX-LITE

Key Characteristics

Key Links & Publications

MOS-X2

Key Characteristics

Key Links & Publications

SUISQ-R

Key Characteristics

Key Links & Publications

EMO

Key Characteristics

Key Links & Publications

mTAM

Key Characteristics

Key Links & Publications

SUS

Key Characteristics

Key Links & Publications

2000–2009

SEQ®

Key Characteristics

Key Links & Publications

SUM

Key Links & Publications

MOS-X

Key Characteristics

Key Links

1990–1999

ASQ

Key Characteristics

Key Links & Publications

PSSUQ

Key Characteristics

Key Links & Publications

CSUQ

UX-Lite^®

SUPR-Qm^® V2

SUPR-Q^®

SUPR-Qm^®

SEQ^®