<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>MeasuringU</title>
	<atom:link href="https://measuringu.com/feed/" rel="self" type="application/rss+xml" />
	<link>https://measuringu.com</link>
	<description>UX Research and Software</description>
	<lastBuildDate>Wed, 04 Mar 2026 00:25:57 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.1</generator>

<image>
	<url>https://measuringu.com/wp-content/uploads/2020/11/site-icon.png</url>
	<title>MeasuringU</title>
	<link>https://measuringu.com</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Why You Should Not Compute Medians for Individual Rating Scales</title>
		<link>https://measuringu.com/means-vs-medians-for-rating-scales/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=means-vs-medians-for-rating-scales</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD&nbsp;•&nbsp;Jeff Sauro, PhD]]></dc:creator>
		<pubDate>Wed, 04 Mar 2026 00:25:57 +0000</pubDate>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Rating Scale]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[mean]]></category>
		<category><![CDATA[median]]></category>
		<category><![CDATA[Rating Scales]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=46780</guid>

					<description><![CDATA[Say you collect rating scale data from dozens of users across ten apps. To analyze the data, you compute medians because you learned that rating scale data isn’t interval or ratio data. The medians of all ten apps end up the same. They’re all 4! If you rely on the medians, you’d conclude the apps [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/02/030326-FeatureImage-1.png"><img fetchpriority="high" decoding="async" class="alignleft wp-image-46790 size-medium" src="https://measuringu.com/wp-content/uploads/2026/02/030326-FeatureImage-1-300x169.png" alt="Feature image showing two persons inspecting a rating scale and an infographic" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/02/030326-FeatureImage-1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/02/030326-FeatureImage-1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/02/030326-FeatureImage-1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/02/030326-FeatureImage-1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/02/030326-FeatureImage-1-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/02/030326-FeatureImage-1.png 2000w" sizes="(max-width: 300px) 100vw, 300px" /></a>Say you collect rating scale data from dozens of users across ten apps. To analyze the data, you compute medians because you learned that rating scale data isn’t interval or ratio data.</p>
<p>The medians of all ten apps end up the same. They’re all 4!</p>
<p>If you rely on the medians, you’d conclude the apps are essentially equivalent.</p>
<p>But if you compute the means, the ratings range from 3.6 to 4.6, providing a much clearer differentiation.</p>
<p>How can the same dataset produce such different stories? What’s the “right” way?<a href="https://measuringu.com/wp-content/uploads/2026/02/03032026-Cartoon.png"><img decoding="async" class="alignnone wp-image-46781 size-full" src="https://measuringu.com/wp-content/uploads/2026/02/03032026-Cartoon.png" alt="Cartoon showing researcher objecting to computing means" width="779" height="346" srcset="https://measuringu.com/wp-content/uploads/2026/02/03032026-Cartoon.png 779w, https://measuringu.com/wp-content/uploads/2026/02/03032026-Cartoon-300x133.png 300w, https://measuringu.com/wp-content/uploads/2026/02/03032026-Cartoon-768x341.png 768w, https://measuringu.com/wp-content/uploads/2026/02/03032026-Cartoon-600x266.png 600w" sizes="(max-width: 779px) 100vw, 779px" /></a></p>
<p>Why are some researchers so adamant about NOT computing the means of rating scales like the Single Ease Question (Figure 1)? In this article, we explain why taking the median of rating scale data is a poor practice.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/02/03032026-F1.png"><img decoding="async" class="alignnone wp-image-46782" src="https://measuringu.com/wp-content/uploads/2026/02/03032026-F1.png" alt="The Single Ease Question" width="1200" height="222" srcset="https://measuringu.com/wp-content/uploads/2026/02/03032026-F1.png 1430w, https://measuringu.com/wp-content/uploads/2026/02/03032026-F1-300x56.png 300w, https://measuringu.com/wp-content/uploads/2026/02/03032026-F1-1024x190.png 1024w, https://measuringu.com/wp-content/uploads/2026/02/03032026-F1-768x142.png 768w, https://measuringu.com/wp-content/uploads/2026/02/03032026-F1-600x111.png 600w" sizes="(max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> The Single Ease Question (SEQ<sup>®</sup>).</p>
<h2><span lang="EN-US">Stevens in 1946 Said Ordinal Data Can’t Be Averaged</span></h2>
<p>Ever since <a href="https://en.wikipedia.org/wiki/Stanley_Smith_Stevens">S. S. Stevens</a> declared in 1946 that <a href="https://pages.gseis.ucla.edu/faculty/richardson/Courses/stevens1946.pdf">numbers are not all created equal</a> by categorizing them as <a href="https://measuringu.com/data-types/">ratio, interval, ordinal, and nominal</a>, analysts have debated whether it’s legitimate to compute the mean of multipoint rating scales such as the SEQ. Based on his “<a href="https://en.wikipedia.org/wiki/Level_of_measurement">principle of invariance</a>,” he argued against doing anything more than counting nominal and ordinal data, which restricts addition, subtraction, multiplication, and division to interval and ratio data. These are exactly the operations needed to compute the mean of a set of data: “Thus, the mean is appropriate to an interval scale and also to a ratio scale (but not, of course, to an ordinal or a nominal scale” (Stevens, 1959, p. 28).</p>
<h3><span lang="EN-US">But Lord in 1953 Says Numbers Don’t Know They Are Ordinal</span></h3>
<p>It didn’t take long for other statisticians and measurement theorists to craft arguments against the proposed policy of restricting analysis of ordinal and nominal data to counts and medians. Probably the most famous counterargument was by Lord (1953). And we’re not referring to the <a href="https://en.wikipedia.org/wiki/Lorde">“Royals” singer</a> nor a deity, but a late psychologist with a divine name and lasting contributions (including the <a href="https://en.wikipedia.org/wiki/Frederic_M._Lord#:~:text=Frederic%20Mather%20Lord%20(November%2012%2C%201912%20%E2%80%93,TOEFL%20are%20all%20based%20on%20Lord's%20research.">SAT and GRE</a> tests).</p>
<p>In his <a href="https://is.muni.cz/el/fss/jaro2010/PSY454/um/Frederick_Lord_On_the_statistical_treatment_of_football_numbers.pdf">parable of a retired professor</a>, Lord described a machine used to randomly assign football numbers to the jerseys of freshmen and sophomore football players at his university … a clear use of numbers as labels (<strong>nominal data</strong>). After receiving their numbers, the freshmen complained that the assignment wasn’t random. They claimed to have received generally smaller numbers than the sophomores and that the sophomores must have tampered with the machine.</p>
<p>The professor consulted with a statistician to investigate how likely it was that the freshmen got their low numbers by chance. Over the professor’s objections, the statistician determined the population mean and standard deviation of the football numbers as 54.3 and 16.0, respectively. He found that the mean of the freshmen’s numbers was too low to have happened by chance, strongly indicating that the sophomores had tampered with the football number machine to get larger numbers. The professor objected to the analysis because the numbers weren’t even ordinal, but the statistician replied, “The numbers don’t know that; since the numbers don’t remember where they came from, they always behave just the same way, regardless.”</p>
<h3><span lang="EN-US">Even Nonparametric Tests Quietly Compute Means</span></h3>
<p>For analyzing ordinal data, some researchers have recommended using statistical methods that are similar to the well-known <em>t</em>&#8211; and <em>F</em>-tests, but which replace the original data with ranks before analysis. These are the so-called nonparametric methods (e.g., the Mann–Whitney <em>U</em> test, the Friedman test, or the Kruskal–Wallis test). But here’s the dirty secret: These methods actually compute the means of the ranks (or an equivalent process), which are ordinal (not interval or ratio) data! Despite these violations of permissible data manipulations from Stevens’ point of view, those methods work perfectly well.</p>
<h2><span lang="EN-US">Why Medians Are Poor Estimates of Central Tendency for Rating Scales</span></h2>
<p>When is computing a median a good practice, and why doesn’t it work well with rating scales?</p>
<h3><span lang="EN-US">When to Compute a Median</span></h3>
<p>The mean and median are both common ways to measure the central tendency of a set of data. To calculate the mean, add up the data points and divide by the total number in the group (the sample size, <em>n</em>). With the mean, every data point contributes to the estimate. The median is simply the center point of a distribution (or, if there is an even number of points, the average of the two center points).</p>
<p>The mean usually works well as a measure of central tendency, especially when the distribution is roughly symmetrical. When the data aren’t symmetrical, however, the mean can be sufficiently influenced by a few extreme data points (e.g., time data, currency values), so it’s no longer a good estimate of central tendency. When that happens, the median can be a better estimate of central tendency than the mean.</p>
<h3><span lang="EN-US">But Rating Scales Are Bounded and Discrete</span></h3>
<p>The examples of data types that can be better summarized with the median than the mean have two things in common:</p>
<ul>
<li>An unlimited range with a small number of extreme scores</li>
<li>Continuous measurement</li>
</ul>
<p>Rating scales, however, have a limited range and fundamentally discrete measurements. Because the ratings are discrete, the median can take only one of a restricted number of values regardless of the sample size. For a five-point scale, the median can take only the following values, no matter how large the sample: 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, and 5.0. (And it can take the intermediate values only when <em>n </em>is even.)</p>
<p>The mean, on the other hand, can take any value between 1 and 5, and as the sample size increases, the mean becomes more and more continuous. Because the mean can be a larger number of values, it can reflect differences between two samples more reliably than the median difference.</p>
<p>When scales are open-ended (have at least one endpoint at infinity, like time data), extreme values can affect the mean but will not affect medians. Rating scales, however, are not open-ended, so <a href="https://www.researchgate.net/publication/220302331_Multipoint_Scales_Mean_and_Median_Differences_and_Observed_Significance_Levels">the median does not have a compelling advantage over the mean</a> when analyzing individual rating scales. Instead, it is at a distinct disadvantage.</p>
<h2><span lang="EN-US">Eleven Mobile Apps That Look the Same Using Medians (A Real Example)</span></h2>
<p>So, we weren’t making up the story about a bunch of apps having the same median (we just changed the number from eleven to ten). The story comes from our data.</p>
<p>In our 2026 UX benchmark of clothing websites, we asked respondents who used the mobile apps of various companies to rate their usefulness with a five-point scale. Table 1 shows the means, medians, and sample sizes for the companies included in the benchmark.</p>

<table id="tablepress-1021" class="tablepress tablepress-id-1021">
<thead>
<tr class="row-1">
	<th class="column-1"><strong>Mobile App</strong></th><th class="column-2"><strong>Mean</strong></th><th class="column-3"><strong>Median</strong></th><th class="column-4"><strong><i>n</i></strong></th>
</tr>
</thead>
<tbody class="row-striping row-hover">
<tr class="row-2">
	<td class="column-1">Anthropologie</td><td class="column-2">3.94</td><td class="column-3">4.00</td><td class="column-4">18</td>
</tr>
<tr class="row-3">
	<td class="column-1">Athleta</td><td class="column-2">4.26</td><td class="column-3">4.00</td><td class="column-4">23</td>
</tr>
<tr class="row-4">
	<td class="column-1">Banana Republic</td><td class="column-2">4.42</td><td class="column-3">4.00</td><td class="column-4">19</td>
</tr>
<tr class="row-5">
	<td class="column-1">Gap</td><td class="column-2">4.44</td><td class="column-3">4.00</td><td class="column-4">18</td>
</tr>
<tr class="row-6">
	<td class="column-1">H&amp;M</td><td class="column-2">4.64</td><td class="column-3">5.00</td><td class="column-4">11</td>
</tr>
<tr class="row-7">
	<td class="column-1">Lululemon</td><td class="column-2">3.57</td><td class="column-3">4.00</td><td class="column-4">23</td>
</tr>
<tr class="row-8">
	<td class="column-1">Neiman Marcus</td><td class="column-2">4.21</td><td class="column-3">4.00</td><td class="column-4">24</td>
</tr>
<tr class="row-9">
	<td class="column-1">Nordstrom</td><td class="column-2">4.12</td><td class="column-3">4.00</td><td class="column-4">17</td>
</tr>
<tr class="row-10">
	<td class="column-1">Old Navy</td><td class="column-2">4.00</td><td class="column-3">4.00</td><td class="column-4">13</td>
</tr>
<tr class="row-11">
	<td class="column-1">Urban Outfitters</td><td class="column-2">3.91</td><td class="column-3">4.00</td><td class="column-4">22</td>
</tr>
<tr class="row-12">
	<td class="column-1">Zara</td><td class="column-2">4.30</td><td class="column-3">4.00</td><td class="column-4">30</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1021 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Means and medians for usefulness ratings of eleven mobile apps for online clothes shopping.</p>
<p>In this example, all the medians were either 4 or 5. The means, on the other hand, ranged from 3.57 to 4.64 with no duplication, providing a much more nuanced picture of the differences in the ratings (Figure 2).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-1-scaled.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46794" src="https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-1-300x114.png" alt="Graph of means for usefulness ratings of eleve online clothes shopping apps." width="1200" height="456" srcset="https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-1-300x114.png 300w, https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-1-1024x389.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-1-768x292.png 768w, https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-1-1536x583.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-1-2048x778.png 2048w, https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-1-600x228.png 600w" sizes="(max-width: 1200px) 100vw, 1200px" /></a><a href="https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-2-scaled.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46795" src="https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-2-300x114.png" alt="Graph of medians for usefulness ratings of eleven online clothes shopping apps. The profile of the means is more informative than the profile of the medians." width="1200" height="456" srcset="https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-2-300x114.png 300w, https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-2-1024x389.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-2-768x292.png 768w, https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-2-1536x584.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-2-2048x778.png 2048w, https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-2-600x228.png 600w" sizes="(max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 2:</strong> Comparison of graphs of means and medians for usefulness ratings of eleven online clothes shopping apps. The profile of the means is more informative than the profile of the medians.</p>
<p>And when sample sizes are very small, there usually won’t be much difference between rating scale means and medians. The most extreme example is when <em>n</em> = 2, in which case the mean and median will be the same, but that doesn’t happen in the real world.</p>
<h2>Use Means, But Don’t Overinterpret Them</h2>
<p>So, which is it—not all numbers are equal (Stevens, 1946), or the numbers don’t remember where they came from (Lord, 1953)? Given our backgrounds in applied statistics (and personal experiences attempting to act in accordance with Stevens’ reasoning that didn’t work out very well—that’s a story for another day), we fall firmly in the camp that supports the use of statistical techniques (such as the <em>t</em>-test, analysis of variance, and factor analysis) on ordinal data such as multipoint rating scales. However, you can’t just ignore the level of measurement of your data.</p>
<p>When you make claims about the meaning of the outcomes of your statistical tests, you must be careful not to act as if rating scale data are interval rather than ordinal. An average rating of 4 might be better than an average rating of 2, and a <em>t</em>-test might indicate that, across a group of participants, the difference is consistent enough to be statistically significant. Even so, you can’t claim that it’s twice as good (a ratio claim), nor can you claim that the difference between 4 and 2 is equal to the difference between 4 and 6 (an interval claim). You can only claim that there is a reliably consistent difference.</p>
<p>Although it might surprise some researchers who treat the implications of the levels of measurement as if they were laws, Stevens (1946, p. 679) took a more moderate stance on this topic than most people realize:</p>
<p style="margin-left: 40px;"><em>On the other hand, for this &#8220;illegal&#8221; statisticizing there can be invoked a kind of pragmatic sanction: In numerous instances it leads to fruitful results. While the outlawing of this procedure would probably serve no good purpose, it is proper to point out that means and standard deviations computed on an ordinal scale are in error to the extent that the successive intervals on the scale are unequal in size. When only the rank-order of data is known, we should proceed cautiously with our statistics, and especially with the conclusions we draw from them.</em></p>
<p>Fortunately, even if you make the mistake of thinking one product is twice as good as another when the scale doesn’t justify it, it would be a mistake that often would not affect the practical decision of which product is better.</p>
<h2><span lang="EN-US">Summary and Discussion</span></h2>
<p>Some analysts strongly advise against computing the means of rating scales, often recommending the computation of medians instead. In this article, we explain why reporting the median of rating scale data doesn’t work as well as reporting the mean.</p>
<p><strong>Medians are better than means when outliers skew continuous, unbounded data. </strong>This pattern is common in measures such as time or money, where a few extreme values can substantially shift the mean.</p>
<p><strong>Rating scales are discrete and bounded, making means more informative than medians. </strong>Even though we spend a lot of money and time collecting data, rating scales aren’t like time and money. For data like this, medians are too coarse to capture the meaningful differences that means are sensitive enough to detect.</p>
<p><strong>Compute means of rating scales, but don’t make interval claims from ordinal data. </strong>Differences between rating scale means indicate consistent ordering, not equal intervals or proportional differences. Even so, they are often very useful in practice.</p>
<p><strong>Bottom line</strong>: When analyzing rating scale data, don’t be afraid to compute and compare means as long as your interpretation of results doesn’t exceed what the data says.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>An Intro to Bayesian Thinking for UX Research: Updating Beliefs with Data</title>
		<link>https://measuringu.com/intro-to-bayesian-thinking-in-ux-research/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=intro-to-bayesian-thinking-in-ux-research</link>
		
		<dc:creator><![CDATA[Jeff Sauro, PhD&nbsp;•&nbsp;Jim Lewis, PhD]]></dc:creator>
		<pubDate>Wed, 25 Feb 2026 00:50:31 +0000</pubDate>
				<category><![CDATA[User Research]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[Bayes theorem]]></category>
		<category><![CDATA[Bayesian theorem]]></category>
		<category><![CDATA[completion rates]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=46598</guid>

					<description><![CDATA[&#8220;That design will never work.&#8221; You may have had that thought before you even ran your first participant in a usability test. If you’ve seen enough users struggle and conducted enough usability tests, then you probably have some idea about how well or poorly a task attempt may go for prototypes or even commercially available [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/02/022426-FeatureImage-4.png"><img loading="lazy" decoding="async" class="alignleft wp-image-46615 size-medium" src="https://measuringu.com/wp-content/uploads/2026/02/022426-FeatureImage-4-300x169.png" alt="Feature image showing a researcher pointing on a math equation using a pointer stick" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/02/022426-FeatureImage-4-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/02/022426-FeatureImage-4-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/02/022426-FeatureImage-4-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/02/022426-FeatureImage-4-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/02/022426-FeatureImage-4-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/02/022426-FeatureImage-4.png 2000w" sizes="(max-width: 300px) 100vw, 300px" /></a>&#8220;That design will never work.&#8221;</p>
<p>You may have had that thought before you even ran your first participant in a usability test.</p>
<p>If you’ve seen enough users struggle and conducted enough usability tests, then you probably have some idea about how well or poorly a task attempt may go for prototypes or even commercially available software or products.</p>
<p>It’s rare to have <em>no</em> idea about how well things will go before the testing even starts. In fact, an experienced researcher is expected to know of some problems and anticipate the friction. This is one of the foundations behind inspection methods like heuristic evaluation and the PURE method (which puts some numbers to friction).</p>
<p>Expert reviewers, of course, are not a substitute for observing users. But is there a way to build in our a priori knowledge of what’s likely to go wrong and then inform and update our beliefs once we see data? Can we do that systematically or even mathematically?</p>
<h2><span lang="EN-US">Thomas Bayes and Updating Our Beliefs from Data</span></h2>
<p>It turns out that hundreds of years ago, a famous Presbyterian minister named <a href="https://en.wikipedia.org/wiki/Thomas_Bayes">Thomas Bayes</a> was also interested in updating his beliefs with what he observed.</p>
<p>His name has been associated with a formula for updating our beliefs with data (<a href="https://en.wikipedia.org/wiki/Bayes%27_theorem">Bayes&#8217; Theorem</a>). It follows a simple iterative process:</p>
<ol>
<li>Start with a belief or hypothesis.</li>
<li>Collect data.</li>
<li>Update the belief.</li>
<li>Repeat.</li>
</ol>
<p>The formula for this process looks like this:</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-1.png"><img loading="lazy" decoding="async" class="aligncenter wp-image-46629" src="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-1.png" alt="Formula for updating beliefs with Bayes Theorem" width="674" height="70" srcset="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-1.png 1078w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-1-300x31.png 300w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-1-1024x106.png 1024w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-1-768x80.png 768w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-1-600x62.png 600w" sizes="(max-width: 674px) 100vw, 674px" /></a><br />
In other words, start with what you expect, check how well the data matches that expectation, and then adjust your belief accordingly.</p>
<p>The Bayes’ formula means that beliefs that better predict the data become more credible; beliefs that predict the data poorly lose credibility.</p>
<p>Our original belief is called the prior hypothesis (before). The belief we have after observing data and calculating an update is our posterior belief (after).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-2.png"><img loading="lazy" decoding="async" class="aligncenter wp-image-46630" src="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-2-300x39.png" alt="Formula for determining a posterior hypothesis" width="536" height="70" srcset="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-2-300x39.png 300w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-2-768x100.png 768w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-2-600x78.png 600w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-2.png 857w" sizes="(max-width: 536px) 100vw, 536px" /></a><br />
If we replace words with symbols, we get the more recognizable Bayesian formula. We have only two symbols that extend Bayesian thinking: θ (theta) and <em>D</em> (data).</p>
<p>Our prior belief is represented with the Greek symbol theta (θ) and shown in the formula as the probability of theta. <em>D</em> represents the data we observed/collected and is shown in the formula in the denominator (probability of all data). Both θ and <em>D</em> appear in the numerator as a conditional probability of the data given theta (<em>D</em>|θ).</p>
<p>Our posterior (updated belief) is represented with the probability of theta given the data (θ|<em>D</em>). The resulting formula is:</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-3.png"><img loading="lazy" decoding="async" class="aligncenter wp-image-46631" src="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-3-300x77.png" alt="Posterior with theta" width="273" height="70" srcset="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-3-300x77.png 300w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-3.png 437w" sizes="(max-width: 273px) 100vw, 273px" /></a><br />
Interestingly, Bayes himself never published his famous theorem. It was published after his death by his friend <a href="https://www.google.com/url?sa=t&amp;source=web&amp;rct=j&amp;opi=89978449&amp;url=https://www.york.ac.uk/depts/maths/histstat/price.pdf&amp;ved=2ahUKEwjyqb--_dySAxXYODQIHRhnMzgQFnoECCAQAQ&amp;usg=AOvVaw2cW4VK9LyEC78RfzR_HhPE">Richard Price</a> [PDF], who used it to attempt to prove the existence of God by showing that the order in the universe wasn’t accidental. Because Price likely made a substantial contribution to completing Bayes’ work on the theorem, this may be another example of <a href="https://en.wikipedia.org/wiki/Stigler%27s_law_of_eponymy">Stigler’s Law</a> (scientific discoveries are not named after the discoverer or, in this case, do not include the co-discoverer).</p>
<p>Formulas, ministers, and theology are interesting and all, but how does this apply to UX research?</p>
<h2><span lang="EN-US">A Simple UX Research Example with Completion Rates</span></h2>
<p>We can use an example of testing a new checkout experience. We want to gauge the completion rate (a fundamental usability metric). How successfully are people able to get through the new flow?</p>
<p>We’ve never tested this checkout flow before, though. But do we really have <em>no idea</em> about what will happen? Is a 0% completion rate really as likely as a 50%, 90%, or 100% completion rate?</p>
<p>Using a <strong>rough</strong> guide from historical data, we know an “average” completion rate is <a href="https://measuringu.com/task-completion/">around 78%</a>. It doesn’t mean we expect this new checkout completion rate to be <strong>exactly</strong> 78% (there is a lot of variability around this average). But values between 50% and 95% seem more plausible than a 5%, 10%, or even 99% completion rate. The lower end would be cause for concern, and the upper end would be desired for such an important flow.</p>
<h3><span lang="EN-US">What Exactly Is Our Prior?</span></h3>
<p>So, following Bayesian thinking, we establish a prior. Our <strong>prior belief</strong> is not a single number (78%), but a <em>range of plausible completion rates</em>, centered near 78% (the most plausible rate). Rates far lower (e.g., 40%) or far higher (e.g., 99%) are possible but less likely. In Bayesian terms, this represents a prior belief with a probability distribution centered at 78% but wide enough to allow for substantial uncertainty (see the appendix for details).</p>
<h3><span lang="EN-US">Collecting Data</span></h3>
<p>As an example of using data to update our initial thinking, assume we’ve collected data from a hypothetical moderated usability test with twenty participants in which eighteen completed the checkout and two failed. That’s a 90% observed completion rate. What does that do to our prior belief?</p>
<p>Using Bayesian thinking, we’d ask which completion rates best explain 18 successes out of 20.</p>
<ul>
<li>Rates near 90% explain it well.</li>
<li>Rates near 78% still explain it reasonably well.</li>
<li>Rates near 50% explain it poorly.</li>
</ul>
<p>Bayes’ theorem formalizes that comparison. It increases the credibility of rates that better predict the data and decreases the credibility of those that don’t.</p>
<h3><span lang="EN-US">Updating Our Prior</span></h3>
<p>Before seeing the data, our belief was centered near 78%. After observing 18/20 completions, we conclude (see appendix for the mechanics):</p>
<ul>
<li>Our updated best estimate of the true completion rate is about <strong>86%</strong>.</li>
<li>A 95% credible interval runs from roughly <strong>72% to 96%</strong>.</li>
<li>There’s about an <strong>89% probability </strong>that the true completion rate exceeds 78%.</li>
</ul>
<p>A few things to notice:</p>
<ul>
<li>The data pulled our estimate up from 78% toward 90%.</li>
<li>It didn’t go all the way to 90%.</li>
<li>The prior kept us from overreacting to just twenty observations.</li>
</ul>
<p>That’s <strong>Bayesian updating</strong>. We started with an informed expectation, saw new evidence, and adjusted accordingly. Figure 1 illustrates this Bayesian thinking.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-4.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46632" src="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-4-300x163.png" alt="The posterior distribution (after observing 18/20 completions) shifts upward from the prior centered at 78%, reflecting the influence of new data while retaining uncertainty." width="1200" height="651" srcset="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-4-300x163.png 300w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-4-1024x556.png 1024w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-4-768x417.png 768w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-4-600x326.png 600w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-4.png 1284w" sizes="(max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> The posterior distribution (after observing 18/20 completions) shifts upward from the prior centered at 78%, reflecting the influence of new data while retaining uncertainty.</p>
<p>So, can we just plug our numbers into the simple formula we showed above? The answer we’ve found working through this Bayesian example is, unfortunately, not that simple. We describe the approach we used for those numbers below in the appendix.</p>
<p>We’ll cover how to conduct these analyses in upcoming articles, but this provides some idea about using Bayesian thinking in practice without getting swallowed up in the conditional probabilities.</p>
<h2><span lang="EN-US">Updating Our Beliefs with More Questions</span></h2>
<p>Who can argue with updating your beliefs with new data? We like this idea of applying iterative Bayesian thinking and incorporating historical data. Who wants to be stuck in their ways? But while using Bayesian thinking seems both appealing and like sound science, it generates a few questions:</p>
<ul>
<li>How is this different from using the statistics taught in an intro statistics class and <a href="https://measuringu.com/product/practical-statistics-for-ux-and-customer-research-course/">our courses</a>?</li>
<li>What’s the difference between a credibility interval and a confidence interval?</li>
<li>Do Bayesian statistics require smaller sample sizes?</li>
<li>What if you don’t have any prior information?</li>
<li>How reliable are priors if they are just our intuition or “conventional wisdom?”</li>
<li>Can a prior steer us in the wrong direction?</li>
<li>How can this concept be extended to assessing the likelihoods of different hypotheses?</li>
</ul>
<p>We’ll dig into these questions in upcoming articles.</p>
<h2><span lang="EN-US">Appendix: How the Posterior Was Computed</span></h2>
<p>Here’s a quick summary of how we computed the values. We used some common modern Bayesian methods that are computationally intense (we’ll cover that in a future article).</p>
<p>We modeled the true completion rate using a Beta distribution and the observed data using a binomial model. We set a weak prior centered at the historical average of 78%, equivalent to about 10 prior observations. This corresponds to a Beta(7.8, 2.2) prior distribution.</p>
<p>With 18 completions out of 20 participants, the Bayesian update is: for a Beta prior and binomial data, the posterior is Beta(α + successes, β + failures). Substituting the values gives a posterior of Beta(25.8, 4.2).</p>
<p>From this updated distribution:</p>
<ul>
<li>The posterior mean is 25.8/30 ≈ 86%.</li>
<li>The 95% credible interval is approximately 72% to 96% (2.5<sup>th</sup> and 97.5<sup>th</sup> percentiles of the Beta(25.8, 4.2) distribution.</li>
<li>The probability that the true completion rate exceeds 78% is about 89% (using the upper tail of the Beta(25.8, 4.2) distribution.</li>
</ul>
<p>This update reflects a compromise between prior expectations and observed data, with the new evidence pulling the estimate upward while retaining uncertainty.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>An Introduction to Effect Sizes</title>
		<link>https://measuringu.com/an-introduction-to-effect-sizes/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=an-introduction-to-effect-sizes</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD&nbsp;•&nbsp;Jeff Sauro, PhD]]></dc:creator>
		<pubDate>Tue, 17 Feb 2026 23:04:00 +0000</pubDate>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[Effect Size]]></category>
		<category><![CDATA[effect sizes]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=46564</guid>

					<description><![CDATA[The completion rate jumped from 20% to 80%. That’s a large effect size. If it had gone from 20% to 21%? Much smaller effect. It’s easy to get caught up in the mechanics of significance testing and p-values. But even before those tools existed, researchers were measuring effect sizes. Effect sizes remain fundamental to understanding [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/02/021726-FeatureImage-1-1.png"><img loading="lazy" decoding="async" class="alignleft wp-image-46587 size-medium" src="https://measuringu.com/wp-content/uploads/2026/02/021726-FeatureImage-1-1-300x169.png" alt="Feature image showing small, medium and large effect sizes." width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/02/021726-FeatureImage-1-1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/02/021726-FeatureImage-1-1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/02/021726-FeatureImage-1-1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/02/021726-FeatureImage-1-1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/02/021726-FeatureImage-1-1-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/02/021726-FeatureImage-1-1.png 2000w" sizes="(max-width: 300px) 100vw, 300px" /></a>The completion rate jumped from 20% to 80%. That’s a <strong>large effect size</strong>. If it had gone from 20% to 21%? Much smaller effect.</p>
<p>It’s easy to get caught up in the mechanics of significance testing and <em>p</em>-values. But even before those tools existed, researchers were measuring effect sizes. Effect sizes remain fundamental to understanding whether a result actually matters.</p>
<p>An important outcome of recent debates about significance testing has been increased consensus on reporting effect sizes alongside <em>p</em>-values.</p>
<p>It’s been a bit trendy lately to trash null hypothesis significance testing (NHST) because of how it’s misused. Many critics argue we should abandon it altogether.</p>
<p>But if you know us, you know we think that just because something is misused (like the NPS) doesn’t mean we should throw it out. We’re pragmatic. We actually agree with the critics of significance testing that we shouldn’t rely on just<em> p</em>-values. Effect sizes and confidence intervals should be used more to understand the practical significance of a result.</p>
<p>We’ve written about this earlier. Figure 1 shows a <a href="https://measuringu.com/from-statistical-to-practical-significance/">framework we originally published in 2021</a> that extends the all-or-none decision of statistical significance to considerations of practical significance based on <a href="https://measuringu.com/ci-10things/">confidence intervals</a> (a type of effect size).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/02/021726-Figure1.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46579" src="https://measuringu.com/wp-content/uploads/2026/02/021726-Figure1-300x110.png" alt="Decision tree for assessing statistical and practical significance." width="1200" height="440" srcset="https://measuringu.com/wp-content/uploads/2026/02/021726-Figure1-300x110.png 300w, https://measuringu.com/wp-content/uploads/2026/02/021726-Figure1-1024x375.png 1024w, https://measuringu.com/wp-content/uploads/2026/02/021726-Figure1-768x281.png 768w, https://measuringu.com/wp-content/uploads/2026/02/021726-Figure1-600x220.png 600w, https://measuringu.com/wp-content/uploads/2026/02/021726-Figure1.png 1378w" sizes="(max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> Decision tree for assessing statistical and practical significance.</p>
<p>In this article, we provide a short introduction to effect sizes, extending our thoughts from our <a href="https://measuringu.com/effect-sizes/">earlier article</a>.</p>
<h2>A Short History of Effect Sizes</h2>
<p>Before there were tests of significance, there were effect sizes. Any time two values are compared, you have an estimate of an effect size.</p>
<p>A key difference: the magnitude of a <em>p</em>-value is affected by sample size, but estimates of effect sizes are not. Significance testing separates effect sizes that could plausibly be zero from those that cannot (typically using an alpha criterion), but the effect size itself is independent of sample size.</p>
<p>Early concepts related to effect sizes can be found in the <a href="https://pages.uoregon.edu/stevensj/workshops/huberty2002.pdf">writings of Francis Galton and Karl Pearson</a> on correlation and regression in the late 19<sup>th</sup> and early 20<sup>th</sup> centuries. In 1960, Ronald Fisher added a general statement about the importance of effect sizes to the 7<sup>th</sup> edition of <a href="https://home.iitk.ac.in/~shalab/anova/DOE-RAF.pdf"><em>The Design of Experiments</em></a>, saying researchers should never “lose sight of the exact strength which the evidence has in fact reached” (p. 25).</p>
<p>Interest in effect sizes grew in the second half of the 20th century with Jacob Cohen’s use of the <a href="https://replicationindex.com/wp-content/uploads/2025/09/Cohen.1962.The_statistical_power_of_abnor.pdf">smallest important effect size to detect</a> (i.e., the critical difference) in sample size estimation and the development of <a href="https://en.wikipedia.org/wiki/Meta-analysis">meta-analysis</a> by <a href="https://en.wikipedia.org/wiki/Gene_V._Glass">Gene Glass</a> and <a href="https://en.wikipedia.org/wiki/Larry_V._Hedges">Larry Hedges</a>.</p>
<p>In 1994, the American Psychological Association (APA) first explicitly recommended reporting effect sizes in the 4<sup>th</sup> edition of its publication manual. In the 5<sup>th</sup> through 7<sup>th</sup> editions (2019), they strongly recommend reporting <a href="https://www.psychologicalscience.org/observer/understanding-confidence-intervals-cis-and-effect-size-estimation">confidence intervals around effect size estimates</a> in addition to standard tests of significance.</p>
<h2>Types of Effect Sizes</h2>
<p>There are many different effect sizes, with estimates of the number <a href="https://en.wikipedia.org/wiki/Effect_size">varying from 50 to 100</a>. At a high level, they measure either <strong>differences</strong> (between means or proportions) or <strong>relationships</strong> (correlations, regression) and can also be classified as unstandardized or standardized. More formally, effect sizes reflect the magnitude of a phenomenon and can be described in terms of what is measured (differences or relationships), how it is computed, and the resulting value.</p>
<p>Because any effect size estimate will be wrong to some degree, the current best practice is to report confidence intervals alongside effect sizes. Confidence intervals show the plausible range of values for an effect size, helping distinguish between statistical significance and practical importance.</p>
<h3>Unstandardized Effect Sizes</h3>
<p>Unstandardized effect sizes preserve the original measurement units—inches, seconds, or points on a rating scale. Because they’re directly interpretable, they’re usually easier to understand and apply. Common examples in UX research: mean differences and regression coefficients (<em>B</em> weights).</p>
<h3>Standardized Effect Sizes</h3>
<p>Standardized effect sizes are, for the most part, unstandardized effect sizes divided by a standard deviation. This converts original units into unit-free measures of magnitude, making them easier to compare across studies or combine in meta-analysis. The best-known standardized effect size for the difference between two independent means is Cohen’s <em>d</em> (the mean difference divided by the pooled standard deviation). In UX research, the correlation coefficient is probably the most common standardized effect, possibly because it is more easily interpreted than its unstandardized counterpart, the covariance.</p>
<h2>Interpreting Standardized Effect Sizes</h2>
<p>The best-known guidelines for interpreting standardized effect sizes as small, medium, or large were developed by <a href="https://en.wikipedia.org/wiki/Jacob_Cohen_(statistician)">Jacob Cohen</a>. He emphasized the importance of basing effect size comparisons whenever possible on the results of previous studies in the relevant research context, but he provided general conventions to use when relevant research was insufficient (Table 1).</p>

<table id="tablepress-1020" class="tablepress tablepress-id-1020">
<thead>
<tr class="row-1">
	<th class="column-1"><strong>Interpretation</strong</th><th class="column-2"><strong>Mean Difference</strong</th><th class="column-3"><strong>Correlation</strong</th><th class="column-4"><strong>Cohen’s Basis</strong</th>
</tr>
</thead>
<tbody class="row-striping row-hover">
<tr class="row-2">
	<td class="column-1">Small</td><td class="column-2">0.2</td><td class="column-3">0.1</td><td class="column-4">Noticeably smaller than medium but not trivial</td>
</tr>
<tr class="row-3">
	<td class="column-1">Medium</td><td class="column-2">0.5</td><td class="column-3">0.3</td><td class="column-4">Visible to naked eye in real world (e.g., height)</td>
</tr>
<tr class="row-4">
	<td class="column-1">Large</td><td class="column-2">0.8</td><td class="column-3">0.5</td><td class="column-4">Same distance above medium as small is below</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1020 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Cohen’s conventions for interpreting standardized mean differences (standard deviation units) and correlations.</p>
<h3>Are Cohen’s Guidelines Applicable to UX Research?</h3>
<p>Research on the <a href="https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2019.00813/full">meaningfulness of effect sizes in psychological research</a> has found larger reported effect sizes for conventionally published research (potentially affected by publication bias) compared to pre-registered research, and differences within subdisciplines of psychology. Another line of research, <a href="https://journals.sagepub.com/doi/pdf/10.1177/2515245919847202">evaluating effect sizes in psychological research</a> and focused on correlation, recommended interpreting reliably estimated correlations of .05 as very small, .10 as small, .20 as medium, .30 as large, and .40 as very large.</p>
<p>Although quantitative methods in UX are largely borrowed from psychology, UX research differs in goals, constraints, and decision contexts—making direct adoption of interpretation guidelines problematic. We’re planning to analyze our historical research (unaffected by publication bias) to develop guidelines specific to UX research contexts.</p>
<h2>Summary and Discussion</h2>
<p>In this article, we provided a brief history of effect sizes, two basic types (differences, relationships), and guidelines for interpretation.</p>
<p><strong>Effect sizes predate significance testing. </strong>Early concepts appeared in the late 19<sup>th</sup> and early 20<sup>th</sup> centuries and were further developed for sample size estimation, power analysis, and meta-analysis in the second half of the 20<sup>th</sup> century. Major organizations now strongly recommend reporting them.</p>
<p><strong>Effect sizes can be unstandardized or standardized. </strong>Unstandardized effect sizes preserve original measurement units; standardized effect sizes (unit-free measures based on proportions of standard deviations) enable cross-study comparison. Other types of research, including sample size estimation and meta-analysis, require standardized effect sizes.</p>
<p><strong>Effect sizes measure differences or relationships. </strong>Standardized effect sizes for mean differences include Cohen’s <em>d</em> and Hedge’s <em>g</em>. Standardized effect sizes for relationships include correlations (<em>r</em>) and coefficients of determination (<em>R</em>²).</p>
<p><strong>Report confidence intervals with effect sizes. </strong>Any point estimate will be wrong to some degree. Confidence intervals show the plausible range around a point estimate.</p>
<p><strong>Interpretation guidelines are context-dependent. </strong>Cohen’s conventions provide a starting point, but research has found larger effects in conventionally published research than in pre-registered research (potentially due to publication bias) and variation across psychological subdisciplines. We plan to investigate effect sizes in our historical data to develop better guidelines for UX research.</p>
<p>We will discuss specific effect size formulas and calculations in future articles.</p>
<h2>Additional Reading</h2>
<p>Cohen, J. (1962). <a href="https://replicationindex.com/wp-content/uploads/2025/09/Cohen.1962.The_statistical_power_of_abnor.pdf">The statistical power of abnormal-social psychological research: A review</a>. <em>Journal of Abnormal and Social Psychology</em>, <em>63</em>(3), 145–153.</p>
<p>Cohen, J. (1990). <a href="https://www.stat.cmu.edu/~brian/jdelaney/cohen-learned-so-far-amer-psychologist-1990.pdf">Things I have learned (so far)</a>. <em>American Psychologist</em>, <em>45</em>(12), 1304–1312.</p>
<p>Ferguson, C. J. (2009). <a href="https://www.researchgate.net/profile/Fei-Xin/post/How_to_determine_the_magnitude_of_ORs_in_logistic_regression/attachment/603e2571ce717d0001ee1746/AS%3A996877617078276%401614685553537/download/Ferguson_EffectSizes_2009.pdf">An effect size primer: A guide for clinicians and researchers</a>. <em>Professional Psychology: Research and Practice</em>, <em>40</em>(5), 532–538.</p>
<p>Fisher, R. A. (1971). <a href="https://home.iitk.ac.in/~shalab/anova/DOE-RAF.pdf">The design of experiments</a> (9<sup>th</sup> ed.). Hafner.</p>
<p>Fritz, C. O., Morris, P. E., &amp; Richler, J. J. (2012). <a href="https://www.researchgate.net/publication/51554230_Effect_Size_Estimates_Current_Use_Calculations_and_Interpretation">Effect size estimates: Current use, calculations, and interpretation</a>. <em>Journal of Experimental Psychology: General</em>, <em>141</em>(1), 2–18.</p>
<p>Funder, D. C., &amp; Ozer, D. J. (2019). <a href="https://journals.sagepub.com/doi/pdf/10.1177/2515245919847202"> Evaluating effect size in psychological research: Sense and nonsense</a>. <em>Advances in Methods and Practices in Psychological Science</em>, <em>2</em>(2), 156–168.</p>
<p>Galton, F. (1889). <a href="https://galton.org/books/natural-inheritance/pdf/galton-nat-inh-1up-clean.pdf">Natural inheritance</a>. Macmillan.</p>
<p>Huberty, C. J. (2002). <a href="https://pages.uoregon.edu/stevensj/workshops/huberty2002.pdf">A history of effect size indices</a>. <em>Educational and Psychological Measurement</em>, <em>62</em>, 227–240.</p>
<p>Kelley, K., &amp; Preacher, K. J. (2012). <a href="https://www.academia.edu/47837419/On_effect_size">On effect size</a>. <em>Psychological Methods</em>, <em>17</em>(2), 137–152.</p>
<p>Levin, J. R. (1998). <a href="https://www.researchgate.net/publication/238374053_What_If_There_Were_No_More_Bickering_About_Statistical_Significance_Tests">What if there were no more bickering about statistical significance tests?</a> <em>Research in the Schools</em>, <em>5</em>(2), 43–53.</p>
<p>Lewis, J. R., &amp; Sauro, J. (2021, June 15). <a href="https://measuringu.com/from-statistical-to-practical-significance/">From statistical to practical significance</a>. MeasuringU.</p>
<p>Lewis, J. R., &amp; Sauro, J. (2021, September 28). <a href="https://measuringu.com/setting-alpha/">For statistical significance, must p be &lt; .05?</a> MeasuringU.</p>
<p>Onwuegbuzie, A. J., Levin, J. R., &amp; Leech, N. L. (2003). <a href="https://files.eric.ed.gov/fulltext/EJ853084.pdf">Do effect-size measures measure up? A brief assessment</a>. <em>Learning Disabilities: A Contemporary Journal</em>, <em>1</em>(1), 37–40.</p>
<p>Rosnow, R. L., &amp; Rosenthal, R. (1989). <a href="https://wiki.ubc.ca/images/3/3c/Rosnow_%26_Rosenthal._1998._Statistical_Procedures_(aspirin_example).pdf">Statistical procedures and the justification of knowledge in psychological science</a>. <em>American Psychologist</em>, <em>44</em>(10), 1276–1284.</p>
<p>Sauro, J. (2014, March 11). <a href="https://measuringu.com/effect-sizes/">Understanding effect sizes in user research</a>. MeasuringU.</p>
<p>Schäfer, T., &amp; Schwarz, M. A. (2019). <a href="https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2019.00813/full">The meaningfulness of effect sizes in psychological research: Differences between sub-disciplines and the impact of potential biases</a>. <em>Frontiers in Psychology: Quantitative Psychology and Measurement</em>, <em>10</em>, Article ID: 813.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Sample Sizes for Comparing UX-Lite Scores</title>
		<link>https://measuringu.com/sample-sizes-for-comparison-of-ux-lite-scores/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=sample-sizes-for-comparison-of-ux-lite-scores</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD&nbsp;•&nbsp;Jeff Sauro, PhD]]></dc:creator>
		<pubDate>Wed, 11 Feb 2026 17:20:59 +0000</pubDate>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[Sample Size]]></category>
		<category><![CDATA[Sample Sizes]]></category>
		<category><![CDATA[UX-Lite]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=46525</guid>

					<description><![CDATA[The UX-Lite® is a relatively new metric, but it is versatile, short, and increasingly popular for UX research. It measures perceived usability and usefulness with just two items. But if you’re using the UX-Lite to compare products or to see whether you’ve improved over time, what sample size do you need? Yes, the sample size [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/02/021026-FeatureImage-2.png"><img loading="lazy" decoding="async" class="alignleft wp-image-46555 size-medium" src="https://measuringu.com/wp-content/uploads/2026/02/021026-FeatureImage-2-300x169.png" alt="Feature image showing drivers of sample size estimation table and a group of wooden pawns" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/02/021026-FeatureImage-2-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/02/021026-FeatureImage-2-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/02/021026-FeatureImage-2-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/02/021026-FeatureImage-2-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/02/021026-FeatureImage-2-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/02/021026-FeatureImage-2.png 2000w" sizes="(max-width: 300px) 100vw, 300px" /></a>The UX-Lite<sup>®</sup> is a relatively new metric, but it is versatile, short, and increasingly popular for UX research. It measures perceived usability and usefulness with just two items.</p>
<p>But if you’re using the UX-Lite to compare products or to see whether you’ve improved over time, what sample size do you need?</p>
<p>Yes, the sample size question we can’t (and shouldn’t) avoid. Fortunately, sample sizes for making comparisons are straightforward and uncontroversial.</p>
<p>In previous articles, we’ve developed sample size tables for studies focused on <a href="https://measuringu.com/ux-lite-sample-sizes-for-confidence-intervals">estimating UX-Lite scores with confidence intervals</a> or <a href="https://measuringu.com/ux-lite-sample-sizes-for-comparison-to-a-benchmark">comparing them to benchmark values</a>.</p>
<p>The <a href="https://measuringu.com/evolution-of-the-ux-lite/">UX-Lite</a>, like the <a href="https://measuringu.com/10-things-sus/">SUS</a>, uses transformed scores that range from 0 to 100, based on responses to its two five-point scales (usability and usefulness). We refer to it as a mini version of the 16-item Technology Acceptance Model (<a href="https://measuringu.com/tam/">TAM</a>). The UX-Lite <a href="https://measuringu.com/accuracy-of-sus-estimation-with-ux-lite/">predicts the SUS</a> with over 95% accuracy, and like the TAM, it <a href="https://measuringu.com/article/effect-of-perceived-ease-of-use-and-usefulness-on-ux-and-behavioral-outcomes/">predicts future product usage</a>.</p>
<p>In this article, we cover how to determine appropriate sample sizes for comparing two mean UX-Lite scores.</p>
<h2><span lang="EN-US">What Drives Sample Size Requirements for Comparison Tests?</span></h2>
<p>You need to know six things to compute the sample size when comparing two means. The first three are the same elements required to compute the sample size for a confidence interval:</p>
<ol>
<li>An estimate of the UX-Lite <a href="https://measuringu.com/reliability-and-variability-of-standardized-ux-scales/">standard deviation</a> (median of 19.3 with an interquartile range from 16.6 [25th percentile] to 21.3 [75th percentile]): <em>s</em></li>
<li>The required level of precision: <em>d</em></li>
<li>The level of confidence (typically 90% or 95%): <em>t<sub>ɑ</sub></em></li>
</ol>
<p>For a more detailed discussion of these three elements, see <a href="https://measuringu.com/ux-lite-sample-sizes-for-confidence-intervals">our previous confidence interval article</a>.</p>
<p>Sample size estimation for benchmark and comparison studies requires two additional considerations:</p>
<ol start="4">
<li>The power of the test (typically 80%): <em>t<sub>β</sub></em></li>
<li>The distribution of the rejection region (one-tailed for benchmark tests, two-tailed for means)</li>
</ol>
<p>As a quick recap, the power of a test refers to its capability to detect a specified minimum difference between means (i.e., to control the <a href="https://measuringu.com/hypothesis-testing-what-can-go-wrong/">likelihood of a Type II error</a>). The number of tails refers to the distribution of the rejection region for the statistical test. In most cases, comparisons of two means should use a two-tailed test. For more details on these topics, see the previous article on <a href="https://measuringu.com/ux-lite-sample-sizes-for-comparison-to-a-benchmark">UX-Lite benchmark testing</a>.</p>
<p>The comparison of two means has one more consideration:</p>
<ol start="6">
<li>Within- or between-subjects experimental design</li>
</ol>
<p>In a within-subjects study, you compare the means of scores that are paired because they came from the same person (assuming proper counterbalancing of the order of presentation). In a between-subjects study, you compare the means of scores that came from different (independent) groups of participants. Each experimental design has its <a href="https://measuringu.com/between-within/">strengths and weaknesses</a>, and each has its own formula for sample size estimation.</p>
<p>Figure 1 illustrates how the number of sample size drivers increases and changes from confidence intervals (the simplest with three drivers) to benchmark testing (five drivers) to tests of two means (six drivers).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2022/04/041321-F2.png"><img loading="lazy" decoding="async" class="alignnone wp-image-32362" src="https://measuringu.com/wp-content/uploads/2022/04/041321-F2.png" alt="Drivers of sample size estimation for comparing scores." width="1200" height="511" srcset="https://measuringu.com/wp-content/uploads/2022/04/041321-F2.png 4400w, https://measuringu.com/wp-content/uploads/2022/04/041321-F2-300x128.png 300w, https://measuringu.com/wp-content/uploads/2022/04/041321-F2-1024x436.png 1024w, https://measuringu.com/wp-content/uploads/2022/04/041321-F2-768x327.png 768w, https://measuringu.com/wp-content/uploads/2022/04/041321-F2-1536x654.png 1536w, https://measuringu.com/wp-content/uploads/2022/04/041321-F2-2048x872.png 2048w, https://measuringu.com/wp-content/uploads/2022/04/041321-F2-600x255.png 600w" sizes="(max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> Drivers of sample size estimation for comparing scores.</p>
<h2><span lang="EN-US">UX-Lite Sample Sizes for Within-Subjects Comparison of Two Means</span></h2>
<p>The sample size formula for a within-subjects study is the same as the one used for benchmark tests:</p>
<p><a href="https://measuringu.com/wp-content/uploads/2022/04/Sample-Size-Fomula.png"><img loading="lazy" decoding="async" class="alignnone wp-image-32296" src="https://measuringu.com/wp-content/uploads/2022/04/Sample-Size-Fomula.png" alt="Sample size formula for a within-subjects study " width="243" height="150" srcset="https://measuringu.com/wp-content/uploads/2022/04/Sample-Size-Fomula.png 330w, https://measuringu.com/wp-content/uploads/2022/04/Sample-Size-Fomula-300x185.png 300w" sizes="(max-width: 243px) 100vw, 243px" /></a></p>
<p>where <em>s</em> is the standard deviation (<em>s</em><sup>2</sup> is the variance), <em>t</em> is the <em>t</em>-value for the desired level of confidence AND power, and <em>d</em> is the target for the critical difference (the smallest difference in means that you need to be able to detect).</p>
<p>As in benchmark testing, <em>t</em> in the formula is the sum of two <em>t</em>-values, one for <em>ɑ</em> (related to confidence, two-sided for comparison of means) and one for <em>β</em> (related to power, always one-sided). For a 90% confidence level and 80% power, this works out to be about 1.645 + 0.842 = 2.5.</p>
<p>One way to think of including power in sample size estimation is as an insurance policy: you purchase the policy by increasing your sample size, improving your likelihood of finding statistically significant results if the standard deviation is a little higher than expected or the observed value of <em>d</em> is a bit lower.</p>
<p>Table 1 shows how variations in these components affect sample size estimates for within-subjects comparisons for the median standard deviation of 19.3 and for the 75<sup>th</sup> percentile of 21.3. In most cases, using the median standard deviation is reasonable, but when a sufficient sample size is more important than controlling the cost of sampling, it’s better to plan with the higher value.</p>

<table id="tablepress-1018" class="tablepress tablepress-id-1018">
<thead>
<tr class="row-1">
	<td class="column-1"></td><th colspan="2" class="column-2"><center><strong><i>s</i> = 19.3</strong></th><th colspan="2" class="column-4"><center><strong><i>s</i> = 21.3</strong></th>
</tr>
</thead>
<tbody class="row-striping row-hover">
<tr class="row-2">
	<td class="column-1"><strong><i>d</i></strong></td><td class="column-2"><center><strong>90%</strong></td><td class="column-3"><center><strong>95%</strong></td><td class="column-4"><center><strong>90%</strong></td><td class="column-5"><center><strong>95%</strong></td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong>15</strong></td><td class="column-2"><center>12</td><td class="column-3"><center>15</td><td class="column-4"><center>14</td><td class="column-5"><center>18</td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>10</strong></td><td class="column-2"><center>25</td><td class="column-3"><center>32</td><td class="column-4"><center>30</td><td class="column-5"><center>38</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>7.5</strong></td><td class="column-2"><center>43</td><td class="column-3"><center>54</td><td class="column-4"><center>52</td><td class="column-5"><center>66</td>
</tr>
<tr class="row-6">
	<td class="column-1"><strong>5.0</strong></td><td class="column-2"><center>94</td><td class="column-3"><center>119</td><td class="column-4"><center>114</td><td class="column-5"><center>145</td>
</tr>
<tr class="row-7">
	<td class="column-1"><strong>2.5</strong></td><td class="column-2"><center>370</td><td class="column-3"><center>470</td><td class="column-4"><center>451</td><td class="column-5"><center>572</td>
</tr>
<tr class="row-8">
	<td class="column-1"><strong>2.0</strong></td><td class="column-2"><center>578</td><td class="column-3"><center>733</td><td class="column-4"><center>703</td><td class="column-5"><center>893</td>
</tr>
<tr class="row-9">
	<td class="column-1"><strong>1.0</strong></td><td class="column-2"><center>2305</td><td class="column-3"><center>2926</td><td class="column-4"><center>2807</td><td class="column-5"><center>3563</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1018 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Sample size requirements for UX-Lite comparisons within subjects given various standard deviations (<em>s</em>), confidence levels, and critical differences (<em>d</em>), with power set to 80%.</p>
<p>In this table, the “<a href="https://measuringu.com/might-not-be-a-magic-number-but-there-are-magic-ranges/">magic range</a>” for the critical difference is from 2.5 to 5, where the sample sizes are reasonably attainable (<em>n</em> from 94 to 572). The table also illustrates the tradeoff between the ability of a test to detect significant differences and the sample size needed to achieve that goal.</p>
<p>For example, if you want to be able to detect mean differences of 15 with 90% confidence and 80% power in a within-subjects study, you’d need a sample size of 12. At the other end of the table, for 95% confidence, 80% power, and a critical difference of 1 in a within-subjects study, you’d need a sample size of 3,563.</p>
<h2><span lang="EN-US">UX-Lite Sample Sizes for Between-Subjects Comparison of Two Means</span></h2>
<p>The only change for a between-subjects comparison is to the sample size formula that roughly doubles the sample size for each group and, because there must be two groups, doubles that again. This means that to achieve the same level of sensitivity while keeping everything else equal, the sample size for a between-subjects comparison is about four times the sample size required for a within-subjects comparison. As shown in Table 2, this constrains the “magic range” for a reasonable critical difference to no less than 5, for which the sample sizes for the various combinations of standard deviation and confidence level range from 372 to 572.</p>

<table id="tablepress-1019" class="tablepress tablepress-id-1019">
<thead>
<tr class="row-1">
	<td class="column-1"></td><th colspan="2" class="column-2"><center><strong><i>s</i> = 19.3</strong></th><th colspan="2" class="column-4"><center><strong><i>s</i> = 21.3</strong></th>
</tr>
</thead>
<tbody class="row-striping row-hover">
<tr class="row-2">
	<td class="column-1"><strong><i>d</i></strong></td><td class="column-2"><center><strong>90%</strong></td><td class="column-3"><center><strong>95%</strong></td><td class="column-4"><center><strong>90%</strong></td><td class="column-5"><center><strong>95%</strong></td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong>15</strong></td><td class="column-2"><center>44</td><td class="column-3"><center>56</td><td class="column-4"><center>52</td><td class="column-5"><center>66</td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>10</strong></td><td class="column-2"><center>94</td><td class="column-3"><center>120</td><td class="column-4"><center>116</td><td class="column-5"><center>146</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>7.5</strong></td><td class="column-2"><center>166</td><td class="column-3"><center>212</td><td class="column-4"><center>202</td><td class="column-5"><center>256</td>
</tr>
<tr class="row-6">
	<td class="column-1"><strong>5.0</strong></td><td class="column-2"><center>372</td><td class="column-3"><center>470</td><td class="column-4"><center>452</td><td class="column-5"><center>572</td>
</tr>
<tr class="row-7">
	<td class="column-1"><strong>2.5</strong></td><td class="column-2"><center>1476</td><td class="column-3"><center>1874</td><td class="column-4"><center>1798</td><td class="column-5"><center>2282</td>
</tr>
<tr class="row-8">
	<td class="column-1"><strong>2.0</strong></td><td class="column-2"><center>2306</td><td class="column-3"><center>2926</td><td class="column-4"><center>2808</td><td class="column-5"><center>3564</td>
</tr>
<tr class="row-9">
	<td class="column-1"><strong>1.0</strong></td><td class="column-2"><center>9214</td><td class="column-3"><center>11698</td><td class="column-4"><center>11222</td><td class="column-5"><center>14248</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1019 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 2:</strong> Sample size requirements for UX-Lite comparisons between subjects given various standard deviations (<em>s</em>), confidence levels, and critical differences (<em>d</em>), with power set to 80% and total sample sizes for two independent groups.</p>
<h3><span lang="EN-US">Technical Note: What to Do for Different Standard Deviations</span></h3>
<p>If your historical UX-Lite data has a very different standard deviation from 19.3 or 21.3, you can do a quick computation to adjust the values in these tables. The first step is to compute a multiplier by dividing the new target variance (the square of the standard deviation, <em>s</em><sup>2</sup>) by the variance used to create the table. Then multiply the tabled value of <em>n</em> by the multiplier and round it to get the revised estimate. To illustrate this, we’ll start with a standard deviation of 19.3 (our typical standard deviation) and show how this works if the target standard deviation (<em>s</em>) is 21.3 (our conservative estimate in Tables 1 and 2). The target variability (21.3<sup>2</sup>) is 453.69. The initial variability is 372.49 (19.3<sup>2</sup>), making the multiplier 453.69/372.49 = 1.218. To use this multiplier to adjust the sample size for 95% confidence and precision of ±10 shown in Table 2 when <em>s</em> = 19.3, multiply 120 by 1.218 to get 146.16, then round it off to 146. For more information, see our article, <a href="https://measuringu.com/how-do-changes-in-sd-affect-n/">How Do Changes in Standard Deviation Affect Sample Size Estimation</a>.</p>
<h2><span lang="EN-US">Summary and Discussion</span></h2>
<p>What sample size do you need when comparing two sets of UX-Lite scores? To answer that question, you need several types of information, some common to all sample size estimation (confidence level to establish control of Type I errors, standard deviation, and margin of error or critical difference), others unique to statistical hypothesis testing (one- vs. two-tailed testing, setting a level of power to control Type II errors), and for comparison of means (whether the experimental design will be within- or between-subjects).</p>
<p>We provided two tables based on typical (<em>s</em> = 19.3) and conservative (<em>s</em> = 21.3) standard deviations for the UX-Lite in retrospective UX studies, with values for between- and within-subjects designs, 90% and 95% confidence, power set to 80%, and critical differences from 1 to 15 points.</p>
<p>For UX researchers working in contexts where the typical standard deviation of the UX-Lite might be different, we also provided a simple way to increase or decrease the tabled sample sizes for larger or smaller standard deviations.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>UX and NPS Benchmarks of Clothing Websites (2026)</title>
		<link>https://measuringu.com/ux-nps-benchmark-report-for-retail-clothing-websites-2026/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=ux-nps-benchmark-report-for-retail-clothing-websites-2026</link>
		
		<dc:creator><![CDATA[Jeff Sauro, PhD •&nbsp;Jim Lewis, PhD&nbsp;•&nbsp;Emily Short]]></dc:creator>
		<pubDate>Tue, 03 Feb 2026 23:05:12 +0000</pubDate>
				<category><![CDATA[Benchmarking]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[clothing]]></category>
		<category><![CDATA[NPS]]></category>
		<category><![CDATA[SUPR-Q]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=46436</guid>

					<description><![CDATA[It’s hard to beat the convenience of shopping for clothing online. You don’t have to worry about when the store will close or finding parking, and getting a price comparison with other stores is just a few clicks away. On websites, you can easily search for clothing using keywords, and it’s simple to see the [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/01/020326-FeatureImage-2.png"><img loading="lazy" decoding="async" class="alignleft wp-image-46475 size-medium" src="https://measuringu.com/wp-content/uploads/2026/01/020326-FeatureImage-2-300x169.png" alt="Feature image showing an interior of a clothing store" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/01/020326-FeatureImage-2-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/01/020326-FeatureImage-2-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/01/020326-FeatureImage-2-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/01/020326-FeatureImage-2-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/01/020326-FeatureImage-2-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/01/020326-FeatureImage-2.png 2000w" sizes="(max-width: 300px) 100vw, 300px" /></a> It’s hard to beat the convenience of shopping for clothing online. You don’t have to worry about when the store will close or finding parking, and getting a price comparison with other stores is just a few clicks away. On websites, you can easily search for clothing using keywords, and it’s simple to see the entire catalog. There’s no reason to hunt through aisles or track down a salesperson.</p>
<p>But shopping for clothing online also has its drawbacks. You can’t <a href="https://www.vogue.com/article/sizing-is-stopping-consumers-from-shopping-heres-what-brands-need-to-know">try on the clothing</a>. Not being able to walk the store means you’re <a href="https://baymard.com/blog/current-state-product-list-and-filtering">reliant on the organization of the website</a>. And just because you see it <a href="https://www.opensend.com/post/out-of-stock-rate-statistics-ecommerce">doesn’t mean it’s in stock</a>. If you receive the wrong size, you may have to deal with restocking fees and the <a href="https://retailwire.com/discussion/are-stricter-return-policies-worth-it/">hassle of return shipping</a>.</p>
<p>Despite these issues, <a href="https://www.ecommercenorthamerica.org/2025/08/04/us-apparel-ecommerce-2025/">estimated online clothing spending in 2025 in the U.S.</a> was about $217B (about a fifth of global online apparel spending). However, online clothes spending in 2025 was only about 38% of all clothing purchases, making the improvement of the UX of clothing websites a <strong>high priority for providers and consumers</strong>.</p>
<p>To understand the quality of their online experiences, we collected UX benchmark metrics on 11 popular clothing websites and mobile apps.</p>
<ul>
<li>Anthropologie</li>
<li>Athleta</li>
<li>Banana Republic</li>
<li>Gap</li>
<li>H&amp;M</li>
<li>Lululemon</li>
<li>Neiman Marcus</li>
<li>Nordstrom</li>
<li>Old Navy</li>
<li>Urban Outfitters</li>
<li>Zara</li>
</ul>
<p>We computed <a href="https://measuringu.com/product/suprq/">SUPR-Q</a><sup>®</sup> and <a href="https://measuringu.com/nps-reliability/">Net Promoter</a> scores, measured users’ attitudes regarding their experiences, conducted <a href="https://measuringu.com/key-drivers/">key driver</a> analyses, and analyzed reported usability problems. (Full details are in the <a href="https://measuringu.com/product/ux-nps-benchmark-report-for-clothing-websites-2026/">downloadable report</a>.)</p>
<h2>Benchmark Study Details</h2>
<p>In November to December 2025, we asked 351 users of clothing websites in the U.S. to recall their most recent experience and perceptions of one of these websites on their desktop and mobile app (if applicable) in the past year.</p>
<p>Respondents completed the eight-item <a href="https://measuringu.com/10-things-suprq/">SUPR-Q</a> (which includes the <a href="https://measuringu.com/nps-three-confidence-intervals/">Net Promoter Score</a>), the two-item <a href="https://measuringu.com/evolution-of-the-ux-lite/">UX-Lite</a><sup>®</sup>, and the <a href="https://measuringu.com/article/streamlining-the-supr-qm-the-supr-qm-v2/">SUPR-Qm</a> standardized questionnaires and they answered questions about their brand attitudes, usage, and prior experiences.</p>
<h2>Quality of the Website User Experience: SUPR-Q</h2>
<p>The SUPR-Q is a standardized questionnaire widely used for measuring attitudes toward the quality of a website user experience. Its norms are computed from a rolling database of around 200 websites across dozens of industries.</p>
<p>SUPR-Q scores are percentile ranks that tell you how a website’s experience ranks relative to the other websites (50<sup>th</sup> percentile is average). The SUPR-Q provides an overall score as well as detailed scores for subdimensions of Usability, Trust, Appearance, and Loyalty.</p>
<p>The mean SUPR-Q across clothing websites in this study was at the <strong>82<sup>nd</sup> percentile</strong> (substantially above average), ranging from the 50<sup>th</sup> percentile for H&amp;M to the 98<sup>th</sup> percentile for Banana Republic.</p>
<h3>Usability Scores</h3>
<p>Overall, usability scores were also well above average for the clothing websites, averaging at the 81<sup>st</sup> percentile. Old Navy had the highest usability score at the 98<sup>th</sup> percentile. Zara had the lowest usability score, falling at the 31<sup>st</sup> percentile.</p>
<p>Comments related to usability on Zara included:</p>
<p style="padding-left: 25px;"><em>“Unnecessary images between products while browsing.” </em></p>
<p style="padding-left: 25px;"><em>“The return process is complicated</em><em>.” </em></p>
<h3>Loyalty/Net Promoter Scores</h3>
<p>All the clothing websites except H&amp;M (with −2%) had positive Net Promoter Scores, led by Anthropologie (40%). The average NPS for these websites was 19% (more promoters than detractors).</p>
<p>Unsurprisingly, ratings of the intention to keep using these websites correlated with their NPS (<em>r</em> = .70 at the website level). As shown in Figure 1, respondents were more likely to continue using the Nieman Marcus website than to continue using the Old Navy website.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/01/020326-Figure1-scaled.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46467" src="https://measuringu.com/wp-content/uploads/2026/01/020326-Figure1-300x112.png" alt="Likelihood to continue using the websites (90% confidence intervals)." width="1200" height="449" srcset="https://measuringu.com/wp-content/uploads/2026/01/020326-Figure1-300x112.png 300w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure1-1024x383.png 1024w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure1-768x287.png 768w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure1-1536x574.png 1536w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure1-2048x765.png 2048w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure1-600x224.png 600w" sizes="(max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> Likelihood to continue using the websites (90% confidence intervals).</p>
<p>Comments related to NPS and loyalty included:</p>
<p style="padding-left: 25px;"><em>“Quality merchandise, great customer service, high-end brands, good sales.” </em>— Nieman Marcus</p>
<p style="padding-left: 25px;"><em>“It&#8217;s a good site but the quality of the clothes are less than they used to be.” </em>— Old Navy</p>
<h2>Websites and Mobile App Usage</h2>
<p>As part of this benchmark, we asked respondents how they accessed the clothing providers online. All respondents reported using their desktop/laptop computers (this was a requirement for participation in the survey), with 39% also using mobile apps and 73% using mobile websites. Most respondents reported visiting their clothing websites on a desktop or laptop computer a few times a year. Users most frequently reported never using the clothing mobile apps (however, 24% of Athleta users and 33% of Zara users reported using those apps a few times per month).</p>
<h2>Key Drivers of UX Quality</h2>
<p>To better understand what affects SUPR-Q scores and Likelihood-to-Recommend (LTR) ratings, we asked respondents to rate potentially important attributes of the clothing websites on a five-point scale from 1 (Strongly disagree) to 5 (Strongly agree). We conducted key driver analyses (regression modeling) to quantify the extent to which ratings on these items drive (account for) variation in overall SUPR-Q scores and, separately, LTR (the rating from which the NPS is derived; full details are in the <a href="https://measuringu.com/product/ux-nps-benchmark-report-for-clothing-websites-2026/">downloadable report</a>).</p>
<p>The top key driver of the overall SUPR-Q scores was the ease of browsing items (12%), followed by the ease of finding “exactly what I want” (10%). Taken together, 11 significant variables accounted for 74% of the variance in the SUPR-Q scores.</p>
<p>For likelihood-to-recommend (LTR), the top key drivers were the ease of finding “exactly what I want” (10%), finding brands quickly (8%), and trusting sites’ style recommendations (8%). Overall, seven significant drivers accounted for 44% of the variance in LTR.</p>
<p>Figure 2 shows a scatterplot of the importance and opportunity for improvement for seven key drivers. The combination of importance and opportunity for improvement provides a basis for prioritizing which key drivers to improve. The importance score is the greater of the variance accounted for by the driver in the SUPR-Q and NPS analyses, where larger percentages indicate more importance. The opportunity score is the <a href="https://measuringu.com/top-top-two-bottom-net-box/">top-box percentage</a> for the driver, so smaller percentages indicate greater opportunity for improvement (i.e., it would be harder to improve a driver with a top-box percentage of 100% than one with a top-box percentage of 10%).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/01/020326-Figure2-scaled.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46468" src="https://measuringu.com/wp-content/uploads/2026/01/020326-Figure2-300x112.png" alt="Scatterplot of importance and opportunity for improvement of key drivers." width="1200" height="448" srcset="https://measuringu.com/wp-content/uploads/2026/01/020326-Figure2-300x112.png 300w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure2-1024x382.png 1024w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure2-768x287.png 768w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure2-1536x573.png 1536w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure2-2048x764.png 2048w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure2-600x224.png 600w" sizes="(max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 2: </strong>Scatterplot of importance and opportunity for improvement of key drivers.</p>
<p>Two of these seven key drivers fell in the FIX quadrant (upper left) with relatively high importance and higher opportunity for improvement (“Exciting to shop on this site” and “Easy to find exactly what I want”). Anthropologie had the highest top-box score for excitement (48%), and Nordstrom had the highest for shoppers being able to find exactly what they want (61%). The websites with the lowest top-box scores, suggesting the most room for improvement, were Old Navy for excitement (16%) and Urban Outfitters for finding “exactly what I want” (25%).</p>
<h2>UX Problems</h2>
<p>We examined the verbatim comments to better understand the user experience problems.</p>
<p>The top frustrations were products out of stock, sizing issues, slow loading, and navigation/browsing issues.</p>
<h3>Products Being Out of Stock Was a Major Annoyance</h3>
<p>This issue affected all the websites, but it was a top complaint for Neiman Marcus, Old Navy (Figure 3), and Urban Outfitters (and was in the top five for the others).</p>
<p style="padding-left: 25px;"><em>“Sometimes things go out of stock quickly.” </em>— Neiman Marcus</p>
<p style="padding-left: 25px;"><em>“There have been many times where specific items aren&#8217;t sold in my size, or they are sold out completely.” </em>— Old Navy</p>
<p style="padding-left: 25px;"><em>“Some problems I&#8217;ve had with the Urban Outfitters website are sometimes things aren&#8217;t in stock and they aren&#8217;t clear on when they will be back in stock.” </em>— Urban Outfitters</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/01/020326-F3.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46440" src="https://measuringu.com/wp-content/uploads/2026/01/020326-F3.png" alt="Old Navy’s spin on products being out of stock (“We knew it was too good to last”)." width="1200" height="458" srcset="https://measuringu.com/wp-content/uploads/2026/01/020326-F3.png 1430w, https://measuringu.com/wp-content/uploads/2026/01/020326-F3-300x115.png 300w, https://measuringu.com/wp-content/uploads/2026/01/020326-F3-1024x391.png 1024w, https://measuringu.com/wp-content/uploads/2026/01/020326-F3-768x293.png 768w, https://measuringu.com/wp-content/uploads/2026/01/020326-F3-600x229.png 600w" sizes="(max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 3: </strong>Old Navy’s spin on products being out of stock (“We knew it was too good to last”).</p>
<h3>Sizing Issues Degrade the User Experience</h3>
<p>Eight of the websites had sizing issues, which was the most frequently-mentioned negative comment for Banana Republic (Figure 4) and Gap.</p>
<p style="padding-left: 25px;"><em>“Sometimes, there are items that are not in stock or I usually get an incorrect size.” </em>— Banana Republic</p>
<p style="padding-left: 25px;"><em>“Sometimes the item I want is not available in both my size and color. It will either be available in my size or my color but not both.” </em>— Gap</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/01/020326-Figure4.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46469" src="https://measuringu.com/wp-content/uploads/2026/01/020326-Figure4-300x124.png" alt="Example of product review about sizing issue on the Banana Republic website." width="1200" height="495" srcset="https://measuringu.com/wp-content/uploads/2026/01/020326-Figure4-300x124.png 300w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure4-1024x422.png 1024w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure4-768x317.png 768w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure4-600x247.png 600w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure4.png 1220w" sizes="(max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 4: </strong>Example of a product review about a sizing issue on the Banana Republic website.</p>
<h3>Slow Loading Times Slow Down Shopping</h3>
<p>Users reported slow loading times for nine of the websites. It was a top complaint for Athleta and Neiman Marcus and was the second-most reported issue for Gap, Nordstrom, and Urban Outfitters.</p>
<p style="padding-left: 25px;"><em>“Requires strong internet connectivity to load otherwise one would experience slow performance.” </em>— Athleta</p>
<p style="padding-left: 25px;"><em>“The website is slow sometimes.” </em>— Neiman Marcus</p>
<p style="padding-left: 25px;"><em>“Sometimes the website can be slow depending on the device I am using.” </em>— Gap</p>
<p style="padding-left: 25px;"><em>“Sometimes it is slow to load once I get to page 3 or 4 of options.” </em>— Nordstrom</p>
<p style="padding-left: 25px;"><em>“Some pages take longer to load than expected, which can be frustrating.” </em>— Urban Outfitters</p>
<h3>Navigation and Browsing Issues Prevent Smooth Shopping</h3>
<p>Users of ten of the websites reported issues with navigation, browsing, or both. They were the top two most frequently reported frustrations for Anthropologie, Athleta, Gap, H&amp;M, Lululemon, Urban Outfitters, and Zara.</p>
<p style="padding-left: 25px;"><em>“It is so hard to browse for something or search for something.” </em>— Zara</p>
<p style="padding-left: 25px;"><em>“It can feel a bit overwhelming to look at at first because it looks like it has a lot going on.” </em>— Urban Outfitters</p>
<p style="padding-left: 25px;"><em>“It can feel a bit cluttered or too overwhelming to find exactly what I want.” </em>— Gap</p>
<p>We were particularly intrigued by a comment by an H&amp;M user who wrote, “It&#8217;s sometimes frustrating that the sidebar appears so frequently while I&#8217;m scrolling.” When we investigated this, it was clear that the underlying design issue was an invisible border around the controls that triggered the sidebar. Other websites had similar designs but either visualized the boundary separating menu options from the browsing area (e.g., Anthropologie and most others) or required users to activate dropdowns by clicking the option rather than hovering nearby (Lululemon). Video 1 shows how the H&amp;M design is particularly tricky relative to the other design.</p>
<div class="ast-oembed-container " style="height: 100%;"><iframe title="ThreeDesignsFinal" src="https://player.vimeo.com/video/1159742485?dnt=1&amp;app_id=122963" width="1170" height="658" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write; encrypted-media; web-share" referrerpolicy="strict-origin-when-cross-origin"></iframe></div>
<p class="wp-caption-text" style="text-align: left;"><strong>Video 1: </strong>Sidebar/dropdown triggering from main menu options on H&amp;M, Anthropologie, and Lululemon.</p>
<h3>Online Shoppers Have Trouble Getting Exactly What They Want</h3>
<p>The signals we get from both the quantitative and qualitative analyses for this benchmark study clearly indicate that users of these clothing websites have trouble getting exactly what they want.</p>
<p>This is demonstrated quantitatively by the significance of key driver ratings of the ease of browsing, finding brands quickly, trusting a site’s recommendations, easy returns, and confidence in the accuracy of size guides.</p>
<p>The qualitative findings provide more of the “why” behind the numbers, including:</p>
<ul>
<li>The annoyance of products being completely out of stock or out of desired sizes/colors, combined with the occasional surprise of not finding out until well into the checkout process</li>
<li>Uncertainty about sizing chart accuracy and variability in sizing across manufacturers, which leads consumers to experience <a href="https://www.vogue.com/article/sizing-is-stopping-consumers-from-shopping-heres-what-brands-need-to-know">fit/sizing uncertainty</a> that can cause them to abandon the purchase due to poor <a href="https://link.springer.com/article/10.1007/s11747-024-01034-9">fit-risk perception</a></li>
<li>Persistent complaints about slow loading</li>
<li>Numerous navigation and browsing issues (e.g., odd dropdown/sidebar behaviors, large images/videos that sometimes do not resize, intrusive ads, complex checkout and return processes)</li>
</ul>
<p>Looking across the quantitative and qualitative findings, all 11 websites have opportunities to improve. Some websites that would especially benefit from a stronger focus on online shopping experiences are:</p>
<ul>
<li><strong>Urban Outfitters</strong>: Lowest top-box score for finding “exactly what I want” (25%) and relatively high percentage of user comments about products being out of stock (top complaint), slow loading times, and navigation/browsing issues</li>
<li><strong>Old Navy</strong>: Lowest top-box score for excitement while shopping on the site (16%) and relatively high percentage of user comments about products being out of stock (top complaint)</li>
<li><strong>Zara</strong>: Lowest SUPR-Q Usability score (31<sup>st</sup> percentile) and relatively high percentage of user complaints, including browsing/navigation issues</li>
<li><strong>H&amp;M</strong>: Lowest overall SUPR-Q score (50<sup>th</sup> percentile) and NPS (−2%) with a relatively high percentage of user complaints, including browsing/navigation issues</li>
</ul>
<h2>Comparison with the 2022 Clothing Benchmark</h2>
<p>In 2022, we collected SUPR-Q and NPS data for all the same websites. Banana Republic, Gap, and Zara had the most improvement. Zara increased by more than 50 points in the four years since we measured (though still lagging behind the leaders), and H&amp;M had the biggest drop (32 points). These differences, shown in Figure 5, are statistically significant [<em>F</em>(10, 995) = 1.89, <em>p</em> = .04].<a href="https://measuringu.com/wp-content/uploads/2026/01/020326-Figure5-scaled.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46470" src="https://measuringu.com/wp-content/uploads/2026/01/020326-Figure5-300x112.png" alt="SUPR-Q scores from the 2022 and 2025 surveys (statistical analysis conducted on raw scores)." width="1200" height="449" srcset="https://measuringu.com/wp-content/uploads/2026/01/020326-Figure5-300x112.png 300w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure5-1024x383.png 1024w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure5-768x287.png 768w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure5-1536x574.png 1536w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure5-2048x765.png 2048w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure5-600x224.png 600w" sizes="(max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 5: </strong>SUPR-Q scores from the 2022 and 2025 surveys (statistical analysis conducted on raw scores).</p>
<h2>Summary and Takeaways</h2>
<p>Clothing companies are big businesses with an estimated online clothing spending in the U.S. of about $217B in 2025 (about a fifth of global online apparel spending). An analysis of the user experience of 11 major clothing websites using data collected in November–December 2025 found:</p>
<ol>
<li><strong>Banana Republic and Anthropologie lead; H&amp;M lags. </strong>Banana Republic had the highest overall SUPR-Q score, falling in the 98th percentile, while H&amp;M had the lowest score (50th percentile). Anthropologie had the highest NPS (40%) and H&amp;M had the lowest (−2%).</li>
</ol>
<ol start="2">
<li><strong>Ease of browsing and ease of finding “exactly what I want” drive UX scores. </strong>The top key driver of the overall SUPR-Q scores was the ease of browsing items (12%), followed by the ease of finding “exactly what I want” (10%). Taken together, 11 significant variables accounted for 74% of the variance in the SUPR-Q scores. For likelihood-to-recommend (LTR), the top key drivers were the ease of finding “exactly what I want” (10%), finding brands quickly (8%), and trusting sites’ style recommendations (8%). Overall, seven significant drivers accounted for 44% of the variance in LTR.</li>
</ol>
<ol start="3">
<li><strong>The top opportunities for improvement are increasing the feeling of excitement when shopping and helping shoppers find what they want. </strong>One way to prioritize attention to key drivers is to consider both their importance (beta weights in regression) and how well the websites achieve the stated goal (top-box scores). The two key drivers with the most potential for improvement (high beta weights and low top-box scores) were the extent to which users feel excitement when shopping on the websites (average top-box score: 31%) and the ease of finding exactly what they want (average top-box score: 38%). For excitement, the leader was Anthropologie (Old Navy lagging). For finding “exactly what I want,” the leader was Nordstrom (Urban Outfitters lagging).</li>
</ol>
<ol start="4">
<li><strong>Top frustrations were products out of stock, sizing issues, and slow loading. </strong>The most reported issue, affecting all websites, was products being out of stock, either entirely or for certain sizes. Some users reported not being notified of this until they were checking out. For nine of the websites, users reported issues with sizing (e.g., poor fit, missing sizes) and slow website loading (e.g., many large images and videos). Navigation/browsing issues were also common, affecting ten of the sites.</li>
</ol>
<ol start="5">
<li><strong>Online clothes shoppers have trouble getting exactly what they want. </strong>Quantitative and qualitative signals from our findings point in the same direction: users of these clothing websites have trouble getting exactly what they want. Quantitative signals include difficult navigation/browsing, hard to find brands quickly, lack of trust in product recommendations, difficult returns, and inaccurate size guides. The qualitative “why” behind the numbers includes annoyance of products being out of stock, fit/sizing uncertainty, slow loading times, and various navigation/browsing issues. Considering both quantitative and qualitative findings, websites that would especially benefit from a focus on the UX of their websites are Urban Outfitters, Old Navy, Zara, and H&amp;M.</li>
</ol>
<p>For more details, see the <a href="https://measuringu.com/product/ux-nps-benchmark-report-for-clothing-websites-2026/">downloadable report</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>UX-Lite Sample Sizes for Comparison to a Benchmark</title>
		<link>https://measuringu.com/ux-lite-sample-sizes-for-comparison-to-a-benchmark/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=ux-lite-sample-sizes-for-comparison-to-a-benchmark</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD&nbsp;•&nbsp;Jeff Sauro, PhD]]></dc:creator>
		<pubDate>Tue, 27 Jan 2026 22:56:42 +0000</pubDate>
				<category><![CDATA[Benchmarking]]></category>
		<category><![CDATA[Sample Size]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[SUS]]></category>
		<category><![CDATA[UX-Lite]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=46311</guid>

					<description><![CDATA[The UX-Lite® is a relatively new but increasingly popular metric for UX research. Its two items generate an overall score and subscale scores on ease and usefulness from 0 to 100. The UX-Lite predicts future product usage as well as or better than the original and longer Technology Acceptance Model (TAM). The ease score also [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/01/012726-FeatureImage-3.png"><img loading="lazy" decoding="async" class="alignleft wp-image-46413 size-medium" src="https://measuringu.com/wp-content/uploads/2026/01/012726-FeatureImage-3-300x169.png" alt="Feature image showing a group of wooden pawns and two sticky notes" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/01/012726-FeatureImage-3-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/01/012726-FeatureImage-3-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/01/012726-FeatureImage-3-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/01/012726-FeatureImage-3-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/01/012726-FeatureImage-3-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/01/012726-FeatureImage-3.png 2000w" sizes="(max-width: 300px) 100vw, 300px" /></a>The <a href="https://measuringu.com/evolution-of-the-ux-lite/">UX-Lite</a><sup>®</sup> is a relatively new but increasingly popular metric for UX research.</p>
<p>Its two items generate an overall score and subscale scores on ease and usefulness <a href="https://measuringu.com/converting-scales-to-100-points/">from 0 to 100</a>. The UX-Lite <a href="https://measuringu.com/article/effect-of-perceived-ease-of-use-and-usefulness-on-ux-and-behavioral-outcomes/">predicts future product usage</a> as well as or better than the original and longer <a href="https://measuringu.com/tam/">Technology Acceptance Model</a> (TAM). The ease score also <a href="https://measuringu.com/accuracy-of-sus-estimation-with-ux-lite/">predicts the SUS</a> with over 95% accuracy.</p>
<p>One of the benefits of the UX-Lite is its familiar scoring system from 0 to 100 (like the SUS) and its <a href="https://measuringu.com/grading-scales-for-the-ux-lite/">reasonable set of benchmark scores</a>.</p>
<p>Finding the right sample size estimate isn’t about picking a magic number. While there may be some <a href="https://measuringu.com/might-not-be-a-magic-number-but-there-are-magic-ranges/">magic ranges</a>, the right process involves starting with the type of study design.</p>
<p>Three of the most common UX study designs are those with a focus on estimation with confidence intervals (discussed in a <a href="https://measuringu.com/ux-lite-sample-sizes-for-confidence-intervals">previous article</a>), comparison with a benchmark, and comparison of two means.</p>
<p>Across all three study types, a key ingredient is the historical standard deviation of the UX-Lite. Fortunately, we’ve collected enough data to have a good idea about a <a href="https://measuringu.com/reliability-and-variability-of-standardized-ux-scales/">typical UX-Lite standard deviation</a>.</p>
<p>In this article, we demonstrate how to compute the right sample size for comparing UX-Lite scores to a benchmark by controlling the size of the critical difference (i.e., the desired level of precision, specifically, the smallest difference you need to be able to reliably detect).</p>
<h2>What Drives Sample Size Requirements for Benchmark Tests?</h2>
<p>Not to be intentionally confusing, but we often use “benchmark testing” to mean a few things. First, it’s used loosely to refer to the process of collecting metrics (e.g., benchmarking), which we cover extensively in <em><a href="https://measuringu.com/book/benchmarking-the-user-experience/">Benchmarking the User Experience</a></em>.</p>
<p>It also refers to how metrics collected within a study will be used. They can be used to establish the <a href="https://measuringu.com/business-software-ux2020/">current experience of a product</a> (a new benchmark), in which case you would use confidence intervals around the benchmark to assess its precision.</p>
<p>Metrics can also be used to compare against a prior experience or a competitive experience (comparative or competitive) or to compare against established thresholds (a benchmark). We’re focusing on this final case, where a sample of UX-Lite scores is collected, and then the mean UX-Lite score is compared to a set benchmark value.</p>
<p>As shown in Figure 1, you need to know five things to compute the sample size for a comparison to a benchmark. The first three are the same elements required to compute the sample size for a confidence interval (for a comprehensive discussion of these three elements, see <a href="https://measuringu.com/ux-lite-sample-sizes-for-confidence-intervals">our previous article</a>).</p>
<ol>
<li>An estimate of the UX-Lite <a href="https://measuringu.com/reliability-and-variability-of-standardized-ux-scales/">standard deviation</a> (median of 19.3 with an interquartile range from 16.6 [25th percentile] to 21.3 [75th percentile]): <em>s</em></li>
<li>The level of confidence (typically 90% or 95%): <em>t</em></li>
<li>The desired precision of measurement (critical difference): <em>d</em></li>
</ol>
<p>Sample size estimation for benchmark tests also requires two additional considerations:</p>
<ol start="4">
<li>The power of the test</li>
<li>The level of confidence for a one-sided (one-tailed) test</li>
</ol>
<p><a href="https://measuringu.com/wp-content/uploads/2026/01/figure1.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46409 size-full" src="https://measuringu.com/wp-content/uploads/2026/01/figure1.png" alt="Drivers of sample size estimation for benchmark comparisons." width="1200" height="306" srcset="https://measuringu.com/wp-content/uploads/2026/01/figure1.png 1200w, https://measuringu.com/wp-content/uploads/2026/01/figure1-300x77.png 300w, https://measuringu.com/wp-content/uploads/2026/01/figure1-1024x261.png 1024w, https://measuringu.com/wp-content/uploads/2026/01/figure1-768x196.png 768w, https://measuringu.com/wp-content/uploads/2026/01/figure1-600x153.png 600w" sizes="(max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> Drivers of sample size estimation for benchmark comparisons.</p>
<h3>Power</h3>
<p>The power of a test refers to its capability to detect a difference between observed measurements and hypothesized values when there really is a significant difference. The power of a test is not an issue when you’re just estimating the value of a parameter, but it matters when <a href="https://measuringu.com/hypothesis-testing-what-can-go-wrong/">testing a hypothesis</a>. Analogous to setting the confidence level to 1 − α (the acceptable level for Type I errors, or false positives), power is 1 − β (the acceptable level for Type II errors, or false negatives).</p>
<h3>One-Tailed Testing</h3>
<p>Most statistical comparisons use a strategy known as two-tailed testing. The term “two-tailed” refers to the tails of the distribution of the differences between the two values. The left distribution in Figure 2 illustrates a two-tailed test showing the rejection criterion (α = .05) evenly split between the two tails.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/01/figure2.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46410 size-full" src="https://measuringu.com/wp-content/uploads/2026/01/figure2.png" alt="Two- and one-sided rejection regions for two- and one-sided significance tests." width="1200" height="297" srcset="https://measuringu.com/wp-content/uploads/2026/01/figure2.png 1200w, https://measuringu.com/wp-content/uploads/2026/01/figure2-300x74.png 300w, https://measuringu.com/wp-content/uploads/2026/01/figure2-1024x253.png 1024w, https://measuringu.com/wp-content/uploads/2026/01/figure2-768x190.png 768w, https://measuringu.com/wp-content/uploads/2026/01/figure2-600x149.png 600w" sizes="(max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 2:</strong> Two- and one-sided rejection regions for two- and one-sided significance tests.</p>
<p>For most comparisons, two-tailed tests are appropriate. When you test an estimated value against a benchmark, however, you care only that your estimate is significantly better than the benchmark. When that’s the case, you can conduct a one-tailed test, illustrated by the right distribution in Figure 2. Instead of splitting the rejection region between two tails, it’s all in one tail. The practical consequence is that the bar for declaring significance is lower for a one-tailed test.</p>
<p>The area in one tail for a two-sided test with α = .10 is the same as a one-sided test with α = .05. <strong>This factor decreases the sample size relative to computing a two-tailed confidence interval.</strong></p>
<h3>Putting the Values Together</h3>
<p>With the five ingredients ready, we use the following formula:</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/01/figure3.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46408" src="https://measuringu.com/wp-content/uploads/2026/01/figure3-300x197.png" alt="Formula for calculating a sample size for comparison to a benchmark. " width="229" height="150" srcset="https://measuringu.com/wp-content/uploads/2026/01/figure3-300x197.png 300w, https://measuringu.com/wp-content/uploads/2026/01/figure3.png 355w" sizes="(max-width: 229px) 100vw, 229px" /></a></p>
<p>where <em>s</em> is the standard deviation (<em>s</em><sup>2</sup> is the variance), <em>t</em> is the <em>t</em>-value for the desired level of confidence AND power, and <em>d</em> is the targeted size for the interval’s margin of error (i.e., precision).</p>
<p>A difference compared to the confidence interval computation is that t is actually the sum of two <em>t</em>-values, one for α (related to confidence) and one for β (related to power, always one-sided). For a 90% confidence level and 80% power, this works out to be about 1.645 + 0.842 = 2.5.</p>
<p>When you don’t need more power, the default power level is 50%, at which <em>t</em> for power = 0, making it the same result as a confidence interval. Any larger value for power (commonly 80%) makes the value of <em>t</em> larger, which will increase the estimated sample size.</p>
<p>One way to think of including power in sample size estimation is as an insurance policy that you purchase by increasing your sample size to increase your likelihood of finding statistically significant results if the standard deviation is a little higher than expected or the observed value of <em>d</em> is a bit lower.</p>
<h3>Sample Size Table for UX-Lite Benchmark Comparisons</h3>
<p>Table 1 shows how variations in these three components affect sample size estimates for confidence intervals for the median standard deviation of 19.3 and for the 75<sup>th</sup> percentile of 21.3. In most cases, it’s reasonable to use the median standard deviation, but when a sufficient sample size is more important than the cost of sampling, it’s better to plan with the higher value.</p>

<table id="tablepress-1017" class="tablepress tablepress-id-1017">
<thead>
<tr class="row-1">
	<td class="column-1"></td><th colspan="2" class="column-2"><center><em>s</em> = 19.3</th><td class="column-4"></td><th colspan="2" class="column-5"><center><em>s</em> = 21.3</th>
</tr>
</thead>
<tbody>
<tr class="row-2">
	<td class="column-1"><strong><i>d</i></td><td class="column-2"><center><strong>90%</td><td class="column-3"><center><strong>95%</td><td class="column-4"></td><td class="column-5"><center><strong>90%</td><td class="column-6"><center><strong>95%</td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong>15</td><td class="column-2"><center>9</td><td class="column-3"><center>12</td><td class="column-4"></td><td class="column-5"><center>11</td><td class="column-6"><center>14</td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>10</td><td class="column-2"><center>18</td><td class="column-3"><center>25</td><td class="column-4"></td><td class="column-5"><center>22</td><td class="column-6"><center>30</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>7.5</td><td class="column-2"><center>31</td><td class="column-3"><center>43</td><td class="column-4"></td><td class="column-5"><center>38</td><td class="column-6"><center>52</td>
</tr>
<tr class="row-6">
	<td class="column-1"><strong>5.0</td><td class="column-2"><center>69</td><td class="column-3"><center>94</td><td class="column-4"></td><td class="column-5"><center>83</td><td class="column-6"><center>114</td>
</tr>
<tr class="row-7">
	<td class="column-1"><strong>2.5</td><td class="column-2"><center>270</td><td class="column-3"><center>370</td><td class="column-4"></td><td class="column-5"><center>329</td><td class="column-6"><center>451</td>
</tr>
<tr class="row-8">
	<td class="column-1"><strong>2.0</td><td class="column-2"><center>421</td><td class="column-3"><center>578</td><td class="column-4"></td><td class="column-5"><center>513</td><td class="column-6"><center>703</td>
</tr>
<tr class="row-9">
	<td class="column-1"><strong>1.0</td><td class="column-2"><center>1681</td><td class="column-3"><center>2305</td><td class="column-4"></td><td class="column-5"><center>2047</td><td class="column-6"><center>2807</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1017 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Sample size requirements for UX-Lite benchmark comparisons given various standard deviations (<em>s</em>), confidence levels, and critical differences (<em>d</em>), with power set to 80% (green shading shows the “magic range” for this table).</p>
<p>For example, to declare that you have significantly beaten a UX-Lite benchmark of 75 with 90% confidence, 80% power, a standard deviation of 19.3, and a critical difference of 15, you will need a sample size of 9, but you will also need the observed UX-Lite mean to be 90 (75 + 15) or higher.</p>
<p>At the other end of the table, if you have the same benchmark (75), 95% confidence, 80% power, a standard deviation of 21.3, and a critical difference of 1, you’ll only need the observed UX-Lite mean to be 76 (75 + 1), but you’ll need a sample size of 2,807.</p>
<p>In this table, the “magic range” for the critical difference is from 2.5 to 5, where the sample sizes are reasonably attainable (<em>n</em> from 69 to 451). The table also illustrates the tradeoff between the ability of a test to detect significant differences and the sample size needed to achieve that goal.</p>
<h3>Technical Note: What to Do for Different Standard Deviations</h3>
<p>If your historical UX-Lite data has a very different standard deviation from 19.3 or 21.3, you can do a quick computation to adjust the values in these tables. The first step is to compute a multiplier by dividing the new target variance (the square of the standard deviation, <em>s</em><sup>2</sup>) by the variance used to create the table. Then multiply the tabled value of <em>n</em> by the multiplier and round it to get the revised estimate. To illustrate this, we’ll start with a standard deviation of 19.3 (our typical standard deviation) and show how this works if the target standard deviation (<em>s</em>) is 21.3 (our conservative estimate in Table 1). The target variability (21.3<sup>2</sup>) is 453.69. The initial variability is 372.49 (19.3<sup>2</sup>), making the multiplier 453.69/372.49 = 1.218. To use this multiplier to adjust the sample size for 95% confidence and precision of ±2.5 shown in Table 1 when <em>s</em> = 19.3, multiply 370 by 1.218 to get 450.66 then round it to 451. For more information, see our article, <a href="https://measuringu.com/how-do-changes-in-sd-affect-n/">How Do Changes in Standard Deviation Affect Sample Size Estimation</a>.</p>
<h2>Summary and Takeaways</h2>
<p>What sample size do you need when conducting a UX-Lite benchmark test? To answer that question, you need several types of information, some common to all sample size estimation (confidence level to establish control of Type I errors, standard deviation, margin of error or critical difference) and others unique to statistical hypothesis testing (one- vs. two-tailed testing, setting a level of power to control Type II errors).</p>
<p>We provided a sample size table based on a typical standard deviation for the UX-Lite in retrospective UX studies (<em>s</em> = 19.3) and a more conservative standard deviation (<em>s</em> = 21.3), with examples of its use.</p>
<p>For UX researchers working in contexts where the typical standard deviation of the UX-Lite might differ, we provided a simple way to increase or decrease the tabled sample sizes for larger or smaller standard deviations. While there isn’t a magic number that will always work, in practice, there are ranges that satisfy many requirements. When comparing UX-Lite scores to benchmarks given measurement precision from 2.5 to 5 points and the typical standard deviation of 19.3, the sample sizes range from 69 to 370.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>UX and NPS Benchmarks of Mass Merchant Websites (2026)</title>
		<link>https://measuringu.com/ux-nps-benchmark-report-for-mass-merchant-websites-2026/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=ux-nps-benchmark-report-for-mass-merchant-websites-2026</link>
					<comments>https://measuringu.com/ux-nps-benchmark-report-for-mass-merchant-websites-2026/#comments</comments>
		
		<dc:creator><![CDATA[Jeff Sauro, PhD •&nbsp;Jim Lewis, PhD&nbsp;•&nbsp;Emily Short]]></dc:creator>
		<pubDate>Wed, 21 Jan 2026 04:50:47 +0000</pubDate>
				<category><![CDATA[Benchmarking]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[mass merchant]]></category>
		<category><![CDATA[NPS]]></category>
		<category><![CDATA[retail]]></category>
		<category><![CDATA[SUPR-Q]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=46326</guid>

					<description><![CDATA[People spend a lot of money (and time) on online purchases, most of that on what we call mass merchant retail websites. The US Census Bureau estimates Q3 2025 retail e-commerce sales at $310B (15.8% of total retail sales that quarter). Spending continues to grow but is tempered by inflation, making shoppers more economically pessimistic [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/01/012026-FeatureImage-1.png"><img loading="lazy" decoding="async" class="alignleft wp-image-46363 size-medium" src="https://measuringu.com/wp-content/uploads/2026/01/012026-FeatureImage-1-300x169.png" alt="Feature image showing a person holding a smartphone and a credit card" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/01/012026-FeatureImage-1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/01/012026-FeatureImage-1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/01/012026-FeatureImage-1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/01/012026-FeatureImage-1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/01/012026-FeatureImage-1-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/01/012026-FeatureImage-1.png 2000w" sizes="(max-width: 300px) 100vw, 300px" /></a>People spend a lot of money (and time) on online purchases, most of that on what we call mass merchant retail websites.</p>
<p>The US Census Bureau estimates <a href="https://www.census.gov/retail/ecommerce.html">Q3 2025 retail e-commerce sales</a> at $310B (15.8% of total retail sales that quarter). Spending continues to grow but is tempered by inflation, making shoppers more <a href="https://www.mckinsey.com/industries/consumer-packaged-goods/our-insights/the-state-of-the-us-consumer">economically pessimistic</a> and <a href="https://cbs6albany.com/resources/pdf/1457590f-06e7-4e14-8dec-6822bfb271ab-DI_2025HolidaySurvey.pdf">price-sensitive</a>, and <a href="https://www.emarketer.com/content/extra-costs-are-the-top-reason-consumers-abandon-online-carts">less tolerant of extra fees</a>.</p>
<p>In general, when purchase volume is high, competition tends to be strong and margins thin. To stand out, providers need to provide online experiences that are <a href="https://theacsi.org/wp-content/uploads/2025/01/25jan_Retail-Study-FINAL.pdf">reliable, frictionless, and high in perceived value</a>.</p>
<p>To understand the quality of their online experiences, we collected UX benchmark metrics on seven popular mass merchant websites and mobile apps.</p>
<ul>
<li>Amazon</li>
<li>JC Penney</li>
<li>Kohl’s</li>
<li>Macy’s</li>
<li>Target</li>
<li>TJ Maxx</li>
<li>Walmart</li>
</ul>
<p>We computed <a href="https://measuringu.com/product/suprq/">SUPR-Q</a><sup>®</sup> and <a href="https://measuringu.com/nps-reliability/">Net Promoter</a> scores, measured users’ attitudes regarding their experiences, conducted <a href="https://measuringu.com/key-drivers/">key driver</a> analyses, and analyzed reported usability problems. Full details are in the <a href="https://measuringu.com/product/ux-nps-benchmark-report-for-mass-merchant-websites-2026/">downloadable report</a>.</p>
<h2>Benchmark Study Details</h2>
<p>In November to December 2025, we asked 351 U.S. users of mass merchant websites to recall their most recent experience and perceptions of one of these websites on their desktop and mobile app (if applicable) in the past year.</p>
<p>Respondents completed the eight-item <a href="https://measuringu.com/product/suprq/">SUPR-Q</a> (which includes the <a href="https://measuringu.com/nps-reliability/">Net Promoter Score</a>), the two-item <a href="https://measuringu.com/evolution-of-the-ux-lite/">UX-Lite</a><sup>®</sup>, and the <a href="https://measuringu.com/article/streamlining-the-supr-qm-the-supr-qm-v2/">SUPR-Qm</a><sup>®</sup> standardized questionnaires, and they answered questions about their brand attitudes, usage, and prior experiences.</p>
<h2>Quality of the Website User Experience: SUPR-Q</h2>
<p>The SUPR-Q is a standardized questionnaire widely used for measuring attitudes toward the quality of a website user experience. Its norms are computed from a rolling database of around 200 websites across dozens of industries.</p>
<p>SUPR-Q scores are percentile ranks that tell you how a website’s experience ranks relative to the other websites (50<sup>th</sup> percentile is average). The SUPR-Q provides an overall score as well as detailed scores for subdimensions of Usability, Trust, Appearance, and Loyalty.</p>
<p>The mean SUPR-Q across mass merchant websites in this study was at the <strong>54<sup>th </sup>percentile</strong> (just above average), ranging from the 75<sup>th</sup> percentile for TJ Maxx to the 12<sup>th</sup> percentile for Walmart.</p>
<h3>Usability Scores</h3>
<p>Overall, usability scores were slightly above average for the mass merchant websites, averaging at the 60<sup>th</sup> percentile. Amazon had the highest usability score at the 78<sup>th</sup> percentile. Walmart had the lowest usability score, falling at the 15<sup>th</sup> percentile.</p>
<p>Comments related to usability on Walmart included:</p>
<p style="padding-left: 25px;"><em>“There are times when I can&#8217;t find something that I need or it can only be bought in bulk from the website.”</em></p>
<p style="padding-left: 25px;"><em>“Searching for something as simple as men&#8217;s pants, there&#8217;s no easy way to narrow down choices by specific size availability or other parameters.”</em></p>
<h3>Loyalty/Net Promoter Scores</h3>
<p>All the mass merchant websites except Walmart (with −30%) had positive Net Promoter Scores, led by TJ Maxx (25%) and Amazon (21%). The average NPS for these websites was 6% (slightly more promoters than detractors).</p>
<p>Unsurprisingly, ratings of the intention to keep using these websites correlated with their NPS (<em>r</em> = .55 at the website level). As shown in Figure 1, respondents were more likely to continue using the websites for Amazon, Kohl’s, Macy’s, and TJ Maxx than to continue using Target, Walmart, or JC Penney.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-1.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46364 size-full" src="https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-1.png" alt="Likelihood to continue using the websites (90% confidence intervals)." width="1200" height="453" srcset="https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-1.png 1200w, https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-1-300x113.png 300w, https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-1-1024x387.png 1024w, https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-1-768x290.png 768w, https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-1-600x227.png 600w" sizes="(max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> Likelihood to continue using the websites (90% confidence intervals).</p>
<p>Comments related to NPS and loyalty included:</p>
<p style="padding-left: 25px;"><em>“First of all, I love TJ Maxx in general and shop there often. I tend to shop in person mostly, but I do go on the website to check for deals and items. I would recommend the website to a friend who doesn&#8217;t have easy access to a physical store so they could see all the great items they carry.” </em>— TJ Maxx</p>
<p style="padding-left: 25px;"><em>“I&#8217;d be happy to refer Amazon to a friend or colleague as it offers a ton of valuable things including fast shipping, an ability to comparison shop easily, movies to watch, free shipping if you join Prime, and super easy returns. Many, many reasons to recommend Amazon. Also, I have rarely had a problem and when I did it was solved quickly.” </em>— Amazon</p>
<p style="padding-left: 25px;"><em>“The Walmart website has gotten very difficult to use over the years. You have to use a lot of filters in order to see things that Walmart stores actually have available to pick up in store. Many, many items on the website are Chinese products that are available by shipping only. It feels very 3rd party and untrustworthy. Walmart has little affiliation with these products.” </em>— Walmart</p>
<h2>Websites and Mobile App Usage</h2>
<p>As a part of this benchmark, we asked respondents how they accessed the mass merchants online. All respondents reported using their desktop/laptop computers (this was a requirement for participation in the survey); 52% of respondents also used mobile apps, and 69% also used mobile websites. Most respondents reported visiting their mass merchant websites on a desktop or laptop computer from a few times a year (Kohl’s, Macy’s, TJ Maxx) to never (Amazon, JC Penney, Target, Walmart). For all but one retailer, most respondents reported never using a mobile app—the exception was Amazon, for which most respondents reported using the mobile app a few times a week.</p>
<h2>Key Drivers of UX Quality</h2>
<p>To better understand what affects SUPR-Q scores and Likelihood-to-Recommend (LTR) ratings, we asked respondents to rate potentially important attributes of the mass merchant websites on a five-point scale from 1 (Strongly disagree) to 5 (Strongly agree). We conducted key driver analyses (regression modeling) to quantify the extent to which ratings on these items drive (account for) variation in overall SUPR-Q scores and, separately, LTR (the rating from which the NPS is derived; full details are in the <a href="https://measuringu.com/product/ux-nps-benchmark-report-for-mass-merchant-websites-2026/">downloadable report</a>).</p>
<p>The top key driver of the overall SUPR-Q scores was the ease of browsing for items (13%), followed by finding inspiration for products (9%) and having clear product images (8%). Taken together, 11 significant variables accounted for 71% of the variance in the SUPR-Q scores.</p>
<p>For likelihood-to-recommend (LTR), the top key drivers were easily finding deals (10%), clear product images (9%), and finding inspiration for products (8%). Overall, six significant drivers accounted for 45% of the variance in LTR.</p>
<p>Figure 2 shows a scatterplot of the importance and opportunity for improvement for seven key drivers (the top SUPR-Q and LTR drivers and drivers that accounted for significant percentages of their variation). The combination of importance and opportunity for improvement provides a basis for prioritizing which key drivers to improve. The importance score is the greater of the variance accounted for by the driver in the SUPR-Q and NPS analyses, where larger percentages indicate more importance.</p>
<p>The opportunity score is the <a href="https://measuringu.com/top-top-two-bottom-net-box/">top-box percentage</a> for the driver, so smaller percentages indicate greater opportunity for improvement (e.g., it would be harder to improve a driver with a top-box percentage of 100% than one with a top-box percentage of 10%).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-2.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46365 size-full" src="https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-2.png" alt="Scatterplot of importance and opportunity for improvement of key drivers." width="1191" height="443" srcset="https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-2.png 1191w, https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-2-300x112.png 300w, https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-2-1024x381.png 1024w, https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-2-768x286.png 768w, https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-2-600x223.png 600w" sizes="(max-width: 1191px) 100vw, 1191px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 2:</strong> Scatterplot of importance and opportunity for improvement of key drivers.</p>
<p>Two of these seven key drivers fell in the upper left quadrant with relatively high importance and higher opportunity for improvement (&#8220;Can find inspiration for desired products” and “Easy to find good deals/discounts&#8221;). TJ Maxx was the leader on these drivers with respective top-box scores of 52% and 50%, while Walmart had the lowest scores (15% and 21%, respectively).</p>
<h2>UX Problems</h2>
<p>We examined verbatim comments to better understand user experience problems.</p>
<p>The top themes were cluttered designs, stocking issues, and item quality.</p>
<h3>Cluttered Designs</h3>
<p>Comments about cluttered designs dominated the comments for JC Penney (Figure 3), Kohl’s, Macy’s, and TJ Maxx, and clutter was the second-most mentioned problem with Target. Comments related to clutter included:</p>
<p style="padding-left: 25px;"><em>“Like previously mentioned, to me it&#8217;s just a big eyesore and I hate trying to navigate through it. It feels outdated and excessively ugly compared to almost any other online retailer that is popular these days, unfortunately.” </em>— JC Penney</p>
<p style="padding-left: 25px;"><em>“It is far too cluttered with an outdated design.” </em>— Kohl’s</p>
<p style="padding-left: 25px;"><em>“I&#8217;ve never had any specific problems with the website. I wish it were slightly more ‘clean’ looking. Sometimes I find it to look cluttered.” </em>— Macy’s</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-3.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46366 size-full" src="https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-3.png" alt="Example of clutter on the JC Penney website." width="1200" height="485" srcset="https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-3.png 1200w, https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-3-300x121.png 300w, https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-3-1024x414.png 1024w, https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-3-768x310.png 768w, https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-3-600x243.png 600w" sizes="(max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 3:</strong> Example of clutter on the JC Penney website.</p>
<h3>Out-of-Stock Items</h3>
<p>Items being out of stock was the most cited problem for Target (Figure 4) and was in the top three for TJ Maxx, Walmart, JC Penney, and Kohl’s.</p>
<p>Comments related to out-of-stock items included:</p>
<p style="padding-left: 25px;"><em>“I sometimes have a problem with an item being out of stock when I want to make a purchase.” </em>— Target</p>
<p style="padding-left: 25px;"><em>“My family often looks for items prior to going to the physical store to get the items. Unfortunately, there are often times when the website shows that items are available locally that end up not being in stock.” </em>— JC Penney</p>
<p style="padding-left: 25px;"><em>“All stores have an ever-changing product availability, so it’s hard to know when and where certain items become available.” </em>— TJ Maxx</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-4.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46367 size-full" src="https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-4.png" alt="Out-of-stock items on the Target website." width="1200" height="687" srcset="https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-4.png 1200w, https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-4-300x172.png 300w, https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-4-1024x586.png 1024w, https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-4-768x440.png 768w, https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-4-600x344.png 600w" sizes="(max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 4:</strong> Out-of-stock items on the Target website.</p>
<h3>Third-Party Sellers Drag Down the Experience on Walmart</h3>
<p>The most frequently mentioned issues for Amazon and Walmart were related to item quality, often related to poor prior experiences and resulting trust issues with third-party sellers. Walmart shoppers also reported having difficulty distinguishing products sold by third-party sellers and those sold by Walmart in search results, as this information appears only on the item details pages (Figure 5).</p>
<p>Comments related to item quality included:</p>
<p style="padding-left: 25px;"><em>“Amazon has loads of fake reviews and I have received counterfeit products when buying electronics in the past. I have become a lot more cautious regarding what I buy on Amazon.”</em>— Amazon</p>
<p style="padding-left: 25px;"><em>“One thing I don&#8217;t like is an uptick in 3rd party sellers that sometimes seem shady.”</em>— Walmart</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-5.png"><img loading="lazy" decoding="async" class="wp-image-46368 size-full alignnone" src="https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-5.png" alt="Example of third-party seller on the Walmart website." width="1200" height="612" srcset="https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-5.png 1200w, https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-5-300x153.png 300w, https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-5-1024x522.png 1024w, https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-5-768x392.png 768w, https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-5-600x306.png 600w" sizes="(max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 5:</strong> Example of a third-party seller on the Walmart website.</p>
<h2>Comparison with the 2021 Mass Merchant Benchmark</h2>
<p>In 2021, we collected SUPR-Q and NPS data for Amazon, Kohl’s, Macy’s, Target, and Walmart. Figure 6 shows a statistically significant decline in the UX of all five websites from 2021 to 2025 and low scores for Walmart compared to other websites in both surveys. The patterns for NPS were similar.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-6.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46369 size-full" src="https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-6.png" alt="SUPR-Q scores from the 2021 and 2025 surveys (statistical analysis conducted on raw scores). " width="1200" height="660" srcset="https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-6.png 1200w, https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-6-300x165.png 300w, https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-6-1024x563.png 1024w, https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-6-768x422.png 768w, https://measuringu.com/wp-content/uploads/2026/01/012026-Figure-6-600x330.png 600w" sizes="(max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 6:</strong> SUPR-Q scores from the 2021 and 2025 surveys (statistical analysis conducted on raw scores).</p>
<p>UX scores were likely inflated in 2021 by the COVID pandemic, when online shopping wasn’t just convenient but was often the default (or safest) option, with positive sentiments flowing to the companies that were saving the day during the disruption.</p>
<p>The UX scores in 2025 mark a return to normalcy with higher expectations and lower tolerance for friction, exacerbated by poor customer sentiment due to lingering effects (real and perceived) of post-pandemic inflation.</p>
<p>Notably, the problem with third-party seller quality has been a strong drag on Walmart’s UX quality since we measured it in 2021 (the most mentioned problem then and in 2025).</p>
<h2>Summary and Takeaways</h2>
<p>Mass merchant services are big business in the US, with estimates for Q3 2025 retail e-commerce sales at $310B (15.8% of total retail sales that quarter). An analysis of the user experience of seven major mass merchant websites using data collected from November-December 2025 found:</p>
<ol>
<li><strong>TJ Maxx led while Walmart lagged. </strong>The mass merchant websites in this study collectively had SUPR-Q scores at the 54<sup>th</sup> percentile, just above the average 50<sup>th</sup> percentile. SUPR-Q scores ranged from the 12<sup>th</sup> percentile for Walmart to the 75<sup>th</sup> percentile for TJ Maxx. TJ Maxx was most likely to be recommended (NPS of 25%) while Walmart was the least likely (NPS of −30%).</li>
</ol>
<ol start="2">
<li><strong>Ease of browsing and ease of finding deals and discounts drive UX scores. </strong>Our key driver models accounted for 71% of the variation in SUPR-Q scores and 45% of the variation in LTR ratings. The top key driver of the mass merchant website experience was “It’s easy to browse for items on the website” (accounting for 13% of SUPR-Q variation). The top key driver in the modeling of LTR ratings was “I can easily find deals and discounts” (10%).</li>
</ol>
<ol start="3">
<li><strong>There are opportunities to improve influential key drivers. </strong>One way to prioritize attention to key drivers is to consider both their importance (beta weights in regression) and how well the websites achieve the stated goal (top-box scores). The two key drivers with the most potential for improvement (high beta weights and low top-box scores) were the extent to which the websites help users find inspiration for desired products (top-box score: 31%) and the ease of finding good deals/discounts (top-box score: 38%). TJ Maxx was the leader on these drivers with respective top-box scores of 52% and 50%, while Walmart had the lowest scores (15% and 21%, respectively).</li>
</ol>
<ol start="4">
<li><strong>Top UX problems reported by users were cluttered design, stocking issues, and item quality. </strong>Comments about cluttered designs dominated the comments for JC Penney, Kohl’s, Macy’s, and TJ Maxx, and clutter was the second-most mentioned problem with Target. Items being out of stock was the most cited problem for Target and was in the top three for TJ Maxx, Walmart, JC Penney, and Kohl’s. The most frequently mentioned issues for Amazon and Walmart were related to item quality, often related to poor prior experiences and resulting trust issues with third-party sellers.</li>
</ol>
<ol start="5">
<li><strong>Third-party sellers pose a threat to the UX of mass merchant websites (especially Walmart). </strong>Third-party seller quality has been a strong drag on Walmart’s UX quality, with that being the top user complaint for Walmart in 2021 and 2025. It has likely also affected the UX ratings of Amazon and Target, where it wasn’t the top complaint but made it into the top 10.</li>
</ol>
<p>For more details, see the <a href="https://measuringu.com/product/ux-nps-benchmark-report-for-mass-merchant-websites-2026/">downloadable report</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://measuringu.com/ux-nps-benchmark-report-for-mass-merchant-websites-2026/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			</item>
		<item>
		<title>UX-Lite Sample Sizes for Confidence Intervals</title>
		<link>https://measuringu.com/ux-lite-sample-sizes-for-confidence-intervals/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=ux-lite-sample-sizes-for-confidence-intervals</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD&nbsp;•&nbsp;Jeff Sauro, PhD]]></dc:creator>
		<pubDate>Tue, 13 Jan 2026 23:31:20 +0000</pubDate>
				<category><![CDATA[Survey]]></category>
		<category><![CDATA[User Experience]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[Sample Size]]></category>
		<category><![CDATA[UX-Lite]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=46245</guid>

					<description><![CDATA[The UX-Lite® is an increasingly popular UX metric. There’s a reason for its popularity. It’s a simple two-item questionnaire that measures perceptions of the user experience of any interface (product, app, website). Its two five-point items are combined and scaled to generate an overall score and subscale scores on ease and usefulness from 0 to [&#8230;]]]></description>
										<content:encoded><![CDATA[		<div data-elementor-type="wp-post" data-elementor-id="46245" class="elementor elementor-46245" data-elementor-post-type="post">
				<div class="elementor-element elementor-element-4155ba10 e-flex e-con-boxed e-con e-parent" data-id="4155ba10" data-element_type="container" data-e-type="container">
					<div class="e-con-inner">
				<div class="elementor-element elementor-element-17a190db elementor-widget elementor-widget-text-editor" data-id="17a190db" data-element_type="widget" data-e-type="widget" data-widget_type="text-editor.default">
				<div class="elementor-widget-container">
									<p><img loading="lazy" decoding="async" class="alignleft wp-image-46294 size-medium" src="https://measuringu.com/wp-content/uploads/2026/01/011326-FeatureImage-1-300x169.png" alt="Feature image showing a calculator and a group of wooden pawns" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/01/011326-FeatureImage-1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/01/011326-FeatureImage-1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/01/011326-FeatureImage-1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/01/011326-FeatureImage-1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/01/011326-FeatureImage-1-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/01/011326-FeatureImage-1.png 2000w" sizes="(max-width: 300px) 100vw, 300px" />The <a href="https://measuringu.com/evolution-of-the-ux-lite/">UX-Lite</a><sup>®</sup> is an increasingly popular UX metric.</p><p>There’s a reason for its popularity. It’s a simple two-item questionnaire that measures perceptions of the user experience of any interface (product, app, website).</p><p>Its two five-point items are combined and scaled to generate an overall score and subscale scores on ease and usefulness <a href="https://measuringu.com/converting-scales-to-100-points/">from 0 to 100</a>. The UX-Lite <a href="https://measuringu.com/article/effect-of-perceived-ease-of-use-and-usefulness-on-ux-and-behavioral-outcomes/">predicts future product usage</a> as well as or better than the original (and longer) <a href="https://measuringu.com/tam/">Technology Acceptance Model</a> (TAM). The ease score also <a href="https://measuringu.com/accuracy-of-sus-estimation-with-ux-lite/">predicts the SUS</a> with over 95% accuracy.</p><p>As with any UX metric, researchers should understand not only how to administer and <a href="https://measuringu.com/grading-scales-for-the-ux-lite/">score the metric</a> but also how to generate sample size estimates.</p><p>Finding the right sample size estimate isn’t about picking a magic number. While there may be some <a href="https://measuringu.com/might-not-be-a-magic-number-but-there-are-magic-ranges/">magic ranges</a>, the right process starts with the type of study design.</p><p>Three of the most common UX study designs are those with a focus on:</p><ul><li><strong>Estimation (Confidence Interval/Margin of Error):</strong> The primary focus of this type of study is to get precise measurements of UX metrics. For example, a company might want to conduct a survey with the UX-Lite to get an initial assessment of how easy to use and useful users find a product. The primary analytical method for this study would be a confidence interval (average UX-Lite plus or minus its margin of error).</li><li><strong>Comparison with a benchmark:</strong> The word “benchmark” is <a href="https://measuringu.com/benchmark-intro/">used in different ways in UX research</a>. In this context, we mean comparison with a benchmark value used as a criterion rather than comparisons of multiple products for high-level benchmarking. For example, a company might want to know if the UX-Lite score for their flagship product is above average, so they could compare their score to a benchmark of 78 (the lower bound of a grade of B on our <a href="https://measuringu.com/grading-scales-for-the-ux-lite/">standard grading scale for the UX-Lite</a>). The primary analytical method for this study would be a <em>t</em>-test comparing the observed average to the benchmark criterion.</li><li><strong>Comparison of two means:</strong> These types of studies focus on comparing means in various contexts, including, for example, using the UX-Lite to compare competitive products or the same product measured over multiple points of time. There are different ways to analyze this type of data, but the most commonly used method is the <em>t</em>-test.</li></ul><p>Across all three study types, a key ingredient is the <strong>historical standard deviation</strong> of the UX-Lite. Fortunately, we’ve collected enough data to have a good idea about a <a href="https://measuringu.com/reliability-and-variability-of-standardized-ux-scales/">typical UX-Lite standard deviation</a>.</p><p>In this article, we demonstrate how to compute the right sample size for UX-Lite confidence intervals by controlling the size of the margin of error (i.e., the desired level of precision).</p><h2>UX-Lite Sample Sizes for Estimation (Confidence Intervals/Margins of Error)</h2><p>A confidence interval is a statistical method for expressing the precision of measurement. The confidence interval is twice the margin of error.</p><h3>The Elements of Sample Size Estimation for Confidence Intervals</h3><p>To compute a sample size for a confidence interval, you need to know (Figure 1):</p><ol><li>An estimate of the standard deviation of measurement (<em>s</em>)</li><li>The desired level of confidence (typically 90% or 95%)</li><li>The desired margin of error (MoE) around the average (plus or minus)</li></ol><p><a href="https://measuringu.com/wp-content/uploads/2026/01/Frame-148.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46293 size-full" src="https://measuringu.com/wp-content/uploads/2026/01/Frame-148.png" alt="Drivers of sample size estimation for confidence intervals." width="1200" height="406" srcset="https://measuringu.com/wp-content/uploads/2026/01/Frame-148.png 1200w, https://measuringu.com/wp-content/uploads/2026/01/Frame-148-300x102.png 300w, https://measuringu.com/wp-content/uploads/2026/01/Frame-148-1024x346.png 1024w, https://measuringu.com/wp-content/uploads/2026/01/Frame-148-768x260.png 768w, https://measuringu.com/wp-content/uploads/2026/01/Frame-148-600x203.png 600w" sizes="(max-width: 1200px) 100vw, 1200px" /></a></p><p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> Drivers of sample size estimation for confidence intervals.</p><ol><li>Our best estimate of the standard deviation of the UX-Lite when averaged across individuals is <strong>3,</strong> with an interquartile range from 16.6 (25<sup>th</sup> percentile) to 21.3 (75<sup>th</sup> percentile). This comes from our internal data collected over the years on many products, websites, and apps.</li><li>In applied UX research, the confidence level is usually 90% or 95%.</li><li>The final (and most influential) component is the desired margin of error that specifies how much uncertainty you can tolerate in your estimate. Realistic values for UX-Lite measurement are usually in the range from ±5 (reasonably precise) to ±10 (less precise but still able to distinguish between broad levels of acceptability).</li></ol><p>With these three ingredients, we can now compute the sample size using the formula we walk through in Chapter 6 of our book <a href="https://measuringu.com/book/quantifying-the-user-experience-practical-statistics-for-user-research/"><em>Quantifying the User Experience</em></a> and in the <a href="https://measuringu.com/sample-sizes-for-sus-ci/">previous article on SUS sample sizes</a>.</p><h3>Sample Size Table for UX-Lite Confidence Intervals</h3><p>Table 1 shows how variations in these three components affect sample size estimates for confidence intervals for the median standard deviation of 19.3 and for the 75<sup>th</sup> percentile of 21.3. In most cases, using the median standard deviation is reasonable, but when a sufficient sample size is more important than controlling the cost of sampling, it’s better to plan with the higher value.</p><p>
<table id="tablepress-1016" class="tablepress tablepress-id-1016">
<thead>
<tr class="row-1">
	<td class="column-1"></td><th colspan="2" class="column-2"><strong><center><i>s</i> = 19.3</th><td class="column-4"></td><th colspan="2" class="column-5"><strong><center><i>s</i> = 21.3</th>
</tr>
</thead>
<tbody>
<tr class="row-2">
	<td class="column-1"><strong>MoE</td><td class="column-2"><strong>90%</td><td class="column-3"><strong>95%</td><td class="column-4"></td><td class="column-5"><strong>90%</td><td class="column-6"><strong>95%</td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong>15</td><td class="column-2">7</td><td class="column-3">9</td><td class="column-4"></td><td class="column-5">8</td><td class="column-6">11</td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>10</td><td class="column-2">13</td><td class="column-3">17</td><td class="column-4"></td><td class="column-5">15</td><td class="column-6">20</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>7.5</td><td class="column-2">20</td><td class="column-3">28</td><td class="column-4"></td><td class="column-5">24</td><td class="column-6">34</td>
</tr>
<tr class="row-6">
	<td class="column-1"><strong>5.0</td><td class="column-2">43</td><td class="column-3">60</td><td class="column-4"></td><td class="column-5">51</td><td class="column-6">73</td>
</tr>
<tr class="row-7">
	<td class="column-1"><strong>2.5</td><td class="column-2">164</td><td class="column-3">232</td><td class="column-4"></td><td class="column-5">199</td><td class="column-6">282</td>
</tr>
<tr class="row-8">
	<td class="column-1"><strong>2.0</td><td class="column-2">254</td><td class="column-3">361</td><td class="column-4"></td><td class="column-5">309</td><td class="column-6">439</td>
</tr>
<tr class="row-9">
	<td class="column-1"><strong>1.0</td><td class="column-2">1010</td><td class="column-3">1434</td><td class="column-4"></td><td class="column-5">1230</td><td class="column-6">1746</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1016 from cache --></p><p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Sample size requirements for UX-Lite confidence intervals given various standard deviations (s), confidence levels, and margins of error (MoE), with green highlighting for the &#8220;magic range&#8221; in the table.</p><p>For example, if you need an interval to have 90% confidence assuming <em>s</em> = 19.3 and precision of ±15 (a very imprecise estimate), then you need a sample size of only 7. At the other end of the table, if you need 95% confidence assuming <em>s</em> = 21.3 and precision of ±1 (a very precise estimate), you’ll need a sample size of 1,746.</p><p>You can see a sort of <a href="https://exoplanets.nasa.gov/resources/323/goldilocks-zone/">Goldilocks zone</a> or “<a href="https://measuringu.com/might-not-be-a-magic-number-but-there-are-magic-ranges/">magic range</a>” for reasonably precise margins of error (MoE from ±2 to ±5), which have reasonably attainable sample size requirements (<em>n</em> from 43 to 439). The table also shows how sample size estimates balance <a href="https://measuringu.com/sample-size-recommendations/">statistics and logistics</a>. The math for a high level of precision may indicate aiming for a sample size over 1,000, but the feasibility (cost and time) of obtaining that many participants might be prohibitive, even in a retrospective survey or unmoderated usability study. For all four columns in Table 1, moving the desired margin of error from ±2 to ±1 requires nearly quadrupling the sample size (a well-known <a href="https://measuringu.com/inverse-square-relationship/">inverse square relationship</a>).</p><h3>Technical Note: What to Do for Different Standard Deviations</h3><p>If your historical UX-Lite data has a very different standard deviation from 19.3 or 21.3, you can do a quick computation to adjust the values in these tables. The first step is to compute a multiplier by dividing the new target variance (the square of the standard deviation, <em>s</em><sup>2</sup>) by the variance used to create the table. Then multiply the tabled value of <em>n</em> by the multiplier and round it to get the revised estimate. To illustrate this, we’ll start with a standard deviation of 19.3 (our typical standard deviation) and show how this works if the target standard deviation (<em>s</em>) is 21.3 (our conservative estimate in Table 1). The target variability (21.3<sup>2</sup>) is 453.69. The initial variability is 372.49 (19.3<sup>2</sup>), making the multiplier 453.69/372.49 = 1.217. To use this multiplier to adjust the sample size for 95% confidence and precision of ±2.5 shown in Table 1 when <em>s</em> = 19.3, multiply 232 by 1.217 to get 282.344, then round it off to 282. For more information, see our article, &#8220;<a href="https://measuringu.com/how-do-changes-in-sd-affect-n/">How Do Changes in Standard Deviation Affect Sample Size Estimation?</a>&#8220;</p><h2>Summary and Discussion</h2><p>In this article, we described how to compute sample size requirements for <strong>UX-Lite scores in studies focused on estimation</strong>. Determining the right sample size is not about selecting a single “magic number,” but about matching statistical requirements to the goals of the study.</p><h3>Sample Size Depends on Study Type</h3><p>We distinguished three common UX study types that generate different sample size requirements: estimation, comparison with a benchmark, and comparison of two means. Each study type implies a different primary analytical method and, therefore, a different approach to sample size estimation. This article focused specifically on <strong>estimation studies</strong>, in which the primary analytical method is a confidence interval used to describe the precision of a UX-Lite score.</p><h3>Estimation Requires Three Inputs</h3><p>Computing a sample size for UX-Lite estimation requires three inputs:</p><ol><li><strong>An estimate of the standard deviation: </strong>Based on accumulated UX-Lite data, we recommend generally using a standard deviation of <strong>19.3</strong> (but use a more conservative <strong>21.3</strong> when it’s critical to meet or exceed the precision goal of the study).</li><li><strong>The desired confidence level: </strong>In applied UX research, this is most commonly <strong>90% or 95%</strong>.</li><li><strong>The desired margin of error (MoE): </strong>The margin of error specifies how much uncertainty can be tolerated in the estimate.</li></ol><h3>Precision Is Purchased with Sample Size</h3><p>All other things being equal, setting a <strong>larger margin of error</strong> (less precise measurement) requires a <strong>smaller sample size</strong>, while setting a <strong>smaller margin of error</strong> (more precise measurement) requires a <strong>larger sample size</strong>. In other words, precision is purchased with sample size.</p><p>The table illustrates a practical planning range for UX-Lite estimation, with margins of error between <strong>±2 and ±5</strong> producing sample size requirements that are often feasible in retrospective surveys and unmoderated studies. They also illustrate the well-known inverse-square relationship between margin of error and sample size: reducing the margin of error by half requires approximately <strong>four times as many participants</strong>.</p><h3>Adjusting for Different Standard Deviations</h3><p>Because the variability of UX-Lite scores may differ across products and contexts, we also provided a straightforward method for adjusting the tabled sample sizes when the historical standard deviation differs from 19.3. By scaling sample sizes using the ratio of variances, researchers can quickly adapt the estimates to their own data while preserving the underlying statistical assumptions.</p><h3>Bottom Line</h3><p>There is no single “magic” sample size for UX-Lite studies. Effective sample size planning begins with the study goal, followed by explicit decisions about acceptable precision and practical constraints. The table and procedures presented here provide a consistent and defensible framework for making those decisions. For computational details, see Chapter 6 in <a href="https://www.amazon.com/Quantifying-User-Experience-Practical-Statistics/dp/0128023082/ref=sr_1_1?crid=20KG56CTBBSMR&amp;keywords=quantifying+the+user+experience&amp;qid=1640660707&amp;s=books&amp;sprefix=quantifying+the%2Cstripbooks%2C139&amp;sr=1-1"><em>Quantifying the User Experience</em></a> or our previous article on estimating sample sizes for <a href="https://measuringu.com/sample-sizes-for-sus-ci/">confidence intervals around SUS scores</a>.</p>								</div>
				</div>
					</div>
				</div>
				</div>
		]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How the SEQ Correlates with Other Task Metrics</title>
		<link>https://measuringu.com/how-the-seq-correlates-with-other-task-metrics/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=how-the-seq-correlates-with-other-task-metrics</link>
		
		<dc:creator><![CDATA[Jeff Sauro, PhD&nbsp;•&nbsp;Jim Lewis, PhD]]></dc:creator>
		<pubDate>Wed, 07 Jan 2026 03:41:33 +0000</pubDate>
				<category><![CDATA[Satisfaction]]></category>
		<category><![CDATA[SEQ]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[Task Completion]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=46190</guid>

					<description><![CDATA[While task completion and task time are the default choices for measuring task effectiveness and task efficiency, the methods used to capture people’s feelings about an experience certainly seem more varied. But after measuring post-task perceptions for decades, we&#8217;ve found that a simple seven-point item does a good job of capturing not only perceptions of [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/01/010626-FeatureImage-1.png"><img loading="lazy" decoding="async" class="alignleft wp-image-46265 size-medium" src="https://measuringu.com/wp-content/uploads/2026/01/010626-FeatureImage-1-300x169.png" alt="Feature image showing SEQ rating scale" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/01/010626-FeatureImage-1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/01/010626-FeatureImage-1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/01/010626-FeatureImage-1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/01/010626-FeatureImage-1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/01/010626-FeatureImage-1-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/01/010626-FeatureImage-1.png 2000w" sizes="(max-width: 300px) 100vw, 300px" /></a>While task completion and task time are the default choices for measuring task effectiveness and task efficiency, the methods used to capture people’s feelings about an experience certainly seem more varied.</p>
<p>But after measuring post-task perceptions for decades, we&#8217;ve found that a simple seven-point item does a good job of capturing not only perceptions of ease but of other adjacent emotions as well.</p>
<p>The <a href="https://measuringu.com/seq10/">Single Ease Question</a> (SEQ<sup>®</sup>) is a single seven-point question asked after participants attempt a task as part of a usability test or benchmark (Figure 1).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2025/12/Figure-1-The-SEQ-built-into-the-MUiQ®-platform.png" rel="attachment wp-att-46191"><img loading="lazy" decoding="async" class="alignnone wp-image-46191 size-full" src="https://measuringu.com/wp-content/uploads/2025/12/Figure-1-The-SEQ-built-into-the-MUiQ®-platform.png" alt="The SEQ (built into the MUiQ® platform). " width="1080" height="203" srcset="https://measuringu.com/wp-content/uploads/2025/12/Figure-1-The-SEQ-built-into-the-MUiQ®-platform.png 1080w, https://measuringu.com/wp-content/uploads/2025/12/Figure-1-The-SEQ-built-into-the-MUiQ®-platform-300x56.png 300w, https://measuringu.com/wp-content/uploads/2025/12/Figure-1-The-SEQ-built-into-the-MUiQ®-platform-1024x192.png 1024w, https://measuringu.com/wp-content/uploads/2025/12/Figure-1-The-SEQ-built-into-the-MUiQ®-platform-768x144.png 768w, https://measuringu.com/wp-content/uploads/2025/12/Figure-1-The-SEQ-built-into-the-MUiQ®-platform-600x113.png 600w" sizes="(max-width: 1080px) 100vw, 1080px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> The SEQ (built into the <a href="https://measuringu.com/muiq/">MUiQ<sup>®</sup> platform</a>).</p>
<p>The SEQ is probably the most popular measure of the perception of post-task ease, having over 600 citations since its <a href="https://measuringu.com/article/comparison-of-three-one-question-post-task-usability-questionnaires/">publication in 2009.</a> It has become a frequently used tool in the toolbox of many UX practitioners and researchers.</p>
<p>For this article, we dug into our historical data to explore the extent to which the SEQ correlates with other task metrics.</p>
<h2>Satisfaction</h2>
<p>In an <a href="https://measuringu.com/how-much-does-satisfaction-correlate-with-ease">analysis of six different sets of task data,</a> we calculated 83 correlations between satisfaction and SEQ ratings collected from 1,768 participants across 21 unique products. Using each individual’s ratings, the correlation between satisfaction and ease was .71 (95% confidence interval from .69 to .73, 50% shared variance). When using means across tasks at the product level, the correlation was even higher (<em>r</em> = .91, 95% confidence from .83 to .99, 83% shared variance).</p>
<h2>NASA Task Load Index (TLX)</h2>
<p>The <a href="https://measuringu.com/nasa-tlx/">NASA TLX</a> is a six-item questionnaire designed to measure perceived workload (mental and physical). Although we rarely investigate tasks where workload is a primary metric, in one study with ten participants, we collected the SEQ alongside the TLX and found the average correlation was quite strong for the overall measure (<em>r</em> = −.89, 95% confidence interval from −1.0 to −.74, 79% shared variance) and for the six dimensions (<em>r</em> = −.57 to <em>r</em> = −.95, 95% confidence intervals from −1.00 to −.07 and −1.00 to −.88, shared variance from 32% to 90%), suggesting good convergent validity.</p>
<h2>Tapping</h2>
<p>Tapping during task performance has been proposed as a real-time <a href="https://dl.acm.org/doi/10.1145/2038476.2038481">measure of cognitive load</a>. Across two lab-based experiments with 28 participants, we compared tapping with SEQ scores across tasks and found a correlation of <em>r</em> = −.40 (95% confidence interval from −.73 to −.07, shared variance of 16%). A correlation of −.40 indicates about 16% shared variance for the metrics, leaving 84% unaccounted for, so SEQ scores don’t completely replace the information obtained from tapping. On the other hand, the relationship is high enough to consider using SEQ as a proxy for cognitive load when a measurement of tapping is not essential for the research questions.</p>
<h2>Lostness</h2>
<p>Lostness is a measure of how a user navigates an interface (usually a website) relative to the most efficient (&#8220;happy&#8221;) path. One way to measure lostness is by the ratio of unique pages relative to the minimum number of pages and the total number of pages relative to the unique number of pages.</p>
<p>However, computing this lostness measure is time consuming as it requires identifying the minimum number of pages or steps needed to complete a task, as well as counting all screens and the number of unique screens.</p>
<p>After analyzing 73 users who attempted eight tasks across three products, we found that the correlation between SEQ and <a href="https://measuringu.com/lostness/">lostness</a> is high at the individual level (<em>r</em> = −.52, 95% confidence interval from −.69 to −.35, 27% shared variability) and very high at the task level (<em>r</em> = −.95, 95% confidence interval from −.86 to −1.00, 90% shared variance).</p>
<h2>Completion Rates</h2>
<p>The <a href="https://measuringu.com/seq-prediction/">SEQ and completion rates</a> correlate strongly at the task level (<em>r</em> = .66, 95% confidence interval from .59 to .73, 44% shared variance). <span data-preserver-spaces="true">This strong relationship (Figure 2) shows that participants&#8217; thoughts about a task generally correspond to what actually happened (but not perfectly, similar to what we found with </span><a class="editor-rtfLink" href="https://measuringu.com/task-comp-sus/" target="_blank" rel="noopener"><span data-preserver-spaces="true">SUS scores</span></a><span data-preserver-spaces="true">).</span></p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/01/Figure-2.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46268 size-full" src="https://measuringu.com/wp-content/uploads/2026/01/Figure-2.png" alt="Figure 2: Relationship between SEQ scores (converted to deciles) and completion rates for 286 moderated and unmoderated tasks (collected from 2014 to 2018 with sample sizes from eight to 601 participants)." width="1200" height="600" srcset="https://measuringu.com/wp-content/uploads/2026/01/Figure-2.png 1200w, https://measuringu.com/wp-content/uploads/2026/01/Figure-2-300x150.png 300w, https://measuringu.com/wp-content/uploads/2026/01/Figure-2-1024x512.png 1024w, https://measuringu.com/wp-content/uploads/2026/01/Figure-2-768x384.png 768w, https://measuringu.com/wp-content/uploads/2026/01/Figure-2-600x300.png 600w" sizes="(max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 2:</strong> Relationship between SEQ scores (converted to deciles) and completion rates for 286 moderated and unmoderated tasks (collected from 2014 to 2018 with sample sizes from eight to 601 participants).</p>
<h2>Task Completion Time</h2>
<p>Using the same dataset as the completion rate data, we looked at the relationship between SEQ and concurrently collected task times from 270 tasks (Figure 3). Sample sizes were between 8 and 601. The task times were for all participants, whether or not the task was completed successfully. We found a strong correlation (<em>r</em> = −.53, 95% confidence interval from −.61 to −.45, 28% shared variance) between SEQ and the logarithm of task times (as tasks take longer, people’s perception of ease goes down). This means that perception of ease (SEQ scores) can explain about 28% of the variation in log task times.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/01/Figure-3.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46267 size-full" src="https://measuringu.com/wp-content/uploads/2026/01/Figure-3.png" alt="Figure 3: Relationship between SEQ scores (converted to deciles) and completion times for 286 moderated and unmoderated tasks (collected from 2014 to 2018 with sample sizes from eight to 601 participants)." width="1200" height="567" srcset="https://measuringu.com/wp-content/uploads/2026/01/Figure-3.png 1200w, https://measuringu.com/wp-content/uploads/2026/01/Figure-3-300x142.png 300w, https://measuringu.com/wp-content/uploads/2026/01/Figure-3-1024x484.png 1024w, https://measuringu.com/wp-content/uploads/2026/01/Figure-3-768x363.png 768w, https://measuringu.com/wp-content/uploads/2026/01/Figure-3-600x284.png 600w" sizes="(max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 3:</strong> Relationship between SEQ scores (converted to deciles) and completion times for 286 moderated and unmoderated tasks (collected from 2014 to 2018 with sample sizes from eight to 601 participants).</p>
<h2>Mental Effort</h2>
<p>The Subject Mental Effort Questionnaire (SMEQ) is a validated tool used to measure the perceived effort of a task. It was originally developed in the Netherlands in the 1990s, but it hasn’t been used much. However, it has some unusual characteristics that make it an interesting measure: (1) the response ranges from 0 to 150; (2) nine of the interior points (but neither endpoint) are labeled; (3) the intervals between labels are unequal.</p>
<p>In <a href="https://measuringu.com/comparison-of-standard-seq-and-click-smeq-sensitivity/?utm_source=chatgpt.com">several studies</a> comparing the SMEQ and SEQ, we have found consistently strong correlations (averaged across five tasks, <em>r</em> = .8, 95% confidence interval from .30 to 1.0, 64% shared variance) between variations of the SMEQ and SEQ, consistent with the expectation that there should be a relationship between perceived ease and perceived mental effort.</p>
<h2>Relationship of SEQ with ASQ, Errors, and CES</h2>
<p>In the previous sections, we’ve discussed the correlation of the SEQ with seven other task metrics.</p>
<p>This section describes the relationship of the SEQ with three other metrics for which we do not have precise quantitative correlations: the After-Scenario Questionnaire (ASQ), task errors, and the Customer Effort Score (CES).</p>
<h3>ASQ</h3>
<p>The <a href="https://www.researchgate.net/publication/230786769_Psychometric_evaluation_of_an_after-scenario_questionnaire_for_computer_usability_studies_The_ASQ">ASQ</a> was developed in 1988 as part of an <a href="https://www.researchgate.net/publication/221053907_Integrated_office_software_benchmarks_A_case_study">IBM Research usability metrics project</a>. It has three seven-point items measuring satisfaction with perceived ease, task duration, and support documentation. The item format of the ASQ is completely different from the standard SEQ (specifically, the ASQ has agreement endpoints, satisfaction-based wording, and a lower ASQ rating indicates a better experience). For comparison with Figure 1, the wording of the ASQ ease item is, “Overall, I am satisfied with the ease of completing the tasks in this scenario” (with seven numbered response options, 1 anchored with Strongly Agree and 7 with Strongly Disagree).</p>
<p>Due to their similarity, we’ve never concurrently collected the ASQ ease item and the SEQ. However, when we compared two versions of the SEQ that differed in endpoint polarity (manipulating the left/right location of Very Difficult and Very Easy), we found no evidence for a <a href="https://measuringu.com/revisiting-the-left-side-bias/">left-side bias</a>, <a href="https://measuringu.com/reversing-seq-endpoints-means/">no significant difference in means</a>, and <a href="https://measuringu.com/reversing-seq-endpoints-response-distributions/">overall no significant difference in top-box scores</a>. We also <a href="https://measuringu.com/comparison-of-SEQ-with-and-without-numbers/">found no statistically significant differences</a> for SEQ versions with slightly different item stems.</p>
<p>So, we strongly suspect that respondent behaviors with the SEQ and the ease item of the ASQ would be almost identical.</p>
<h3>Task Errors</h3>
<p>The key task metrics for usability studies are successful task completion, task completion time, and subjective ratings of constructs like perceived ease (e.g., the SEQ). A less commonly reported task metric is the <a href="https://measuringu.com/errors-ux/">number of errors</a> committed by participants.</p>
<p>We do not have any recent data with concurrently collected SEQ ratings and error counts. In 2009, however, we published analyses of data from 90 anonymized usability studies contributed by numerous UX practitioners (Jeff at MeasuringU, Jim at IBM, and many others for a total of 1,034 unique tasks). The studies contributed by Jeff included early versions of the SEQ, those contributed by Jim included the ASQ, and those contributed by others were not identical, but all were subjective task ratings. Our best estimate of the correlation between task satisfaction (e.g., SEQ, ASQ, and other subjective ratings) and the number of errors was <em>r</em> = −.44 (95% confidence interval from −.77 to −.11, 19% shared variance).</p>
<p>We don’t have a good estimate of the specific correlation between SEQ and task errors, but we expect it would not be radically different from the correlation of −.44 we published in 2009.</p>
<h3>Customer Effort Score (CES)</h3>
<p>The <a href="https://measuringu.com/customer-effort-score/">CES is a single item</a> developed for customers to rate how easy it was to interact with an organization in the context of a support issue—a specialized, real-world task. From its first to second versions, the scale evolved from a five-point item-specific scale (with endpoints from Very Low Effort to Very High Effort) to a fully labeled seven-point agreement scale (specifically, “The organization made it easy for me to handle my issue” with endpoints from 1: Strongly Disagree to 7: Strongly Agree).</p>
<p>It’s often cited as a good alternative to the Net Promoter Score (<a href="https://measuringu.com/nps-ux/">NPS</a>) despite little data to support this claim. Also, its wording makes it much less generalizable than the NPS, as many customers don’t contact support. We aren’t aware of solid published benchmarks on the CES beyond its original publication and some general advice.</p>
<p>We don’t have correlational data for the SEQ and CES, but we would expect a high correlation because the wording and construction of the CES are so similar to the SEQ (seven-point scales assessing task ease with 1 the poorest and 7 the best rating). We’ve tested <a href="https://measuringu.com/evolution-of-seq/">many variations of SEQ wording/formats</a> and found little difference in means and consistently high correlations. We’ve also seen little impact when scales are changed from <a href="https://uxpajournal.org/survey-item-formats-agreement-specific-endpoints/">item-specific to agreement</a> formats.</p>
<p>The current manifestation of the CES so closely resembles the SEQ that they’re hard to differentiate (the ease of handling an issue is akin to the ease of completing a task). Because the SEQ is likely to be a good proxy for the CES, our <a href="https://measuringu.com/adjective-interpretations-of-seq-scores/">SEQ benchmarks</a> may be helpful for interpreting the CES.</p>
<p>From an experiment in which we manipulated task difficulty and had respondents rate their experience with the SEQ (standard seven-point version) and an adjective scale, we got the following results (range of SEQ means associated with the adjective scale):</p>
<p>1.00–1.49: Most difficult imaginable</p>
<p>1.50–2.69: Very difficult</p>
<p>2.70–4.29: Difficult</p>
<p>4.30–5.59: Easy</p>
<p>5.60–6.49: Very easy</p>
<p>6.50–7.00: Easiest imaginable</p>
<p>Note that our empirically estimated boundaries for the response options of the adjective scale are similar to published heuristics (not empirically validated) for interpreting the CES. For example, a <a href="https://www.intellicon.io/customer-effort-score-cef/">recent online source</a> (Intellicon) recommended interpreting scores below 3.5 as poor, 3.5–4.4 as below average, 4.5–5.4 as average, 5.5–6.4 as good, and 6.5–7.0 as Excellent.</p>
<h2>Summary</h2>
<p>The SEQ is popular because it’s short <strong>and</strong> often provides concurrent validity for other UX metrics. When correlations are high, it can also act as a proxy for other measures, especially those that are challenging to collect in typical UX research contexts. Table 1 summarizes the correlations between the SEQ and other task-level measures.</p>

<table id="tablepress-1015" class="tablepress tablepress-id-1015">
<thead>
<tr class="row-1">
	<th class="column-1">Metric</th><th class="column-2">Level</th><th class="column-3">Correlation (<i>r</i>)</th><th class="column-4">Shared Variance (<i>R</i><sup>2</sup>)</th>
</tr>
</thead>
<tbody>
<tr class="row-2">
	<td class="column-1">Satisfaction</td><td class="column-2">Individual</td><td class="column-3"> 0.71</td><td class="column-4">50%</td>
</tr>
<tr class="row-3">
	<td class="column-1"></td><td class="column-2">Product</td><td class="column-3"> 0.91</td><td class="column-4">83%</td>
</tr>
<tr class="row-4">
	<td class="column-1">NASA TLX</td><td class="column-2">Task</td><td class="column-3">−0.89</td><td class="column-4">79%</td>
</tr>
<tr class="row-5">
	<td class="column-1">Tapping</td><td class="column-2">Individual</td><td class="column-3">−0.40</td><td class="column-4">16%</td>
</tr>
<tr class="row-6">
	<td class="column-1">Lostness</td><td class="column-2">Individual</td><td class="column-3">−0.52</td><td class="column-4">27%</td>
</tr>
<tr class="row-7">
	<td class="column-1"></td><td class="column-2">Task</td><td class="column-3">−0.95</td><td class="column-4">90%</td>
</tr>
<tr class="row-8">
	<td class="column-1">Completion</td><td class="column-2">Task</td><td class="column-3"> 0.66</td><td class="column-4">44%</td>
</tr>
<tr class="row-9">
	<td class="column-1">Time</td><td class="column-2">Task</td><td class="column-3">−0.53</td><td class="column-4">28%</td>
</tr>
<tr class="row-10">
	<td class="column-1">SMEQ</td><td class="column-2">Task</td><td class="column-3"> 0.80</td><td class="column-4">64%</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1015 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Summary of SEQ and other task-level metrics (due to reduced variability, correlations at task and product levels tend to be higher than at the individual level, all correlations statistically significant with <em>p</em> &lt; .05).</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>48 UX Metrics, Methods, &#038; Measurement Articles from 2025</title>
		<link>https://measuringu.com/ux-2025/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=ux-2025</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD&nbsp;•&nbsp;Jeff Sauro, PhD]]></dc:creator>
		<pubDate>Wed, 31 Dec 2025 04:00:34 +0000</pubDate>
				<category><![CDATA[Measurement]]></category>
		<category><![CDATA[Methods]]></category>
		<category><![CDATA[Metrics]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[MUIQ]]></category>
		<category><![CDATA[SUPR-Q]]></category>
		<category><![CDATA[SUPR-Qm]]></category>
		<category><![CDATA[UX-Lite]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=46056</guid>

					<description><![CDATA[Happy New Year from all of us at MeasuringU®! In 2025, we posted 48 articles and continued to add features to our MUiQ® UX testing platform to make it even easier to develop studies and analyze results. We hosted our 12th UX Measurement Bootcamp—a blended virtual event attended by UX practitioners who completed a combination [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><img loading="lazy" decoding="async" class="alignleft wp-image-46236 size-medium" src="https://measuringu.com/wp-content/uploads/2025/12/123025-FeatureImage-4-300x169.png" alt="" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2025/12/123025-FeatureImage-4-300x169.png 300w, https://measuringu.com/wp-content/uploads/2025/12/123025-FeatureImage-4-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2025/12/123025-FeatureImage-4-768x432.png 768w, https://measuringu.com/wp-content/uploads/2025/12/123025-FeatureImage-4-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2025/12/123025-FeatureImage-4-600x338.png 600w, https://measuringu.com/wp-content/uploads/2025/12/123025-FeatureImage-4.png 2000w" sizes="(max-width: 300px) 100vw, 300px" />Happy New Year from all of us at MeasuringU<sup>®</sup>!</p>
<p>In 2025, we posted 48 articles and continued to add features to our <a href="https://measuringu.com/approach/muiq/">MUiQ</a><sup>®</sup> UX testing platform to make it even easier to develop studies and analyze results.</p>
<p>We hosted our 12<sup>th</sup> <a href="https://measuringu.com/events/ux-measurement-bootcamp-2025/">UX Measurement Bootcamp</a>—a <a href="https://www.teachtci.com/blog/types-of-blended-learning-models/#:~:text=Blended%20learning%20models%20usually%20leverage,whether%20online%20or%20in%20person.">blended</a> virtual event attended by UX practitioners who completed a combination of <a href="https://measuringu.com/courses/">MeasuringUniversity</a><sup><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2122.png" alt="™" class="wp-smiley" style="height: 1em; max-height: 1em;" /></sup> online courses and live Zoom sessions. It was a challenging four weeks of intensive training on UX methods, metrics, and measurements; plus, groups worked together to design a study in MUiQ, collect and analyze data, and prepare a report.</p>
<p>Through MeasuringUniversity, we continue to offer access to our webinars plus full courses on:</p>
<ul>
<li><a href="https://measuringu.com/courses/survey-design-and-analysis-for-ux-customer-research/">Survey Design and Analysis for UX &amp; Customer Research</a></li>
<li><a href="https://measuringu.com/courses/practical-statistics-for-ux-and-customer-research/">Practical Statistics for UX and Customer Research</a></li>
<li><a href="https://measuringu.com/courses/ux-metrics/">UX Metrics</a></li>
<li><a href="https://measuringu.com/courses/ux-methods/">UX Methods</a></li>
</ul>
<p>In addition to publishing our latest refereed journal article, &#8220;<a href="https://measuringu.com/article/streamlining-the-supr-qm-the-supr-qm-v2/">Streamlining the SUPR-Qm: The SUPR-Qm V2</a>&#8221; in the <a href="https://uxpajournal.org/"><em>Journal of User Experience</em></a>, we conducted research and went deep into several UX topics, including metrics, methods, statistics, and industry benchmarks.</p>
<p>The annual review of our blog articles is a great way to catch up on what you’ve missed from MeasuringU in 2025!</p>
<h2>20 Years of MeasuringU</h2>
<p>2025 marked the 20<sup>th</sup> anniversary of the founding of MeasuringU. We had a party, of course, and we also published four articles about our history.</p>
<ul>
<li><strong>The Foundational Years (1998–2008).</strong> Jeff incorporated MeasuringU in 2005, but the <a href="https://measuringu.com/20-years-of-measuringu-part-1/">company’s story began in 1998</a> when Jeff was trained in statistical process control at General Electric. This led to an intense interest in usability measurement and the original measuringusability.com website in 2004. Between 2005 and 2008, Jeff had his first collaborations with Jim Lewis on sample size estimation and statistical methods for small samples, and he investigated statistical summarization of usability metrics with Erika Kindlund.</li>
<li><strong>Growth and Change (2009–2015).</strong> In the <a href="https://measuringu.com/20-years-of-measuringu-part-2/">early part of this period</a>, Jeff hired the first MeasuringU employee and hosted the first Lean UX Denver conference (basically, our first UX measurement bootcamp). Jeff published his first two books (<a href="https://measuringu.com/book/a-practical-guide-to-measuring-usability/"><em>A Practical Guide to Measuring Usability</em></a> and <em><a href="https://measuringu.com/book/a-practical-guide-to-the-system-usability-scale/">A Practical Guide to the System Usability Scale</a></em>) and, with Jim Lewis, published the first edition of <a href="https://measuringu.com/book/quantifying-the-user-experience-practical-statistics-for-user-research/"><em>Quantifying the User Experience</em></a>. By the end of this period, the company had a new name (from Measuring Usability to MeasuringU), a new questionnaire for assessing the UX of websites (<a href="https://measuringu.com/product/suprq/">SUPR-Q</a><sup>®</sup>), and the first version of the <a href="https://measuringu.com/muiq/">MUiQ platform</a>.</li>
<li><strong>MUiQ and an Explosion of Research (2016–2025).</strong> In the first part of <a href="https://measuringu.com/20-years-of-measuringu-part-3/">this period</a>, MeasuringU outgrew its original space and added custom labs with one-way mirrors, published <a href="https://measuringu.com/book/benchmarking-the-user-experience/"><em>Benchmarking the User Experience</em></a> and the second edition of <em>Quantifying the User Experience,</em> enhanced MUiQ for data collection and licensing, developed new UX methods and metrics (<a href="https://measuringu.com/article/a-pragmatic-services-for-scoring-product-usability/">PURE</a>, <a href="https://measuringu.com/article/suprqm/">SUPR-Qm</a><sup>®</sup>), and had over a million <a href="https://measuringu.com/blogs/">blog views</a> in 2017. The period between 2020 and 2025 saw the publication of <a href="https://measuringu.com/book/surveying-the-user-experience/"><em>Surveying the User Experience</em></a>, increased research and support staff, development of more UX metrics (<a href="https://measuringu.com/from-umux-lite-to-ux-lite/">UX-Lite</a><sup><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2122.png" alt="™" class="wp-smiley" style="height: 1em; max-height: 1em;" /></sup>, <a href="https://measuringu.com/how-to-use-the-tac/">TAC-10</a><sup><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2122.png" alt="™" class="wp-smiley" style="height: 1em; max-height: 1em;" /></sup>, a streamlined version of <a href="https://measuringu.com/article/streamlining-the-supr-qm-the-supr-qm-v2/">SUPR-Qm</a>), and an enterprise-ready version of MUiQ.</li>
<li><strong>What Metrics Has MeasuringU Created?</strong> <em>So far, we’ve published sixteen, most notably the SUPR-Q and the UX-Lite.</em> At MeasuringU, we don’t just use UX metrics—<a href="https://measuringu.com/what-ux-metrics-has-measuringu-created/">we create them</a>. Working back from the most recent, this article describes the 16 psychometrically qualified UX metrics (both completely original and adapted from existing questionnaires) that Jim and Jeff have published in their careers (so far).</li>
</ul>
<h2>Standardized UX Metrics</h2>
<p>This year, our work on standardized UX metrics included four articles on a taxonomy of 70+ UX metrics, six on the development of the SUPR-Qm V2, two on the relationship between perceived ease and satisfaction, and three on various other topics (grading scales for the UX-Lite, editing standard UX questions, and a report card on claims about the NPS).</p>
<h3>Using a Taxonomy of 70+ UX Metrics</h3>
<p>Users and clients often ask us how to get started with quantitative UX research, so we wrote a series of four articles, starting with a broad view of UX metrics before focusing on key metrics to get started.</p>
<ul>
<li><strong>A Taxonomy (Visual Overview) of 70+ UX Metrics.</strong> Measuring the user experience starts with UX metrics. But there is no single UX measure—no universal gauge that provides a complete view of the user experience, so we rely on multiple metrics, each offering an incomplete yet complementary perspective. <a href="https://measuringu.com/taxonomy-of-70-ux-metrics/">In this article</a>, we identified over 70 UX <a href="https://measuringu.com/task-based-metrics/">task-based</a> and <a href="https://measuringu.com/guide-to-study-based-ux-metrics/">study-level</a> metrics, encompassing action metrics (what people do) and attitude metrics (what people think and how they feel). As shown in Figure 1, we rated each metric by its popular usage, ease of collection, and availability of reference benchmarks.</li>
</ul>
<p><a href="https://measuringu.com/wp-content/uploads/2025/12/OverviewUXMetrics_1-3-scaled-1.webp" rel="attachment wp-att-46058"><img loading="lazy" decoding="async" class="alignnone wp-image-46058" src="https://measuringu.com/wp-content/uploads/2025/12/OverviewUXMetrics_1-3-scaled-1.webp" alt="An overview of 70+ UX metrics." width="1200" height="658" srcset="https://measuringu.com/wp-content/uploads/2025/12/OverviewUXMetrics_1-3-scaled-1.webp 2560w, https://measuringu.com/wp-content/uploads/2025/12/OverviewUXMetrics_1-3-scaled-1-300x164.webp 300w, https://measuringu.com/wp-content/uploads/2025/12/OverviewUXMetrics_1-3-scaled-1-1024x561.webp 1024w, https://measuringu.com/wp-content/uploads/2025/12/OverviewUXMetrics_1-3-scaled-1-768x421.webp 768w, https://measuringu.com/wp-content/uploads/2025/12/OverviewUXMetrics_1-3-scaled-1-1536x842.webp 1536w, https://measuringu.com/wp-content/uploads/2025/12/OverviewUXMetrics_1-3-scaled-1-2048x1122.webp 2048w, https://measuringu.com/wp-content/uploads/2025/12/OverviewUXMetrics_1-3-scaled-1-600x329.webp 600w" sizes="(max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> An overview of 70+ UX metrics (this is a living document with the most recent version available at <a href="https://measuringu.com/an-overview-of-70-ux-metrics/">https://measuringu.com/an-overview-of-70-ux-metrics/</a>).<strong> </strong></p>
<ul>
<li><strong>How to Select a UX Metric.</strong> We described <a href="https://measuringu.com/how-to-select-a-ux-metric/">five steps for picking the right UX metric</a>: Define what you’re trying to measure, find a match in our taxonomy of 70+ UX metrics, use widely used metrics, choose metrics that are easier to collect, and look for metrics with known benchmarks for ease of interpretation.</li>
<li><strong>There are 70+ UX Metrics, Start with These 4.</strong> Selecting the right metrics to use from so many can feel overwhelming. When you’re not sure what to use in task-based studies, <a href="https://measuringu.com/get-comfortable-with-four-ux-metrics/">we recommend starting with four</a>: <a href="https://measuringu.com/determine-task-completion/">Task completion</a>, the Single Ease Question (<a href="https://measuringu.com/evolution-of-seq/">SEQ</a><sup>®</sup>), <a href="https://measuringu.com/article/rent-a-car-in-just-0-60-240-or-1217-seconds-comparative-usability-measurement-cue-8/">task time</a>, and the <a href="https://measuringu.com/article/effect-of-perceived-ease-of-use-and-usefulness-on-ux-and-behavioral-outcomes/">UX-Lite</a>. For each of these metrics, we described what it is, how it’s measured, how it’s interpreted, what are good, average, or bad values, and finally, what to watch out for.</li>
<li><strong>How to Get Comfortable with Quantitative UX Research.</strong> Getting <a href="https://measuringu.com/how-to-get-comfortable-with-quantitative-ux-research/">comfortable with quantitative research</a> takes patience and the right mindset. Five ways to get more comfortable are to <a href="https://measuringu.com/qual-quant-words/">look for keywords</a> in research questions that suggest quant over qual methods, get to know the <a href="https://measuringu.com/get-comfortable-with-four-UX-metrics">four key metrics</a> (task completion, task time, SEQ, UX-Lite), understand <a href="https://measuringu.com/product/practical-statistics-for-ux-and-customer-research-course/">binary versus continuous data types</a>, use rating scales to measure attitudes without obsessing over <a href="https://measuringu.com/changes-to-rating-scale-formats-can-matter-but-usually-not-that-much/">minor format differences</a>, and use <a href="https://measuringu.com/might-not-be-a-magic-number-but-there-are-magic-ranges">magic ranges</a> rather than magic numbers for sample size planning.</li>
</ul>
<h3>SUPR-Qm</h3>
<p>The SUPR-Qm is a standardized questionnaire designed to assess the UX of mobile apps. In 2024, we gathered and analyzed SUPR-Qm data collected from 2019 through 2023 to develop a <a href="https://measuringu.com/article/streamlining-the-supr-qm-the-supr-qm-v2/">streamlined version of the questionnaire</a>. This process was documented across six blog articles in 2025.</p>
<ul>
<li><strong>A Review of Mobile App UX Questionnaires.</strong> Mobile apps are a distinct type of product that warrant their own measurement, but work on standardized mobile app questionnaires has been limited. <a href="https://measuringu.com/standardized-questionnaires-for-the-ux-of-mobile-apps/">In this article</a>, we reviewed how mobile UX has historically been measured and then reviewed three mobile-specific questionnaires: the MPUQ, mod-AUG scales, and SUPR-Qm.</li>
<li><strong>How Stable Is the SUPR-Qm After 8 Years? </strong><em>Very stable.</em> We developed the SUPR-Qm in 2017 to focus on the quality of the mobile app experience. But a lot has changed since 2017, and mobile apps aren’t the same. We <a href="https://measuringu.com/how-stable-is-the-suprqm-after-8-years/">reviewed our latest results</a> from a large-scale data collection effort to assess the stability of the SUPR-Qm and found that it demonstrated a high degree of stability over eight years.</li>
<li><strong>Streamlining the SUPR-Qm from 16 to 5 Items.</strong> The original version of the SUPR-Qm from 2017 has 16 items. Analysis of the location of the items on a Wright map indicated opportunities to reduce the <a href="https://measuringu.com/streamlining-the-suprq-m/">number of items</a> while retaining the questionnaire’s good measurement properties. After exploring different numbers of items, we found that five items gave the best balance between accuracy and brevity (published as the SUPR-Qm V2).</li>
<li><strong>Verifying the Stability of the Five-Item SUPR-Qm V2. </strong>With only five items, it was an open research question whether the SUPR-Qm V2 would be <a href="https://measuringu.com/verifying-the-stability-of-the-suprq-m/">stable over time</a>. To check this, we divided our large sample in half to compare Group A (data collected from Feb 2019 through Aug 2021, 11 industries, 58 websites, <em>n</em> = 2,143) with Group B (data collected from Feb 2022 through May 2023, 12 industries, 97 websites, <em>n</em> = 2,006). As shown in Figure 2, the locations of scores on Rasch logit scales were nearly identical, demonstrating the stability of the SUPR-Qm V2 with varying dates and industries.</li>
</ul>
<p><a href="https://measuringu.com/wp-content/uploads/2025/12/Figure22-scaled.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46210" src="https://measuringu.com/wp-content/uploads/2025/12/Figure22-scaled.png" alt="Figure 2: Stability of Rasch scale for SUPR-Qm V2, indicated by the overlap of lines for Groups A and B." width="1200" height="805" srcset="https://measuringu.com/wp-content/uploads/2025/12/Figure22-scaled.png 2560w, https://measuringu.com/wp-content/uploads/2025/12/Figure22-300x201.png 300w, https://measuringu.com/wp-content/uploads/2025/12/Figure22-1024x687.png 1024w, https://measuringu.com/wp-content/uploads/2025/12/Figure22-768x515.png 768w, https://measuringu.com/wp-content/uploads/2025/12/Figure22-1536x1030.png 1536w, https://measuringu.com/wp-content/uploads/2025/12/Figure22-2048x1374.png 2048w, https://measuringu.com/wp-content/uploads/2025/12/Figure22-600x402.png 600w" sizes="(max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 2:</strong> Stability of Rasch scale for SUPR-Qm V2, indicated by the overlap of lines for Groups A and B.</p>
<ul>
<li><strong>How to Score and Interpret the Five-Item SUPR-Qm V2. </strong>We used the probability of the location of a SUPR-Qm V2 score (interpolated to 0–100 points) to develop a <a href="https://measuringu.com/how-to-score-and-interpret-the-five-item-supr-qm-v2/">grading scale for its interpretation</a>. Practitioners can use the new curved grading table to interpret SUPR-Qm scores.</li>
<li><strong>Ten Things to Know About the SUPR-Qm. </strong>We listed the <a href="https://measuringu.com/supr-qm-10-things/">ten things to know about the SUPR-Qm</a>, from the statistical to the practical, including a description of our new <a href="https://measuringu.com/product/supr-qm-calculator-package/">SUPR-Qm calculator</a>.</li>
</ul>
<h3>Perceived Ease and Satisfaction</h3>
<p>It’s very common in UX research to measure perceived ease of use. It’s less common to measure satisfaction. To understand whether measuring satisfaction is good practice for UX researchers, it’s important to know the difference between these two constructs, so we published two articles on this topic.</p>
<ul>
<li><strong>What’s the Difference Between Ease and Satisfaction?</strong> <em>The conceptual difference is that ease is a key experiential driver of the broader construct of satisfaction.</em> Customer satisfaction is a key metric in market research and business. In this article, we discussed the measurement of satisfaction and ease at the study and task levels and <a href="https://measuringu.com/what-is-the-difference-between-ease-and-satisfaction">presented an argument for measuring both</a> in UX research (at least at the study level) to provide a quantitative connection between UX and business metrics.</li>
<li><strong>How Much Does Satisfaction Correlate with Ease?</strong> <em>A lot.</em> To <a href="https://measuringu.com/how-much-does-satisfaction-correlate-with-ease">quantify the relationship between satisfaction and ease</a>, we reviewed our recent (2020–2025) retrospective and task-based studies at both the study level (for multiple products) and the task level for measures of satisfaction and ease. We found ratings of ease and satisfaction were strongly correlated; when put on a common scale, their means differed by only 1%.</li>
</ul>
<h3>Other UX Metrics</h3>
<p>We published three additional articles on various other metrics (UX-Lite grading scales, effect of editing standardized UX questions, NPS).</p>
<ul>
<li><strong>Grading Scales for the UX-Lite.</strong> Early in its development, we used our well-known curved grading scale for the SUS to interpret UX-Lite scores. After collecting several years of UX-Lite data, we started interpreting UX-Lite scores with percentiles. We now have <a href="https://measuringu.com/grading-scales-for-the-ux-lite/">two grading scales for interpreting UX-Lite scores</a>, one standard and one curved. The standard grading scale is appropriate for a general assessment of UX-Lite means, while the curved grading scale is appropriate when new UX-Lite data match the reference group for determining percentiles.</li>
<li><strong>Is It OK to Edit the Wording of Standardized UX Questions?</strong> <em>Usually, but only to a point.</em> You can edit the wording of standardized UX questions, such as clarifying language or inserting a product name, as long as those changes don’t alter the intensity or scope of the original wording. <a href="https://measuringu.com/is-it-ok-to-edit-the-wording-of-standardized-ux-questions/">Across multiple studies</a> using the SUS, SEQ, and UX-Lite, we found that minor wording adjustments rarely affected response patterns, whereas more extreme modifications did. When possible, testing any changes before adopting them is best, but small, thoughtful edits are unlikely to compromise the reliability or validity of your measurements.</li>
<li><strong>A Report Card for the Net Promoter Score.</strong> Should you use the NPS? Maybe, maybe not. In this article, we don’t debate whether you should use it (and you may not have a choice). Instead, we used data (rather than opinions) to <a href="https://measuringu.com/report-card-for-the-nps/">review and grade 13 claims made about the NPS</a>, some from its critics and others from its proponents. At the end, we gave a report card (Figure 3) on how well these claims stand up against the evidence.</li>
</ul>
<p><a href="https://measuringu.com/wp-content/uploads/2025/12/Figure33.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46211" src="https://measuringu.com/wp-content/uploads/2025/12/Figure33.png" alt="Figure 3: A report card for 13 claims made about the NPS." width="1200" height="1251" srcset="https://measuringu.com/wp-content/uploads/2025/12/Figure33.png 1303w, https://measuringu.com/wp-content/uploads/2025/12/Figure33-288x300.png 288w, https://measuringu.com/wp-content/uploads/2025/12/Figure33-983x1024.png 983w, https://measuringu.com/wp-content/uploads/2025/12/Figure33-768x800.png 768w, https://measuringu.com/wp-content/uploads/2025/12/Figure33-600x625.png 600w" sizes="(max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 3:</strong> A report card for 13 claims made about the NPS.</p>
<h2>UX Methods</h2>
<p>This year, we published five articles on various aspects of UX methods (scoring, recruitment, prototype testing, research moderation, and UX deliverables).</p>
<ul>
<li><strong>Three Ways to Score the UX of Products.</strong> We discussed <a href="https://measuringu.com/three-ways-to-score-the-ux-of-products/">three major types of UX studies</a>: behavioral (task-based), attitudinal (retrospective), and analytical/inspection (PURE). The three methods have various pros and cons with regard to cost and setup effort.</li>
<li><strong>How Much Should You Over-Recruit?</strong> <em>In our experience, by about 20%.</em> To estimate <a href="https://measuringu.com/how-much-should-you-over-recruit/">how much to over-recruit</a> for UX studies in this article, we defined the over-recruitment rate and found that almost all studies needed some amount of over-recruitment (on average, about 20%). We also noted that no-shows aren’t the only factor affecting recruitment because they accounted for only half of unusable sessions.</li>
<li><strong>What Happens When You Test a Mobile Prototype on Desktop?</strong> <em>We found that metrics collected on mobile were slightly better.</em> To compare UX data collected for a <a href="https://measuringu.com/ratings-of-a-mobile-app-prototype-on-desktop-and-mobile-devices/">prototype mobile banking app</a> displayed on a desktop computer versus on a mobile device, we had 100 participants (50 per condition) attempt three tasks: Check Balance, Card Transaction, and Credit Score. Overall, the results were generally better for the mobile condition (in some cases, significantly so).</li>
<li><strong>What Makes a Good UX Research Moderator?</strong> <em>To paraphrase </em><a href="https://www.esquire.com/entertainment/movies/a31775/taken-speech/"><em>Liam Neeson</em></a><em>, it’s having “a particular set of skills.”</em> The <a href="https://measuringu.com/what-makes-a-good-ux-research-moderator/">skills of a good UX research moderator</a> include, but are not limited to, establishing rapport, probing appropriately, managing time well, using interpersonal skills effectively, detecting misrepresentation, using silence strategically, avoiding leading questions, knowing when to move on, and knowing when to assist.</li>
<li><strong>What Are UX Deliverables?</strong> <em>The recorded memory of UX research.</em> Unlike UX design deliverables, which are primarily visual and tangible (what the website will look like, what using a prototype feels like), <a href="https://measuringu.com/what-are-ux-deliverables">UX research deliverables</a> must communicate the insights that drive the designs (e.g., what problems people had, how different groups in a sample had different experiences). One way to classify UX deliverables is by stages of research, including interim (e.g., test plans, screeners), final (e.g., presentations, reports, dashboards), and artifacts (e.g., raw data, participant videos).</li>
</ul>
<h2>Statistical Topics</h2>
<p>Our nine articles on statistical topics included four on weighting data, two on interpreting percentages, and three on sample sizes.</p>
<h3>Weighting Data</h3>
<p>The four articles in this section cover the decision to weight data, how to weight means or percentages on a single variable, and how to use rake weighting when needing to weight on multiple variables.</p>
<ul>
<li><strong>To Weight or Not to Weight.</strong> The purpose of weighting is to match the characteristics of a sample to a reference population. In UX research, <a href="https://measuringu.com/to-weight-or-not-to-weight/">we generally recommend against weighting</a> unless there is an appropriate reference population (which is usually not the U.S. population). To avoid the risks associated with weighting, researchers should default to analyzing unweighted data. Even if the plan is to present weighted data, good practice is to compare the weighted and unweighted results to see which conclusions, if any, have been significantly affected by weighting.</li>
<li><strong>How to Weight Means.</strong> We covered how to <a href="https://measuringu.com/how-to-weight-means/">weight means in the basic case</a> of matching against a single reference variable at either group or case levels.</li>
<li><strong>How to Weight Percentages.</strong> We showed how methods for weighting means can apply to the <a href="https://measuringu.com/how-to-weight-percentages/">weighting of proportions and percentages</a> when those measures can be conceptualized as means.</li>
<li><strong>Rake Weighting: How to Weight Survey Data with Multiple Variables.</strong> When there is a need to weight data against multiple user variables, a popular method is <a href="https://measuringu.com/rake-weighting-how-to-weight-survey-data-with-multiple-variables/">rake weighting</a>. For example, you may want your survey sample to match not solely the age of your customers, but the age, gender, and geographic region simultaneously. The article presented a fully worked-out example with an R script using the anesrake package.</li>
</ul>
<h3>Types of Percentages and 100-Point Scales</h3>
<p>The two articles in this section explain the differences in three types of percentages and different types of 100-point scales.</p>
<ul>
<li><strong>Three Types of Percentages.</strong> Interpreting percentages can be tricky because the same term is associated with <a href="https://measuringu.com/three-types-of-percentages/">three distinct concepts</a>: absolute percentages, relative percentages, and net percentages. We discuss how they are the same and, more importantly, how they are different.</li>
<li><strong>Understanding Different Types of 100-Point Scales.</strong> Absolute percentages aren’t the only scales that range from 0 to 100. Other <a href="https://measuringu.com/types-of-100-point-scales/">types of 100-point scales</a> are percentiles (e.g., SUPR-Q scores) and 100-point scores (e.g., SUS, UX-Lite), both of which are distinct from absolute percentages and from each other.</li>
</ul>
<h3>Sample Sizes</h3>
<p>The three articles in this section address different aspects of sample sizes, including the inverse square relationship, schools of thought, and magic ranges for sample size planning.</p>
<ul>
<li><strong>Using the Inverse Square Relationship for Sample Sizes.</strong> Although the mathematics behind sample size calculations can look complicated (and they sometimes are), the concept is something you can see and experience. Like many natural phenomena, sample size and precision have an <a href="https://measuringu.com/inverse-square-relationship/">inverse square relationship</a>. For estimation or comparison, to double the precision of measurement, you need to quadruple the sample size (Figure 4).</li>
</ul>
<p><a href="https://measuringu.com/wp-content/uploads/2025/12/Figure44-scaled.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46214" src="https://measuringu.com/wp-content/uploads/2025/12/Figure44-scaled.png" alt="Figure 4: Quadrupling the sample size cuts measurement error in half—in other words, doubling the precision." width="1200" height="799" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 4:</strong> Quadrupling the sample size cuts measurement error in half—in other words, doubling the precision.</p>
<ul>
<li><strong>Schools of Thought on Sample Sizes in UX Research.</strong> We’ve noticed that there seem to be two <a href="https://measuringu.com/schools-of-thought-on-ux-sample-sizes/">schools of thought</a> when discussing sample sizes in UX. Both are too extreme, but both have a kernel (or more) of truth. One school preaches that small sample sizes are always enough; the other, that you always need large sample sizes. As is often the case with vocal opinions, the truth lies somewhere in between. To find a pragmatic position between them, it’s important to understand the philosophies of each school.</li>
<li><strong>Might Not Be a Magic Number but There Are Magic Ranges.</strong> It’s tempting to rely on “magic numbers” when planning sample sizes for UX research (e.g., five for formative usability tests, 30 for comparison studies). The problem with single magic numbers is that they’re more likely to be wrong than right. There are, however, Goldilocks Zones (“<a href="https://measuringu.com/might-not-be-a-magic-number-but-there-are-magic-ranges/">magic ranges</a>”) that balance the various statistical and logistical forces that drive appropriate sample sizes for different goals (discovery, estimation, and comparison). In this article, we provided guidance for identifying upper and lower bounds for these magic ranges.</li>
</ul>
<h2>Data Visualization</h2>
<p>Our article on data visualization explains why and how to use scatterplot jitter.</p>
<ul>
<li><strong>Scatterplot Jitter—Why and How?</strong> <em>Use to reveal more points in scatterplots when there are numerous ties.</em> Scatterplots are powerful data visualization tools, but they do not work well when the values being plotted can easily tie (e.g., rating scale data). In this article, we described <a href="https://measuringu.com/scatterplot-jitter/">jittering</a>, a popular strategy for breaking ties to reveal otherwise hidden relationships (Figure 5).</li>
</ul>
<table>
<tbody>
<tr>
<td width="312"><a href="https://measuringu.com/wp-content/uploads/2025/12/Figure-5a.png" rel="attachment wp-att-46063"><img loading="lazy" decoding="async" class="alignnone wp-image-46063 size-full" src="https://measuringu.com/wp-content/uploads/2025/12/Figure-5a.png" alt="Standard scatterplot for ratings of likelihood to recommend and likelihood to discourage." width="647" height="369" srcset="https://measuringu.com/wp-content/uploads/2025/12/Figure-5a.png 647w, https://measuringu.com/wp-content/uploads/2025/12/Figure-5a-300x171.png 300w, https://measuringu.com/wp-content/uploads/2025/12/Figure-5a-600x342.png 600w" sizes="(max-width: 647px) 100vw, 647px" /></a></td>
<td width="312"><a href="https://measuringu.com/wp-content/uploads/2025/12/Figure-5b.png" rel="attachment wp-att-46065"><img loading="lazy" decoding="async" class="alignnone wp-image-46065 size-full" src="https://measuringu.com/wp-content/uploads/2025/12/Figure-5b.png" alt="Jittered scatterplot for ratings of likelihood to recommend and likelihood to discourage." width="650" height="369" srcset="https://measuringu.com/wp-content/uploads/2025/12/Figure-5b.png 650w, https://measuringu.com/wp-content/uploads/2025/12/Figure-5b-300x170.png 300w, https://measuringu.com/wp-content/uploads/2025/12/Figure-5b-600x341.png 600w" sizes="(max-width: 650px) 100vw, 650px" /></a></td>
</tr>
</tbody>
</table>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 5:</strong> Standard (left) and jittered (right) scatterplots for ratings of likelihood to recommend and likelihood to discourage.</p>
<h2>UX Industry Reports</h2>
<p>We conducted mixed-methods benchmark studies using the SUPR-Q and NPS for three online consumer services and ran our biennial consumer and business software surveys. Thanks to all of you who have purchased our reports. The proceeds from these sales fund the original research we post on MeasuringU. We also published seven articles about the UX profession based on the 2024 UXPA salary survey.</p>
<h3>SUPR-Q Benchmark Studies</h3>
<p>In 2024, we published the results of three UX and NPS benchmark studies; SUPR-Q scores are included in a <a href="https://measuringu.com/product/suprq/">SUPR-Q license</a>.</p>
<ul>
<li><strong>Brokerage.</strong> Our survey (<em>n</em> = 508) of <a href="https://measuringu.com/brokerage-benchmark-2025/">11 brokerage websites</a> (Ally Invest, Charles Schwab, Coinbase, Edward Jones, E-Trade, Fidelity, J.P. Morgan, Merrill Lynch, Robinhood, TD Ameritrade, Vanguard) found substantial variation in the UX of the sites. Users were primarily concerned with website responsiveness, difficult navigation, and issues accessing accounts [<a href="https://measuringu.com/product/ux-nps-benchmark-report-for-brokerage-websites-2025/">full report</a>].</li>
<li><strong>Pets.</strong> A total of 240 users of <a href="https://measuringu.com/pet-benchmark-2025/">pet websites</a> in the U.S. rated their experience with one of four websites (Chewy, Petco, PetMeds, PetSmart). The UX of pet websites was high, ranging from the 82<sup>nd</sup> percentile for Petco to the 94<sup>th</sup> percentile for Chewy. The top UX problems reported by users were items being out of stock and cluttered/overwhelming options [<a href="https://measuringu.com/product/ux-nps-benchmark-report-for-pet-websites-2025/">full report</a>].</li>
<li><strong>International Banks.</strong> For our survey of nine <a href="https://measuringu.com/international-banking-benchmark-2025/">international banking</a> websites (ANZ Banking, BNP Paribas, Commonwealth Bank, HSBC, ING Bank, Itaú Unibanco, National Australia Bank, Royal Bank of Canada, Westpac; <em>n</em> = 462), the UX differed widely by site, with Itaú Unibanco having the highest SUPR-Q score (96th percentile), followed by Westpac (73rd percentile), while BNP Paribas had the lowest SUPR-Q score (26th percentile). Our key driver analysis revealed an industry-wide opportunity to improve the ease of comparing rates and fees (Figure 6) [<a href="https://measuringu.com/product/ux-nps-benchmark-report-for-international-banking-websites-2025/">full report</a>].</li>
</ul>
<p><a href="https://measuringu.com/wp-content/uploads/2025/12/Figure5-scaled.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46215" src="https://measuringu.com/wp-content/uploads/2025/12/Figure5-scaled.png" alt="Figure 6: Scatterplot of importance and opportunity for improvement of key drivers of the international bank website experience." width="1200" height="456" srcset="https://measuringu.com/wp-content/uploads/2025/12/Figure5-scaled.png 2560w, https://measuringu.com/wp-content/uploads/2025/12/Figure5-300x114.png 300w, https://measuringu.com/wp-content/uploads/2025/12/Figure5-1024x389.png 1024w, https://measuringu.com/wp-content/uploads/2025/12/Figure5-768x292.png 768w, https://measuringu.com/wp-content/uploads/2025/12/Figure5-1536x584.png 1536w, https://measuringu.com/wp-content/uploads/2025/12/Figure5-2048x779.png 2048w, https://measuringu.com/wp-content/uploads/2025/12/Figure5-600x228.png 600w" sizes="(max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 6:</strong> Scatterplot of importance and opportunity for improvement of key drivers of the international bank website experience.</p>
<h3>SUS Benchmark Studies</h3>
<p>Our 2025 SUS benchmark studies included reviews of business software and consumer software, plus separate reports on meeting software products used by both businesses and consumers, and a report on the UX of AI-based chat software.</p>
<ul>
<li><strong>Business Software.</strong> Our 2025 report on <a href="https://measuringu.com/business-software-ux-2025/">business software benchmarks</a> covered 23 products (<em>n</em> = 980) with a mix of productivity and communications software. Across the 23 products, the average NPS was −5%, ranging from −38% to 24%. This average is reasonably consistent with the previous averages of the business NPS from 2020 (−3%) and 2022 (−12%). For this report, we investigated the impact of perceived ease and usefulness from the UX-Lite on overall experience, brand attitude, and intentions to reuse and recommend (Figure 7) [<a href="https://measuringu.com/product/net-promoter-ux-benchmark-report-for-business-software-2025">full report</a>].</li>
</ul>
<p><a href="https://measuringu.com/wp-content/uploads/2025/12/Figure6.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46216" src="https://measuringu.com/wp-content/uploads/2025/12/Figure6-300x132.png" alt="Figure 7: Structural equation model demonstrating the impact of perceived ease and usefulness on intentions to reuse and recommend (all beta weights significant, p &lt; .001; excellent fit statistics, 𝜒2(3) = 0.9, p = .82; CFI = 1.0; RMSEA = .00; BIC = 124.9). " width="1200" height="530" srcset="https://measuringu.com/wp-content/uploads/2025/12/Figure6-300x132.png 300w, https://measuringu.com/wp-content/uploads/2025/12/Figure6-1024x452.png 1024w, https://measuringu.com/wp-content/uploads/2025/12/Figure6-768x339.png 768w, https://measuringu.com/wp-content/uploads/2025/12/Figure6-600x265.png 600w, https://measuringu.com/wp-content/uploads/2025/12/Figure6.png 1293w" sizes="(max-width: 1200px) 100vw, 1200px" /></a></p>
<p><strong>Figure 7: </strong>Structural equation model demonstrating the impact of perceived ease and usefulness on intentions to reuse and recommend (all beta weights significant, <em>p</em> &lt; .001; excellent fit statistics, 𝜒<sup>2</sup>(3) = 0.9, <em>p</em> = .82; CFI = 1.0; RMSEA = .00; BIC = 124.9).</p>
<ul>
<li><strong>Consumer Software.</strong> To understand the current consumer software landscape, in January 2025, we conducted a large-scale retrospective benchmark with 1,896 U.S. respondents on <a href="https://measuringu.com/consumer-software-ux-2025/">40 popular consumer software products</a>, similar to what we’ve done for the last 10+ years. Across the 40 products, the average NPS was 24%, ranging from −12% to 56%. The pattern of relationships and fit statistics for a structural equation model of the impact of perceived ease and usefulness on intentions to reuse and recommend were similar to the business software model above [<a href="https://measuringu.com/product/net-promoter-ux-benchmark-report-for-consumer-software-2025">full report</a>].</li>
<li><strong>Meeting Software.</strong> Following up on our 2023 article on meeting software, we reported on <a href="https://measuringu.com/meeting-software-ux-2025/">five meeting software</a> products (Google Meet, GoToMeeting, Microsoft Teams, Webex, Zoom, <em>n</em> = 233). Google Meet had top UX scores. Reported issues included performance, usability, and dated UIs [<a href="https://measuringu.com/product/meeting-software-2025/">full report</a>].</li>
<li><strong>AI-Based Chat Software.</strong> In January and February 2025, we conducted a retrospective study on <a href="https://measuringu.com/ai-based-chat-software-ux-2025/">three AI-based chat products</a> (ChatGPT, Claude, Gemini) with 153 U.S.-based panel participants. This study included metrics from our standard UX &amp; NPS survey as part of our larger consumer software data collection effort. All products had high and similar Net Promoter Scores. Reported issues included accuracy, generic content, and limited free versions [<a href="https://measuringu.com/product/net-promoter-ux-benchmark-report-for-consumer-software-2025">full report</a>].</li>
</ul>
<h3>UXPA Salary Survey Analyses</h3>
<p>Every few years, we assist our friends at the <a href="https://uxpa.org/">UXPA</a> to help the UX community understand the latest compensation, skills, and composition of the UX profession, most recently for their <a href="https://uxpa.org/salary-surveys/">2024 survey</a> (<em>n</em> = 444), which ended data collection in late 2024 (which is why we have seven UXPA survey articles in 2025).</p>
<ul>
<li><strong>How Does the UX Job Market Look for 2025?</strong> <em>Not great at the beginning, but hope for a rebound.</em> Our <a href="https://measuringu.com/how-does-the-ux-job-market-look-for-2025/">analysis of the UX job market</a> using historical data from the UXPA salary survey from 2009 through 2024 revealed significant UX job contraction, reduced tech jobs, and concerns about the future business climate. On the other hand, of those in a position to hire, most respondents (70%) reported plans to hire at least one UX position in 2025.</li>
<li><strong>Does an Advanced Degree Pay Off?</strong> <em>Not much in dollars if you stop working to pursue a degree.</em> We analyzed data from the 2024 UXPA salary survey to see whether there has been any change in <a href="https://measuringu.com/does-an-advanced-degree-pay-off/">the financial value of a PhD</a>, and we expanded our previous analysis to include master’s degrees. Taking opportunity cost into account (lost wages if stopping work for the degree), after 35 years of employment, those with master’s degrees were estimated to make 5% more than those with undergraduate degrees, while PhDs were expected to make 7% more (and about 1% more than those with master’s degrees).</li>
<li><strong>How Much Is AI Used in UX?</strong> <em>More than you might think.</em> The responses of 444 respondents to three questions in the 2024 UXPA salary survey about <a href="https://measuringu.com/how-much-is-ai-used-in-ux/">past and expected future use of AI</a> in UX design and research revealed a mix of optimistic and pessimistic views. The usage of AI in 2023 had a healthy start, with about half of respondents having tried AI (20% were not impressed). More companies supported using AI than discouraged it (by about 6 to 1). Most respondents expected to use AI more in 2025, but expectations over the next five years were mixed.</li>
<li><strong>The Methods UX Professionals Use.</strong> We looked for trends in 33 <a href="https://measuringu.com/ux-methods-2024/">methods that UX professionals reported using</a> in UXPA salary surveys from 2014 through 2024. Relative to 2022, the results of the 2024 UXPA salary survey revealed declines in the reported percentages of use for most of the tracked UX methods, with 11 dropping by more than ten percentage points. Despite the drops in the reported frequencies, the rank order of methods changed only slightly (<em>r</em> = .94), demonstrating the consistency of the relative importance of the UX methods tracked in the UXPA salary surveys.</li>
<li><strong>UX Professionals’ Job Satisfaction.</strong> We found that UX professionals report <a href="https://measuringu.com/ux-ux-jobsatisfaction-2024/">generally high job satisfaction</a> (70 out of 100). However, job satisfaction declined slightly in 2024 compared to the last decade (and since 2022). The drop is partially explained by a small but significant increase in unemployment and by widespread fears of layoffs and AI replacing UX roles. Compared to other industries, UX satisfaction remains relatively strong. Salary explains little of the variance in job satisfaction. Job security concerns and industry shifts appear to be the strongest forces shaping UX professionals’ outlook today.</li>
<li><strong>What UX Hiring Managers Want and What UX Practitioners Report Doing.</strong> In the 2024 UXPA salary survey, 71 respondents planned to hire in 2025. We compared <a href="https://measuringu.com/what-hiring-managers-want-and-what-ux-practitioners-do/">the practitioner skills they reported wanting</a> in new hires with the skills that practitioners in the survey reported using. As shown in Figure 8, there was a high correlation (<em>r</em> = .93) between the skills UX hiring managers want and what UX practitioners do.</li>
</ul>
<p><a href="https://measuringu.com/wp-content/uploads/2025/12/Figure7.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46218" src="https://measuringu.com/wp-content/uploads/2025/12/Figure7-300x225.png" alt="Figure 8: Scatterplot of skills UX managers wanted to hire in 2025 and the skills UX practitioners reported using in 2024." width="1200" height="899" srcset="https://measuringu.com/wp-content/uploads/2025/12/Figure7-300x225.png 300w, https://measuringu.com/wp-content/uploads/2025/12/Figure7-1024x767.png 1024w, https://measuringu.com/wp-content/uploads/2025/12/Figure7-768x575.png 768w, https://measuringu.com/wp-content/uploads/2025/12/Figure7-600x449.png 600w, https://measuringu.com/wp-content/uploads/2025/12/Figure7.png 1290w" sizes="(max-width: 1200px) 100vw, 1200px" /></a></p>
<p><strong>Figure 8: </strong>Scatterplot of skills UX managers wanted to hire in 2025 and the skills UX practitioners reported using in 2024.</p>
<ul>
<li><strong>UX Professionals’ Satisfaction with Pay Transparency.</strong> <a href="https://measuringu.com/ux-pay-transparency-2024/">Pay transparency</a> has been a hot topic, so UXPA included three related questions in their 2024 salary survey. About half of UX professionals reported having at least some level of pay transparency. Satisfaction with pay transparency and fairness was low among UX practitioners, and pay transparency was associated with higher job satisfaction scores.</li>
</ul>
<h2>Coming Up in 2026</h2>
<p>For 2026, stay tuned for a year’s worth of new articles, industry reports, webinars, MUiQ enhancements, and our annual boot camp.</p>
]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>

<!-- plugin=object-cache-pro client=phpredis metric#hits=8611 metric#misses=56 metric#hit-ratio=99.4 metric#bytes=3989730 metric#prefetches=218 metric#store-reads=104 metric#store-writes=171 metric#store-hits=318 metric#store-misses=36 metric#sql-queries=57 metric#ms-total=739.70 metric#ms-cache=31.64 metric#ms-cache-avg=0.1155 metric#ms-cache-ratio=4.3 sample#redis-hits=62609437 sample#redis-misses=5804691 sample#redis-hit-ratio=91.5 sample#redis-ops-per-sec=169 sample#redis-evicted-keys=0 sample#redis-used-memory=61143224 sample#redis-used-memory-rss=67330048 sample#redis-memory-fragmentation-ratio=1.1 sample#redis-connected-clients=1 sample#redis-tracking-clients=0 sample#redis-rejected-connections=0 sample#redis-keys=20881 -->
