<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>MeasuringU</title>
	<atom:link href="https://measuringu.com/feed/" rel="self" type="application/rss+xml" />
	<link>https://measuringu.com</link>
	<description>UX Research and Software</description>
	<lastBuildDate>Wed, 08 Apr 2026 17:53:14 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://measuringu.com/wp-content/uploads/2020/11/site-icon.png</url>
	<title>MeasuringU</title>
	<link>https://measuringu.com</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Credible vs. Confidence Intervals: Different Meanings but Similar Decisions</title>
		<link>https://measuringu.com/credible-vs-confidence-intervals/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=credible-vs-confidence-intervals</link>
		
		<dc:creator><![CDATA[Jeff Sauro, PhD&nbsp;•&nbsp;Jim Lewis, PhD]]></dc:creator>
		<pubDate>Wed, 08 Apr 2026 06:49:35 +0000</pubDate>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[Bayes theorem]]></category>
		<category><![CDATA[Bayesian theorem]]></category>
		<category><![CDATA[confidence interval]]></category>
		<category><![CDATA[Confidence Intervals]]></category>
		<category><![CDATA[credible interval]]></category>
		<category><![CDATA[credible intervals]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47234</guid>

					<description><![CDATA[We’ve written a lot about confidence intervals for the last two decades. We especially encourage them for small sample studies. Some of you even bought into our recommendation and use them yourselves (a decision we continue to support). But maybe you’ve heard about Bayesian credible intervals and wonder if you should be using them instead. [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1.png"><img fetchpriority="high" decoding="async" class="alignleft wp-image-47288 size-medium" src="https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1-300x169.png" alt="Feature image shows two researchers, each examining a measuring tool, with a specific interval highlighted." width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1.png 2000w" sizes="(max-width: 300px) 100vw, 300px" /></a>We’ve <a href="https://measuringu.com/article/estimating-completion-rates-from-small-samples-using-binomial-confidence-intervals-comparisons-and-recommendations/">written a lot about confidence intervals</a> for the last two decades.</p>
<p>We especially encourage them for small sample studies.</p>
<p>Some of you even bought into our recommendation and use them yourselves (a decision we continue to support).</p>
<p>But maybe you’ve heard about <a href="https://en.wikipedia.org/wiki/Credible_interval">Bayesian credible intervals</a> and wonder if you should be using them instead.</p>
<p>In this article, we return to an <a href="https://measuringu.com/bayes-law-in-ux-research-the-power-and-perils-of-priors">example used in our previous articles</a> on Bayesian methods applied to UX research and compare analyses of that example with confidence and credible intervals.</p>
<h2><span lang="EN-US">Confidence Interval Analysis</span></h2>
<p>In our recurring example, 18 of 20 participants successfully completed a checkout task (a 90% completion rate). But if we were to test hundreds, thousands, or (somehow) all potential users, would the completion rate be exactly 90%? Almost surely not. But instead of trying to nail down an exact single number, a likely range is usually sufficient for decision making and surprisingly easy to compute and accurate even for small sample sizes.</p>
<p>For this type of data (binary), the likely range can be computed using an adjusted-Wald confidence interval with 95% confidence. That interval is 68.7% to 98.4%.</p>
<p>We’ve made it easy to compute binomial confidence intervals with <a href="https://measuringu.com/calculators/wald/">our online calculator</a>. But how do you interpret or explain what it means? How about:</p>
<ul>
<li>There’s a 95% probability the true completion rate is between 68.7% and 98.4%.</li>
<li>There’s a 95% chance the true completion rate falls within 68.7% and 98.4%.</li>
<li>95% of future tests with completion rates will be between 68.7% and 98.4%.</li>
</ul>
<p>Strictly speaking, all three of those statements are wrong. A stats professor or Bayesian enthusiast will be happy to point out that error.</p>
<p>The more technically correct way to describe the interval is:</p>
<ul>
<li>If we ran many tests, each with 20 users from the same population and computed confidence intervals each time, on average, 95 out of 100 confidence intervals will contain the unknown population completion rate.</li>
</ul>
<p>Strictly speaking, we are 95% confident <em>in the method </em>of generating confidence intervals and not in any given interval. The confidence interval we generated from the sample data either does or does not contain the population completion rate.</p>
<p>We don’t know if our sample of 20 is one of those five whose confidence interval doesn&#8217;t contain the completion rate. So, it’s best to avoid using “probability” or “chance” when describing a confidence interval and remember that we’re 95% confident in the process of generating confidence intervals rather than a given interval.</p>
<p>So, we have just one study, and we computed only one interval. What does that mean? What are we “allowed” to say other than that cumbersome statement? We have a couple of recommendations suitable for practical decision making:</p>
<ul>
<li><strong>Likely range</strong>: “68.7% to 98.4% is the most likely range for the unknown completion rate from all users.”</li>
<li><strong>Plausible range</strong> (from <a href="https://www.amazon.com/Confidence-Intervals-Quantitative-Applications-Sciences/dp/076192499X">Smithson, 2002</a>): “Given this data, values inside the confidence interval are plausible while those outside are implausible. The observed completion rate of 90% is plausible but rates lower than 68.7% or higher than 98.4% are implausible.”</li>
</ul>
<p>This is where the precision of numbers meets the imprecision of language. Although confidence, probability, likely, and plausible all sound about the same, they have more precise usage when it comes to statistics and probability.</p>
<p>This rigidity in language makes them less ideal when communicating the results to stakeholders who will not likely have a sophisticated understanding of confidence intervals (although <a href="https://link.springer.com/article/10.3758/s13423-013-0572-3">even professors sometimes struggle with the concept</a>).</p>
<h2><span lang="EN-US">Credible Interval Analysis</span></h2>
<p>One proposed alternative is the Bayesian credible interval.</p>
<p>Credible intervals are designed to allow for the interpretation people naturally want to use. A 95% credible interval can be interpreted as having a 95% probability of containing the true value.</p>
<p>Like with confidence intervals, there are different computations used to generate credible intervals on binary data. And like with confidence intervals, there are debates about which method is optimal. We won’t get into that debate here. Instead, we’ve provided in Table 1 three Bayesian credible intervals for our example that differ in <a href="https://measuringu.com/bayes-law-in-ux-research-the-power-and-perils-of-priors">their priors</a> (all of which are <a href="https://nvlpubs.nist.gov/nistpubs/TechnicalNotes/NIST.TN.2119.pdf">commonly used in practice</a>).</p>

<table id="tablepress-1026" class="tablepress tablepress-id-1026">
<thead>
<tr class="row-1">
	<th class="column-1"><strong>Method</strong></th><th class="column-2"><strong>Prior/Setup</strong></th><th class="column-3"><strong>95% Interval</strong></th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1">Adjusted-Wald</td><td class="column-2">Add ~2 successes &amp; ~2 failures</td><td class="column-3"><strong>68.7% to 98.4%</strong></td>
</tr>
<tr class="row-3">
	<td class="column-1">Bayesian credible interval</td><td class="column-2">Beta(1,1)—Uniform prior</td><td class="column-3"><strong>69.6% to 97.0%</strong></td>
</tr>
<tr class="row-4">
	<td class="column-1">Bayesian credible interval</td><td class="column-2">Beta(0.5, 0.5)—Jeffreys prior</td><td class="column-3"><strong>71.6% to 97.9%</strong></td>
</tr>
<tr class="row-5">
	<td class="column-1">Bayesian credible interval</td><td class="column-2">Beta(2, 2)—Agresti prior</td><td class="column-3"><strong>66.4% to 95.0%</strong></td>
</tr>
</tbody>
</table>
<!-- #tablepress-1026 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Four 95% interval estimates, one confidence and three credible.</p>
<p>For example, a 95% Bayesian credible interval using a uniform prior for 18 successes and 2 failures generates a credible interval of 69.6% to 97.0%.</p>
<p>We can say there’s a 95% probability that the true and unknown completion rate is between 69.6% and 97%.</p>
<p>Stats professors are happy with that statement. Bayesian purists are happy with that statement. And your stakeholders probably understand that statement too!</p>
<p>So, should we all start using credible intervals and abandon confidence intervals? Not necessarily.</p>
<p>Credible intervals require more complex calculations and usually don’t have the simple closed-form solution of the adjusted-Wald interval. In practice, however, this difference is negligible because modern software handles the computation (e.g., we used the binom.bayes function in the R package binom).</p>
<p>But did you notice anything about the values in Table 1? The intervals are all very similar, as shown in the graph in Figure 1.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-scaled.png"><img decoding="async" class="alignnone wp-image-47293" src="https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-300x175.png" alt="Graph of the four intervals " width="1200" height="698" srcset="https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-300x175.png 300w, https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-1024x596.png 1024w, https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-768x447.png 768w, https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-1536x894.png 1536w, https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-2048x1192.png 2048w, https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-600x349.png 600w" sizes="(max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> Graph of the four intervals (Green: adjusted-Wald, Blue: Bayesian Uniform, Orange: Bayesian Jeffreys, Black: Bayesian Agresti); dashed green line shows limits of adjusted-Wald interval across the three Bayesian intervals.</p>
<p>There are very few differences between the intervals. The width of the adjusted-Wald interval is 29.7%. The Uniform and Jeffreys intervals lie within the adjusted-Wald (with respective widths of 27.4% and 26.3%) while the Agresti interval has about the same width as the adjusted-Wald (28.6%), with its upper and lower endpoints shifted down relative to the adjusted-Wald interval by, respectively, 3.4% and 2.3%.</p>
<p>If the output is roughly the same, does it really matter? The numbers don’t know where they come from.</p>
<p>This is similar to the debate about ordinal versus interval data. As Lord (1951) noted, even <a href="https://is.muni.cz/el/fss/jaro2010/PSY454/um/Frederick_Lord_On_the_statistical_treatment_of_football_numbers.pdf">nominal values like football jersey numbers can be averaged</a>. The math works, but proper interpretation is critical.</p>
<p>Confidence intervals and credible intervals can yield nearly identical results, especially for this type of data. In many cases, <strong>they will lead to the same practical decision</strong>, even though the interpretation differs.</p>
<p>So, what should you do?</p>
<p>The results here suggest that, at least for this type of data, traditional confidence intervals and Bayesian credible intervals can produce very similar ranges. The main difference is not in the numbers, but in how we interpret and communicate them.</p>
<p>That’s one reason we continue to recommend confidence intervals. They are well understood, widely taught, and, when used appropriately, provide accurate estimates of the range of plausible values.</p>
<p>At the same time, we understand the appeal of credible intervals. The interpretation is more natural and often aligns better with how stakeholders think about uncertainty.</p>
<p>In practice, either approach can be effective. What matters most is understanding what the interval represents and communicating it clearly. Decisions are made by inspecting the endpoints of the intervals. If you’d make the same decision for both endpoints, then you have enough information to make the decision. Otherwise, you need more data. In this example, it seems unlikely that the slight variation in endpoint values would affect real-world decision making.</p>
<p>Notably, in this example, the confidence interval encompassed two of the Bayesian intervals, so not only did it have 95% confidence from a frequentist perspective, but it also had at least 95% credibility from a Bayesian perspective.</p>
<p>We’ll continue to explore where these approaches differ more meaningfully in future articles, including whether these similarities extend beyond this example to different proportions and to other statistics such as means.</p>
<h2>Key Takeaways</h2>
<p>In this latest article on Bayesian methods, we covered:</p>
<ul>
<li>Confidence intervals are harder to explain than most people think.</li>
<li>Credible intervals match how people want to interpret uncertainty.</li>
<li>In this example, both methods produce very similar ranges.</li>
<li>The difference is less about the numbers and more about what we can say about them.</li>
<li>Use either approach thoughtfully, but focus on clear communication.</li>
</ul>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Bayes’ Law in UX Research: The Power and Perils of Priors</title>
		<link>https://measuringu.com/bayes-law-in-ux-research-the-power-and-perils-of-priors/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=bayes-law-in-ux-research-the-power-and-perils-of-priors</link>
		
		<dc:creator><![CDATA[Jeff Sauro, PhD&nbsp;•&nbsp;Jim Lewis, PhD]]></dc:creator>
		<pubDate>Wed, 01 Apr 2026 03:35:38 +0000</pubDate>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Bayes theorem]]></category>
		<category><![CDATA[Bayesian theorem]]></category>
		<category><![CDATA[Statistics]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47171</guid>

					<description><![CDATA[&#8220;That confirms what I expected.&#8221; The same data, two different conclusions. A 90% completion rate from 20 participants on a usability test of a checkout flow. Is that completion rate better than the historical average of 78%? One researcher says yes, definitely. Another says no, it’s in line with the historical average. Both are using [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/03/033126-FeatureImage.png"><img decoding="async" class="alignleft wp-image-47227 size-medium" src="https://measuringu.com/wp-content/uploads/2026/03/033126-FeatureImage-300x169.png" alt="Feature image showing two balance scales with urns on each side" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/03/033126-FeatureImage-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/03/033126-FeatureImage-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/033126-FeatureImage-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/03/033126-FeatureImage-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/033126-FeatureImage-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/03/033126-FeatureImage.png 2000w" sizes="(max-width: 300px) 100vw, 300px" /></a>&#8220;That confirms what I expected.&#8221;</p>
<p>The same data, two different conclusions.</p>
<p>A 90% completion rate from 20 participants on a usability test of a checkout flow. Is that completion rate better than the <a href="https://measuringu.com/task-completion/">historical average of 78%</a>?</p>
<p>One researcher says yes, definitely. Another says no, it’s in line with the historical average.</p>
<p>Both are using the same <a href="https://measuringu.com/intro-to-bayesian-thinking-in-ux-research/">Bayesian method</a>. How can the same data produce opposite conclusions?</p>
<p>The answer lies in <em>priors</em>, the assumptions you bring to the analysis before the data impact the decision.</p>
<p>In our previous article, we assumed equal priors when <a href="https://measuringu.com/bayes-law-in-ux-research-from-urns-to-users">analyzing completion rate data</a> to simplify the analysis. But what happens when those priors change?</p>
<p>In this article, we explore the consequences of manipulating those prior probabilities in different ways.</p>
<h2><span lang="EN-US">The Effect of Priors on the Outcome</span></h2>
<p>In Bayesian analysis, we assign numerical probabilities to prior beliefs about competing hypotheses. Priors reflect how plausible we think each explanation is before seeing the current data.</p>
<p>If a prior belief is well supported, we give it more weight. If it’s less credible, we give it less weight. When we don’t have strong prior information, we can assign roughly equal weights, allowing the observed data to play a larger role in the conclusion.</p>
<p>In our example, 18 of 20 participants successfully completed a task (a 90% completion rate). We want to understand how different prior beliefs affect our interpretation of this result when compared to a historical completion rate of 78%.</p>
<p>To do this, we compare two hypotheses: that the true completion rate is 78% (historical) or 90% (based on the observed data), under different prior assumptions. We could also test other values (e.g., 85% or 95%), but we use 90% as a convenient reference based on the sample, recognizing that this is a simplifying modeling choice.</p>
<p>So, which is more plausible: a 78% or 90% completion rate?</p>
<p>We examine five scenarios that vary the strength and direction of the prior belief:</p>
<ol>
<li>Neutral prior (no preference)</li>
<li>Weak prior favoring a 78% completion rate</li>
<li>Weak prior favoring a 90% completion rate</li>
<li>Strong prior favoring a 78% completion rate</li>
<li>Strong prior favoring a 90% completion rate</li>
</ol>
<p>So how do we quantify the strength of our prior beliefs? What values should we use to represent neutral, weak, and strong preferences for one hypothesis over another?</p>
<p>A neutral prior is straightforward, a 50/50 reflecting no preference. But once we move beyond that, the choice of “weak” or “strong” priors becomes less clear.</p>
<p>If we move slightly off a neutral stance, values like 60/40 seem reasonable. But whether we use 60/40, 70/30, or 80/20 is somewhat arbitrary. We use 0.6 and 0.8 to represent weak and strong prior preferences, respectively.</p>
<p>To avoid confusion between completion rates (e.g., 90%) and prior probabilities (e.g., 0.8), we use decimal values for the priors.</p>
<p>When we apply these values to the Bayesian formula (see the appendix), we obtain the results shown in Table 1.</p>
<p>Each row represents a different prior scenario. The second and third columns show the prior beliefs assigned to each hypothesis. The next two columns show how those beliefs are updated after observing 18 of 20 participants complete the task. The final column shows the relative likelihood of the two hypotheses.</p>
<p>For example, with neutral priors, the 90% completion rate is 2.7 times more likely than the 78% completion rate. In contrast, with a strong prior favoring 78%, the 78% completion rate becomes more likely than the 90% completion rate.</p>

<table id="tablepress-1025" class="tablepress tablepress-id-1025">
<thead>
<tr class="row-1">
	<th rowspan="2" class="column-1"><center>Prior Scenario</th><th colspan="2" class="column-2">Prior Belief in</th><th colspan="2" class="column-4">Updated Belief in</th><th rowspan="2" class="column-6"><center>Which Is<br>More Likely?</th><th rowspan="2" class="column-7"><center>Odds<br>(90% vs. 78%)</th>
</tr>
<tr class="row-2">
	<th class="column-2">90%</th><th class="column-3">78%</th><th class="column-4">90%</th><th class="column-5">78%</th>
</tr>
</thead>
<tbody class="row-striping row-hover">
<tr class="row-3">
	<td class="column-1">Neutral prior (no preference)</td><td class="column-2">0.5</td><td class="column-3">0.5</td><td class="column-4">0.732</td><td class="column-5">0.268</td><td class="column-6"><center>90%</td><td class="column-7"><center>2.7×</td>
</tr>
<tr class="row-4">
	<td class="column-1">Weak prior favoring 78%</td><td class="column-2">0.4</td><td class="column-3">0.6</td><td class="column-4">0.645</td><td class="column-5">0.355</td><td class="column-6"><center>90%</td><td class="column-7"><center>1.8×</td>
</tr>
<tr class="row-5">
	<td class="column-1">Weak prior favoring 90%</td><td class="column-2">0.6</td><td class="column-3">0.4</td><td class="column-4">0.804</td><td class="column-5">0.196</td><td class="column-6"><center>90%</td><td class="column-7"><center>4.1×</td>
</tr>
<tr class="row-6">
	<td class="column-1">Strong prior favoring 78%</td><td class="column-2">0.2</td><td class="column-3">0.8</td><td class="column-4">0.405</td><td class="column-5">0.595</td><td class="column-6"><center>78%</td><td class="column-7"><center><strong>0.68×<br>(≈1.5× for 78%)</strong></td>
</tr>
<tr class="row-7">
	<td class="column-1">Strong prior favoring 90%</td><td class="column-2">0.8</td><td class="column-3">0.2</td><td class="column-4">0.916</td><td class="column-5">0.084</td><td class="column-6"><center>90%</td><td class="column-7"><center><strong>10.9×</strong></td>
</tr>
</tbody>
</table>
<!-- #tablepress-1025 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Effect of different priors on updated beliefs.</p>
<h2><span lang="EN-US">How Our Conclusions Change Based on Priors</span></h2>
<p>Across all five scenarios, a 90% completion rate is more likely in four of them. In one case, it’s more than ten times as likely as the 78% completion rate. Only when we strongly favor the historical data does the conclusion shift, making the 78% completion rate more likely despite the observed results.</p>
<p>Changing only the prior belief can shift the conclusion from favoring 78% to strongly favoring 90%. No new data were added. In this example, changing the prior assumption had a larger effect on the conclusion than a modest increase in sample size would. This raises a natural question: how much additional data would be needed to overcome a strong prior?</p>
<p>This highlights an important property of Bayesian analysis. The conclusions are influenced not only by the observed data, but also by the strength and direction of the prior beliefs. When priors are strong, they can reinforce or counteract the data. When priors are weak or neutral, the data play a larger role.</p>
<p>Who decides what the historical data is and how relevant it is? And how strongly do you weight the priors? There isn’t a Bayesian rule book we can reference. Instead, it comes down to making informed and good judgments. But is that judgment always clear, and does it lead to better conclusions?</p>
<p>Understanding how priors affect the decision (under one scenario) is the easy part. Teasing out the pros and cons of this approach with more Bayesian methods and real-world scenarios is the harder one. And the subject of some upcoming articles.</p>
<p>This illustrates both the potential power and the potential risk of Bayesian analysis. It can incorporate prior knowledge in a principled way, but when priors are uncertain, subjective, or weakly supported, the results may reflect assumptions as much as evidence.</p>
<h2><span lang="EN-US">Summary and Discussion</span></h2>
<p>In a <a href="https://measuringu.com/bayes-law-in-ux-research-from-urns-to-users">previous article</a>, we extended a classical problem in Bayesian comparison of the likelihoods of two hypotheses to a UX research context using an approach that required only simple algebra.</p>
<p>In this article, we showed how variation in prior belief can affect the posterior likelihoods of competing UX hypotheses, potentially having a larger impact than small changes in the observed data. For this example, varying the priors had a large effect on the likelihoods of the hypotheses (from <strong>0.405 to 0.916</strong> for the 90% hypothesis). This may, in part, have been affected by the relatively small difference in the competing hypotheses (78% vs. 90%, just a 12-point difference).</p>
<h3><span lang="EN-US">What Should Researchers Do About Priors?</span></h3>
<p>In practice, researchers should:</p>
<ul>
<li>Be explicit about the priors they use and how they were chosen.</li>
<li>Test multiple plausible priors to understand how sensitive the conclusions are to variation in priors (e.g., <a href="https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2020.608045/full">prior sensitivity analysis</a>).</li>
<li>Be cautious when priors are uncertain or weakly supported.</li>
<li>Consider collecting more data when conclusions depend heavily on prior assumptions.</li>
</ul>
<p>Understanding how priors influence results is an important step in using Bayesian methods effectively. It does not mean avoiding Bayesian analysis, but it does mean using it thoughtfully and transparently.</p>
<h2><span lang="EN-US">Appendix</span></h2>
<p>For this example, we assumed 20 participants attempted an online checkout task with 18 successes and 2 failures (90% success). With that result, we want to understand whether it’s more likely that the true successful completion rate is 78% (historical) or our observed 90% (better than historical).</p>
<p>To get the odds ratios displayed in Table 1, we used the following Bayesian formula.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula.png"><img loading="lazy" decoding="async" class="size-full wp-image-47183 aligncenter" src="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula.png" alt="Bayesian formula comparing 78% and 90% completion rates" width="390" height="55" srcset="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula.png 390w, https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-300x42.png 300w" sizes="auto, (max-width: 390px) 100vw, 390px" /></a></p>
<p>where:</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|90%) is the probability of getting this sample (the data, D) if the true completion rate is 90%.</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|78%) is the probability of getting this sample if the true completion rate is 78%.</p>
<p style="padding-left: 25px;"><em>P</em>(90%) is our expected (prior) probability that the true completion rate is 90%.</p>
<p style="padding-left: 25px;"><em>P</em>(78%) is our expected (prior) probability that the true completion rate is 78%.</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) is the conditional probability of the completion rate being 90% given the sample.</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) is 1 – <em>P</em>(90%|<em>D</em>).</p>
<p>Using the binomial probability formula, we can compute the probabilities of getting this sample for each of the hypothesized true completion rates:</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|90%) is (0.9)<sup>18</sup> × (0.1)<sup>2</sup> = 0.0015.</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|78%) is (0.78)<sup>18</sup> × (0.22)<sup>2</sup> = 0.00055.</p>
<p>Next, we apply this formula to the five sets of priors (Neutral, Weak Favoring 78%, Weak Favoring 90%, Strong Favoring 78%, Strong Favoring 90%).</p>
<p><strong><em>Technical note</em></strong><em><strong>:</strong> We used binomial probabilities throughout this article because they allow us to illustrate the mechanics of the Bayesian analyses with simple algebra. The downside of this simplification is that we had to assign specific prior probabilities rather than using the current practice of using </em><a href="https://bookdown.org/pbaumgartner/bayesian-fun/05-beta-distribution.html"><em>beta distributions for priors</em></a><em>, but this does not affect the logic of the discussion. Also, we excluded the factorial component of the binomial probability formula because it was constant across the computations.</em></p>
<h3><span lang="EN-US">Neutral Prior</span></h3>
<p>If we decide there is no basis for weighting the priors unequally, the values for the formula are:</p>
<p style="padding-left: 25px;">P(90%) = 0.5</p>
<p style="padding-left: 25px;">P(78%) = 0.5</p>
<p style="padding-left: 25px;">P(D|90%) = 0.0015</p>
<p style="padding-left: 25px;">P(D|78%) = 0.00055</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-neutral-priors.png"><img loading="lazy" decoding="async" class="size-full wp-image-47184 aligncenter" src="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-neutral-priors.png" alt="Bayesian formula with neutral priors" width="324" height="211" srcset="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-neutral-priors.png 324w, https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-neutral-priors-300x195.png 300w" sizes="auto, (max-width: 324px) 100vw, 324px" /></a></p>
<p>So:</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) = 0.732</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) = 0.268</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) / <em>P</em>(78%|<em>D</em>) = 2.73</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) / <em>P</em>(90%|<em>D</em>) = 0.37</p>
<p>Conclusion: There is a substantial likelihood that the historical hypothesis (78%) might be true (0.268 isn’t anywhere near 0), but the alternative hypothesis (90%) is <strong>2.7 times more likely</strong>.</p>
<h3><span lang="EN-US">Weak Prior Favoring 78%</span></h3>
<p>If we decide to give a little more weight to the historical hypothesis (78%) and a little less to the alternative hypothesis (90%), we get:</p>
<p style="padding-left: 25px;"><em>P</em>(90%) = 0.4</p>
<p style="padding-left: 25px;"><em>P</em>(78%) = 0.6</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|90%) = 0.0015</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|78%) = 0.00055</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-weak-priors.png"><img loading="lazy" decoding="async" class="wp-image-47185 size-full aligncenter" src="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-weak-priors.png" alt="Bayesian formula with weak prior favoring 78%" width="307" height="213" srcset="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-weak-priors.png 307w, https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-weak-priors-300x208.png 300w" sizes="auto, (max-width: 307px) 100vw, 307px" /></a></p>
<p>So:</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) = 0.645</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) = 0.355</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) / <em>P</em>(78%|<em>D</em>) = 1.82</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) / <em>P</em>(90%|<em>D</em>) = 0.55</p>
<p>Conclusion: There is a substantial likelihood that the historical hypothesis (78%) might be true (0.355 isn’t anywhere near 0), but the alternative hypothesis (90%) is <strong>1.8 times more likely</strong>.</p>
<h3><span lang="EN-US">Weak Prior Favoring 90%</span></h3>
<p>If we decide to give a little more weight to the alternative hypothesis (90%) and a little less to the historical hypothesis (78%), we get:</p>
<p style="padding-left: 25px;"><em>P</em>(90%) = 0.6</p>
<p style="padding-left: 25px;"><em>P</em>(78%) = 0.4</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|90%) = 0.0015</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|78%) = 0.00055</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-favoring-90.png"><img loading="lazy" decoding="async" class="wp-image-47186 size-full aligncenter" src="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-favoring-90.png" alt="Bayesian formulas with weak prior favoring 90%" width="320" height="207" srcset="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-favoring-90.png 320w, https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-favoring-90-300x194.png 300w" sizes="auto, (max-width: 320px) 100vw, 320px" /></a></p>
<p>So:</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) = 0.804</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) = 0.196</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) / <em>P</em>(78%|<em>D</em>) = 4.09</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) / <em>P</em>(90%|<em>D</em>) = 0.24</p>
<p>Conclusion: There is a decent likelihood that the historical hypothesis (78%) might be true (0.196 isn’t that close to 0), but the alternative hypothesis (90%) is <strong>4.1 times more likely</strong>.</p>
<h3><span lang="EN-US">Strong Prior Favoring 78%</span></h3>
<p>If we decide to give a lot more weight to the historical hypothesis (78%) and a lot less to the alternative hypothesis (90%), we get:</p>
<p style="padding-left: 25px;"><em>P</em>(90%) = 0.2</p>
<p style="padding-left: 25px;"><em>P</em>(78%) = 0.8</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|90%) = 0.0015</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|78%) = 0.00055</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-strong-priors-favoring-78.png"><img loading="lazy" decoding="async" class="size-full wp-image-47188 aligncenter" src="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-strong-priors-favoring-78.png" alt="" width="307" height="213" srcset="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-strong-priors-favoring-78.png 307w, https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-strong-priors-favoring-78-300x208.png 300w" sizes="auto, (max-width: 307px) 100vw, 307px" /></a></p>
<p>So:</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) = 0.405</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) = 0.595</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) / <em>P</em>(78%|<em>D</em>) = 0.68</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) / <em>P</em>(90%|<em>D</em>) = 1.47</p>
<p>Conclusion: The historical hypothesis (78%) is <strong>about 1.5 times more likely</strong> than the alternative hypothesis (90%), but not by much (both likelihoods aren’t that far from 50%).</p>
<h3><span lang="EN-US">Strong Prior Favoring 90%</span></h3>
<p>If we decide to give a lot more weight to the alternative hypothesis (90%) and a lot less to the historical hypothesis (78%), we get:</p>
<p style="padding-left: 25px;"><em>P</em>(90%) = 0.8</p>
<p style="padding-left: 25px;"><em>P</em>(78%) = 0.2</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|90%) = 0.0015</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|78%) = 0.00055</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-strong-priors-favoring-90.png"><img loading="lazy" decoding="async" class="size-full wp-image-47189 aligncenter" src="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-strong-priors-favoring-90.png" alt="Bayesian formula with strong priors favoring 90%" width="318" height="222" srcset="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-strong-priors-favoring-90.png 318w, https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-strong-priors-favoring-90-300x209.png 300w" sizes="auto, (max-width: 318px) 100vw, 318px" /></a></p>
<p>So:</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) = 0.916</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) = 0.084</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) / <em>P</em>(78%|<em>D</em>) = 10.91</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) / <em>P</em>(90%|<em>D</em>) = 0.09</p>
<p>Conclusion: There is relatively little likelihood that the historical hypothesis (78%) might be true (0.084 is getting close to 0), and the alternative hypothesis (90%) is <strong>10.9 times more likely</strong>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How to Use Banner Tables to Present Survey Results</title>
		<link>https://measuringu.com/how-to-use-banner-tables-to-present-survey-results/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=how-to-use-banner-tables-to-present-survey-results</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD&nbsp;•&nbsp;Jeff Sauro, PhD]]></dc:creator>
		<pubDate>Wed, 25 Mar 2026 00:13:04 +0000</pubDate>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Survey]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[banner table]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[table]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47120</guid>

					<description><![CDATA[Surveys are a common way to measure attitudes, behaviors, and intentions related to products and services. But large surveys can include dozens of questions and multiple demographic segments, which can mean hundreds of potential comparisons. How do you present all those results in a way stakeholders can quickly scan? You can use a slide deck [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/03/032426-FeatureImage-1.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47161 size-medium" src="https://measuringu.com/wp-content/uploads/2026/03/032426-FeatureImage-1-300x169.png" alt="Feature image showing a researcher using banner table to present survey results" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/03/032426-FeatureImage-1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/03/032426-FeatureImage-1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/032426-FeatureImage-1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/03/032426-FeatureImage-1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/032426-FeatureImage-1-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/03/032426-FeatureImage-1.png 2000w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>Surveys are a common way to measure attitudes, behaviors, and intentions related to products and services.</p>
<p>But large surveys can include dozens of questions and multiple demographic segments, which can mean hundreds of potential comparisons. How do you present all those results in a way stakeholders can quickly scan?</p>
<p>You can use a slide deck with charts for every question and segment, but that can easily lead to dozens of slides.</p>
<p>Another option is a <em>banner table</em>. While it sounds like something you might see at a trade show, a banner table is a compact way to display many cross-tabulated survey results in a single view.</p>
<p>Banner tables are widely used in market research, but they are less commonly seen in UX research. In a <a href="https://measuringu.com/what-are-ux-deliverables/">previous article</a>, we listed 18 UX research deliverables classified as interim, final, and artifacts; the banner table was one of the least familiar.</p>
<p>When used appropriately, banner tables provide an efficient way to summarize survey results across multiple segments.</p>
<p>Below is an example showing a banner table displaying brand attitude and reluctance to share political content by two social media platforms and gender (Table 1).</p>

<table id="tablepress-1024" class="tablepress tablepress-id-1024">
<thead>
<tr class="row-1">
	<td class="column-1"></td><td class="column-2"></td><th colspan="2" class="column-3"><center><strong>Facebook</strong></th><th colspan="2" class="column-5"><center><strong>TikTok</strong></th>
</tr>
</thead>
<tbody class="row-striping row-hover">
<tr class="row-2">
	<td class="column-1"><strong>Metric</strong></td><td class="column-2"><strong>Total</strong></td><td class="column-3"><strong>Female</strong></td><td class="column-4"><strong>Male</strong></td><td class="column-5"><strong>Female</strong></td><td class="column-6"><strong>Male</strong></td>
</tr>
<tr class="row-3">
	<td class="column-1">Brand attitude (Top-2 Box %)</td><td class="column-2">30%</td><td class="column-3">28%</td><td class="column-4"> 7%</td><td class="column-5">50%</td><td class="column-6">35%</td>
</tr>
<tr class="row-4">
	<td class="column-1">Reluctance to share political content (Bottom-2 Box %)</td><td class="column-2">70%</td><td class="column-3">58%</td><td class="column-4">73%</td><td class="column-5">72%</td><td class="column-6">93%</td>
</tr>
<tr class="row-5">
	<td class="column-1">Sample size (<i>n</i>)</td><td class="column-2">123</td><td class="column-3">49</td><td class="column-4">25</td><td class="column-5">29</td><td class="column-6">20</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1024 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Example of a banner table.</p>
<p>In this article, we provide more detail about the why and how of banner tables, plus we display an example created with an R script.</p>
<p>Before diving into how banner tables work, it helps to understand where they came from and why they became a standard deliverable in market (and UX) research.</p>
<h2><span lang="EN-US">Banner Tables: Common in Market Research, Less Known in UX Research</span></h2>
<p>For large-scale surveys with multiple segments, it’s good to display results in a banner table when you need to provide cross-tabulated results by key segments (e.g., demographics, personas, behaviors) to reveal group differences in a form that is easy to scan (Figure 1).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-1.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47163" src="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-1-300x20.png" alt="High-level view of a banner table." width="1200" height="81" srcset="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-1-300x20.png 300w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-1-1024x69.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-1-768x52.png 768w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-1-1536x104.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-1-600x41.png 600w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-1.png 2005w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> High-level view of a banner table.</p>
<p>Later in this article, we’ll zoom in on the different parts of this table and dig into its details. What you can see from the high-level view in Figure 1 is a set of metrics in the first column followed by a Totals column. The empty green columns separate crosstabs of the metrics with, in order, six social media platforms, three gender designations, six age groups, and six income levels. When presented as a spreadsheet, it’s common to freeze the top row and the first one or two columns to support easily browsing the contents.</p>
<h3><span lang="EN-US">Banner Tables Can Be Traced Back to U.S. Census Practices in 1949</span></h3>
<p>There’s no clear historical record of when the first banner table was published, but it likely coincided with the emergence of large-scale surveys in the mid-20<sup>th</sup> century. In banner tables, the rows are sometimes called the <em>stub</em> and the columns the <em>banner</em>, and older names for banner tables include stub-and-banner tables or stub-and-boxhead (as in the 1949 U.S. Census Bureau publication, <a href="https://www2.census.gov/library/publications/1949/general/tabular-presentation.pdf"><em>Bureau of the Census Manual of Tabular Presentation</em></a>). Regardless of the nomenclature, the key to its success is compressing a large number of crosstabs into one wide table.</p>
<h3><span lang="EN-US">Banner Tables Are Widely Used in Market Research</span></h3>
<p>Often considered a core piece of survey reporting for market research projects, a “banner run” shows every key question broken out by key segments (e.g., demographics, usage, brand, region). This is a common practice because the sample sizes in market research are often large enough to support a large number of data splits, the format is standardized and repeatable, and it serves the needs of stakeholders who want the same results sliced in different ways.</p>
<h3><span lang="EN-US">Banner Tables Are Less Common in UX Research but Have Their Place</span></h3>
<p>It’s possible for a UX researcher to spend decades in the field and never be asked to produce a banner table (we know this from personal experience). Nonetheless, banner tables can play a role when the research methodology is a large-scale survey (especially when focused on segmentation analysis). Even then, in UX research, banner tables will usually be more of a <a href="https://measuringu.com/what-are-ux-deliverables/">supporting deliverable</a> than a key item, as in marketing research.</p>
<h3><span lang="EN-US">Banner Tables Provide a Quick Way to Compare Weighted and Unweighted Results</span></h3>
<p>In our previous article on <a href="https://measuringu.com/rake-weighting-how-to-weight-survey-data-with-multiple-variables/">rake weighting</a>, we demonstrated the use of the <a href="https://cran.r-project.org/web/packages/anesrake/index.html">anesrake R package</a> to weight data on multiple demographic variables. The practice of demographic weighting is more common in market research and political polling because they have clearer and more accessible reference populations than is typical in UX research.</p>
<p>If the decision has been made to weight data, a banner table provides a convenient way to check on the effect of weighting on research outcomes.</p>
<p>In practice, market research banner tables usually treat weighted results as the “official” estimates but commonly include unweighted bases and percentages for quality control and transparency. UX research tends to follow that practice when weights are known to be based on a strong reference population; otherwise, unweighted results may take precedence over weighted results when reviewing the banner tables.</p>
<h2><span lang="EN-US">Banner Table Example</span></h2>
<p>For this example, we return to the data we used in our article on rake weighting so we can produce banner tables with both unweighted and weighted outcomes (for R scripting details, see the appendix).</p>
<h3><span lang="EN-US">Social Media: Weighting Brand Attitude and Reluctance to Engage in Political Discourse</span></h3>
<p>The data for this example came from our <a href="https://measuringu.com/the-ux-of-social-media-in-2024/">2024 SUPR-Q survey</a> of social media platforms. We recruited 324 participants in August 2024 to reflect on their most recent experience with one of six social media platforms: Facebook, Instagram, LinkedIn, Snapchat, TikTok, and X. We were interested in a wide range of UX topics (e.g., overall quality of experience, levels of trust, impact on mood and self-esteem). For the rake weighting article, our examples focused on measuring brand attitude and reluctance to engage in political discourse on the platforms.</p>
<p>We used demographic distributions of the adult U.S. population (18 years of age and older) as the reference population for <a href="https://news.gallup.com/poll/656708/lgbtq-identification-rises.aspx">gender</a>, <a href="https://www2.census.gov/library/publications/decennial/2020/census-briefs/c2020br-06.pdf">age</a>, and <a href="https://www.census.gov/library/publications/2025/demo/p60-286.html">income</a> because it’s commonly used for that purpose in many research contexts. Note that <strong>we do not recommend this as good practice for UX research</strong>, as the entire U.S. population is rarely the target audience for a specific product or service, and demographic variables often have little effect on UX metrics. It did, however, work well in our example as a quick check of the value (or lack of value) of employing this kind of demographic weighting in future SUPR-Q retrospective benchmarks.</p>
<p>Figure 2 shows the first ten rows of the source data with the respondent number, the platform, gender, age group, income range, brand attitude (seven-point scale and top-two box), rating of likelihood to share political content (five-point scale and bottom-two-box score), and the case weight determined by the previous rake weighting exercise. The item stems were, respectively, “Overall, how would you rate your attitude toward &lt;platform&gt;?” and “How likely are you to share political content on &lt;platform&gt;?”</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-2.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47154" src="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-2.png" alt="Portion of source data with weights from previous rake weighting exercise." width="1200" height="287" srcset="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-2.png 1200w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-2-300x72.png 300w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-2-1024x245.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-2-768x184.png 768w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-2-600x144.png 600w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 2:</strong> Portion of source data with weights from previous rake weighting exercise.</p>
<h3><span lang="EN-US">Banner Table Results</span></h3>
<p>To produce the banner table, we used three R packages:</p>
<ul>
<li><a href="https://www.r-bloggers.com/2024/11/creating-professional-excel-reports-with-r-a-comprehensive-guide-to-openxlsx-package/">openxlsx</a>: Get data from an Excel file and produce formatted results</li>
<li><a href="https://cengel.github.io/R-data-wrangling/dplyr.html">dplyr</a>: Manipulation of data frames</li>
<li><a href="https://tidyr.tidyverse.org/reference/tidyr-package.html">tidyr</a>: Simplified creation of tidy data formats</li>
</ul>
<p>The complete R script for creating this banner table is in the appendix.</p>
<p>Figures 3 through 6 show each of the crosstabs in the table for the overall effects of Product, Gender, Age Group, and Income. Because the brand attitude metric in the table is a top-two box, larger percentages reflect a more favorable attitude. In contrast, because the item measuring likelihood to engage in political discourse is a bottom-two box (the top boxes were too sparse to provide a meaningful signal), larger percentages indicate greater reluctance to engage.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-3.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47155" src="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-3-300x49.png" alt="Effect of platform (TikTok had highest brand satisfaction; LinkedIn highest reluctance to engage in political discourse)." width="1170" height="191" srcset="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-3-300x49.png 300w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-3-1024x167.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-3-768x125.png 768w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-3-1536x251.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-3-600x98.png 600w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-3.png 1776w" sizes="auto, (max-width: 1170px) 100vw, 1170px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 3:</strong> Effect of platform (TikTok had the highest brand satisfaction; LinkedIn had the highest reluctance to engage in political discourse).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-4.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47156" src="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-4-300x70.png" alt="Effect of gender (female users had substantially higher brand attitudes than male users; nonbinary users least likely to engage in political discourse). " width="813" height="191" srcset="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-4-300x70.png 300w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-4-1024x241.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-4-768x180.png 768w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-4-600x141.png 600w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-4.png 1230w" sizes="auto, (max-width: 813px) 100vw, 813px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 4:</strong> Effect of gender (female users had substantially higher brand attitudes than male users; nonbinary users were the least likely to engage in political discourse).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-5.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47157" src="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-5-300x52.png" alt="Effect of age (50-59 had higher brand attitude; 18-24 least likely to engage in political discourse). " width="1092" height="191" srcset="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-5-300x52.png 300w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-5-1024x179.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-5-768x134.png 768w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-5-1536x269.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-5-600x105.png 600w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-5.png 1647w" sizes="auto, (max-width: 1092px) 100vw, 1092px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 5:</strong> Effect of age (50–59 had higher brand attitude; 18–24 least likely to engage in political discourse).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-6.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47158" src="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-6-300x48.png" alt="Effect of income ($25k-$49k had highest brand attitude; $200k+ were least likely to engage in political discourse). " width="1201" height="191" srcset="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-6-300x48.png 300w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-6-1024x163.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-6-768x122.png 768w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-6-1536x244.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-6-600x95.png 600w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-6.png 1811w" sizes="auto, (max-width: 1201px) 100vw, 1201px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 6: </strong>Effect of income ($25k–$49k had the highest brand attitude; $200k+ were least likely to engage in political discourse).</p>
<h3><span lang="EN-US">Try It!</span></h3>
<p>Table 2 is a <a href="https://tablepress.org/">TablePress</a> version of the banner table with the top row and the left two columns frozen.</p>
<p>To use the table, click on any row below the header, then use the slider or arrow keys to scroll horizontally.</p>
<p>To switch between the brand attitude and political discourse rows, toggle the 1 and 2 below the right corner of the table, then to resume horizontal scrolling, click on any row below the header.</p>

<table id="tablepress-1022" class="tablepress tablepress-id-1022">
<thead>
<tr class="row-1">
	<th class="column-1"><strong>Metric</strong></th><th class="column-2"><strong>Total</strong></th><th class="column-3"><strong> </strong></th><th class="column-4"><strong>Platform: Facebook</strong></th><th class="column-5"><strong>Platform: Instagram</strong></th><th class="column-6"><strong>Platform: LinkedIn</strong></th><th class="column-7"><strong>Platform: Snapchat</strong></th><th class="column-8"><strong>Platform: TikTok</strong></th><th class="column-9"><strong>Platform: X</strong></th><th class="column-10"><strong> </strong></th><th class="column-11"><strong>Gender: Female</strong></th><th class="column-12"><strong>Gender: Male</strong></th><th class="column-13"><strong>Gender: Nonbinary</strong></th><th class="column-14"><strong> </strong></th><th class="column-15"><strong>Age: 18-24</strong></th><th class="column-16"><strong>Age: 25-29</strong></th><th class="column-17"><strong>Age: 30-39</strong></th><th class="column-18"><strong>Age: 40-49</strong></th><th class="column-19"><strong>Age: 50-59</strong></th><th class="column-20"><strong>Age: 60-69</strong></th><th class="column-21"><strong> </strong></th><th class="column-22"><strong>Income: $0-$24k</strong></th><th class="column-23"><strong>Income: $25k-$49k</strong></th><th class="column-24"><strong>Income: $50k-$99k</strong></th><th class="column-25"><strong>Income: $100k-$149k</strong></th><th class="column-26"><strong>Income: $150k-$199k</strong></th><th class="column-27"><strong>Income: $200k+</strong></th>
</tr>
</thead>
<tbody class="row-striping row-hover">
<tr class="row-2">
	<td class="column-1">Brand attitude (Top-2 Box): Unweighted n</td><td class="column-2">324</td><td class="column-3"></td><td class="column-4">53</td><td class="column-5">56</td><td class="column-6">52</td><td class="column-7">50</td><td class="column-8">57</td><td class="column-9">56</td><td class="column-10"></td><td class="column-11">198</td><td class="column-12">114</td><td class="column-13">12</td><td class="column-14"></td><td class="column-15">68</td><td class="column-16">64</td><td class="column-17">102</td><td class="column-18">51</td><td class="column-19">31</td><td class="column-20">8</td><td class="column-21"></td><td class="column-22">50</td><td class="column-23">70</td><td class="column-24">107</td><td class="column-25">56</td><td class="column-26">25</td><td class="column-27">16</td>
</tr>
<tr class="row-3">
	<td class="column-1">Brand attitude (Top-2 Box): Weighted n</td><td class="column-2">324.0</td><td class="column-3"></td><td class="column-4">74.1</td><td class="column-5">52.6</td><td class="column-6">49.4</td><td class="column-7">39.4</td><td class="column-8">50.4</td><td class="column-9">58.0</td><td class="column-10"></td><td class="column-11">163.6</td><td class="column-12">157.1</td><td class="column-13">3.2</td><td class="column-14"></td><td class="column-15">51.7</td><td class="column-16">34.3</td><td class="column-17">69.1</td><td class="column-18">62.6</td><td class="column-19">66.3</td><td class="column-20">40.0</td><td class="column-21"></td><td class="column-22">56.8</td><td class="column-23">70.8</td><td class="column-24">91.2</td><td class="column-25">50.6</td><td class="column-26">22.3</td><td class="column-27">32.3</td>
</tr>
<tr class="row-4">
	<td class="column-1">Brand attitude (Top-2 Box): % (unweighted)</td><td class="column-2">27.2%</td><td class="column-3"></td><td class="column-4">24.5%</td><td class="column-5">25.0%</td><td class="column-6">23.1%</td><td class="column-7">26.0%</td><td class="column-8">50.9%</td><td class="column-9">12.5%</td><td class="column-10"></td><td class="column-11">35.4%</td><td class="column-12">15.8%</td><td class="column-13">0.0%</td><td class="column-14"></td><td class="column-15">22.1%</td><td class="column-16">31.2%</td><td class="column-17">22.5%</td><td class="column-18">33.3%</td><td class="column-19">35.5%</td><td class="column-20">25.0%</td><td class="column-21"></td><td class="column-22">20.0%</td><td class="column-23">30.0%</td><td class="column-24">29.9%</td><td class="column-25">28.6%</td><td class="column-26">24.0%</td><td class="column-27">18.8%</td>
</tr>
<tr class="row-5">
	<td class="column-1">Brand attitude (Top-2 Box): % (weighted)</td><td class="column-2">24.6%</td><td class="column-3"></td><td class="column-4">21.0%</td><td class="column-5">27.0%</td><td class="column-6">31.4%</td><td class="column-7">21.3%</td><td class="column-8">43.1%</td><td class="column-9">7.3%</td><td class="column-10"></td><td class="column-11">32.4%</td><td class="column-12">17.0%</td><td class="column-13">0.0%</td><td class="column-14"></td><td class="column-15">16.2%</td><td class="column-16">27.8%</td><td class="column-17">18.8%</td><td class="column-18">26.4%</td><td class="column-19">33.6%</td><td class="column-20">25.0%</td><td class="column-21"></td><td class="column-22">21.9%</td><td class="column-23">32.4%</td><td class="column-24">26.0%</td><td class="column-25">28.0%</td><td class="column-26">17.0%</td><td class="column-27">8.0%</td>
</tr>
<tr class="row-6">
	<td class="column-1"></td><td class="column-2"></td><td class="column-3"></td><td class="column-4"></td><td class="column-5"></td><td class="column-6"></td><td class="column-7"></td><td class="column-8"></td><td class="column-9"></td><td class="column-10"></td><td class="column-11"></td><td class="column-12"></td><td class="column-13"></td><td class="column-14"></td><td class="column-15"></td><td class="column-16"></td><td class="column-17"></td><td class="column-18"></td><td class="column-19"></td><td class="column-20"></td><td class="column-21"></td><td class="column-22"></td><td class="column-23"></td><td class="column-24"></td><td class="column-25"></td><td class="column-26"></td><td class="column-27"></td>
</tr>
<tr class="row-7">
	<td class="column-1">Political discourse reluctance (Bottom-2 Box): Unweighted n</td><td class="column-2">324</td><td class="column-3"></td><td class="column-4">53</td><td class="column-5">56</td><td class="column-6">52</td><td class="column-7">50</td><td class="column-8">57</td><td class="column-9">56</td><td class="column-10"></td><td class="column-11">198</td><td class="column-12">114</td><td class="column-13">12</td><td class="column-14"></td><td class="column-15">68</td><td class="column-16">64</td><td class="column-17">102</td><td class="column-18">51</td><td class="column-19">31</td><td class="column-20">8</td><td class="column-21"></td><td class="column-22">50</td><td class="column-23">70</td><td class="column-24">107</td><td class="column-25">56</td><td class="column-26">25</td><td class="column-27">16</td>
</tr>
<tr class="row-8">
	<td class="column-1">Political discourse reluctance (Bottom-2 Box): Weighted n</td><td class="column-2">324.0</td><td class="column-3"></td><td class="column-4">74.1</td><td class="column-5">52.6</td><td class="column-6">49.4</td><td class="column-7">39.4</td><td class="column-8">50.4</td><td class="column-9">58.0</td><td class="column-10"></td><td class="column-11">163.6</td><td class="column-12">157.1</td><td class="column-13">3.2</td><td class="column-14"></td><td class="column-15">51.7</td><td class="column-16">34.3</td><td class="column-17">69.1</td><td class="column-18">62.6</td><td class="column-19">66.3</td><td class="column-20">40.0</td><td class="column-21"></td><td class="column-22">56.8</td><td class="column-23">70.8</td><td class="column-24">91.2</td><td class="column-25">50.6</td><td class="column-26">22.3</td><td class="column-27">32.3</td>
</tr>
<tr class="row-9">
	<td class="column-1">Political discourse reluctance (Bottom-2 Box): % (unweighted)</td><td class="column-2">77.5%</td><td class="column-3"></td><td class="column-4">71.7%</td><td class="column-5">73.2%</td><td class="column-6">94.2%</td><td class="column-7">84.0%</td><td class="column-8">75.4%</td><td class="column-9">67.9%</td><td class="column-10"></td><td class="column-11">75.8%</td><td class="column-12">78.9%</td><td class="column-13">91.7%</td><td class="column-14"></td><td class="column-15">80.9%</td><td class="column-16">78.1%</td><td class="column-17">78.4%</td><td class="column-18">74.5%</td><td class="column-19">74.2%</td><td class="column-20">62.5%</td><td class="column-21"></td><td class="column-22">76.0%</td><td class="column-23">77.1%</td><td class="column-24">78.5%</td><td class="column-25">78.6%</td><td class="column-26">68.0%</td><td class="column-27">87.5%</td>
</tr>
<tr class="row-10">
	<td class="column-1">Political discourse reluctance (Bottom-2 Box): % (weighted)</td><td class="column-2">76.0%</td><td class="column-3"></td><td class="column-4">63.2%</td><td class="column-5">70.8%</td><td class="column-6">92.8%</td><td class="column-7">85.1%</td><td class="column-8">80.2%</td><td class="column-9">73.0%</td><td class="column-10"></td><td class="column-11">71.6%</td><td class="column-12">80.3%</td><td class="column-13">90.7%</td><td class="column-14"></td><td class="column-15">80.7%</td><td class="column-16">76.8%</td><td class="column-17">80.5%</td><td class="column-18">79.1%</td><td class="column-19">72.4%</td><td class="column-20">62.5%</td><td class="column-21"></td><td class="column-22">63.8%</td><td class="column-23">79.2%</td><td class="column-24">74.2%</td><td class="column-25">83.0%</td><td class="column-26">73.5%</td><td class="column-27">86.1%</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1022 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 2: </strong>Working version of the banner table.</p>
<h2><span lang="EN-US">Summary</span></h2>
<p>In this article, we went through the why and how of banner tables, ending with an example created with an R script from data collected in a retrospective benchmark study of attitudes toward social media platforms. We discussed that banner tables:</p>
<p><strong>Compress many crosstabs into one viewable table.</strong><br />
The compression of many crosstabs into a single banner table allows stakeholders to quickly scan results without having to flip between multiple slides.</p>
<p><strong>Are created with common analysis tools like R and AI.</strong><br />
Numerous software tools can create banner tables; in our example, we used R packages to generate the table. You can also easily have AI create these for you.</p>
<p><strong>Are ideal for large surveys with segmentation.</strong><br />
Use banner tables to summarize survey results (especially with large sample sizes) when comparing metrics across multiple segments.</p>
<p><strong>Are common in market research but useful in UX.</strong><br />
Banner tables are widely used in market research and, while less frequently requested in UX research, can be the right deliverable when you want to convey cross-tab metrics compactly.</p>
<h2><span lang="EN-US">Appendix</span></h2>
<p>Use the link below to download a PDF of the R script. It’s specific to this example but could certainly be edited for use with other data. The script is formatted so you can select all, copy, modify, and then paste the code into R or R Studio.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/rscriptForBannerTables.pdf">Click here for the R script</a></p>
<p><strong>AI note:</strong> We used ChatGPT 5.2 to create and iterate the R script until it worked as desired (which took about six hours, including debugging some weird roundoff errors). For the final table, we did a little additional formatting by hand (e.g., making the empty columns smaller with light green fill, freezing the top row and the left two columns for easier browsing of the crosstab sections).</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Assistant, Analyst, and User: How We’re Examining AI in UX</title>
		<link>https://measuringu.com/ai-as-uxr-assistant-user-and-analyst/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=ai-as-uxr-assistant-user-and-analyst</link>
		
		<dc:creator><![CDATA[Jeff Sauro, PhD&nbsp;•&nbsp;Jim Lewis, PhD]]></dc:creator>
		<pubDate>Wed, 18 Mar 2026 00:48:07 +0000</pubDate>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[artificial intelligence]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47094</guid>

					<description><![CDATA[It seems like AI is almost everywhere. For many people, it is. From the moment we wake up, AI increasingly shapes our daily experiences. Music playlists are generated automatically. Our computers prompt us to use AI assistants. Internet searches are now often preceded by AI-generated summaries. Call a doctor’s office after hours. and an AI [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/03/031726-FeatureImage-1.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47108 size-medium" src="https://measuringu.com/wp-content/uploads/2026/03/031726-FeatureImage-1-300x169.png" alt="Header image showing a person communicating with 3 AI robots" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/03/031726-FeatureImage-1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/03/031726-FeatureImage-1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/031726-FeatureImage-1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/03/031726-FeatureImage-1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/031726-FeatureImage-1-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/03/031726-FeatureImage-1.png 2000w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>It seems like AI is almost everywhere. For many people, it is.</p>
<p>From the moment we wake up, AI increasingly shapes our daily experiences. Music playlists are generated automatically. Our computers prompt us to use AI assistants. Internet searches are now often preceded by AI-generated summaries.</p>
<p>Call a doctor’s office after hours. and an AI voice assistant may help schedule your appointment. Chat with customer support, and you’ll likely interact with a chatbot before reaching a human. Write an email, and AI offers suggestions. Start a meeting and AI software generates notes and summaries. Need an image to make a point? Use AI to generate one from a textual description (e.g., Figure 1).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/03172026.jpg"><img loading="lazy" decoding="async" class="alignnone size-full wp-image-47095" src="https://measuringu.com/wp-content/uploads/2026/03/03172026.jpg" alt="Image showing the ubiquity of AI." width="881" height="588" srcset="https://measuringu.com/wp-content/uploads/2026/03/03172026.jpg 881w, https://measuringu.com/wp-content/uploads/2026/03/03172026-300x200.jpg 300w, https://measuringu.com/wp-content/uploads/2026/03/03172026-768x513.jpg 768w, https://measuringu.com/wp-content/uploads/2026/03/03172026-600x400.jpg 600w" sizes="auto, (max-width: 881px) 100vw, 881px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> The ubiquity of AI.</p>
<p>And of course, AI’s influence affects what we do in UX research.</p>
<p>But is AI helping? Is it making us more efficient, more accurate? Or is it actually just making us <a href="https://hbr.org/2026/02/ai-doesnt-reduce-work-it-intensifies-it">work more intensely</a>?</p>
<p>Of course, there are voices who overhype its efficacy in UX Research and Design. There are also voices who dismiss it as a fad. Increasingly, the latter is becoming a less tenable position.</p>
<p>We’re more pragmatic at MeasuringU and have an aversion to extreme attitudes. The <a href="https://en.wikipedia.org/wiki/Golden_mean_(philosophy)">Aristotelian golden mean</a> between extremes is part of our company DNA.</p>
<p>We lean into empiricism and judge the efficacy of claims using data. We also critically evaluate the quality of the evidence. An anecdote about improved productivity from a software company is not the same as a controlled study.</p>
<p>As is often the case with fast-changing technology, there’s a dearth of high-quality studies that allow us to separate the hype from the hypothesis testing. We’re actively conducting studies and literature reviews to quantify the extent to which different applications of AI to UX research are useful.</p>
<p>A good way to assess the evidence and group our research is to think about AI’s impact in UX research in three categories: AI as Research Assistant, AI as (Synthetic) User, and AI as Researcher.</p>
<h2><span lang="EN-US">AI as Research Assistant</span></h2>
<p>Let’s start with something less controversial and rather commonplace. That is, researchers using AI tools to assist (usually expedite) research.</p>
<p>Many UX research teams use AI for the following tasks, and the AI assistants appear to be well-received by researchers to either increase research speed or improve research quality. Questions remain about measurable quality criteria, failure modes, and the role of the human in the loop.</p>
<ul>
<li>Coding comments from categories</li>
<li>Cleaning data</li>
<li>Translation and localization</li>
<li>Analyzing interviews to find themes</li>
<li>Developing insights from transcripts</li>
<li>Building and modifying participant screeners</li>
<li>Writing and editing survey questions</li>
<li>Detecting bias and other quality issues in questions</li>
<li>Identifying categories from card sort results</li>
<li>Developing and editing task scenarios</li>
<li>Developing and editing test plans</li>
</ul>
<p>There’s more to do, but we’ve already made some progress investigating the role of AI as a research assistant in comment classification.</p>
<h3><span lang="EN-US">AI and Human Classification of Comments</span></h3>
<p>One of the first analyses we conducted on using AI to code comments was promising. We used three runs of  <a href="https://measuringu.com/classification-agreement-between-ux-researchers-and-chatgpt/">ChatGPT-4 to classify comments</a> in UX research and compared its results (in 2023!) to three human coders. We found only slightly lower interrater reliabilities between human coders and ChatGPT than between human coders alone, with three caveats:</p>
<ul>
<li>Human coders were more likely to assign single comments to their own themes.</li>
<li>Different prompts had different levels of effectiveness (prompt specificity matters).</li>
<li>AI outputs with the same prompt were similar, but there was substantial variation, making it necessary to run AI analyses multiple times.</li>
</ul>
<p>We plan to investigate how well newer AI products perform this task.</p>
<h2><span lang="EN-US">AI as Synthetic User: Synthetic Attitudes vs. Synthetic Actions</span></h2>
<p>Now we move into a category that gets <a href="https://quantuxblog.com/synthetic-survey-data-its-not-data">a lot of people fired up</a>, and for good reason. Any time you take the user out of UX, it becomes objectionable as a matter of principle. But again, we try to be open-minded. After all, inspection methods like <a href="https://measuringu.com/effective-he/">heuristic evaluation</a>, <a href="https://measuringu.com/ux-metrics-pure/">PURE</a>, and <a href="https://measuringu.com/inspection-methods/">guideline reviews</a> are part of the UX research toolbox even though users aren’t directly involved.</p>
<p>We see an important distinction between synthetic user attitudes and synthetic user behaviors, both of which have yet to be fully explored.</p>
<ul>
<li>Synthetic survey respondents (attitudes and reported behaviors): AI-generated responses to rating scales that measure things like satisfaction, intention, and usability, and responses to behaviors like product ownership and usage</li>
<li>Synthetic users of task-based studies (behaviors): AI-generated responses to task-based scenarios used in usability testing</li>
<li>Synthetic users of information architecture tasks (tree tests, card sorts)</li>
</ul>
<p>We have not conducted studies that use data from individually crafted synthetic users, but we have experimented with comparing AI predictions of user behaviors and attitudes for card sorting and tree testing, with mixed success.</p>
<h3><span lang="EN-US">AI and Human Analysis of Card Sorting Results</span></h3>
<p>AI’s ability <a href="https://measuringu.com/comparing-chatgpt-to-card-sorting-results/">to sort items into groups</a>, as in a card sort, was actually reasonably good. Our use of ChatGPT-4 to appropriately name groups of items, with the groups synthesized by human researchers from a standard open card sort, found a strong similarity in numbers and names of categories. Items matched most of the time, the interrater reliability between the two methods was moderate to substantial, and there weren’t any obviously bad ChatGPT placements.</p>
<h3><span lang="EN-US">AI and Human Tree Testing Results</span></h3>
<p>Our tree testing results were also promising. Based on data collected with <a href="https://measuringu.com/chatgpt4-tree-test/">multiple iterations of ChatGPT-4 and 33 participants</a> finding the location of target items in a tree structure based on the IRS website, and using the SEQ to assess perceived task difficulty, we found that ChatGPT performed too well and <strong>was not suitable</strong> for estimating how well humans will find items in a tree test. However, ChatGPT predicted people’s ease ratings of the search tasks <strong>with reasonable accuracy</strong>.</p>
<h2><span lang="EN-US">AI as Researcher</span></h2>
<p>These are more advanced tasks where AI might be able to take a more central role in analysis, but it isn’t clear how AI output compares to human output regarding the amount of time saved in the process (if any) and accuracy. Two ways in which AI might replace researchers are as analysts and moderators.</p>
<h3><span lang="EN-US">AI as Analyst</span></h3>
<p>A lot of human data analysis is repetitive, making it attractive for replacement with AI (Figure 2).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/03172026-F2.jpg"><img loading="lazy" decoding="async" class="alignnone size-full wp-image-47096" src="https://measuringu.com/wp-content/uploads/2026/03/03172026-F2.jpg" alt="Cartoon showing a robot applying for a mindless and repetitive job." width="1043" height="561" srcset="https://measuringu.com/wp-content/uploads/2026/03/03172026-F2.jpg 1043w, https://measuringu.com/wp-content/uploads/2026/03/03172026-F2-300x161.jpg 300w, https://measuringu.com/wp-content/uploads/2026/03/03172026-F2-1024x551.jpg 1024w, https://measuringu.com/wp-content/uploads/2026/03/03172026-F2-768x413.jpg 768w, https://measuringu.com/wp-content/uploads/2026/03/03172026-F2-600x323.jpg 600w" sizes="auto, (max-width: 1043px) 100vw, 1043px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 2:</strong> Robot applying for a job.</p>
<p>Other human data analysis is less repetitive and more dependent on contextual knowledge and human judgment (e.g., identification of usability problems). Some of the opportunities we envision for AI as an analyst (but which need development and validation) are:</p>
<ul>
<li>Validating screenshots to determine task success</li>
<li>Identifying usability problems from image analysis</li>
<li>Identifying usability problems from videos</li>
<li>Heuristic evaluation from analysis of videos, images, and websites</li>
<li>Advanced inspection analyses (<a href="https://measuringu.com/pure/">PURE</a>, <a href="https://measuringu.com/predicted-times/">KLM/GOMS</a>)</li>
<li>Analyzing datasets</li>
</ul>
<h3><span lang="EN-US">AI as Moderator</span></h3>
<p>Research moderation seems like a quintessentially human activity. However, advances in AI avatars, LLM dialog management, and synthetic speech production have led to the development of AI agents that could be applied to a <a href="https://www.nngroup.com/articles/ai-interviewers/">variety of moderation tasks</a>. Research in this area should focus on understanding when it works, when it fails, and how to validate quality.</p>
<ul>
<li>Simple interviews</li>
<li>Complex interviews</li>
<li>Moderated usability tests</li>
</ul>
<h2><span lang="EN-US">AI Adoption and Attitudes</span></h2>
<p>We have conducted and will conduct follow-up studies of attitudes toward AI usage by the general public and by UX researchers.</p>
<p>We’ve already published research on attitudes of UX researchers regarding the use of AI in UX (in association with UXPA) and attitudes of a general population of users toward three AI chat products.</p>
<p>Before examining how AI may function as an assistant, analyst, or synthetic user in UX research, it’s useful to understand how widely AI tools are already being used and how people perceive them. Some recent studies provide insight into both adoption and user experience with AI-based systems.</p>
<h3><span lang="EN-US">How Much Is AI Used in UX? </span></h3>
<p><em>More than you might think.</em> While our <a href="https://measuringu.com/how-much-is-ai-used-in-ux/">industry data</a> from 2024 is due for a refresh, we found that about half of UX professionals had used AI (but 20% were not impressed). More companies supported using AI than discouraged it (by about 6 to 1). Most respondents expected to use AI more in 2025, but expectations over the next five years were mixed.</p>
<h3><span lang="EN-US">Retrospective Benchmark of ChatGPT, Claude, and Gemini</span></h3>
<p>In January and February 2025, we conducted a retrospective study on <a href="https://measuringu.com/ai-based-chat-software-ux-2025/">three AI-based chat products</a> (ChatGPT, Claude, Gemini) with 153 U.S-based panel participants. This study included metrics from our standard UX &amp; NPS survey as part of our larger consumer software data collection effort. All products had high and similar Net Promoter Scores. Reported issues included accuracy, generic content, and limited free versions.</p>
<h2><span lang="EN-US">Summary</span></h2>
<p>It can be easy to be seduced into extreme views about emerging technologies. They can be cast as the best thing ever or a complete waste of time. Our recommendation is a more pragmatic, empirical approach. Rather than relying on anecdotes or hype, we encourage evaluating the role of AI in UX research with data.</p>
<p>One useful way to think about AI in UX research is to group its applications into three roles:</p>
<ul>
<li><strong>AI as Research Assistant.</strong> Tools that improve the quality and quantity of the work that UX researchers already do, such as coding comments, summarizing interviews, and generating study materials.</li>
</ul>
<ul>
<li><strong>AI as Synthetic User.</strong> Systems that simulate user attitudes or behaviors. An important distinction is between synthetic attitudes and synthetic actions. Our early work suggests some promise in modeling behavior, but much less evidence for synthetic attitudes.</li>
</ul>
<ul>
<li><strong>AI as Research Analyst.</strong> Applications where AI plays a more central role in analysis—identifying usability issues from images or videos, evaluating task success, or even assisting with research moderation.</li>
</ul>
<p>There is still much to learn. In the coming year, we plan to continue studying these areas and revisit both the usage of AI tools and attitudes toward them. Our goal is not to promote or dismiss AI, but to understand, through evidence, where it genuinely improves UX research.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Bayes&#8217; Law in UX Research: From Urns to Users</title>
		<link>https://measuringu.com/bayes-law-in-ux-research-from-urns-to-users/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=bayes-law-in-ux-research-from-urns-to-users</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD&nbsp;•&nbsp;Jeff Sauro, PhD]]></dc:creator>
		<pubDate>Tue, 10 Mar 2026 23:16:59 +0000</pubDate>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Bayes theorem]]></category>
		<category><![CDATA[Bayesian theorem]]></category>
		<category><![CDATA[Statistics]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=46809</guid>

					<description><![CDATA[&#8220;Follow the data. Update your beliefs.&#8221; We like the idea of applying iterative Bayesian thinking to how we test hypotheses and conduct UX research. The idea is simple, but modern Bayesian math can be opaque and hard to understand. We have questions about how well Bayesian analysis works relative to frequentist analysis. We are also [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/03/031026-FeatureImage.png"><img loading="lazy" decoding="async" class="alignleft wp-image-46937 size-medium" src="https://measuringu.com/wp-content/uploads/2026/03/031026-FeatureImage-300x169.png" alt="Feature image showing an urn, a math equation and a bust memorial" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/03/031026-FeatureImage-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/03/031026-FeatureImage-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/031026-FeatureImage-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/03/031026-FeatureImage-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/031026-FeatureImage-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/03/031026-FeatureImage.png 2000w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>&#8220;Follow the data. Update your beliefs.&#8221;</p>
<p>We like the idea of applying iterative Bayesian thinking to how we test hypotheses and conduct UX research.</p>
<p>The idea is simple, but modern Bayesian math can be opaque and hard to understand.</p>
<p>We have questions about how well Bayesian analysis works relative to <a href="https://www.statsig.com/perspectives/bayesian-or-frequentist-choosing-your-statistical-approach">frequentist analysis</a>. We are also intrigued by the possibility of <a href="https://measuringu.com/intro-to-bayesian-thinking-in-ux-research">Bayesian thinking in UX research</a>.</p>
<p>The best way to understand how to apply Bayesian thinking and math to UX is to match the original Bayesian examples from hundreds of years ago to modern problems in UX. And it starts with urns.</p>
<h2><span lang="EN-US">Statistics, Probability, and Urns</span></h2>
<p>Statistics is abstract. Probability is hard to understand. And conditional probability is harder still.</p>
<p>One way to make probability more concrete is through cards, dice, and coins. We’re not trying to make people compulsive gamblers. Historically, however, many of our modern formulas come from games of chance. If you can understand how to win a card game or avoid a costly mistake at the roulette table, you tend to pay more attention.</p>
<p>And this is where Thomas Bayes’ <a href="https://bayes.wustl.edu/Manual/an.essay.pdf">famous essay</a> on the logic of updating beliefs from observed outcomes (e.g., successes and failures) comes into play. This work was published after he died in 1763, a reminder that sometimes the private scraps of an idea can publicly change the world.</p>
<p>While cards and dice work for teaching basic probability, conditional probability is often clearer with a different classic tool: the urn problem. In its simplest form, two urns contain different proportions of colored balls. After drawing a sample, the task is to determine which urn most likely produced it.</p>
<p>The advantage of urn problems is flexibility. Unlike coins (two sides), dice (six sides), or cards (fixed suits and ranks), urns allow probabilities to vary in ways that make Bayesian comparisons easier to illustrate.</p>
<p>We modified an example from Cowles’ excellent book, <a href="https://www.amazon.com/Statistics-Psychology-Perspective-Michael-Cowles/dp/080580031X"><em>Statistics in Psychology: An Historical Perspective</em></a>, to demonstratethe use of Bayesian analysis to assess the relative likelihood of competing urn hypotheses (and who noted on p. 75, “Traditionally, writers on Bayes make heavy use of ‘urn problems’”).</p>
<p>For the example depicted in Figure 1, suppose you have a sample of 20 balls where 18 are green and 2 are red. Is this sample more likely to have come from an urn with 90% green balls (Urn A) or one with 78% green balls (Urn B)?</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/03102026-F1.png"><img loading="lazy" decoding="async" class="alignnone size-full wp-image-46810" src="https://measuringu.com/wp-content/uploads/2026/03/03102026-F1.png" alt="Depiction of a classical urn problem." width="399" height="267" srcset="https://measuringu.com/wp-content/uploads/2026/03/03102026-F1.png 399w, https://measuringu.com/wp-content/uploads/2026/03/03102026-F1-300x201.png 300w" sizes="auto, (max-width: 399px) 100vw, 399px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> Depiction of a classical urn problem.</p>
<p>Before we go through the math to compare the likelihoods, let’s update the narrative. After all, UX researchers don’t work with urns. We work with users.</p>
<h2><span lang="EN-US">From Balls in Urns to Users Checking Out</span></h2>
<p>In our <a href="https://measuringu.com/intro-to-bayesian-thinking-in-ux-research">earlier article</a>, we introduced Bayesian thinking using a checkout completion example with 20 participants, 18 who successfully completed the task and two who failed.</p>
<p>So instead of asking which urn produced a sample of green and red balls, let’s ask a question more relevant to UX research: <strong>Is this checkout experience performing at a historical level, or is it consistent with a more aspirational goal?</strong></p>
<p>That’s the same math, just with a better context.</p>
<p>Instead of trying to figure out which urn a sample of balls came from, we want to know which of two possible completion rates is more likely.</p>
<ol>
<li>Historical Completion Rate of 78% (Hypothesis H): This comes from <a href="https://measuringu.com/task-completion/">a historical average with an overall 78% completion</a>.</li>
<li>Aspirational Completion Rate of 90% (Hypothesis A): It’s aspirational because in this hypothetical example, we have a reason to believe (or at least hope) the checkout flow is better than average.</li>
</ol>
<p>Because the observed percentage of success from the sample is exactly 90%, this seems more consistent with the aspirational hypothesis. But let’s work through the math using Bayes’ theorem.</p>
<p>Comparing these two hypotheses, the exact probability of the <em>aspirational 90% hypothesis</em> is:</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/031026-Formula1.png"><img loading="lazy" decoding="async" class="aligncenter wp-image-46940" src="https://measuringu.com/wp-content/uploads/2026/03/031026-Formula1-300x51.png" alt="Bayesian formula to calculate probability" width="411" height="70" srcset="https://measuringu.com/wp-content/uploads/2026/03/031026-Formula1-300x51.png 300w, https://measuringu.com/wp-content/uploads/2026/03/031026-Formula1-1024x174.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/031026-Formula1-768x131.png 768w, https://measuringu.com/wp-content/uploads/2026/03/031026-Formula1-1536x262.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/031026-Formula1-2048x349.png 2048w, https://measuringu.com/wp-content/uploads/2026/03/031026-Formula1-600x102.png 600w" sizes="auto, (max-width: 411px) 100vw, 411px" /></a></p>
<p>where:</p>
<p><em>P(D|A) </em>is the probability of getting this sample (the data, <em>D</em>) if the aspirational hypothesis is true.</p>
<p><em>P(D|H) </em>is the probability of getting this sample if the historical hypothesis is true.</p>
<p><em>P(A) </em>is our expected (prior) probability that the aspirational hypothesis is true.</p>
<p><em>P(H) </em>is our expected (prior) probability that the historical hypothesis is true.</p>
<p><em>P(A|D)</em> is the conditional probability of the aspirational hypothesis given the sample.</p>
<p>Using binomial probabilities, <em>P</em>(<em>D</em>|<em>A</em>) is (0.9)<sup>18</sup> × (0.1)<sup>2</sup> = 0.0015 and <em>P</em>(<em>D</em>|<em>H</em>) is (0.78)<sup>18</sup> × (0.22)<sup>2</sup> = 0.00055. Assuming we have no prior belief favoring either hypothesis (so <em>P</em>(<em>A</em>) = <em>P</em>(<em>H</em>) = 0.5), we get:</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/031026-Formula2.png"><img loading="lazy" decoding="async" class="aligncenter wp-image-46941" src="https://measuringu.com/wp-content/uploads/2026/03/031026-Formula2-300x49.png" alt="Filled in formula for calculating Bayesian probability." width="432" height="70" srcset="https://measuringu.com/wp-content/uploads/2026/03/031026-Formula2-300x49.png 300w, https://measuringu.com/wp-content/uploads/2026/03/031026-Formula2-1024x166.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/031026-Formula2-768x124.png 768w, https://measuringu.com/wp-content/uploads/2026/03/031026-Formula2-1536x249.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/031026-Formula2-2048x332.png 2048w, https://measuringu.com/wp-content/uploads/2026/03/031026-Formula2-600x97.png 600w" sizes="auto, (max-width: 432px) 100vw, 432px" /></a></p>
<p>which equals 0.00075/(0.00075 + 0.000275) = 0.00075/0.001025 = 0.732 (73.2%). Because <em>P</em>(<em>A</em>|<em>D</em>) + <em>P</em>(<em>H</em>|<em>D</em>) = 1, <em>P</em>(<em>H</em>|<em>D</em>) = 0.268 (26.8%).</p>
<p>We conclude there is a substantial probability that the historical hypothesis might be true (26.8% isn’t anywhere near 0%), but the aspirational hypothesis is <strong>2.7 times more likely</strong>.</p>
<h2><span lang="EN-US">Summary and Discussion</span></h2>
<p>In this article, we extended a classic Bayesian urn exercise to illustrate one way to apply Bayesian analysis to a UX research context.</p>
<p>We showed how you can use a relatively simple version of Bayes’ theorem to compare the likelihoods of two hypotheses from observed completion rate data.</p>
<p>Even though we couldn’t say with certainty that either hypothesis was implausible, the data were clearly a better fit for the aspirational hypothesis.</p>
<p>But this leaves more questions:</p>
<ol>
<li>Is comparing two hypotheses in this way a better approach than just using a confidence (or credibility) interval, or using binomial tests to compare the sample against the two benchmarks (78%, 90%)?</li>
<li>How do we come up with a solid prior? We avoided that question in this example by using the <a href="https://en.wikipedia.org/wiki/Principle_of_indifference">principle of indifference</a> (setting both priors to 0.5). But what if we had a good reason to believe that the historical value of 78% should receive more (or less) weight in the calculation? How much could that change the <a href="https://en.wikipedia.org/wiki/Posterior_probability">posterior belief</a> (<em>P</em>(<em>A</em>|<em>D</em>)) and consequent decisions?</li>
</ol>
<p>We’ll address these questions in future articles.</p>
<p>Note that we aren’t recommending this specific method in UX analysis. One of our primary goals in this article is to illustrate the mechanics of Bayesian analysis with simple algebra and binomial probabilities. The downside of this is that we had to assign specific prior probabilities rather than following the current practice of <a href="https://bookdown.org/pbaumgartner/bayesian-fun/05-beta-distribution.html">using beta distributions for priors</a>. This does not, however, affect the logic of the discussion.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Why You Should Not Compute Medians for Individual Rating Scales</title>
		<link>https://measuringu.com/means-vs-medians-for-rating-scales/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=means-vs-medians-for-rating-scales</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD&nbsp;•&nbsp;Jeff Sauro, PhD]]></dc:creator>
		<pubDate>Wed, 04 Mar 2026 00:25:57 +0000</pubDate>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Rating Scale]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[mean]]></category>
		<category><![CDATA[median]]></category>
		<category><![CDATA[Rating Scales]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=46780</guid>

					<description><![CDATA[Say you collect rating scale data from dozens of users across ten apps. To analyze the data, you compute medians because you learned that rating scale data isn’t interval or ratio data. The medians of all ten apps end up the same. They’re all 4! If you rely on the medians, you’d conclude the apps [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/02/030326-FeatureImage-1.png"><img loading="lazy" decoding="async" class="alignleft wp-image-46790 size-medium" src="https://measuringu.com/wp-content/uploads/2026/02/030326-FeatureImage-1-300x169.png" alt="Feature image showing two persons inspecting a rating scale and an infographic" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/02/030326-FeatureImage-1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/02/030326-FeatureImage-1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/02/030326-FeatureImage-1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/02/030326-FeatureImage-1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/02/030326-FeatureImage-1-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/02/030326-FeatureImage-1.png 2000w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>Say you collect rating scale data from dozens of users across ten apps. To analyze the data, you compute medians because you learned that rating scale data isn’t interval or ratio data.</p>
<p>The medians of all ten apps end up the same. They’re all 4!</p>
<p>If you rely on the medians, you’d conclude the apps are essentially equivalent.</p>
<p>But if you compute the means, the ratings range from 3.6 to 4.6, providing a much clearer differentiation.</p>
<p>How can the same dataset produce such different stories? What’s the “right” way?<a href="https://measuringu.com/wp-content/uploads/2026/02/03032026-Cartoon.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46781 size-full" src="https://measuringu.com/wp-content/uploads/2026/02/03032026-Cartoon.png" alt="Cartoon showing researcher objecting to computing means" width="779" height="346" srcset="https://measuringu.com/wp-content/uploads/2026/02/03032026-Cartoon.png 779w, https://measuringu.com/wp-content/uploads/2026/02/03032026-Cartoon-300x133.png 300w, https://measuringu.com/wp-content/uploads/2026/02/03032026-Cartoon-768x341.png 768w, https://measuringu.com/wp-content/uploads/2026/02/03032026-Cartoon-600x266.png 600w" sizes="auto, (max-width: 779px) 100vw, 779px" /></a></p>
<p>Why are some researchers so adamant about NOT computing the means of rating scales like the Single Ease Question (Figure 1)? In this article, we explain why taking the median of rating scale data is a poor practice.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/02/03032026-F1.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46782" src="https://measuringu.com/wp-content/uploads/2026/02/03032026-F1.png" alt="The Single Ease Question" width="1200" height="222" srcset="https://measuringu.com/wp-content/uploads/2026/02/03032026-F1.png 1430w, https://measuringu.com/wp-content/uploads/2026/02/03032026-F1-300x56.png 300w, https://measuringu.com/wp-content/uploads/2026/02/03032026-F1-1024x190.png 1024w, https://measuringu.com/wp-content/uploads/2026/02/03032026-F1-768x142.png 768w, https://measuringu.com/wp-content/uploads/2026/02/03032026-F1-600x111.png 600w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> The Single Ease Question (SEQ<sup>®</sup>).</p>
<h2><span lang="EN-US">Stevens in 1946 Said Ordinal Data Can’t Be Averaged</span></h2>
<p>Ever since <a href="https://en.wikipedia.org/wiki/Stanley_Smith_Stevens">S. S. Stevens</a> declared in 1946 that <a href="https://pages.gseis.ucla.edu/faculty/richardson/Courses/stevens1946.pdf">numbers are not all created equal</a> by categorizing them as <a href="https://measuringu.com/data-types/">ratio, interval, ordinal, and nominal</a>, analysts have debated whether it’s legitimate to compute the mean of multipoint rating scales such as the SEQ. Based on his “<a href="https://en.wikipedia.org/wiki/Level_of_measurement">principle of invariance</a>,” he argued against doing anything more than counting nominal and ordinal data, which restricts addition, subtraction, multiplication, and division to interval and ratio data. These are exactly the operations needed to compute the mean of a set of data: “Thus, the mean is appropriate to an interval scale and also to a ratio scale (but not, of course, to an ordinal or a nominal scale” (Stevens, 1959, p. 28).</p>
<h3><span lang="EN-US">But Lord in 1953 Says Numbers Don’t Know They Are Ordinal</span></h3>
<p>It didn’t take long for other statisticians and measurement theorists to craft arguments against the proposed policy of restricting analysis of ordinal and nominal data to counts and medians. Probably the most famous counterargument was by Lord (1953). And we’re not referring to the <a href="https://en.wikipedia.org/wiki/Lorde">“Royals” singer</a> nor a deity, but a late psychologist with a divine name and lasting contributions (including the <a href="https://en.wikipedia.org/wiki/Frederic_M._Lord#:~:text=Frederic%20Mather%20Lord%20(November%2012%2C%201912%20%E2%80%93,TOEFL%20are%20all%20based%20on%20Lord's%20research.">SAT and GRE</a> tests).</p>
<p>In his <a href="https://is.muni.cz/el/fss/jaro2010/PSY454/um/Frederick_Lord_On_the_statistical_treatment_of_football_numbers.pdf">parable of a retired professor</a>, Lord described a machine used to randomly assign football numbers to the jerseys of freshmen and sophomore football players at his university … a clear use of numbers as labels (<strong>nominal data</strong>). After receiving their numbers, the freshmen complained that the assignment wasn’t random. They claimed to have received generally smaller numbers than the sophomores and that the sophomores must have tampered with the machine.</p>
<p>The professor consulted with a statistician to investigate how likely it was that the freshmen got their low numbers by chance. Over the professor’s objections, the statistician determined the population mean and standard deviation of the football numbers as 54.3 and 16.0, respectively. He found that the mean of the freshmen’s numbers was too low to have happened by chance, strongly indicating that the sophomores had tampered with the football number machine to get larger numbers. The professor objected to the analysis because the numbers weren’t even ordinal, but the statistician replied, “The numbers don’t know that; since the numbers don’t remember where they came from, they always behave just the same way, regardless.”</p>
<h3><span lang="EN-US">Even Nonparametric Tests Quietly Compute Means</span></h3>
<p>For analyzing ordinal data, some researchers have recommended using statistical methods that are similar to the well-known <em>t</em>&#8211; and <em>F</em>-tests, but which replace the original data with ranks before analysis. These are the so-called nonparametric methods (e.g., the Mann–Whitney <em>U</em> test, the Friedman test, or the Kruskal–Wallis test). But here’s the dirty secret: These methods actually compute the means of the ranks (or an equivalent process), which are ordinal (not interval or ratio) data! Despite these violations of permissible data manipulations from Stevens’ point of view, those methods work perfectly well.</p>
<h2><span lang="EN-US">Why Medians Are Poor Estimates of Central Tendency for Rating Scales</span></h2>
<p>When is computing a median a good practice, and why doesn’t it work well with rating scales?</p>
<h3><span lang="EN-US">When to Compute a Median</span></h3>
<p>The mean and median are both common ways to measure the central tendency of a set of data. To calculate the mean, add up the data points and divide by the total number in the group (the sample size, <em>n</em>). With the mean, every data point contributes to the estimate. The median is simply the center point of a distribution (or, if there is an even number of points, the average of the two center points).</p>
<p>The mean usually works well as a measure of central tendency, especially when the distribution is roughly symmetrical. When the data aren’t symmetrical, however, the mean can be sufficiently influenced by a few extreme data points (e.g., time data, currency values), so it’s no longer a good estimate of central tendency. When that happens, the median can be a better estimate of central tendency than the mean.</p>
<h3><span lang="EN-US">But Rating Scales Are Bounded and Discrete</span></h3>
<p>The examples of data types that can be better summarized with the median than the mean have two things in common:</p>
<ul>
<li>An unlimited range with a small number of extreme scores</li>
<li>Continuous measurement</li>
</ul>
<p>Rating scales, however, have a limited range and fundamentally discrete measurements. Because the ratings are discrete, the median can take only one of a restricted number of values regardless of the sample size. For a five-point scale, the median can take only the following values, no matter how large the sample: 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, and 5.0. (And it can take the intermediate values only when <em>n </em>is even.)</p>
<p>The mean, on the other hand, can take any value between 1 and 5, and as the sample size increases, the mean becomes more and more continuous. Because the mean can be a larger number of values, it can reflect differences between two samples more reliably than the median difference.</p>
<p>When scales are open-ended (have at least one endpoint at infinity, like time data), extreme values can affect the mean but will not affect medians. Rating scales, however, are not open-ended, so <a href="https://www.researchgate.net/publication/220302331_Multipoint_Scales_Mean_and_Median_Differences_and_Observed_Significance_Levels">the median does not have a compelling advantage over the mean</a> when analyzing individual rating scales. Instead, it is at a distinct disadvantage.</p>
<h2><span lang="EN-US">Eleven Mobile Apps That Look the Same Using Medians (A Real Example)</span></h2>
<p>So, we weren’t making up the story about a bunch of apps having the same median (we just changed the number from eleven to ten). The story comes from our data.</p>
<p>In our 2026 UX benchmark of clothing websites, we asked respondents who used the mobile apps of various companies to rate their usefulness with a five-point scale. Table 1 shows the means, medians, and sample sizes for the companies included in the benchmark.</p>

<table id="tablepress-1021" class="tablepress tablepress-id-1021">
<thead>
<tr class="row-1">
	<th class="column-1"><strong>Mobile App</strong></th><th class="column-2"><strong>Mean</strong></th><th class="column-3"><strong>Median</strong></th><th class="column-4"><strong><i>n</i></strong></th>
</tr>
</thead>
<tbody class="row-striping row-hover">
<tr class="row-2">
	<td class="column-1">Anthropologie</td><td class="column-2">3.94</td><td class="column-3">4.00</td><td class="column-4">18</td>
</tr>
<tr class="row-3">
	<td class="column-1">Athleta</td><td class="column-2">4.26</td><td class="column-3">4.00</td><td class="column-4">23</td>
</tr>
<tr class="row-4">
	<td class="column-1">Banana Republic</td><td class="column-2">4.42</td><td class="column-3">4.00</td><td class="column-4">19</td>
</tr>
<tr class="row-5">
	<td class="column-1">Gap</td><td class="column-2">4.44</td><td class="column-3">4.00</td><td class="column-4">18</td>
</tr>
<tr class="row-6">
	<td class="column-1">H&amp;M</td><td class="column-2">4.64</td><td class="column-3">5.00</td><td class="column-4">11</td>
</tr>
<tr class="row-7">
	<td class="column-1">Lululemon</td><td class="column-2">3.57</td><td class="column-3">4.00</td><td class="column-4">23</td>
</tr>
<tr class="row-8">
	<td class="column-1">Neiman Marcus</td><td class="column-2">4.21</td><td class="column-3">4.00</td><td class="column-4">24</td>
</tr>
<tr class="row-9">
	<td class="column-1">Nordstrom</td><td class="column-2">4.12</td><td class="column-3">4.00</td><td class="column-4">17</td>
</tr>
<tr class="row-10">
	<td class="column-1">Old Navy</td><td class="column-2">4.00</td><td class="column-3">4.00</td><td class="column-4">13</td>
</tr>
<tr class="row-11">
	<td class="column-1">Urban Outfitters</td><td class="column-2">3.91</td><td class="column-3">4.00</td><td class="column-4">22</td>
</tr>
<tr class="row-12">
	<td class="column-1">Zara</td><td class="column-2">4.30</td><td class="column-3">4.00</td><td class="column-4">30</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1021 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Means and medians for usefulness ratings of eleven mobile apps for online clothes shopping.</p>
<p>In this example, all the medians were either 4 or 5. The means, on the other hand, ranged from 3.57 to 4.64 with no duplication, providing a much more nuanced picture of the differences in the ratings (Figure 2).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-1-scaled.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46794" src="https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-1-300x114.png" alt="Graph of means for usefulness ratings of eleve online clothes shopping apps." width="1200" height="456" srcset="https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-1-300x114.png 300w, https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-1-1024x389.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-1-768x292.png 768w, https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-1-1536x583.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-1-2048x778.png 2048w, https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-1-600x228.png 600w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /></a><a href="https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-2-scaled.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46795" src="https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-2-300x114.png" alt="Graph of medians for usefulness ratings of eleven online clothes shopping apps. The profile of the means is more informative than the profile of the medians." width="1200" height="456" srcset="https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-2-300x114.png 300w, https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-2-1024x389.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-2-768x292.png 768w, https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-2-1536x584.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-2-2048x778.png 2048w, https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-2-600x228.png 600w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 2:</strong> Comparison of graphs of means and medians for usefulness ratings of eleven online clothes shopping apps. The profile of the means is more informative than the profile of the medians.</p>
<p>And when sample sizes are very small, there usually won’t be much difference between rating scale means and medians. The most extreme example is when <em>n</em> = 2, in which case the mean and median will be the same, but that doesn’t happen in the real world.</p>
<h2>Use Means, But Don’t Overinterpret Them</h2>
<p>So, which is it—not all numbers are equal (Stevens, 1946), or the numbers don’t remember where they came from (Lord, 1953)? Given our backgrounds in applied statistics (and personal experiences attempting to act in accordance with Stevens’ reasoning that didn’t work out very well—that’s a story for another day), we fall firmly in the camp that supports the use of statistical techniques (such as the <em>t</em>-test, analysis of variance, and factor analysis) on ordinal data such as multipoint rating scales. However, you can’t just ignore the level of measurement of your data.</p>
<p>When you make claims about the meaning of the outcomes of your statistical tests, you must be careful not to act as if rating scale data are interval rather than ordinal. An average rating of 4 might be better than an average rating of 2, and a <em>t</em>-test might indicate that, across a group of participants, the difference is consistent enough to be statistically significant. Even so, you can’t claim that it’s twice as good (a ratio claim), nor can you claim that the difference between 4 and 2 is equal to the difference between 4 and 6 (an interval claim). You can only claim that there is a reliably consistent difference.</p>
<p>Although it might surprise some researchers who treat the implications of the levels of measurement as if they were laws, Stevens (1946, p. 679) took a more moderate stance on this topic than most people realize:</p>
<p style="margin-left: 40px;"><em>On the other hand, for this &#8220;illegal&#8221; statisticizing there can be invoked a kind of pragmatic sanction: In numerous instances it leads to fruitful results. While the outlawing of this procedure would probably serve no good purpose, it is proper to point out that means and standard deviations computed on an ordinal scale are in error to the extent that the successive intervals on the scale are unequal in size. When only the rank-order of data is known, we should proceed cautiously with our statistics, and especially with the conclusions we draw from them.</em></p>
<p>Fortunately, even if you make the mistake of thinking one product is twice as good as another when the scale doesn’t justify it, it would be a mistake that often would not affect the practical decision of which product is better.</p>
<h2><span lang="EN-US">Summary and Discussion</span></h2>
<p>Some analysts strongly advise against computing the means of rating scales, often recommending the computation of medians instead. In this article, we explain why reporting the median of rating scale data doesn’t work as well as reporting the mean.</p>
<p><strong>Medians are better than means when outliers skew continuous, unbounded data. </strong>This pattern is common in measures such as time or money, where a few extreme values can substantially shift the mean.</p>
<p><strong>Rating scales are discrete and bounded, making means more informative than medians. </strong>Even though we spend a lot of money and time collecting data, rating scales aren’t like time and money. For data like this, medians are too coarse to capture the meaningful differences that means are sensitive enough to detect.</p>
<p><strong>Compute means of rating scales, but don’t make interval claims from ordinal data. </strong>Differences between rating scale means indicate consistent ordering, not equal intervals or proportional differences. Even so, they are often very useful in practice.</p>
<p><strong>Bottom line</strong>: When analyzing rating scale data, don’t be afraid to compute and compare means as long as your interpretation of results doesn’t exceed what the data says.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>An Intro to Bayesian Thinking for UX Research: Updating Beliefs with Data</title>
		<link>https://measuringu.com/intro-to-bayesian-thinking-in-ux-research/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=intro-to-bayesian-thinking-in-ux-research</link>
		
		<dc:creator><![CDATA[Jeff Sauro, PhD&nbsp;•&nbsp;Jim Lewis, PhD]]></dc:creator>
		<pubDate>Wed, 25 Feb 2026 00:50:31 +0000</pubDate>
				<category><![CDATA[User Research]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[Bayes theorem]]></category>
		<category><![CDATA[Bayesian theorem]]></category>
		<category><![CDATA[completion rates]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=46598</guid>

					<description><![CDATA[&#8220;That design will never work.&#8221; You may have had that thought before you even ran your first participant in a usability test. If you’ve seen enough users struggle and conducted enough usability tests, then you probably have some idea about how well or poorly a task attempt may go for prototypes or even commercially available [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/02/022426-FeatureImage-4.png"><img loading="lazy" decoding="async" class="alignleft wp-image-46615 size-medium" src="https://measuringu.com/wp-content/uploads/2026/02/022426-FeatureImage-4-300x169.png" alt="Feature image showing a researcher pointing on a math equation using a pointer stick" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/02/022426-FeatureImage-4-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/02/022426-FeatureImage-4-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/02/022426-FeatureImage-4-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/02/022426-FeatureImage-4-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/02/022426-FeatureImage-4-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/02/022426-FeatureImage-4.png 2000w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>&#8220;That design will never work.&#8221;</p>
<p>You may have had that thought before you even ran your first participant in a usability test.</p>
<p>If you’ve seen enough users struggle and conducted enough usability tests, then you probably have some idea about how well or poorly a task attempt may go for prototypes or even commercially available software or products.</p>
<p>It’s rare to have <em>no</em> idea about how well things will go before the testing even starts. In fact, an experienced researcher is expected to know of some problems and anticipate the friction. This is one of the foundations behind inspection methods like heuristic evaluation and the PURE method (which puts some numbers to friction).</p>
<p>Expert reviewers, of course, are not a substitute for observing users. But is there a way to build in our a priori knowledge of what’s likely to go wrong and then inform and update our beliefs once we see data? Can we do that systematically or even mathematically?</p>
<h2><span lang="EN-US">Thomas Bayes and Updating Our Beliefs from Data</span></h2>
<p>It turns out that hundreds of years ago, a famous Presbyterian minister named <a href="https://en.wikipedia.org/wiki/Thomas_Bayes">Thomas Bayes</a> was also interested in updating his beliefs with what he observed.</p>
<p>His name has been associated with a formula for updating our beliefs with data (<a href="https://en.wikipedia.org/wiki/Bayes%27_theorem">Bayes&#8217; Theorem</a>). It follows a simple iterative process:</p>
<ol>
<li>Start with a belief or hypothesis.</li>
<li>Collect data.</li>
<li>Update the belief.</li>
<li>Repeat.</li>
</ol>
<p>The formula for this process looks like this:</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-1.png"><img loading="lazy" decoding="async" class="aligncenter wp-image-46629" src="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-1.png" alt="Formula for updating beliefs with Bayes Theorem" width="674" height="70" srcset="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-1.png 1078w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-1-300x31.png 300w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-1-1024x106.png 1024w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-1-768x80.png 768w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-1-600x62.png 600w" sizes="auto, (max-width: 674px) 100vw, 674px" /></a><br />
In other words, start with what you expect, check how well the data matches that expectation, and then adjust your belief accordingly.</p>
<p>The Bayes’ formula means that beliefs that better predict the data become more credible; beliefs that predict the data poorly lose credibility.</p>
<p>Our original belief is called the prior hypothesis (before). The belief we have after observing data and calculating an update is our posterior belief (after).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-2.png"><img loading="lazy" decoding="async" class="aligncenter wp-image-46630" src="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-2-300x39.png" alt="Formula for determining a posterior hypothesis" width="536" height="70" srcset="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-2-300x39.png 300w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-2-768x100.png 768w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-2-600x78.png 600w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-2.png 857w" sizes="auto, (max-width: 536px) 100vw, 536px" /></a><br />
If we replace words with symbols, we get the more recognizable Bayesian formula. We have only two symbols that extend Bayesian thinking: θ (theta) and <em>D</em> (data).</p>
<p>Our prior belief is represented with the Greek symbol theta (θ) and shown in the formula as the probability of theta. <em>D</em> represents the data we observed/collected and is shown in the formula in the denominator (probability of all data). Both θ and <em>D</em> appear in the numerator as a conditional probability of the data given theta (<em>D</em>|θ).</p>
<p>Our posterior (updated belief) is represented with the probability of theta given the data (θ|<em>D</em>). The resulting formula is:</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-3.png"><img loading="lazy" decoding="async" class="aligncenter wp-image-46631" src="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-3-300x77.png" alt="Posterior with theta" width="273" height="70" srcset="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-3-300x77.png 300w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-3.png 437w" sizes="auto, (max-width: 273px) 100vw, 273px" /></a><br />
Interestingly, Bayes himself never published his famous theorem. It was published after his death by his friend <a href="https://www.google.com/url?sa=t&amp;source=web&amp;rct=j&amp;opi=89978449&amp;url=https://www.york.ac.uk/depts/maths/histstat/price.pdf&amp;ved=2ahUKEwjyqb--_dySAxXYODQIHRhnMzgQFnoECCAQAQ&amp;usg=AOvVaw2cW4VK9LyEC78RfzR_HhPE">Richard Price</a> [PDF], who used it to attempt to prove the existence of God by showing that the order in the universe wasn’t accidental. Because Price likely made a substantial contribution to completing Bayes’ work on the theorem, this may be another example of <a href="https://en.wikipedia.org/wiki/Stigler%27s_law_of_eponymy">Stigler’s Law</a> (scientific discoveries are not named after the discoverer or, in this case, do not include the co-discoverer).</p>
<p>Formulas, ministers, and theology are interesting and all, but how does this apply to UX research?</p>
<h2><span lang="EN-US">A Simple UX Research Example with Completion Rates</span></h2>
<p>We can use an example of testing a new checkout experience. We want to gauge the completion rate (a fundamental usability metric). How successfully are people able to get through the new flow?</p>
<p>We’ve never tested this checkout flow before, though. But do we really have <em>no idea</em> about what will happen? Is a 0% completion rate really as likely as a 50%, 90%, or 100% completion rate?</p>
<p>Using a <strong>rough</strong> guide from historical data, we know an “average” completion rate is <a href="https://measuringu.com/task-completion/">around 78%</a>. It doesn’t mean we expect this new checkout completion rate to be <strong>exactly</strong> 78% (there is a lot of variability around this average). But values between 50% and 95% seem more plausible than a 5%, 10%, or even 99% completion rate. The lower end would be cause for concern, and the upper end would be desired for such an important flow.</p>
<h3><span lang="EN-US">What Exactly Is Our Prior?</span></h3>
<p>So, following Bayesian thinking, we establish a prior. Our <strong>prior belief</strong> is not a single number (78%), but a <em>range of plausible completion rates</em>, centered near 78% (the most plausible rate). Rates far lower (e.g., 40%) or far higher (e.g., 99%) are possible but less likely. In Bayesian terms, this represents a prior belief with a probability distribution centered at 78% but wide enough to allow for substantial uncertainty (see the appendix for details).</p>
<h3><span lang="EN-US">Collecting Data</span></h3>
<p>As an example of using data to update our initial thinking, assume we’ve collected data from a hypothetical moderated usability test with twenty participants in which eighteen completed the checkout and two failed. That’s a 90% observed completion rate. What does that do to our prior belief?</p>
<p>Using Bayesian thinking, we’d ask which completion rates best explain 18 successes out of 20.</p>
<ul>
<li>Rates near 90% explain it well.</li>
<li>Rates near 78% still explain it reasonably well.</li>
<li>Rates near 50% explain it poorly.</li>
</ul>
<p>Bayes’ theorem formalizes that comparison. It increases the credibility of rates that better predict the data and decreases the credibility of those that don’t.</p>
<h3><span lang="EN-US">Updating Our Prior</span></h3>
<p>Before seeing the data, our belief was centered near 78%. After observing 18/20 completions, we conclude (see appendix for the mechanics):</p>
<ul>
<li>Our updated best estimate of the true completion rate is about <strong>86%</strong>.</li>
<li>A 95% credible interval runs from roughly <strong>72% to 96%</strong>.</li>
<li>There’s about an <strong>89% probability </strong>that the true completion rate exceeds 78%.</li>
</ul>
<p>A few things to notice:</p>
<ul>
<li>The data pulled our estimate up from 78% toward 90%.</li>
<li>It didn’t go all the way to 90%.</li>
<li>The prior kept us from overreacting to just twenty observations.</li>
</ul>
<p>That’s <strong>Bayesian updating</strong>. We started with an informed expectation, saw new evidence, and adjusted accordingly. Figure 1 illustrates this Bayesian thinking.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-4.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46632" src="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-4-300x163.png" alt="The posterior distribution (after observing 18/20 completions) shifts upward from the prior centered at 78%, reflecting the influence of new data while retaining uncertainty." width="1200" height="651" srcset="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-4-300x163.png 300w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-4-1024x556.png 1024w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-4-768x417.png 768w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-4-600x326.png 600w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-4.png 1284w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> The posterior distribution (after observing 18/20 completions) shifts upward from the prior centered at 78%, reflecting the influence of new data while retaining uncertainty.</p>
<p>So, can we just plug our numbers into the simple formula we showed above? The answer we’ve found working through this Bayesian example is, unfortunately, not that simple. We describe the approach we used for those numbers below in the appendix.</p>
<p>We’ll cover how to conduct these analyses in upcoming articles, but this provides some idea about using Bayesian thinking in practice without getting swallowed up in the conditional probabilities.</p>
<h2><span lang="EN-US">Updating Our Beliefs with More Questions</span></h2>
<p>Who can argue with updating your beliefs with new data? We like this idea of applying iterative Bayesian thinking and incorporating historical data. Who wants to be stuck in their ways? But while using Bayesian thinking seems both appealing and like sound science, it generates a few questions:</p>
<ul>
<li>How is this different from using the statistics taught in an intro statistics class and <a href="https://measuringu.com/product/practical-statistics-for-ux-and-customer-research-course/">our courses</a>?</li>
<li>What’s the difference between a credibility interval and a confidence interval?</li>
<li>Do Bayesian statistics require smaller sample sizes?</li>
<li>What if you don’t have any prior information?</li>
<li>How reliable are priors if they are just our intuition or “conventional wisdom?”</li>
<li>Can a prior steer us in the wrong direction?</li>
<li>How can this concept be extended to assessing the likelihoods of different hypotheses?</li>
</ul>
<p>We’ll dig into these questions in upcoming articles.</p>
<h2><span lang="EN-US">Appendix: How the Posterior Was Computed</span></h2>
<p>Here’s a quick summary of how we computed the values. We used some common modern Bayesian methods that are computationally intense (we’ll cover that in a future article).</p>
<p>We modeled the true completion rate using a Beta distribution and the observed data using a binomial model. We set a weak prior centered at the historical average of 78%, equivalent to about 10 prior observations. This corresponds to a Beta(7.8, 2.2) prior distribution.</p>
<p>With 18 completions out of 20 participants, the Bayesian update is: for a Beta prior and binomial data, the posterior is Beta(α + successes, β + failures). Substituting the values gives a posterior of Beta(25.8, 4.2).</p>
<p>From this updated distribution:</p>
<ul>
<li>The posterior mean is 25.8/30 ≈ 86%.</li>
<li>The 95% credible interval is approximately 72% to 96% (2.5<sup>th</sup> and 97.5<sup>th</sup> percentiles of the Beta(25.8, 4.2) distribution.</li>
<li>The probability that the true completion rate exceeds 78% is about 89% (using the upper tail of the Beta(25.8, 4.2) distribution.</li>
</ul>
<p>This update reflects a compromise between prior expectations and observed data, with the new evidence pulling the estimate upward while retaining uncertainty.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>An Introduction to Effect Sizes</title>
		<link>https://measuringu.com/an-introduction-to-effect-sizes/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=an-introduction-to-effect-sizes</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD&nbsp;•&nbsp;Jeff Sauro, PhD]]></dc:creator>
		<pubDate>Tue, 17 Feb 2026 23:04:00 +0000</pubDate>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[Effect Size]]></category>
		<category><![CDATA[effect sizes]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=46564</guid>

					<description><![CDATA[The completion rate jumped from 20% to 80%. That’s a large effect size. If it had gone from 20% to 21%? Much smaller effect. It’s easy to get caught up in the mechanics of significance testing and p-values. But even before those tools existed, researchers were measuring effect sizes. Effect sizes remain fundamental to understanding [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/02/021726-FeatureImage-1-1.png"><img loading="lazy" decoding="async" class="alignleft wp-image-46587 size-medium" src="https://measuringu.com/wp-content/uploads/2026/02/021726-FeatureImage-1-1-300x169.png" alt="Feature image showing small, medium and large effect sizes." width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/02/021726-FeatureImage-1-1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/02/021726-FeatureImage-1-1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/02/021726-FeatureImage-1-1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/02/021726-FeatureImage-1-1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/02/021726-FeatureImage-1-1-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/02/021726-FeatureImage-1-1.png 2000w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>The completion rate jumped from 20% to 80%. That’s a <strong>large effect size</strong>. If it had gone from 20% to 21%? Much smaller effect.</p>
<p>It’s easy to get caught up in the mechanics of significance testing and <em>p</em>-values. But even before those tools existed, researchers were measuring effect sizes. Effect sizes remain fundamental to understanding whether a result actually matters.</p>
<p>An important outcome of recent debates about significance testing has been increased consensus on reporting effect sizes alongside <em>p</em>-values.</p>
<p>It’s been a bit trendy lately to trash null hypothesis significance testing (NHST) because of how it’s misused. Many critics argue we should abandon it altogether.</p>
<p>But if you know us, you know we think that just because something is misused (like the NPS) doesn’t mean we should throw it out. We’re pragmatic. We actually agree with the critics of significance testing that we shouldn’t rely on just<em> p</em>-values. Effect sizes and confidence intervals should be used more to understand the practical significance of a result.</p>
<p>We’ve written about this earlier. Figure 1 shows a <a href="https://measuringu.com/from-statistical-to-practical-significance/">framework we originally published in 2021</a> that extends the all-or-none decision of statistical significance to considerations of practical significance based on <a href="https://measuringu.com/ci-10things/">confidence intervals</a> (a type of effect size).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/02/021726-Figure1.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46579" src="https://measuringu.com/wp-content/uploads/2026/02/021726-Figure1-300x110.png" alt="Decision tree for assessing statistical and practical significance." width="1200" height="440" srcset="https://measuringu.com/wp-content/uploads/2026/02/021726-Figure1-300x110.png 300w, https://measuringu.com/wp-content/uploads/2026/02/021726-Figure1-1024x375.png 1024w, https://measuringu.com/wp-content/uploads/2026/02/021726-Figure1-768x281.png 768w, https://measuringu.com/wp-content/uploads/2026/02/021726-Figure1-600x220.png 600w, https://measuringu.com/wp-content/uploads/2026/02/021726-Figure1.png 1378w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> Decision tree for assessing statistical and practical significance.</p>
<p>In this article, we provide a short introduction to effect sizes, extending our thoughts from our <a href="https://measuringu.com/effect-sizes/">earlier article</a>.</p>
<h2>A Short History of Effect Sizes</h2>
<p>Before there were tests of significance, there were effect sizes. Any time two values are compared, you have an estimate of an effect size.</p>
<p>A key difference: the magnitude of a <em>p</em>-value is affected by sample size, but estimates of effect sizes are not. Significance testing separates effect sizes that could plausibly be zero from those that cannot (typically using an alpha criterion), but the effect size itself is independent of sample size.</p>
<p>Early concepts related to effect sizes can be found in the <a href="https://pages.uoregon.edu/stevensj/workshops/huberty2002.pdf">writings of Francis Galton and Karl Pearson</a> on correlation and regression in the late 19<sup>th</sup> and early 20<sup>th</sup> centuries. In 1960, Ronald Fisher added a general statement about the importance of effect sizes to the 7<sup>th</sup> edition of <a href="https://home.iitk.ac.in/~shalab/anova/DOE-RAF.pdf"><em>The Design of Experiments</em></a>, saying researchers should never “lose sight of the exact strength which the evidence has in fact reached” (p. 25).</p>
<p>Interest in effect sizes grew in the second half of the 20th century with Jacob Cohen’s use of the <a href="https://replicationindex.com/wp-content/uploads/2025/09/Cohen.1962.The_statistical_power_of_abnor.pdf">smallest important effect size to detect</a> (i.e., the critical difference) in sample size estimation and the development of <a href="https://en.wikipedia.org/wiki/Meta-analysis">meta-analysis</a> by <a href="https://en.wikipedia.org/wiki/Gene_V._Glass">Gene Glass</a> and <a href="https://en.wikipedia.org/wiki/Larry_V._Hedges">Larry Hedges</a>.</p>
<p>In 1994, the American Psychological Association (APA) first explicitly recommended reporting effect sizes in the 4<sup>th</sup> edition of its publication manual. In the 5<sup>th</sup> through 7<sup>th</sup> editions (2019), they strongly recommend reporting <a href="https://www.psychologicalscience.org/observer/understanding-confidence-intervals-cis-and-effect-size-estimation">confidence intervals around effect size estimates</a> in addition to standard tests of significance.</p>
<h2>Types of Effect Sizes</h2>
<p>There are many different effect sizes, with estimates of the number <a href="https://en.wikipedia.org/wiki/Effect_size">varying from 50 to 100</a>. At a high level, they measure either <strong>differences</strong> (between means or proportions) or <strong>relationships</strong> (correlations, regression) and can also be classified as unstandardized or standardized. More formally, effect sizes reflect the magnitude of a phenomenon and can be described in terms of what is measured (differences or relationships), how it is computed, and the resulting value.</p>
<p>Because any effect size estimate will be wrong to some degree, the current best practice is to report confidence intervals alongside effect sizes. Confidence intervals show the plausible range of values for an effect size, helping distinguish between statistical significance and practical importance.</p>
<h3>Unstandardized Effect Sizes</h3>
<p>Unstandardized effect sizes preserve the original measurement units—inches, seconds, or points on a rating scale. Because they’re directly interpretable, they’re usually easier to understand and apply. Common examples in UX research: mean differences and regression coefficients (<em>B</em> weights).</p>
<h3>Standardized Effect Sizes</h3>
<p>Standardized effect sizes are, for the most part, unstandardized effect sizes divided by a standard deviation. This converts original units into unit-free measures of magnitude, making them easier to compare across studies or combine in meta-analysis. The best-known standardized effect size for the difference between two independent means is Cohen’s <em>d</em> (the mean difference divided by the pooled standard deviation). In UX research, the correlation coefficient is probably the most common standardized effect, possibly because it is more easily interpreted than its unstandardized counterpart, the covariance.</p>
<h2>Interpreting Standardized Effect Sizes</h2>
<p>The best-known guidelines for interpreting standardized effect sizes as small, medium, or large were developed by <a href="https://en.wikipedia.org/wiki/Jacob_Cohen_(statistician)">Jacob Cohen</a>. He emphasized the importance of basing effect size comparisons whenever possible on the results of previous studies in the relevant research context, but he provided general conventions to use when relevant research was insufficient (Table 1).</p>

<table id="tablepress-1020" class="tablepress tablepress-id-1020">
<thead>
<tr class="row-1">
	<th class="column-1"><strong>Interpretation</strong</th><th class="column-2"><strong>Mean Difference</strong</th><th class="column-3"><strong>Correlation</strong</th><th class="column-4"><strong>Cohen’s Basis</strong</th>
</tr>
</thead>
<tbody class="row-striping row-hover">
<tr class="row-2">
	<td class="column-1">Small</td><td class="column-2">0.2</td><td class="column-3">0.1</td><td class="column-4">Noticeably smaller than medium but not trivial</td>
</tr>
<tr class="row-3">
	<td class="column-1">Medium</td><td class="column-2">0.5</td><td class="column-3">0.3</td><td class="column-4">Visible to naked eye in real world (e.g., height)</td>
</tr>
<tr class="row-4">
	<td class="column-1">Large</td><td class="column-2">0.8</td><td class="column-3">0.5</td><td class="column-4">Same distance above medium as small is below</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1020 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Cohen’s conventions for interpreting standardized mean differences (standard deviation units) and correlations.</p>
<h3>Are Cohen’s Guidelines Applicable to UX Research?</h3>
<p>Research on the <a href="https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2019.00813/full">meaningfulness of effect sizes in psychological research</a> has found larger reported effect sizes for conventionally published research (potentially affected by publication bias) compared to pre-registered research, and differences within subdisciplines of psychology. Another line of research, <a href="https://journals.sagepub.com/doi/pdf/10.1177/2515245919847202">evaluating effect sizes in psychological research</a> and focused on correlation, recommended interpreting reliably estimated correlations of .05 as very small, .10 as small, .20 as medium, .30 as large, and .40 as very large.</p>
<p>Although quantitative methods in UX are largely borrowed from psychology, UX research differs in goals, constraints, and decision contexts—making direct adoption of interpretation guidelines problematic. We’re planning to analyze our historical research (unaffected by publication bias) to develop guidelines specific to UX research contexts.</p>
<h2>Summary and Discussion</h2>
<p>In this article, we provided a brief history of effect sizes, two basic types (differences, relationships), and guidelines for interpretation.</p>
<p><strong>Effect sizes predate significance testing. </strong>Early concepts appeared in the late 19<sup>th</sup> and early 20<sup>th</sup> centuries and were further developed for sample size estimation, power analysis, and meta-analysis in the second half of the 20<sup>th</sup> century. Major organizations now strongly recommend reporting them.</p>
<p><strong>Effect sizes can be unstandardized or standardized. </strong>Unstandardized effect sizes preserve original measurement units; standardized effect sizes (unit-free measures based on proportions of standard deviations) enable cross-study comparison. Other types of research, including sample size estimation and meta-analysis, require standardized effect sizes.</p>
<p><strong>Effect sizes measure differences or relationships. </strong>Standardized effect sizes for mean differences include Cohen’s <em>d</em> and Hedge’s <em>g</em>. Standardized effect sizes for relationships include correlations (<em>r</em>) and coefficients of determination (<em>R</em>²).</p>
<p><strong>Report confidence intervals with effect sizes. </strong>Any point estimate will be wrong to some degree. Confidence intervals show the plausible range around a point estimate.</p>
<p><strong>Interpretation guidelines are context-dependent. </strong>Cohen’s conventions provide a starting point, but research has found larger effects in conventionally published research than in pre-registered research (potentially due to publication bias) and variation across psychological subdisciplines. We plan to investigate effect sizes in our historical data to develop better guidelines for UX research.</p>
<p>We will discuss specific effect size formulas and calculations in future articles.</p>
<h2>Additional Reading</h2>
<p>Cohen, J. (1962). <a href="https://replicationindex.com/wp-content/uploads/2025/09/Cohen.1962.The_statistical_power_of_abnor.pdf">The statistical power of abnormal-social psychological research: A review</a>. <em>Journal of Abnormal and Social Psychology</em>, <em>63</em>(3), 145–153.</p>
<p>Cohen, J. (1990). <a href="https://www.stat.cmu.edu/~brian/jdelaney/cohen-learned-so-far-amer-psychologist-1990.pdf">Things I have learned (so far)</a>. <em>American Psychologist</em>, <em>45</em>(12), 1304–1312.</p>
<p>Ferguson, C. J. (2009). <a href="https://www.researchgate.net/profile/Fei-Xin/post/How_to_determine_the_magnitude_of_ORs_in_logistic_regression/attachment/603e2571ce717d0001ee1746/AS%3A996877617078276%401614685553537/download/Ferguson_EffectSizes_2009.pdf">An effect size primer: A guide for clinicians and researchers</a>. <em>Professional Psychology: Research and Practice</em>, <em>40</em>(5), 532–538.</p>
<p>Fisher, R. A. (1971). <a href="https://home.iitk.ac.in/~shalab/anova/DOE-RAF.pdf">The design of experiments</a> (9<sup>th</sup> ed.). Hafner.</p>
<p>Fritz, C. O., Morris, P. E., &amp; Richler, J. J. (2012). <a href="https://www.researchgate.net/publication/51554230_Effect_Size_Estimates_Current_Use_Calculations_and_Interpretation">Effect size estimates: Current use, calculations, and interpretation</a>. <em>Journal of Experimental Psychology: General</em>, <em>141</em>(1), 2–18.</p>
<p>Funder, D. C., &amp; Ozer, D. J. (2019). <a href="https://journals.sagepub.com/doi/pdf/10.1177/2515245919847202"> Evaluating effect size in psychological research: Sense and nonsense</a>. <em>Advances in Methods and Practices in Psychological Science</em>, <em>2</em>(2), 156–168.</p>
<p>Galton, F. (1889). <a href="https://galton.org/books/natural-inheritance/pdf/galton-nat-inh-1up-clean.pdf">Natural inheritance</a>. Macmillan.</p>
<p>Huberty, C. J. (2002). <a href="https://pages.uoregon.edu/stevensj/workshops/huberty2002.pdf">A history of effect size indices</a>. <em>Educational and Psychological Measurement</em>, <em>62</em>, 227–240.</p>
<p>Kelley, K., &amp; Preacher, K. J. (2012). <a href="https://www.academia.edu/47837419/On_effect_size">On effect size</a>. <em>Psychological Methods</em>, <em>17</em>(2), 137–152.</p>
<p>Levin, J. R. (1998). <a href="https://www.researchgate.net/publication/238374053_What_If_There_Were_No_More_Bickering_About_Statistical_Significance_Tests">What if there were no more bickering about statistical significance tests?</a> <em>Research in the Schools</em>, <em>5</em>(2), 43–53.</p>
<p>Lewis, J. R., &amp; Sauro, J. (2021, June 15). <a href="https://measuringu.com/from-statistical-to-practical-significance/">From statistical to practical significance</a>. MeasuringU.</p>
<p>Lewis, J. R., &amp; Sauro, J. (2021, September 28). <a href="https://measuringu.com/setting-alpha/">For statistical significance, must p be &lt; .05?</a> MeasuringU.</p>
<p>Onwuegbuzie, A. J., Levin, J. R., &amp; Leech, N. L. (2003). <a href="https://files.eric.ed.gov/fulltext/EJ853084.pdf">Do effect-size measures measure up? A brief assessment</a>. <em>Learning Disabilities: A Contemporary Journal</em>, <em>1</em>(1), 37–40.</p>
<p>Rosnow, R. L., &amp; Rosenthal, R. (1989). <a href="https://wiki.ubc.ca/images/3/3c/Rosnow_%26_Rosenthal._1998._Statistical_Procedures_(aspirin_example).pdf">Statistical procedures and the justification of knowledge in psychological science</a>. <em>American Psychologist</em>, <em>44</em>(10), 1276–1284.</p>
<p>Sauro, J. (2014, March 11). <a href="https://measuringu.com/effect-sizes/">Understanding effect sizes in user research</a>. MeasuringU.</p>
<p>Schäfer, T., &amp; Schwarz, M. A. (2019). <a href="https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2019.00813/full">The meaningfulness of effect sizes in psychological research: Differences between sub-disciplines and the impact of potential biases</a>. <em>Frontiers in Psychology: Quantitative Psychology and Measurement</em>, <em>10</em>, Article ID: 813.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Sample Sizes for Comparing UX-Lite Scores</title>
		<link>https://measuringu.com/sample-sizes-for-comparison-of-ux-lite-scores/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=sample-sizes-for-comparison-of-ux-lite-scores</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD&nbsp;•&nbsp;Jeff Sauro, PhD]]></dc:creator>
		<pubDate>Wed, 11 Feb 2026 17:20:59 +0000</pubDate>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[Sample Size]]></category>
		<category><![CDATA[Sample Sizes]]></category>
		<category><![CDATA[UX-Lite]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=46525</guid>

					<description><![CDATA[The UX-Lite® is a relatively new metric, but it is versatile, short, and increasingly popular for UX research. It measures perceived usability and usefulness with just two items. But if you’re using the UX-Lite to compare products or to see whether you’ve improved over time, what sample size do you need? Yes, the sample size [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/02/021026-FeatureImage-2.png"><img loading="lazy" decoding="async" class="alignleft wp-image-46555 size-medium" src="https://measuringu.com/wp-content/uploads/2026/02/021026-FeatureImage-2-300x169.png" alt="Feature image showing drivers of sample size estimation table and a group of wooden pawns" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/02/021026-FeatureImage-2-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/02/021026-FeatureImage-2-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/02/021026-FeatureImage-2-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/02/021026-FeatureImage-2-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/02/021026-FeatureImage-2-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/02/021026-FeatureImage-2.png 2000w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>The UX-Lite<sup>®</sup> is a relatively new metric, but it is versatile, short, and increasingly popular for UX research. It measures perceived usability and usefulness with just two items.</p>
<p>But if you’re using the UX-Lite to compare products or to see whether you’ve improved over time, what sample size do you need?</p>
<p>Yes, the sample size question we can’t (and shouldn’t) avoid. Fortunately, sample sizes for making comparisons are straightforward and uncontroversial.</p>
<p>In previous articles, we’ve developed sample size tables for studies focused on <a href="https://measuringu.com/ux-lite-sample-sizes-for-confidence-intervals">estimating UX-Lite scores with confidence intervals</a> or <a href="https://measuringu.com/ux-lite-sample-sizes-for-comparison-to-a-benchmark">comparing them to benchmark values</a>.</p>
<p>The <a href="https://measuringu.com/evolution-of-the-ux-lite/">UX-Lite</a>, like the <a href="https://measuringu.com/10-things-sus/">SUS</a>, uses transformed scores that range from 0 to 100, based on responses to its two five-point scales (usability and usefulness). We refer to it as a mini version of the 16-item Technology Acceptance Model (<a href="https://measuringu.com/tam/">TAM</a>). The UX-Lite <a href="https://measuringu.com/accuracy-of-sus-estimation-with-ux-lite/">predicts the SUS</a> with over 95% accuracy, and like the TAM, it <a href="https://measuringu.com/article/effect-of-perceived-ease-of-use-and-usefulness-on-ux-and-behavioral-outcomes/">predicts future product usage</a>.</p>
<p>In this article, we cover how to determine appropriate sample sizes for comparing two mean UX-Lite scores.</p>
<h2><span lang="EN-US">What Drives Sample Size Requirements for Comparison Tests?</span></h2>
<p>You need to know six things to compute the sample size when comparing two means. The first three are the same elements required to compute the sample size for a confidence interval:</p>
<ol>
<li>An estimate of the UX-Lite <a href="https://measuringu.com/reliability-and-variability-of-standardized-ux-scales/">standard deviation</a> (median of 19.3 with an interquartile range from 16.6 [25th percentile] to 21.3 [75th percentile]): <em>s</em></li>
<li>The required level of precision: <em>d</em></li>
<li>The level of confidence (typically 90% or 95%): <em>t<sub>ɑ</sub></em></li>
</ol>
<p>For a more detailed discussion of these three elements, see <a href="https://measuringu.com/ux-lite-sample-sizes-for-confidence-intervals">our previous confidence interval article</a>.</p>
<p>Sample size estimation for benchmark and comparison studies requires two additional considerations:</p>
<ol start="4">
<li>The power of the test (typically 80%): <em>t<sub>β</sub></em></li>
<li>The distribution of the rejection region (one-tailed for benchmark tests, two-tailed for means)</li>
</ol>
<p>As a quick recap, the power of a test refers to its capability to detect a specified minimum difference between means (i.e., to control the <a href="https://measuringu.com/hypothesis-testing-what-can-go-wrong/">likelihood of a Type II error</a>). The number of tails refers to the distribution of the rejection region for the statistical test. In most cases, comparisons of two means should use a two-tailed test. For more details on these topics, see the previous article on <a href="https://measuringu.com/ux-lite-sample-sizes-for-comparison-to-a-benchmark">UX-Lite benchmark testing</a>.</p>
<p>The comparison of two means has one more consideration:</p>
<ol start="6">
<li>Within- or between-subjects experimental design</li>
</ol>
<p>In a within-subjects study, you compare the means of scores that are paired because they came from the same person (assuming proper counterbalancing of the order of presentation). In a between-subjects study, you compare the means of scores that came from different (independent) groups of participants. Each experimental design has its <a href="https://measuringu.com/between-within/">strengths and weaknesses</a>, and each has its own formula for sample size estimation.</p>
<p>Figure 1 illustrates how the number of sample size drivers increases and changes from confidence intervals (the simplest with three drivers) to benchmark testing (five drivers) to tests of two means (six drivers).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2022/04/041321-F2.png"><img loading="lazy" decoding="async" class="alignnone wp-image-32362" src="https://measuringu.com/wp-content/uploads/2022/04/041321-F2.png" alt="Drivers of sample size estimation for comparing scores." width="1200" height="511" srcset="https://measuringu.com/wp-content/uploads/2022/04/041321-F2.png 4400w, https://measuringu.com/wp-content/uploads/2022/04/041321-F2-300x128.png 300w, https://measuringu.com/wp-content/uploads/2022/04/041321-F2-1024x436.png 1024w, https://measuringu.com/wp-content/uploads/2022/04/041321-F2-768x327.png 768w, https://measuringu.com/wp-content/uploads/2022/04/041321-F2-1536x654.png 1536w, https://measuringu.com/wp-content/uploads/2022/04/041321-F2-2048x872.png 2048w, https://measuringu.com/wp-content/uploads/2022/04/041321-F2-600x255.png 600w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> Drivers of sample size estimation for comparing scores.</p>
<h2><span lang="EN-US">UX-Lite Sample Sizes for Within-Subjects Comparison of Two Means</span></h2>
<p>The sample size formula for a within-subjects study is the same as the one used for benchmark tests:</p>
<p><a href="https://measuringu.com/wp-content/uploads/2022/04/Sample-Size-Fomula.png"><img loading="lazy" decoding="async" class="alignnone wp-image-32296" src="https://measuringu.com/wp-content/uploads/2022/04/Sample-Size-Fomula.png" alt="Sample size formula for a within-subjects study " width="243" height="150" srcset="https://measuringu.com/wp-content/uploads/2022/04/Sample-Size-Fomula.png 330w, https://measuringu.com/wp-content/uploads/2022/04/Sample-Size-Fomula-300x185.png 300w" sizes="auto, (max-width: 243px) 100vw, 243px" /></a></p>
<p>where <em>s</em> is the standard deviation (<em>s</em><sup>2</sup> is the variance), <em>t</em> is the <em>t</em>-value for the desired level of confidence AND power, and <em>d</em> is the target for the critical difference (the smallest difference in means that you need to be able to detect).</p>
<p>As in benchmark testing, <em>t</em> in the formula is the sum of two <em>t</em>-values, one for <em>ɑ</em> (related to confidence, two-sided for comparison of means) and one for <em>β</em> (related to power, always one-sided). For a 90% confidence level and 80% power, this works out to be about 1.645 + 0.842 = 2.5.</p>
<p>One way to think of including power in sample size estimation is as an insurance policy: you purchase the policy by increasing your sample size, improving your likelihood of finding statistically significant results if the standard deviation is a little higher than expected or the observed value of <em>d</em> is a bit lower.</p>
<p>Table 1 shows how variations in these components affect sample size estimates for within-subjects comparisons for the median standard deviation of 19.3 and for the 75<sup>th</sup> percentile of 21.3. In most cases, using the median standard deviation is reasonable, but when a sufficient sample size is more important than controlling the cost of sampling, it’s better to plan with the higher value.</p>

<table id="tablepress-1018" class="tablepress tablepress-id-1018">
<thead>
<tr class="row-1">
	<td class="column-1"></td><th colspan="2" class="column-2"><center><strong><i>s</i> = 19.3</strong></th><th colspan="2" class="column-4"><center><strong><i>s</i> = 21.3</strong></th>
</tr>
</thead>
<tbody class="row-striping row-hover">
<tr class="row-2">
	<td class="column-1"><strong><i>d</i></strong></td><td class="column-2"><center><strong>90%</strong></td><td class="column-3"><center><strong>95%</strong></td><td class="column-4"><center><strong>90%</strong></td><td class="column-5"><center><strong>95%</strong></td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong>15</strong></td><td class="column-2"><center>12</td><td class="column-3"><center>15</td><td class="column-4"><center>14</td><td class="column-5"><center>18</td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>10</strong></td><td class="column-2"><center>25</td><td class="column-3"><center>32</td><td class="column-4"><center>30</td><td class="column-5"><center>38</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>7.5</strong></td><td class="column-2"><center>43</td><td class="column-3"><center>54</td><td class="column-4"><center>52</td><td class="column-5"><center>66</td>
</tr>
<tr class="row-6">
	<td class="column-1"><strong>5.0</strong></td><td class="column-2"><center>94</td><td class="column-3"><center>119</td><td class="column-4"><center>114</td><td class="column-5"><center>145</td>
</tr>
<tr class="row-7">
	<td class="column-1"><strong>2.5</strong></td><td class="column-2"><center>370</td><td class="column-3"><center>470</td><td class="column-4"><center>451</td><td class="column-5"><center>572</td>
</tr>
<tr class="row-8">
	<td class="column-1"><strong>2.0</strong></td><td class="column-2"><center>578</td><td class="column-3"><center>733</td><td class="column-4"><center>703</td><td class="column-5"><center>893</td>
</tr>
<tr class="row-9">
	<td class="column-1"><strong>1.0</strong></td><td class="column-2"><center>2305</td><td class="column-3"><center>2926</td><td class="column-4"><center>2807</td><td class="column-5"><center>3563</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1018 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Sample size requirements for UX-Lite comparisons within subjects given various standard deviations (<em>s</em>), confidence levels, and critical differences (<em>d</em>), with power set to 80%.</p>
<p>In this table, the “<a href="https://measuringu.com/might-not-be-a-magic-number-but-there-are-magic-ranges/">magic range</a>” for the critical difference is from 2.5 to 5, where the sample sizes are reasonably attainable (<em>n</em> from 94 to 572). The table also illustrates the tradeoff between the ability of a test to detect significant differences and the sample size needed to achieve that goal.</p>
<p>For example, if you want to be able to detect mean differences of 15 with 90% confidence and 80% power in a within-subjects study, you’d need a sample size of 12. At the other end of the table, for 95% confidence, 80% power, and a critical difference of 1 in a within-subjects study, you’d need a sample size of 3,563.</p>
<h2><span lang="EN-US">UX-Lite Sample Sizes for Between-Subjects Comparison of Two Means</span></h2>
<p>The only change for a between-subjects comparison is to the sample size formula that roughly doubles the sample size for each group and, because there must be two groups, doubles that again. This means that to achieve the same level of sensitivity while keeping everything else equal, the sample size for a between-subjects comparison is about four times the sample size required for a within-subjects comparison. As shown in Table 2, this constrains the “magic range” for a reasonable critical difference to no less than 5, for which the sample sizes for the various combinations of standard deviation and confidence level range from 372 to 572.</p>

<table id="tablepress-1019" class="tablepress tablepress-id-1019">
<thead>
<tr class="row-1">
	<td class="column-1"></td><th colspan="2" class="column-2"><center><strong><i>s</i> = 19.3</strong></th><th colspan="2" class="column-4"><center><strong><i>s</i> = 21.3</strong></th>
</tr>
</thead>
<tbody class="row-striping row-hover">
<tr class="row-2">
	<td class="column-1"><strong><i>d</i></strong></td><td class="column-2"><center><strong>90%</strong></td><td class="column-3"><center><strong>95%</strong></td><td class="column-4"><center><strong>90%</strong></td><td class="column-5"><center><strong>95%</strong></td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong>15</strong></td><td class="column-2"><center>44</td><td class="column-3"><center>56</td><td class="column-4"><center>52</td><td class="column-5"><center>66</td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>10</strong></td><td class="column-2"><center>94</td><td class="column-3"><center>120</td><td class="column-4"><center>116</td><td class="column-5"><center>146</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>7.5</strong></td><td class="column-2"><center>166</td><td class="column-3"><center>212</td><td class="column-4"><center>202</td><td class="column-5"><center>256</td>
</tr>
<tr class="row-6">
	<td class="column-1"><strong>5.0</strong></td><td class="column-2"><center>372</td><td class="column-3"><center>470</td><td class="column-4"><center>452</td><td class="column-5"><center>572</td>
</tr>
<tr class="row-7">
	<td class="column-1"><strong>2.5</strong></td><td class="column-2"><center>1476</td><td class="column-3"><center>1874</td><td class="column-4"><center>1798</td><td class="column-5"><center>2282</td>
</tr>
<tr class="row-8">
	<td class="column-1"><strong>2.0</strong></td><td class="column-2"><center>2306</td><td class="column-3"><center>2926</td><td class="column-4"><center>2808</td><td class="column-5"><center>3564</td>
</tr>
<tr class="row-9">
	<td class="column-1"><strong>1.0</strong></td><td class="column-2"><center>9214</td><td class="column-3"><center>11698</td><td class="column-4"><center>11222</td><td class="column-5"><center>14248</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1019 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 2:</strong> Sample size requirements for UX-Lite comparisons between subjects given various standard deviations (<em>s</em>), confidence levels, and critical differences (<em>d</em>), with power set to 80% and total sample sizes for two independent groups.</p>
<h3><span lang="EN-US">Technical Note: What to Do for Different Standard Deviations</span></h3>
<p>If your historical UX-Lite data has a very different standard deviation from 19.3 or 21.3, you can do a quick computation to adjust the values in these tables. The first step is to compute a multiplier by dividing the new target variance (the square of the standard deviation, <em>s</em><sup>2</sup>) by the variance used to create the table. Then multiply the tabled value of <em>n</em> by the multiplier and round it to get the revised estimate. To illustrate this, we’ll start with a standard deviation of 19.3 (our typical standard deviation) and show how this works if the target standard deviation (<em>s</em>) is 21.3 (our conservative estimate in Tables 1 and 2). The target variability (21.3<sup>2</sup>) is 453.69. The initial variability is 372.49 (19.3<sup>2</sup>), making the multiplier 453.69/372.49 = 1.218. To use this multiplier to adjust the sample size for 95% confidence and precision of ±10 shown in Table 2 when <em>s</em> = 19.3, multiply 120 by 1.218 to get 146.16, then round it off to 146. For more information, see our article, <a href="https://measuringu.com/how-do-changes-in-sd-affect-n/">How Do Changes in Standard Deviation Affect Sample Size Estimation</a>.</p>
<h2><span lang="EN-US">Summary and Discussion</span></h2>
<p>What sample size do you need when comparing two sets of UX-Lite scores? To answer that question, you need several types of information, some common to all sample size estimation (confidence level to establish control of Type I errors, standard deviation, and margin of error or critical difference), others unique to statistical hypothesis testing (one- vs. two-tailed testing, setting a level of power to control Type II errors), and for comparison of means (whether the experimental design will be within- or between-subjects).</p>
<p>We provided two tables based on typical (<em>s</em> = 19.3) and conservative (<em>s</em> = 21.3) standard deviations for the UX-Lite in retrospective UX studies, with values for between- and within-subjects designs, 90% and 95% confidence, power set to 80%, and critical differences from 1 to 15 points.</p>
<p>For UX researchers working in contexts where the typical standard deviation of the UX-Lite might be different, we also provided a simple way to increase or decrease the tabled sample sizes for larger or smaller standard deviations.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>UX and NPS Benchmarks of Clothing Websites (2026)</title>
		<link>https://measuringu.com/ux-nps-benchmark-report-for-retail-clothing-websites-2026/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=ux-nps-benchmark-report-for-retail-clothing-websites-2026</link>
		
		<dc:creator><![CDATA[Jeff Sauro, PhD •&nbsp;Jim Lewis, PhD&nbsp;•&nbsp;Emily Short]]></dc:creator>
		<pubDate>Tue, 03 Feb 2026 23:05:12 +0000</pubDate>
				<category><![CDATA[Benchmarking]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[clothing]]></category>
		<category><![CDATA[NPS]]></category>
		<category><![CDATA[SUPR-Q]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=46436</guid>

					<description><![CDATA[It’s hard to beat the convenience of shopping for clothing online. You don’t have to worry about when the store will close or finding parking, and getting a price comparison with other stores is just a few clicks away. On websites, you can easily search for clothing using keywords, and it’s simple to see the [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/01/020326-FeatureImage-2.png"><img loading="lazy" decoding="async" class="alignleft wp-image-46475 size-medium" src="https://measuringu.com/wp-content/uploads/2026/01/020326-FeatureImage-2-300x169.png" alt="Feature image showing an interior of a clothing store" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/01/020326-FeatureImage-2-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/01/020326-FeatureImage-2-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/01/020326-FeatureImage-2-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/01/020326-FeatureImage-2-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/01/020326-FeatureImage-2-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/01/020326-FeatureImage-2.png 2000w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a> It’s hard to beat the convenience of shopping for clothing online. You don’t have to worry about when the store will close or finding parking, and getting a price comparison with other stores is just a few clicks away. On websites, you can easily search for clothing using keywords, and it’s simple to see the entire catalog. There’s no reason to hunt through aisles or track down a salesperson.</p>
<p>But shopping for clothing online also has its drawbacks. You can’t <a href="https://www.vogue.com/article/sizing-is-stopping-consumers-from-shopping-heres-what-brands-need-to-know">try on the clothing</a>. Not being able to walk the store means you’re <a href="https://baymard.com/blog/current-state-product-list-and-filtering">reliant on the organization of the website</a>. And just because you see it <a href="https://www.opensend.com/post/out-of-stock-rate-statistics-ecommerce">doesn’t mean it’s in stock</a>. If you receive the wrong size, you may have to deal with restocking fees and the <a href="https://retailwire.com/discussion/are-stricter-return-policies-worth-it/">hassle of return shipping</a>.</p>
<p>Despite these issues, <a href="https://www.ecommercenorthamerica.org/2025/08/04/us-apparel-ecommerce-2025/">estimated online clothing spending in 2025 in the U.S.</a> was about $217B (about a fifth of global online apparel spending). However, online clothes spending in 2025 was only about 38% of all clothing purchases, making the improvement of the UX of clothing websites a <strong>high priority for providers and consumers</strong>.</p>
<p>To understand the quality of their online experiences, we collected UX benchmark metrics on 11 popular clothing websites and mobile apps.</p>
<ul>
<li>Anthropologie</li>
<li>Athleta</li>
<li>Banana Republic</li>
<li>Gap</li>
<li>H&amp;M</li>
<li>Lululemon</li>
<li>Neiman Marcus</li>
<li>Nordstrom</li>
<li>Old Navy</li>
<li>Urban Outfitters</li>
<li>Zara</li>
</ul>
<p>We computed <a href="https://measuringu.com/product/suprq/">SUPR-Q</a><sup>®</sup> and <a href="https://measuringu.com/nps-reliability/">Net Promoter</a> scores, measured users’ attitudes regarding their experiences, conducted <a href="https://measuringu.com/key-drivers/">key driver</a> analyses, and analyzed reported usability problems. (Full details are in the <a href="https://measuringu.com/product/ux-nps-benchmark-report-for-clothing-websites-2026/">downloadable report</a>.)</p>
<h2>Benchmark Study Details</h2>
<p>In November to December 2025, we asked 351 users of clothing websites in the U.S. to recall their most recent experience and perceptions of one of these websites on their desktop and mobile app (if applicable) in the past year.</p>
<p>Respondents completed the eight-item <a href="https://measuringu.com/10-things-suprq/">SUPR-Q</a> (which includes the <a href="https://measuringu.com/nps-three-confidence-intervals/">Net Promoter Score</a>), the two-item <a href="https://measuringu.com/evolution-of-the-ux-lite/">UX-Lite</a><sup>®</sup>, and the <a href="https://measuringu.com/article/streamlining-the-supr-qm-the-supr-qm-v2/">SUPR-Qm</a> standardized questionnaires and they answered questions about their brand attitudes, usage, and prior experiences.</p>
<h2>Quality of the Website User Experience: SUPR-Q</h2>
<p>The SUPR-Q is a standardized questionnaire widely used for measuring attitudes toward the quality of a website user experience. Its norms are computed from a rolling database of around 200 websites across dozens of industries.</p>
<p>SUPR-Q scores are percentile ranks that tell you how a website’s experience ranks relative to the other websites (50<sup>th</sup> percentile is average). The SUPR-Q provides an overall score as well as detailed scores for subdimensions of Usability, Trust, Appearance, and Loyalty.</p>
<p>The mean SUPR-Q across clothing websites in this study was at the <strong>82<sup>nd</sup> percentile</strong> (substantially above average), ranging from the 50<sup>th</sup> percentile for H&amp;M to the 98<sup>th</sup> percentile for Banana Republic.</p>
<h3>Usability Scores</h3>
<p>Overall, usability scores were also well above average for the clothing websites, averaging at the 81<sup>st</sup> percentile. Old Navy had the highest usability score at the 98<sup>th</sup> percentile. Zara had the lowest usability score, falling at the 31<sup>st</sup> percentile.</p>
<p>Comments related to usability on Zara included:</p>
<p style="padding-left: 25px;"><em>“Unnecessary images between products while browsing.” </em></p>
<p style="padding-left: 25px;"><em>“The return process is complicated</em><em>.” </em></p>
<h3>Loyalty/Net Promoter Scores</h3>
<p>All the clothing websites except H&amp;M (with −2%) had positive Net Promoter Scores, led by Anthropologie (40%). The average NPS for these websites was 19% (more promoters than detractors).</p>
<p>Unsurprisingly, ratings of the intention to keep using these websites correlated with their NPS (<em>r</em> = .70 at the website level). As shown in Figure 1, respondents were more likely to continue using the Nieman Marcus website than to continue using the Old Navy website.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/01/020326-Figure1-scaled.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46467" src="https://measuringu.com/wp-content/uploads/2026/01/020326-Figure1-300x112.png" alt="Likelihood to continue using the websites (90% confidence intervals)." width="1200" height="449" srcset="https://measuringu.com/wp-content/uploads/2026/01/020326-Figure1-300x112.png 300w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure1-1024x383.png 1024w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure1-768x287.png 768w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure1-1536x574.png 1536w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure1-2048x765.png 2048w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure1-600x224.png 600w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> Likelihood to continue using the websites (90% confidence intervals).</p>
<p>Comments related to NPS and loyalty included:</p>
<p style="padding-left: 25px;"><em>“Quality merchandise, great customer service, high-end brands, good sales.” </em>— Nieman Marcus</p>
<p style="padding-left: 25px;"><em>“It&#8217;s a good site but the quality of the clothes are less than they used to be.” </em>— Old Navy</p>
<h2>Websites and Mobile App Usage</h2>
<p>As part of this benchmark, we asked respondents how they accessed the clothing providers online. All respondents reported using their desktop/laptop computers (this was a requirement for participation in the survey), with 39% also using mobile apps and 73% using mobile websites. Most respondents reported visiting their clothing websites on a desktop or laptop computer a few times a year. Users most frequently reported never using the clothing mobile apps (however, 24% of Athleta users and 33% of Zara users reported using those apps a few times per month).</p>
<h2>Key Drivers of UX Quality</h2>
<p>To better understand what affects SUPR-Q scores and Likelihood-to-Recommend (LTR) ratings, we asked respondents to rate potentially important attributes of the clothing websites on a five-point scale from 1 (Strongly disagree) to 5 (Strongly agree). We conducted key driver analyses (regression modeling) to quantify the extent to which ratings on these items drive (account for) variation in overall SUPR-Q scores and, separately, LTR (the rating from which the NPS is derived; full details are in the <a href="https://measuringu.com/product/ux-nps-benchmark-report-for-clothing-websites-2026/">downloadable report</a>).</p>
<p>The top key driver of the overall SUPR-Q scores was the ease of browsing items (12%), followed by the ease of finding “exactly what I want” (10%). Taken together, 11 significant variables accounted for 74% of the variance in the SUPR-Q scores.</p>
<p>For likelihood-to-recommend (LTR), the top key drivers were the ease of finding “exactly what I want” (10%), finding brands quickly (8%), and trusting sites’ style recommendations (8%). Overall, seven significant drivers accounted for 44% of the variance in LTR.</p>
<p>Figure 2 shows a scatterplot of the importance and opportunity for improvement for seven key drivers. The combination of importance and opportunity for improvement provides a basis for prioritizing which key drivers to improve. The importance score is the greater of the variance accounted for by the driver in the SUPR-Q and NPS analyses, where larger percentages indicate more importance. The opportunity score is the <a href="https://measuringu.com/top-top-two-bottom-net-box/">top-box percentage</a> for the driver, so smaller percentages indicate greater opportunity for improvement (i.e., it would be harder to improve a driver with a top-box percentage of 100% than one with a top-box percentage of 10%).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/01/020326-Figure2-scaled.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46468" src="https://measuringu.com/wp-content/uploads/2026/01/020326-Figure2-300x112.png" alt="Scatterplot of importance and opportunity for improvement of key drivers." width="1200" height="448" srcset="https://measuringu.com/wp-content/uploads/2026/01/020326-Figure2-300x112.png 300w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure2-1024x382.png 1024w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure2-768x287.png 768w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure2-1536x573.png 1536w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure2-2048x764.png 2048w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure2-600x224.png 600w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 2: </strong>Scatterplot of importance and opportunity for improvement of key drivers.</p>
<p>Two of these seven key drivers fell in the FIX quadrant (upper left) with relatively high importance and higher opportunity for improvement (“Exciting to shop on this site” and “Easy to find exactly what I want”). Anthropologie had the highest top-box score for excitement (48%), and Nordstrom had the highest for shoppers being able to find exactly what they want (61%). The websites with the lowest top-box scores, suggesting the most room for improvement, were Old Navy for excitement (16%) and Urban Outfitters for finding “exactly what I want” (25%).</p>
<h2>UX Problems</h2>
<p>We examined the verbatim comments to better understand the user experience problems.</p>
<p>The top frustrations were products out of stock, sizing issues, slow loading, and navigation/browsing issues.</p>
<h3>Products Being Out of Stock Was a Major Annoyance</h3>
<p>This issue affected all the websites, but it was a top complaint for Neiman Marcus, Old Navy (Figure 3), and Urban Outfitters (and was in the top five for the others).</p>
<p style="padding-left: 25px;"><em>“Sometimes things go out of stock quickly.” </em>— Neiman Marcus</p>
<p style="padding-left: 25px;"><em>“There have been many times where specific items aren&#8217;t sold in my size, or they are sold out completely.” </em>— Old Navy</p>
<p style="padding-left: 25px;"><em>“Some problems I&#8217;ve had with the Urban Outfitters website are sometimes things aren&#8217;t in stock and they aren&#8217;t clear on when they will be back in stock.” </em>— Urban Outfitters</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/01/020326-F3.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46440" src="https://measuringu.com/wp-content/uploads/2026/01/020326-F3.png" alt="Old Navy’s spin on products being out of stock (“We knew it was too good to last”)." width="1200" height="458" srcset="https://measuringu.com/wp-content/uploads/2026/01/020326-F3.png 1430w, https://measuringu.com/wp-content/uploads/2026/01/020326-F3-300x115.png 300w, https://measuringu.com/wp-content/uploads/2026/01/020326-F3-1024x391.png 1024w, https://measuringu.com/wp-content/uploads/2026/01/020326-F3-768x293.png 768w, https://measuringu.com/wp-content/uploads/2026/01/020326-F3-600x229.png 600w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 3: </strong>Old Navy’s spin on products being out of stock (“We knew it was too good to last”).</p>
<h3>Sizing Issues Degrade the User Experience</h3>
<p>Eight of the websites had sizing issues, which was the most frequently-mentioned negative comment for Banana Republic (Figure 4) and Gap.</p>
<p style="padding-left: 25px;"><em>“Sometimes, there are items that are not in stock or I usually get an incorrect size.” </em>— Banana Republic</p>
<p style="padding-left: 25px;"><em>“Sometimes the item I want is not available in both my size and color. It will either be available in my size or my color but not both.” </em>— Gap</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/01/020326-Figure4.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46469" src="https://measuringu.com/wp-content/uploads/2026/01/020326-Figure4-300x124.png" alt="Example of product review about sizing issue on the Banana Republic website." width="1200" height="495" srcset="https://measuringu.com/wp-content/uploads/2026/01/020326-Figure4-300x124.png 300w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure4-1024x422.png 1024w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure4-768x317.png 768w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure4-600x247.png 600w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure4.png 1220w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 4: </strong>Example of a product review about a sizing issue on the Banana Republic website.</p>
<h3>Slow Loading Times Slow Down Shopping</h3>
<p>Users reported slow loading times for nine of the websites. It was a top complaint for Athleta and Neiman Marcus and was the second-most reported issue for Gap, Nordstrom, and Urban Outfitters.</p>
<p style="padding-left: 25px;"><em>“Requires strong internet connectivity to load otherwise one would experience slow performance.” </em>— Athleta</p>
<p style="padding-left: 25px;"><em>“The website is slow sometimes.” </em>— Neiman Marcus</p>
<p style="padding-left: 25px;"><em>“Sometimes the website can be slow depending on the device I am using.” </em>— Gap</p>
<p style="padding-left: 25px;"><em>“Sometimes it is slow to load once I get to page 3 or 4 of options.” </em>— Nordstrom</p>
<p style="padding-left: 25px;"><em>“Some pages take longer to load than expected, which can be frustrating.” </em>— Urban Outfitters</p>
<h3>Navigation and Browsing Issues Prevent Smooth Shopping</h3>
<p>Users of ten of the websites reported issues with navigation, browsing, or both. They were the top two most frequently reported frustrations for Anthropologie, Athleta, Gap, H&amp;M, Lululemon, Urban Outfitters, and Zara.</p>
<p style="padding-left: 25px;"><em>“It is so hard to browse for something or search for something.” </em>— Zara</p>
<p style="padding-left: 25px;"><em>“It can feel a bit overwhelming to look at at first because it looks like it has a lot going on.” </em>— Urban Outfitters</p>
<p style="padding-left: 25px;"><em>“It can feel a bit cluttered or too overwhelming to find exactly what I want.” </em>— Gap</p>
<p>We were particularly intrigued by a comment by an H&amp;M user who wrote, “It&#8217;s sometimes frustrating that the sidebar appears so frequently while I&#8217;m scrolling.” When we investigated this, it was clear that the underlying design issue was an invisible border around the controls that triggered the sidebar. Other websites had similar designs but either visualized the boundary separating menu options from the browsing area (e.g., Anthropologie and most others) or required users to activate dropdowns by clicking the option rather than hovering nearby (Lululemon). Video 1 shows how the H&amp;M design is particularly tricky relative to the other design.</p>
<div class="ast-oembed-container " style="height: 100%;"><iframe loading="lazy" title="ThreeDesignsFinal" src="https://player.vimeo.com/video/1159742485?dnt=1&amp;app_id=122963" width="1170" height="658" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write; encrypted-media; web-share" referrerpolicy="strict-origin-when-cross-origin"></iframe></div>
<p class="wp-caption-text" style="text-align: left;"><strong>Video 1: </strong>Sidebar/dropdown triggering from main menu options on H&amp;M, Anthropologie, and Lululemon.</p>
<h3>Online Shoppers Have Trouble Getting Exactly What They Want</h3>
<p>The signals we get from both the quantitative and qualitative analyses for this benchmark study clearly indicate that users of these clothing websites have trouble getting exactly what they want.</p>
<p>This is demonstrated quantitatively by the significance of key driver ratings of the ease of browsing, finding brands quickly, trusting a site’s recommendations, easy returns, and confidence in the accuracy of size guides.</p>
<p>The qualitative findings provide more of the “why” behind the numbers, including:</p>
<ul>
<li>The annoyance of products being completely out of stock or out of desired sizes/colors, combined with the occasional surprise of not finding out until well into the checkout process</li>
<li>Uncertainty about sizing chart accuracy and variability in sizing across manufacturers, which leads consumers to experience <a href="https://www.vogue.com/article/sizing-is-stopping-consumers-from-shopping-heres-what-brands-need-to-know">fit/sizing uncertainty</a> that can cause them to abandon the purchase due to poor <a href="https://link.springer.com/article/10.1007/s11747-024-01034-9">fit-risk perception</a></li>
<li>Persistent complaints about slow loading</li>
<li>Numerous navigation and browsing issues (e.g., odd dropdown/sidebar behaviors, large images/videos that sometimes do not resize, intrusive ads, complex checkout and return processes)</li>
</ul>
<p>Looking across the quantitative and qualitative findings, all 11 websites have opportunities to improve. Some websites that would especially benefit from a stronger focus on online shopping experiences are:</p>
<ul>
<li><strong>Urban Outfitters</strong>: Lowest top-box score for finding “exactly what I want” (25%) and relatively high percentage of user comments about products being out of stock (top complaint), slow loading times, and navigation/browsing issues</li>
<li><strong>Old Navy</strong>: Lowest top-box score for excitement while shopping on the site (16%) and relatively high percentage of user comments about products being out of stock (top complaint)</li>
<li><strong>Zara</strong>: Lowest SUPR-Q Usability score (31<sup>st</sup> percentile) and relatively high percentage of user complaints, including browsing/navigation issues</li>
<li><strong>H&amp;M</strong>: Lowest overall SUPR-Q score (50<sup>th</sup> percentile) and NPS (−2%) with a relatively high percentage of user complaints, including browsing/navigation issues</li>
</ul>
<h2>Comparison with the 2022 Clothing Benchmark</h2>
<p>In 2022, we collected SUPR-Q and NPS data for all the same websites. Banana Republic, Gap, and Zara had the most improvement. Zara increased by more than 50 points in the four years since we measured (though still lagging behind the leaders), and H&amp;M had the biggest drop (32 points). These differences, shown in Figure 5, are statistically significant [<em>F</em>(10, 995) = 1.89, <em>p</em> = .04].<a href="https://measuringu.com/wp-content/uploads/2026/01/020326-Figure5-scaled.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46470" src="https://measuringu.com/wp-content/uploads/2026/01/020326-Figure5-300x112.png" alt="SUPR-Q scores from the 2022 and 2025 surveys (statistical analysis conducted on raw scores)." width="1200" height="449" srcset="https://measuringu.com/wp-content/uploads/2026/01/020326-Figure5-300x112.png 300w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure5-1024x383.png 1024w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure5-768x287.png 768w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure5-1536x574.png 1536w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure5-2048x765.png 2048w, https://measuringu.com/wp-content/uploads/2026/01/020326-Figure5-600x224.png 600w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 5: </strong>SUPR-Q scores from the 2022 and 2025 surveys (statistical analysis conducted on raw scores).</p>
<h2>Summary and Takeaways</h2>
<p>Clothing companies are big businesses with an estimated online clothing spending in the U.S. of about $217B in 2025 (about a fifth of global online apparel spending). An analysis of the user experience of 11 major clothing websites using data collected in November–December 2025 found:</p>
<ol>
<li><strong>Banana Republic and Anthropologie lead; H&amp;M lags. </strong>Banana Republic had the highest overall SUPR-Q score, falling in the 98th percentile, while H&amp;M had the lowest score (50th percentile). Anthropologie had the highest NPS (40%) and H&amp;M had the lowest (−2%).</li>
</ol>
<ol start="2">
<li><strong>Ease of browsing and ease of finding “exactly what I want” drive UX scores. </strong>The top key driver of the overall SUPR-Q scores was the ease of browsing items (12%), followed by the ease of finding “exactly what I want” (10%). Taken together, 11 significant variables accounted for 74% of the variance in the SUPR-Q scores. For likelihood-to-recommend (LTR), the top key drivers were the ease of finding “exactly what I want” (10%), finding brands quickly (8%), and trusting sites’ style recommendations (8%). Overall, seven significant drivers accounted for 44% of the variance in LTR.</li>
</ol>
<ol start="3">
<li><strong>The top opportunities for improvement are increasing the feeling of excitement when shopping and helping shoppers find what they want. </strong>One way to prioritize attention to key drivers is to consider both their importance (beta weights in regression) and how well the websites achieve the stated goal (top-box scores). The two key drivers with the most potential for improvement (high beta weights and low top-box scores) were the extent to which users feel excitement when shopping on the websites (average top-box score: 31%) and the ease of finding exactly what they want (average top-box score: 38%). For excitement, the leader was Anthropologie (Old Navy lagging). For finding “exactly what I want,” the leader was Nordstrom (Urban Outfitters lagging).</li>
</ol>
<ol start="4">
<li><strong>Top frustrations were products out of stock, sizing issues, and slow loading. </strong>The most reported issue, affecting all websites, was products being out of stock, either entirely or for certain sizes. Some users reported not being notified of this until they were checking out. For nine of the websites, users reported issues with sizing (e.g., poor fit, missing sizes) and slow website loading (e.g., many large images and videos). Navigation/browsing issues were also common, affecting ten of the sites.</li>
</ol>
<ol start="5">
<li><strong>Online clothes shoppers have trouble getting exactly what they want. </strong>Quantitative and qualitative signals from our findings point in the same direction: users of these clothing websites have trouble getting exactly what they want. Quantitative signals include difficult navigation/browsing, hard to find brands quickly, lack of trust in product recommendations, difficult returns, and inaccurate size guides. The qualitative “why” behind the numbers includes annoyance of products being out of stock, fit/sizing uncertainty, slow loading times, and various navigation/browsing issues. Considering both quantitative and qualitative findings, websites that would especially benefit from a focus on the UX of their websites are Urban Outfitters, Old Navy, Zara, and H&amp;M.</li>
</ol>
<p>For more details, see the <a href="https://measuringu.com/product/ux-nps-benchmark-report-for-clothing-websites-2026/">downloadable report</a>.</p>
]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>

<!-- plugin=object-cache-pro client=phpredis metric#hits=7444 metric#misses=53 metric#hit-ratio=99.3 metric#bytes=3927791 metric#prefetches=106 metric#store-reads=147 metric#store-writes=168 metric#store-hits=303 metric#store-misses=35 metric#sql-queries=53 metric#ms-total=604.39 metric#ms-cache=34.68 metric#ms-cache-avg=0.1105 metric#ms-cache-ratio=5.7 sample#redis-hits=171664752 sample#redis-misses=16207080 sample#redis-hit-ratio=91.4 sample#redis-ops-per-sec=160 sample#redis-evicted-keys=0 sample#redis-used-memory=180285784 sample#redis-used-memory-rss=73740288 sample#redis-memory-fragmentation-ratio=0.4 sample#redis-connected-clients=1 sample#redis-tracking-clients=0 sample#redis-rejected-connections=8 sample#redis-keys=117530 -->
