<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>MeasuringU</title>
	<atom:link href="https://measuringu.com/feed/" rel="self" type="application/rss+xml" />
	<link>https://measuringu.com</link>
	<description>UX Research and Software</description>
	<lastBuildDate>Tue, 09 Jun 2026 21:37:45 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://measuringu.com/wp-content/uploads/2020/11/site-icon.png</url>
	<title>MeasuringU</title>
	<link>https://measuringu.com</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Do Statistics Really Require 30 Participants?</title>
		<link>https://measuringu.com/do-statistics-really-require-30-participants/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=do-statistics-really-require-30-participants</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD and Jeff Sauro, PhD]]></dc:creator>
		<pubDate>Tue, 09 Jun 2026 21:37:45 +0000</pubDate>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Sample Size]]></category>
		<category><![CDATA[Usability Testing]]></category>
		<category><![CDATA[User Experience]]></category>
		<category><![CDATA[User Research]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[confidence interval]]></category>
		<category><![CDATA[Confidence Intervals]]></category>
		<category><![CDATA[t-distribution]]></category>
		<category><![CDATA[t-test]]></category>
		<category><![CDATA[z-distribution]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47696</guid>

					<description><![CDATA[Should the sample size n be greater than 30? If you’ve taken any introductory statistics course or an AP statistics class (or helped your child with it), you’ve encountered the n ≥ 30 rule. The “magic number 5” rule we’ve written extensively about applies (with its important caveats) to problem discovery for usability testing. But the [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/06/060926-FeatureImage1-scaled.png"><img fetchpriority="high" decoding="async" class="alignleft wp-image-47752 size-medium" src="https://measuringu.com/wp-content/uploads/2026/06/060926-FeatureImage1-300x169.png" alt="Feature image showing an icon representing participants and &quot;≥&quot;" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/06/060926-FeatureImage1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/06/060926-FeatureImage1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/06/060926-FeatureImage1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/06/060926-FeatureImage1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/06/060926-FeatureImage1-2048x1152.png 2048w, https://measuringu.com/wp-content/uploads/2026/06/060926-FeatureImage1-600x338.png 600w" sizes="(max-width: 300px) 100vw, 300px" /></a>Should the sample size <em>n</em> be greater than 30?</p>
<p>If you’ve taken any introductory statistics course or an AP statistics class (or helped your child with it), you’ve encountered the <em>n</em> ≥ 30 rule.</p>
<p>The “<a href="https://measuringu.com/specific-sample-sizes-in-discovery-studies/">magic number 5</a>” rule we’ve written extensively about applies (with its important caveats) to problem discovery for usability testing.</p>
<p>But the <em>n</em> ≥ 30 rule goes beyond usability testing, coming up across disciplines and even in classrooms. It will often be mentioned by skeptical stakeholders and during the peer review process (probably from <a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC8505560/">Reviewer Number 2</a>). Violating it can feel like a methodological sin.</p>
<p>But where does it actually come from? And does it hold up in general and in UX research in particular?</p>
<p>The short answer is that the rule has real statistical roots, but they’re often misunderstood and misapplied.</p>
<h2>On One Hand: Arguments for Why Researchers Need <em>n </em>≥ 30</h2>
<p>The <em>n</em> ≥ 30 rule is grounded in two related concerns: whether your statistical analyses will perform accurately with smaller sample sizes (1) when your raw data is normally distributed and (2) when raw data is not normally distributed.</p>
<h3>The <em>t</em>-Distribution Converges to <em>Z</em> at about <em>n</em> = 30 for Continuous Data</h3>
<p>When learning statistics, you’ll often start with the normal <em>z</em>-distribution for statistical tests. You can use tables or simple formulas to look up <em>z</em> values when making computations. But using the normal <em>z</em>-distribution and tables means you need to know the population standard deviation.</p>
<p>Unfortunately, in applied settings, we rarely know the population standard deviation! Fortunately, the alternative is to use <em>t</em>-distribution tables and computations. They have one additional input compared to <em>z</em> computations, which is the sample size (strictly speaking, <em>n</em> − 1, the <a href="https://sites.utexas.edu/sos/degreesfreedom/">degrees of freedom</a> for the <em>t</em>-distributions). However, when the sample size gets to about 30, <em>z</em> and <em>t</em> values converge, so they are roughly the same (see Figure 1).</p>
<p>Consequently, when <em>n</em> is at least 30, you don’t have to deal with somewhat more complicated small-sample statistics. Over time, this statistical footnote calcified into a general-purpose sample size rule.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/06/Figure1-scaled.png"><img decoding="async" class="alignnone wp-image-47754" src="https://measuringu.com/wp-content/uploads/2026/06/Figure1-300x225.png" alt="Figure 1: Approach of t to z as a function of degrees of freedom (n - 1)." width="800" height="599" srcset="https://measuringu.com/wp-content/uploads/2026/06/Figure1-300x225.png 300w, https://measuringu.com/wp-content/uploads/2026/06/Figure1-1024x767.png 1024w, https://measuringu.com/wp-content/uploads/2026/06/Figure1-768x575.png 768w, https://measuringu.com/wp-content/uploads/2026/06/Figure1-1536x1150.png 1536w, https://measuringu.com/wp-content/uploads/2026/06/Figure1-2048x1534.png 2048w, https://measuringu.com/wp-content/uploads/2026/06/Figure1-600x449.png 600w" sizes="(max-width: 800px) 100vw, 800px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> Approach of <em>t</em> to <em>z</em> as a function of degrees of freedom (<em>n</em> − 1).</p>
<h3>Traditional “Wald” Confidence Intervals Are Less Accurate at Small Sample Sizes</h3>
<p>The <em>n</em> ≥ 30 rule isn’t just for continuous data. It’s also been applied to binary (0/1, yes/no) data. Like the convergence of <em>t</em> to <em>z</em> shown in Figure 1, there is a similar convergence of the binomial distribution to the <em>z</em>-distribution, which becomes approximately normal when <em>n</em> = 30 and the expected proportion is not very close to 0 or 1. The most widely taught method for calculating binomial confidence intervals (the Wald method) grossly understates the width of the true interval when sample sizes are small because it’s based on the <em>z</em>-distribution. We demonstrated the inaccuracy of the Wald method in <a href="https://measuringu.com/article/estimating-completion-rates-from-small-samples-using-binomial-confidence-intervals-comparisons-and-recommendations/">our 2005 paper</a> using real-world completion rate data. For example, a 95% confidence interval around a completion rate with <em>n</em> = 15 constructed with the Wald method is more like a 72% confidence interval (wildly inaccurate—see below for other findings from that paper).</p>
<h3>UX Data Generally Isn’t Normally Distributed</h3>
<p>If you’ve ever stared at a histogram of task completion rates, time-on-task, or even SUS scores, you know that raw UX data rarely forms the classic bell curve. Completion rates are binary (0 or 1), usually with more successes than failures. Task times are often-right skewed due to a long tail of slow participants. Likert-scale items such as the Single Ease Question (<a href="https://measuringu.com/evolution-of-seq/">SEQ</a><sup>®</sup>) tend to cluster toward the top (more scores above the median than below it). As we showed in our article <a href="https://measuringu.com/is-ux-data-normal/">Is UX Data Normally Distributed?</a>, none of these distributions look remotely normal (see Figure 2).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/05/Figure-2-Distributions-of-four-UX-metrics-showing-their-non-normal-raw-distributions.png" rel="attachment wp-att-47698"><img decoding="async" class="alignnone wp-image-47698 size-full" src="https://measuringu.com/wp-content/uploads/2026/05/Figure-2-Distributions-of-four-UX-metrics-showing-their-non-normal-raw-distributions.png" alt="Figure 2: Distributions of four UX metrics showing their non-normal raw distributions. " width="1041" height="1100" srcset="https://measuringu.com/wp-content/uploads/2026/05/Figure-2-Distributions-of-four-UX-metrics-showing-their-non-normal-raw-distributions.png 1041w, https://measuringu.com/wp-content/uploads/2026/05/Figure-2-Distributions-of-four-UX-metrics-showing-their-non-normal-raw-distributions-284x300.png 284w, https://measuringu.com/wp-content/uploads/2026/05/Figure-2-Distributions-of-four-UX-metrics-showing-their-non-normal-raw-distributions-969x1024.png 969w, https://measuringu.com/wp-content/uploads/2026/05/Figure-2-Distributions-of-four-UX-metrics-showing-their-non-normal-raw-distributions-768x812.png 768w, https://measuringu.com/wp-content/uploads/2026/05/Figure-2-Distributions-of-four-UX-metrics-showing-their-non-normal-raw-distributions-600x634.png 600w" sizes="(max-width: 1041px) 100vw, 1041px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 2:</strong> Distributions of four UX metrics showing their non-normal raw distributions.</p>
<p>That might sound alarming. Many statistical tests, such as confidence intervals, <em>t</em>-tests, and ANOVA, assume normality. If the raw data isn’t normal, are those analyses invalid?</p>
<h2>On the Other Hand: Arguments Why Sometimes <em>n</em> &lt; 30 Is OK</h2>
<p>Fortunately, neither the convergence of <em>t</em> to <em>z</em> nor the non-normality of raw data means confidence intervals or statistical comparison tests are invalid below <em>n</em> = 30. For continuous data, the <em>t</em>-distribution works correctly at any sample size. For binary data, the standard-Wald method can be replaced with better-performing alternatives. And for the normality concern, the Central Limit Theorem (CLT) means the distribution of your raw data matters far less than most researchers assume. This is where the distinction between the distribution of your <em>raw data</em> and the distribution of your <em>sample means</em> becomes critical. We’ll start with the normality issue.</p>
<h3>The Central Limit Theorem Solves Most Normality Issues</h3>
<p>One of the most important concepts in all of statistics is the CLT. According to the CLT, as the sample size increases, the distribution of the mean becomes more and more normal, <a href="https://measuringu.com/is-ux-data-normal/">regardless of the normality of the underlying distribution</a>.</p>
<p>How quickly does the CLT kick in? Considering a wide variety of distributions, most achieve a normal or near-normal distribution of the means when <em>n</em> is 30, making that a reasonably safe bet when you don’t know anything about the distributions.</p>
<p>For many common UX metrics, however, the distributions of the means approach normality sooner than you’d expect. Using bootstrap simulations on real UX datasets (repeatedly drawing sub-samples and computing means), the sampling distributions of SEQ ratings and SUS scores approach normality by <em>n</em> = 10. Even binomial completion rates (which are maximally non-normal) and completion times approach a normal sampling distribution by <em>n</em> = 20–30 (see Figure 3). But it turns out there’s a fix for even smaller binomial sample sizes (see below).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/05/Figure-3-Distributions-of-the-means-for-four-UX-metrics-with-varying-sample-sizes.png" rel="attachment wp-att-47699"><img loading="lazy" decoding="async" class="alignnone wp-image-47699 size-full" src="https://measuringu.com/wp-content/uploads/2026/05/Figure-3-Distributions-of-the-means-for-four-UX-metrics-with-varying-sample-sizes.png" alt="Figure 3: Distributions of the means for four UX metrics with varying sample sizes." width="1185" height="1445" srcset="https://measuringu.com/wp-content/uploads/2026/05/Figure-3-Distributions-of-the-means-for-four-UX-metrics-with-varying-sample-sizes.png 1185w, https://measuringu.com/wp-content/uploads/2026/05/Figure-3-Distributions-of-the-means-for-four-UX-metrics-with-varying-sample-sizes-246x300.png 246w, https://measuringu.com/wp-content/uploads/2026/05/Figure-3-Distributions-of-the-means-for-four-UX-metrics-with-varying-sample-sizes-840x1024.png 840w, https://measuringu.com/wp-content/uploads/2026/05/Figure-3-Distributions-of-the-means-for-four-UX-metrics-with-varying-sample-sizes-768x937.png 768w, https://measuringu.com/wp-content/uploads/2026/05/Figure-3-Distributions-of-the-means-for-four-UX-metrics-with-varying-sample-sizes-600x732.png 600w" sizes="auto, (max-width: 1185px) 100vw, 1185px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 3:</strong> Distributions of the means for four UX metrics with varying sample sizes.</p>
<p>This is why means, <em>t</em>-tests, and confidence intervals work reasonably well even when your raw responses are skewed (e.g., completion times) or bounded (e.g., rating scales).</p>
<h3>The <em>t</em>-Distribution Was Built for Small Samples</h3>
<p>The deep irony of being advised that “you need 30 to use a <em>t</em>-test” is that the <em>t</em>-distribution was invented specifically for small samples.</p>
<p>In 1899, William S. Gossett (Figure 4), a recent graduate of New College, Oxford with degrees in chemistry and mathematics, became one of the first scientists to join the Guinness brewery.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/05/Figure-4-An-anachronistic-interpretation-of-William-S.-Gossett.jpg" rel="attachment wp-att-47700"><img loading="lazy" decoding="async" class="alignnone wp-image-47700 size-full" src="https://measuringu.com/wp-content/uploads/2026/05/Figure-4-An-anachronistic-interpretation-of-William-S.-Gossett.jpg" alt="Figure 4: An anachronistic interpretation of William S. Gossett (Student), adapted from his Wikipedia page photo (public domain) with AI assistance." width="837" height="1075" srcset="https://measuringu.com/wp-content/uploads/2026/05/Figure-4-An-anachronistic-interpretation-of-William-S.-Gossett.jpg 837w, https://measuringu.com/wp-content/uploads/2026/05/Figure-4-An-anachronistic-interpretation-of-William-S.-Gossett-234x300.jpg 234w, https://measuringu.com/wp-content/uploads/2026/05/Figure-4-An-anachronistic-interpretation-of-William-S.-Gossett-797x1024.jpg 797w, https://measuringu.com/wp-content/uploads/2026/05/Figure-4-An-anachronistic-interpretation-of-William-S.-Gossett-768x986.jpg 768w, https://measuringu.com/wp-content/uploads/2026/05/Figure-4-An-anachronistic-interpretation-of-William-S.-Gossett-600x771.jpg 600w" sizes="auto, (max-width: 837px) 100vw, 837px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 4:</strong> An anachronistic interpretation of William S. Gossett (Student), adapted from his Wikipedia page photo (public domain) with AI assistance.</p>
<p>As Michael Cowles wrote in his book <em>Statistics in Psychology: An Historical Perspective</em>, “Compared with the giants of his day, he published very little, but his contribution is of critical importance. … The nature of the process of brewing, with its variability in temperature and ingredients, means that it is not possible to take large samples over a long run” (pp. 108–109).</p>
<p>Gossett couldn’t use <em>z</em>-scores because they don’t perform well with small samples. After analyzing the deficiencies of the <em>z</em>-distribution for small-sample statistical tests, he worked out the necessary adjustments as a function of degrees of freedom to produce the <em>t</em>-distribution—published in 1908 under the pseudonym “Student,” because Guinness prohibited employees from publishing. In the work that led to his tables, Gossett performed an early version of <a href="https://en.wikipedia.org/wiki/Monte_Carlo_method">Monte Carlo simulations</a>. He prepared 3,000 cards labeled with physical measurements taken on criminals, shuffled them, then dealt them into 750 groups of size 4 (<em>n</em> much smaller than 30).</p>
<p>The point is that the <em>t</em>-distribution was designed precisely to handle small samples correctly. The idea that you need <em>n</em> ≥ 30 even when using the <em>t</em>-distribution contradicts both the history and purpose of the statistic.</p>
<p>Historians of statistics widely regard Gossett&#8217;s publication of Student&#8217;s <em>t</em>-test as a landmark event. In a letter to <a href="https://en.wikipedia.org/wiki/Ronald_Fisher">Ronald A. Fisher</a> containing an early copy of the <em>t</em>-tables, Gossett wrote, &#8220;<a href="https://www.physoc.org/magazine-articles/the-strange-origins-of-the-students-t-test/">You are the only man that&#8217;s ever likely to use them</a>.&#8221;</p>
<p>Gossett got a lot of things right. He certainly got that wrong.</p>
<h3>The Wald Method Can Be Fixed with a Simple Adjustment</h3>
<p>While binary data generates inaccurate confidence intervals with the standard-Wald method when <em>n</em> &lt; 20–30, there&#8217;s a fix. It turns out that a slight adjustment makes even small binomial samples generate accurate confidence intervals. In the same <a href="https://math.unm.edu/~james/Agresti1998.pdf">2005 paper</a> where we demonstrated the problem with the Wald method, we also showed that a simple adjustment brings 95% confidence intervals back to accurate coverage. The adjustment is very close to just adding two successes and two failures to your observed data, then computing a standard-Wald interval around that adjusted proportion. This is now often called the adjusted-Wald method, formalized by work by Agresti and Coull. A small tweak to the math turns an unreliable method into a reliable one, even with small samples.</p>
<h2>Our Recommendation: Stats Work Below <em>n</em> = 30 with the Right Approach</h2>
<p>When the cost of sampling is high, as it often is, always insisting on at least 30 users regardless of study goals is wasteful at best and infeasible at worst. A more appropriate approach is to use sample size formulas derived from the specific statistical analysis (e.g., confidence interval estimation, significance test), accounting for the data type, expected variability, desired confidence, and target effect size. We’ve published articles on this for several common UX scenarios, for example:</p>
<ul>
<li><a href="https://measuringu.com/ux-lite-sample-sizes-for-confidence-intervals/">UX-Lite Sample Sizes for Confidence Intervals</a></li>
<li><a href="https://measuringu.com/ux-lite-sample-sizes-for-comparison-to-a-benchmark/">UX-Lite Sample Sizes for Comparison to a Benchmark</a></li>
<li><a href="https://measuringu.com/sample-sizes-for-comparison-of-ux-lite-scores/">Sample Sizes for Comparing UX-Lite Scores</a></li>
</ul>
<p>Knowing that small samples <em>can</em> work statistically doesn&#8217;t tell you how to handle them. The right approach depends on the type of data you&#8217;re analyzing:</p>
<ul>
<li><strong>Rating scales</strong> (SUS, SEQ, SUPR-Q): Use the <em>t</em>-distribution with the correct degrees of freedom for confidence intervals and tests of significance. It was designed for exactly this situation.</li>
<li><strong>Binary data</strong> (completion rates, yes/no): Use adjusted methods (e.g., <a href="https://measuringu.com/article/estimating-completion-rates-from-small-samples-using-binomial-confidence-intervals-comparisons-and-recommendations/">adjusted-Wald</a> for confidence intervals, <a href="https://measuringu.com/what-is-the-n-1-two-proportion-test/"><em>N</em>−1 two-proportion</a> method for significance tests), which perform accurately at small samples where standard methods based on the <em>z</em>-distribution break down.</li>
<li><strong>Time data</strong>: For confidence intervals, log-transform the raw data to correct for right-skew, then transform back to the original scale (usually no need to transform for tests of significance, but it is always an option).</li>
</ul>
<p>When samples are small, the concern shifts from normality and sampling distributions to sensitivity and power—the accuracy of an estimate (confidence intervals) or your ability to detect a true difference when one exists (hypothesis testing).</p>
<p>The procedures work correctly with small samples, but you&#8217;re limited to relatively imprecise estimates (confidence intervals) or reliably detecting only large differences (hypothesis tests). Subtle or moderate effects will likely go undetected, not because the statistics are broken but because small samples carry more uncertainty. We cover how to plan for adequate sensitivity and power in all our articles on sample size estimation (e.g., <a href="https://measuringu.com/sample-sizes-for-rating-scale-comparisons/">Sample Sizes for Comparing Rating Scale Means</a>). Sometimes a small sample is all you need to achieve your research goal.</p>
<p>This controversy is similar to the &#8220;magic number 5&#8221; controversy but applied to <a href="https://measuringu.com/three-goals/">summative rather than formative</a> research. The &#8220;magic number 30&#8221; has real empirical rationale, as it&#8217;s roughly where the CLT kicks in for a wide variety of distributions and where <em>t</em> converges on <em>z</em>. In practice, however, it&#8217;s applied far too rigidly. The appropriate sample size depends on the  distribution, the expected variability, the desired confidence and power, and the minimum <a href="https://measuringu.com/an-introduction-to-effect-sizes/">effect size</a> you need to detect. A sample of 30 is almost never exactly right for any specific situation.</p>
<p>It isn&#8217;t much more complicated to use the <em>t</em>-distribution than the <em>z</em>-distribution, or the adjusted-Wald instead of the Wald method (you just need to account for the sample size). The entire reason the <em>t</em>-distribution was developed was to enable the analysis of small samples. This is just one of the less obvious ways usability practitioners benefit from the science and practice of beer brewing.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Using the TAC-10 for Screening and Data Cleaning</title>
		<link>https://measuringu.com/using-the-tac10-for-screening-and-data-cleaning/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=using-the-tac10-for-screening-and-data-cleaning</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD and Jeff Sauro, PhD]]></dc:creator>
		<pubDate>Tue, 02 Jun 2026 23:09:08 +0000</pubDate>
				<category><![CDATA[Methods]]></category>
		<category><![CDATA[User Research]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[Online Panels]]></category>
		<category><![CDATA[TAC-10]]></category>
		<category><![CDATA[Tech savviness]]></category>
		<category><![CDATA[UX Research]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47676</guid>

					<description><![CDATA[It’s hard to collect data for UX research, and once you have it, you have to clean it. In a simpler world, all respondents would be honest and focused on providing high-quality information rather than maximizing income, but that’s not the world we live in. From past research, we estimate the prevalence of cheating on [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/05/060226-FeatureImage-1.png" rel="attachment wp-att-24969"><img loading="lazy" decoding="async" class="alignleft wp-image-47712 size-medium" src="https://measuringu.com/wp-content/uploads/2026/05/060226-FeatureImage-1-300x169.png" alt="Feature image showing TAC-10 being used for screening and data cleaning" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/05/060226-FeatureImage-1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/05/060226-FeatureImage-1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/05/060226-FeatureImage-1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/05/060226-FeatureImage-1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/05/060226-FeatureImage-1-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/05/060226-FeatureImage-1.png 2000w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>It’s hard to collect data for UX research, and once you have it, you have to clean it.</p>
<p>In a simpler world, all respondents would be honest and focused on providing high-quality information rather than maximizing income, but that’s not the world we live in. From <a href="https://measuringu.com/cheat-survey/">past research</a>, we estimate the prevalence of cheating on paid panels to be about 10% of respondents (ranging from 3–20%).</p>
<p>UX researchers can use <a href="https://measuringu.com/cleaning-data/">numerous strategies</a> for screening (stopping bad actors before they get to the actual study) and cleaning (finding and removing poor quality respondents after study completion). These include:</p>
<ul>
<li>Identification of speeders</li>
<li>Disqualifying questions</li>
<li>Attention checks</li>
<li>Review of open-ended responses</li>
<li>Internal consistency</li>
<li>Straightlining</li>
<li>Review of session recordings (when available)</li>
<li>Duplicate and bot detection</li>
</ul>
<p>AI complicates all these approaches. Modern AI can mimic attentive respondent behavior well enough to slip past most of these detection methods. We are encouraged, however, that many panel operators have taken active steps to restrict AI fraud at the source.</p>
<p>When those safeguards are in place, or when participants come from a verified human population such as a customer list, we propose another dual-purpose and quick approach.</p>
<p>In this article, we demonstrate how to use TAC-10<img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2122.png" alt="™" class="wp-smiley" style="height: 1em; max-height: 1em;" /> response patterns not only for its primary purpose as a measure of tech savviness but also as a type of internal consistency check for screening and detecting inattentive or misrepresenting human respondents.</p>
<h2>TAC-10 Basics</h2>
<p>In a <a href="https://measuringu.com/how-to-use-the-tac/">series of articles</a>, we reviewed the findings of eight years of research into measuring tech savviness. In that research program, we explored <a href="https://measuringu.com/in-search-of-tech-savvy-measures/">several methods</a> for measuring tech savviness, including quizzes (what people know), self-assessment questionnaires (what people feel), and technical activity checklists (what people are confident doing).</p>
<p>After analyzing thousands of participants’ data to see how measures of tech savviness predict task performance, we determined that technical activity checklists had better measurement properties than quizzes or questionnaires. Of the various versions of checklists that we studied, we determined that the one with ten activities and a none-of-the-above option (the TAC-10 shown in Figure 1) has the best balance between conciseness and completeness.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/05/1.png" rel="attachment wp-att-47677"><img loading="lazy" decoding="async" class="alignnone wp-image-47677 size-full" src="https://measuringu.com/wp-content/uploads/2026/05/1.png" alt="Figure 1: The current version of the TAC-10 (image from the MUiQ® platform). " width="683" height="625" srcset="https://measuringu.com/wp-content/uploads/2026/05/1.png 683w, https://measuringu.com/wp-content/uploads/2026/05/1-300x275.png 300w, https://measuringu.com/wp-content/uploads/2026/05/1-600x549.png 600w" sizes="auto, (max-width: 683px) 100vw, 683px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> The current version of the TAC-10 (image from the <a href="https://measuringu.com/muiq/">MUiQ<sup>®</sup> platform</a>).</p>
<p>The TAC-10 score for a person is the number of selected items. It’s a reliable (consistent) and valid (predictive) measure of tech savviness that, for its primary purpose, can be used (1) to classify participants into groups with low, medium, or high levels of tech savviness and (2) as a tech savviness predictor or covariate in advanced statistical analysis.</p>
<h2>Some TAC-10 Response Patterns Are More Plausible than Others</h2>
<p>In addition to being a tool to measure and classify tech savviness, TAC-10 response patterns can also be used to identify potentially problematic respondents based on the plausibility of the pattern.</p>
<p>In May 2023, we collected a large sample of completed TAC-16 checklists as part of a screening survey (<em>n </em>= 4,731) to acquire enough data for <a href="https://measuringu.com/rasch-analysis-of-three-technical-activity-checklists/">Rasch analysis</a> of three versions of the TAC (TAC-9, TAC-10, and TAC-16). In this new analysis, we applied various methods to classify response patterns as plausible, implausible, or indeterminate. Examples of plausible patterns are those consistent with perfect or near-perfect <a href="https://en.wikipedia.org/wiki/Guttman_scale">Guttman scaling</a>. Response patterns that are logically inconsistent are implausible. Patterns that are not clearly plausible or implausible are indeterminate. For these analyses, we coded each TAC-10 response as a binary string of 0s (not selected) and 1s (selected) for activities in the order shown in Figure 1. For example, 1100000000 indicates a user who selected &#8220;installing a new app on your phone&#8221; and &#8220;setting up a new phone,&#8221; but no other activities.</p>
<h3>Responses Consistent with Guttman Scaling Are Plausible</h3>
<p>Guttman scaling, which dates back to the 1940s, is a deterministic predecessor of probabilistic Rasch scaling. The goal of a Guttman scale is to develop a set of distinctive items, from easy to difficult, that form a unidimensional scale. The range from easy to difficult can refer to characteristics like easy to solve to difficult to solve for math problems or easy to agree with to difficult to agree with for attitudinal scales.</p>
<p>For 10 binary (yes/no) items like the TAC-10, there are 2<sup>10</sup> (1,024) possible arrangements of selected (1) or unselected (0) items. Only 11, however, are consistent with a perfect Guttman scale (all 1s toward the left side of the pattern, all 0s to the right): 0000000000, 1000000000, 1100000000, 1110000000, 1111000000, 1111100000, 1111110000, 1111111000, 1111111100, 1111111110, and 1111111111. We categorized these patterns as plausible.</p>
<h3>Other Plausible Response Patterns</h3>
<p>In practice, other patterns that are close to Guttman patterns are also likely to be plausible. For example, if someone is comfortable with all activities except HTML, the pattern would be 1111111101. Although it’s unlikely that someone who programs efficiently in C knows nothing about HTML, it’s possible that they would lack sufficient practical or deep familiarity with it to be comfortable selecting it. In most cases, Guttman-like patterns with one or two discontinuities are plausible.</p>
<h3>Implausible Response Patterns</h3>
<p>Patterns that are the inverse of Guttman patterns (1 and 0 swapped) are categorized as implausible (except for 0000000000 and 1111111111). For example, the pattern 0000000001 indicates someone who programs efficiently but isn’t comfortable with anything else in the list—possible but highly unlikely.</p>
<p>Other problematic patterns are those that start with 01 because, for this to be plausible, the respondent would have to be comfortable setting up a new phone but uncomfortable adding an app to that phone.</p>
<p>Patterns that contain just one 1 (other than 100000000) are implausible and may indicate a respondent who misunderstood the instruction to select all that apply.</p>
<h3>Indeterminate Response Patterns</h3>
<p>Patterns not categorized as plausible or implausible are provisionally defined as indeterminate.</p>
<h2>Plausible TAC-10 Patterns Are Much More Likely in Practice than Implausible Patterns</h2>
<p>We investigated the frequency of occurrence of plausible, implausible, and indeterminate patterns in our large sample of TAC-10 scores. Of the 1,024 possible patterns, only 199 appeared at least once in our dataset of 4,731 cases.</p>
<h3>Guttman Patterns</h3>
<p>Table 1 shows the frequency of Guttman patterns in the large TAC-10 database, accounting for 56.4% of cases.</p>

<table id="tablepress-1048" class="tablepress tablepress-id-1048">
<thead>
<tr class="row-1">
	<th class="column-1">Guttman Patterns</th><th class="column-2">Freq</th><th class="column-3">Percent</th>
</tr>
</thead>
<tbody>
<tr class="row-2">
	<td class="column-1">1111111111</td><td class="column-2">365</td><td class="column-3"> 7.7%</td>
</tr>
<tr class="row-3">
	<td class="column-1">1111111110</td><td class="column-2">633</td><td class="column-3">13.4%</td>
</tr>
<tr class="row-4">
	<td class="column-1">1111111100</td><td class="column-2">764</td><td class="column-3">16.1%</td>
</tr>
<tr class="row-5">
	<td class="column-1">1111111000</td><td class="column-2">474</td><td class="column-3">10.0%</td>
</tr>
<tr class="row-6">
	<td class="column-1">1111110000</td><td class="column-2">268</td><td class="column-3"> 5.7%</td>
</tr>
<tr class="row-7">
	<td class="column-1">1111100000</td><td class="column-2">104</td><td class="column-3"> 2.2%</td>
</tr>
<tr class="row-8">
	<td class="column-1">1111000000</td><td class="column-2"> 27</td><td class="column-3"> 0.6%</td>
</tr>
<tr class="row-9">
	<td class="column-1">1110000000</td><td class="column-2"> 20</td><td class="column-3"> 0.4%</td>
</tr>
<tr class="row-10">
	<td class="column-1">1100000000</td><td class="column-2">  9</td><td class="column-3"> 0.2%</td>
</tr>
<tr class="row-11">
	<td class="column-1">1000000000</td><td class="column-2">  5</td><td class="column-3"> 0.1%</td>
</tr>
<tr class="row-12">
	<td class="column-1">0000000000</td><td class="column-2">  0</td><td class="column-3"> 0.0%</td>
</tr>
<tr class="row-13">
	<td class="column-1"></td><td class="column-2"></td><td class="column-3"></td>
</tr>
<tr class="row-14">
	<td class="column-1">Total</td><td class="column-2">2669</td><td class="column-3">56.4%</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1048 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Distribution of Guttman patterns in the large database.</p>
<h3>Other Plausible Patterns</h3>
<p>Table 2 shows other frequently occurring plausible patterns (each accounting for at least 0.5% of cases in the database). The 21 patterns in the table accounted for 30.7% of cases. In combination, the percentage of the 32 Guttman and other high-frequency plausible patterns in the database is 87.1%.</p>

<table id="tablepress-1049" class="tablepress tablepress-id-1049">
<thead>
<tr class="row-1">
	<th class="column-1">Other Plausible Patterns</th><th class="column-2"> Freq</th><th class="column-3">Percent</th>
</tr>
</thead>
<tbody>
<tr class="row-2">
	<td class="column-1">1111110100</td><td class="column-2"> 271</td><td class="column-3"> 5.7%</td>
</tr>
<tr class="row-3">
	<td class="column-1">1111111010</td><td class="column-2"> 179</td><td class="column-3"> 3.8%</td>
</tr>
<tr class="row-4">
	<td class="column-1">1111101000</td><td class="column-2"> 152</td><td class="column-3"> 3.2%</td>
</tr>
<tr class="row-5">
	<td class="column-1">1111110110</td><td class="column-2"> 127</td><td class="column-3"> 2.7%</td>
</tr>
<tr class="row-6">
	<td class="column-1">1111010100</td><td class="column-2"> 104</td><td class="column-3"> 2.2%</td>
</tr>
<tr class="row-7">
	<td class="column-1">1111010000</td><td class="column-2">  99</td><td class="column-3"> 2.1%</td>
</tr>
<tr class="row-8">
	<td class="column-1">1111110010</td><td class="column-2">  72</td><td class="column-3"> 1.5%</td>
</tr>
<tr class="row-9">
	<td class="column-1">1111101100</td><td class="column-2">  66</td><td class="column-3"> 1.4%</td>
</tr>
<tr class="row-10">
	<td class="column-1">1111011100</td><td class="column-2">  52</td><td class="column-3"> 1.1%</td>
</tr>
<tr class="row-11">
	<td class="column-1">1111010110</td><td class="column-2">  44</td><td class="column-3"> 0.9%</td>
</tr>
<tr class="row-12">
	<td class="column-1">1111111101</td><td class="column-2">  32</td><td class="column-3"> 0.7%</td>
</tr>
<tr class="row-13">
	<td class="column-1">1110100000</td><td class="column-2">  30</td><td class="column-3"> 0.6%</td>
</tr>
<tr class="row-14">
	<td class="column-1">1111111011</td><td class="column-2">  30</td><td class="column-3"> 0.6%</td>
</tr>
<tr class="row-15">
	<td class="column-1">1111100100</td><td class="column-2">  28</td><td class="column-3"> 0.6%</td>
</tr>
<tr class="row-16">
	<td class="column-1">1111011110</td><td class="column-2">  26</td><td class="column-3"> 0.5%</td>
</tr>
<tr class="row-17">
	<td class="column-1">1111101010</td><td class="column-2">  26</td><td class="column-3"> 0.5%</td>
</tr>
<tr class="row-18">
	<td class="column-1">1111011000</td><td class="column-2">  25</td><td class="column-3"> 0.5%</td>
</tr>
<tr class="row-19">
	<td class="column-1">1110101000</td><td class="column-2">  23</td><td class="column-3"> 0.5%</td>
</tr>
<tr class="row-20">
	<td class="column-1">1100100000</td><td class="column-2">  22</td><td class="column-3"> 0.5%</td>
</tr>
<tr class="row-21">
	<td class="column-1">1101100000</td><td class="column-2">  22</td><td class="column-3"> 0.5%</td>
</tr>
<tr class="row-22">
	<td class="column-1">1110111000</td><td class="column-2">  22</td><td class="column-3"> 0.5%</td>
</tr>
<tr class="row-23">
	<td class="column-1"></td><td class="column-2"></td><td class="column-3"></td>
</tr>
<tr class="row-24">
	<td class="column-1">Total</td><td class="column-2">1452</td><td class="column-3">30.7%</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1049 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 2:</strong> Distribution of other patterns in the large database that are plausible and had frequencies of at least 0.5%.</p>
<h3>Implausible Patterns</h3>
<p>The database did not contain any cases matching an inverse Guttman pattern.</p>
<p>There were 17 implausible patterns that started with 01, each having a frequency of 1 or 2 for a total of 21, accounting for just 0.4% of the data.</p>
<p>There were only four cases (0.1% of the data) in which a single activity was chosen past the phone activities (three cases with 0010000000 and one with 0001000000, two additional implausible patterns).</p>
<h3>Indeterminate Patterns</h3>
<p>Because there were 32 plausible and 19 implausible patterns (51) out of a total of 199 patterns, the remaining 148 patterns are indeterminate.</p>
<p>Combined, the indeterminate patterns account for 12.4% of the data, with no individual indeterminate case having a frequency greater than 0.4%.</p>
<h2>Summary and Discussion</h2>
<p>In addition to its use as a measure of tech savviness, we investigated how well the TAC-10 might be used to identify plausible and implausible response patterns for the purpose of identifying potentially problematic respondents in screening and data cleaning.</p>
<p>Based on our large database of TAC-10 cases (<em>n</em> = 4,731), using two criteria for identifying plausible response patterns (matching Guttman patterns and/or frequently occurring patterns), we found that 56.4% of cases matched Guttman patterns; an additional 21 frequently occurring patterns that slightly deviated from Guttman patterns accounted for 30.7%, for a total of 87.1%. Clearly implausible patterns accounted for only 0.5% of cases, leaving the others indeterminate.</p>
<p>Our key conclusions from these analyses were:</p>
<p><strong>Plausible patterns made up the vast majority (87%) of TAC-10 cases. </strong>This suggests that most respondents were attending to the items rather than carelessly checking boxes, especially because we randomized the order of presentation of the items.</p>
<p><strong>Implausible patterns were rare. </strong>There were no occurrences of inverse Guttman patterns, and less than 0.5% of the of the responses had a problematic pattern starting with 01 or containing a single 1 (aside from a single 1 for the easiest activity).</p>
<p><strong>TAC-10 responses can be used for screening and data cleaning. </strong>These results (a large percentage of plausible and a low percentage of implausible response patterns) are encouraging regarding the application of TAC-10 to identify potentially problematic (implausible or indeterminate) response patterns as part of a battery of approaches used to identify potential cheaters (along with other strategies such as examination of open-ended responses, implausible completion times, distractors in multiple choice items, attention checks, and straightlining).</p>
<p><strong>Not going to solve AI fraud</strong>: We don’t think the TAC-10 is necessarily a solution to AI fraud. More sophisticated AI methods can convincingly mimic either a low- or high-skilled human respondent, possibly by training on the articles we’ve published on the TAC-10. However, the TAC-10 remains a valuable screening tool in contexts where respondents come from a known population, such as a customer list, or where other panel-level methods have already confirmed that participants are human.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Does AI Find Real UI Problems or Just Hallucinations?</title>
		<link>https://measuringu.com/does-ai-find-real-ui-problems-or-just-hallucinations/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=does-ai-find-real-ui-problems-or-just-hallucinations</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD •&nbsp;Jeff Sauro, PhD •&nbsp;Will Schiavone, PhD&nbsp;•&nbsp;Lucas Plabst, PhD]]></dc:creator>
		<pubDate>Wed, 27 May 2026 00:41:28 +0000</pubDate>
				<category><![CDATA[Usability]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[ChatGPT]]></category>
		<category><![CDATA[Gemini]]></category>
		<category><![CDATA[Problem Discovery]]></category>
		<category><![CDATA[Usability Problem]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47650</guid>

					<description><![CDATA[In a previous experiment, AI identified roughly half the usability problems that trained researchers found in a video of a usability test session. That sounds promising. If AI can find usability issues, it can substantially increase the amount of usability testing that research teams can conduct. But in our analysis of that video, AI generated nearly [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/05/052626-FeatureImage-scaled.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47672 size-medium" src="https://measuringu.com/wp-content/uploads/2026/05/052626-FeatureImage-300x169.png" alt="Feature image showing an AI robot, three documents each labeled &quot;Verified&quot;, &quot;Fake&quot;, or &quot;False alarm&quot;, and a researcher." width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/05/052626-FeatureImage-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/05/052626-FeatureImage-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/05/052626-FeatureImage-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/05/052626-FeatureImage-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/05/052626-FeatureImage-2048x1152.png 2048w, https://measuringu.com/wp-content/uploads/2026/05/052626-FeatureImage-600x338.png 600w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>In a <a href="https://measuringu.com/ai-vs-human-usability-problem-analysis-of-a-video/">previous experiment</a>, AI identified roughly half the usability problems that trained researchers found in a video of a usability test session.</p>
<p>That sounds promising. If AI can find usability issues, it can substantially increase the amount of usability testing that research teams can conduct.</p>
<p>But in our analysis of that video, AI generated nearly as many <em>additional</em> problems that humans never flagged. Are these problems hidden gems missed by multiple researchers, or just AI hallucinations?</p>
<p>For this article, we classified all the unique problems the AIs generated into one of three categories:</p>
<ol>
<li>a real problem humans missed</li>
<li>a false alarm (a true observation misread as a usability problem)</li>
<li>a hallucination (something the AI reported that simply never happened)</li>
</ol>
<p>What we found suggests that the new AI problems are mostly false alarms, but there are some notable exceptions.</p>
<h2>Experimental Design: Four Researchers, Two LLMs, and One Video</h2>
<p>For this study, we had four humans (professional UX researchers working at MeasuringU) review a video from a previous usability benchmark study of online dining reservation websites. Each researcher independently created a list of the usability issues they observed in the six-minute video.</p>
<p>We ran the video through two LLMs (ChatGPT-5.4 Thinking and Gemini 3 Flash Thinking) four times <a href="https://measuringu.com/ai-usability-problem-analysis-of-a-video/">using the same prompt</a> each time.</p>
<p>So in this study, we held constant the video, the key elements of the prompt, and the LLM versions/settings—variables that we plan to vary in future studies. This time, we varied only the type of analyst: human, ChatGPT, and Gemini.</p>
<h2>Gemini Finds a Jewel; ChatGPT Goes on a Tangent</h2>
<p>Using the human-generated and -verified problem lists as the “gold standard,” Figure 1 shows a summary of what we found.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/05/Figure2.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47542" src="https://measuringu.com/wp-content/uploads/2026/05/Figure2-1024x901.png" alt="Venn diagram of usability problem discovery by humans, ChatGPT, and Gemini." width="700" height="616" srcset="https://measuringu.com/wp-content/uploads/2026/05/Figure2-1024x901.png 1024w, https://measuringu.com/wp-content/uploads/2026/05/Figure2-300x264.png 300w, https://measuringu.com/wp-content/uploads/2026/05/Figure2-768x676.png 768w, https://measuringu.com/wp-content/uploads/2026/05/Figure2-1536x1351.png 1536w, https://measuringu.com/wp-content/uploads/2026/05/Figure2-2048x1802.png 2048w, https://measuringu.com/wp-content/uploads/2026/05/Figure2-600x528.png 600w" sizes="auto, (max-width: 700px) 100vw, 700px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1: </strong>Venn diagram of usability problem discovery by humans, ChatGPT, and Gemini.</p>
<p>We know Venn diagrams can generate some bad high school math memories, so here’s a summary for all our sanity:</p>
<ul>
<li>Four human researchers found nine problems (3 + 2 + 3 + 1).</li>
<li>Two AIs combined found 14 problems (6 + 1 + 3 + 4).</li>
<li>Only three problems were found by researchers and both AIs (the 3 in the middle of the circles).</li>
<li>ChatGPT matched five of the nine researcher-identified problems (3 + 2).</li>
<li>Gemini matched four (3 + 1) of the researcher-identified problems.</li>
<li>That leaves 11 problems the AIs flagged that no researcher identified (6 + 1 + 4).</li>
<li>Of those 11 problems, six were unique to ChatGPT, four were unique to Gemini, and one was identified by both AIs.</li>
</ul>
<p>So, <strong>AIs generated 11 new problems</strong> not identified by any of the human researchers. Table 1 has details of those 11 problems, listed in chronological order using problem number codes from the previous article. Of the 11 problems no human flagged, one was a genuine find, seven were false alarms, and three were hallucinations. Here&#8217;s more detail about each category.</p>

<table id="tablepress-1047" class="tablepress tablepress-id-1047">
<thead>
<tr class="row-1">
	<th class="column-1">Prob #</th><th class="column-2">Problem Description</th><th class="column-3">Source</th><th class="column-4">Classification</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1"><strong>4b</strong></td><td class="column-2">Filters not helpful</td><td class="column-3">ChatGPT</td><td class="column-4">False alarm</td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong>5b</strong></td><td class="column-2">Participant used Ctrl-F to search for "sushi" when it wasn't in the 86-cuisine list</td><td class="column-3">Gemini</td><td class="column-4"><strong>Genuine find</strong></td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>6b</strong></td><td class="column-2">Search results for sushi included many non-sushi restaurants</td><td class="column-3">ChatGPT</td><td class="column-4">False alarm</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>6c</strong></td><td class="column-2">Weak presentation of cuisine information in search results</td><td class="column-3">ChatGPT</td><td class="column-4">False alarm</td>
</tr>
<tr class="row-6">
	<td class="column-1"><strong>7b-Gem</strong></td><td class="column-2">Participant chose the highest price tier</td><td class="column-3">Gemini</td><td class="column-4"><strong><font color="red">Hallucination</font></strong></td>
</tr>
<tr class="row-7">
	<td class="column-1"><strong>7b-GPT</strong></td><td class="column-2">Sorting by highest rated surfaced many non-sushi restaurants</td><td class="column-3">ChatGPT</td><td class="column-4">False alarm</td>
</tr>
<tr class="row-8">
	<td class="column-1"><strong>8b</strong></td><td class="column-2">UI pushes browsing without good decision support</td><td class="column-3">ChatGPT</td><td class="column-4">False alarm</td>
</tr>
<tr class="row-9">
	<td class="column-1"><strong>9b</strong></td><td class="column-2">Seating options only presented after selecting reservation time</td><td class="column-3">Gemini</td><td class="column-4">False alarm</td>
</tr>
<tr class="row-10">
	<td class="column-1"><strong>9c</strong></td><td class="column-2">Participant set time to 5:10 instead of 5:00</td><td class="column-3">Gemini</td><td class="column-4"><strong><font color="red">Hallucination</font></strong></td>
</tr>
<tr class="row-11">
	<td class="column-1"><strong>10a</strong></td><td class="column-2">Selected restaurant labeled "seafood" rather than "sushi" by OpenTable</td><td class="column-3">Both</td><td class="column-4">False alarm</td>
</tr>
<tr class="row-12">
	<td class="column-1"><strong>10b</strong></td><td class="column-2">Task not completed—participant never reached the reservation form</td><td class="column-3">ChatGPT</td><td class="column-4"><strong><font color="red">Hallucination</font></strong></td>
</tr>
</tbody>
</table>
<!-- #tablepress-1047 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1: </strong>The 11 AI-generated problems not identified by any human researcher, classified by type.</p>
<h3>Gemini’s Genuine Find</h3>
<p>Let’s start with the good news: All four Gemini runs identified that after the participant expanded the cuisine filter to show all 86 cuisines, she used Ctrl-F to search the page for “sushi” (5b-Gem)—an event not reported by any of the human evaluators. It happened quickly, so it’s possible that the search field was not in the visual focus of the humans who were likely examining the list of cuisines (Figure 2). We consider this a true usability problem because (1) this behavior was driven by poor filter design and (2) it was unsuccessful—the word “sushi” was not on the page even though the cuisine filter was fully expanded.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/05/052626-F2V2.png"><img loading="lazy" decoding="async" class="alignnone size-full wp-image-47654" src="https://measuringu.com/wp-content/uploads/2026/05/052626-F2.png" alt="Frame from video showing Ctrl-F search field with first few letters of “sushi” typed at the top right of the screen and 28 of the 86 cuisine types on the left. " width="825" height="648" srcset="https://measuringu.com/wp-content/uploads/2026/05/052626-F2.png 825w, https://measuringu.com/wp-content/uploads/2026/05/052626-F2-300x236.png 300w, https://measuringu.com/wp-content/uploads/2026/05/052626-F2-768x603.png 768w, https://measuringu.com/wp-content/uploads/2026/05/052626-F2-600x471.png 600w" sizes="auto, (max-width: 825px) 100vw, 825px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 2:</strong> Frame from video showing Ctrl-F search field with first few letters of “sushi” typed at the top right of the screen and 28 of the 86 cuisine types on the left.</p>
<h3>Seven False Alarms</h3>
<p>Next, the not-so-good news. When a researcher identifies something that happened, but it’s not really considered a problem, it’s referred to as a false alarm. Sometimes things are literally a feature and not a bug! From our interpretation, AIs generated seven false alarms (not too different from what <a href="https://measuringu.com/false-positives/">you sometimes see</a> with a group of human evaluators).</p>
<h4><strong>The seafood/sushi labeling issue (6b, 6c, 7b-GPT, 8b, 10a)</strong></h4>
<p>Five of the seven false alarms (6b, 6c, 7b-GPT, 8b, 10a) were derived from ChatGPT taking the search for sushi restaurants too literally. After searching for sushi, many OpenTable results were labeled &#8220;seafood.&#8221; ChatGPT flagged this repeatedly across multiple runs in different ways (e.g., weak cuisine presentation, non-sushi results surfacing, poor decision support), but they all trace back to the same fundamental observation. ChatGPT only considered acceptable restaurants that OpenTable labeled as sushi restaurants, not restaurants that serve sushi on the menu, regardless of OpenTable&#8217;s labeling.</p>
<p>The restaurant the participant ultimately selected was labeled seafood, which led ChatGPT to declare task failure in three of four runs. The human reviewers took a more pragmatic view: the restaurant served sushi, so the participant successfully completed the task. Gemini flagged the same seafood/sushi labeling issue once (10a) but didn&#8217;t spiral into multiple variations of it.</p>
<h4><strong>Seating options not shown until after time selection (9b-Gem)</strong></h4>
<p>OpenTable withholds seating options until you pick a time. Given the range of possible seating configurations (inside, patio, bar, banquette, communal, high top, private, counter), showing them before a time is selected isn&#8217;t really feasible. And if a seating option doesn&#8217;t work out, the recovery path is low friction. Gemini flagged this as a problem. The human researchers recognized this as a design tradeoff rather than a usability problem.</p>
<h4><strong>Filters not helpful (4b-GPT)</strong></h4>
<p>We categorized this as a false alarm because it was overly vague. It’s true that there were issues with some filters (e.g., cuisine), but that was not true of all filters.</p>
<h3>Three Hallucinations</h3>
<p>In contrast to false alarms, which we consider misinterpretations of events that happened, a hallucination is when a problem is associated with something that just didn’t happen. We saw three of these.</p>
<h4><strong>AI claimed the participant incorrectly selected the highest price tier (7b-Gem)</strong></h4>
<p>From the narrative of the second Gemini run:</p>
<blockquote><p><em>The task required selecting a restaurant that was not the lowest or highest price point.</em></p>
<p><em>Problem: The participant chose Ocean Prime, which is a restaurant (the highest tier on the platform). At 05:13, the participant verbally identified this as &#8220;mid-range.&#8221;</em></p>
<p><em>User Impact: The participant technically failed this part of the task constraints.</em></p></blockquote>
<p><strong><em>This didn’t happen</em></strong>. Ocean Prime had a mid-range price designation.</p>
<h4><strong>AI claimed the participant set the reservation time for 5:10 pm (9c-Gem) </strong></h4>
<p>From the narrative of the second Gemini run:</p>
<blockquote><p><em>The participant selected Ocean Prime at 05:10</em></p></blockquote>
<p><strong><em>This didn’t happen</em></strong>. The participant, in accordance with the task instructions, selected 5:00 pm.</p>
<h4><strong>AI claimed the participant did not reach the reservation form (10b-ChatGPT)</strong></h4>
<p>From the narrative of the second ChatGPT run:</p>
<blockquote><p><em>By the end of the clip, they are still comparing list items and time slots; they do not appear to reach the restaurant detail/reservation form step.</em></p></blockquote>
<p><strong><em>This isn’t accurate</em></strong>. The clip ended with the participant selecting the reservation time, then standard dining room seating, then stopping before entering her personal information.</p>
<p>The good news is that there were only three hallucinations out of 11 AI-generated problems. The bad news is you can&#8217;t know which AI-generated problem descriptions were hallucinated without watching the video and reviewing all the problems yourself.</p>
<h2>Summary and Discussion</h2>
<p>In this article, we focused on qualitative similarities and differences in the usability problems listed by professional human UX researchers and two AIs (Gemini 3 Flash Thinking and ChatGPT-5.4 Thinking) after reviewing a video in which a participant made a restaurant reservation.</p>
<p>Our key findings were:</p>
<p><strong>False alarms and hallucinations dominate.</strong> Of the 11 problems the AIs generated that no human flagged, seven (64%) were false alarms, three (27%) were hallucinations, and one (9%) was a genuine find. That&#8217;s a useful number to keep in mind: roughly nine out of ten AI-only problems in this study required either correction or dismissal.</p>
<p><strong>AI adds value as a junior researcher, not a trusted expert.</strong> AI was able to find one problem (a participant had to use Ctrl-F) that was real and useful and not found by humans. But getting to it required reviewing ten other problems that ranged from technically true but irrelevant to simply fabricated. The ROI depends on how much that review costs you.</p>
<p><strong>Most false alarms came from a single fixation.</strong> Five of the seven traced back to ChatGPT interpreting &#8220;sushi restaurant&#8221; more literally than any human would. At least in this video and our criteria for what constitutes a problem, this is a systematic bias worth knowing about if you&#8217;re using these models for task-based evaluations.</p>
<p><strong>Hallucinations were infrequent but consequential.</strong> Three of the problems (27%) were hallucinations. Although nominally low, this is probably too high for most applications. You can&#8217;t catch those without going back to the video, which means human review isn&#8217;t optional.</p>
<p><strong>Like humans, AI usability reviews of videos are prone to the “evaluator effect.” </strong>Just like with human evaluators, multiple runs of AI usability evaluations of videos are not perfectly consistent, so it’s good practice to run these evaluations multiple times for consistency checks. Two of the three hallucinations came from the same Gemini run. Running multiple evaluations and looking for consistency across runs is a practical filter before any human review.</p>
<p><strong>Bottom line: AI usability reviews of videos require human oversight. </strong>In their current form (what we tested), these AI products can add value to this type of UX research, but more as junior researchers whose actions and conclusions require expert human oversight rather than as trusted experts themselves.<strong><br />
</strong></p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How Many Years Does It Take to Become a Senior UX Researcher?</title>
		<link>https://measuringu.com/how-many-years-does-it-take-to-become-a-senior-ux-researcher/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=how-many-years-does-it-take-to-become-a-senior-ux-researcher</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD and Jeff Sauro, PhD]]></dc:creator>
		<pubDate>Tue, 19 May 2026 22:15:45 +0000</pubDate>
				<category><![CDATA[Methods]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[Usability]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[UX Maturity]]></category>
		<category><![CDATA[experience]]></category>
		<category><![CDATA[Salary Survey]]></category>
		<category><![CDATA[UX Salary Survey]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47607</guid>

					<description><![CDATA[What does it take to become a senior UX researcher? An advanced degree? Particular experience and skills, like the number of moderated studies conducted or a variety of methods employed? While all those play a role, the type of job (in-house small-team, in-house large-team, solo researcher, or agency) can affect what you are exposed to. [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/05/051926-FeatureImage1.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47633 size-medium" src="https://measuringu.com/wp-content/uploads/2026/05/051926-FeatureImage1-300x169.png" alt="Feature image showing an entry-level UX researcher becoming a senior UX researcher over several years" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/05/051926-FeatureImage1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/05/051926-FeatureImage1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/05/051926-FeatureImage1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/05/051926-FeatureImage1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/05/051926-FeatureImage1-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/05/051926-FeatureImage1.png 2000w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>What does it take to become a senior UX researcher?</p>
<p>An advanced degree? Particular experience and skills, like the number of moderated studies conducted or a variety of methods employed?</p>
<p>While all those play a role, the type of job (in-house small-team, in-house large-team, solo researcher, or agency) can affect what you are exposed to.</p>
<p>Certainly, most would agree that one to two years of experience seems too little time to demonstrate senior-level performance in UX research. We thought that something around five years of experience was a good benchmark. But is that warranted? What is a good number of years of experience?</p>
<p>There is no official rule book on titles. As is often the case when making decisions about jobs, we can use a few approaches:</p>
<ol>
<li>Principle-based: Set a rule based on a principle that disregards what people do.</li>
<li>Tradition and trends: Look to broader workforce trends, what others report, or guidance online.</li>
<li>Data: See what’s happening in practice if you have access to data.</li>
</ol>
<h2>Principle Based</h2>
<p>Even though there isn’t an official designation, we can look broadly at how long it takes to master a skill or job like UX researcher. One used in popular culture (based on some research) and popularized by Malcolm Gladwell is the <a href="https://jamesclear.com/deliberate-practice-strategy">10,000-hour rule</a>. That is, after about 10k hours of practice, you master a skill. That is a very rough guideline and <a href="https://www.bbc.com/future/article/20121114-gladwells-10000-hour-rule-myth">definitely has its critics</a>.</p>
<h2>Tradition and Trends</h2>
<p>Seniority levels can <a href="https://www.indeed.com/career-advice/career-development/seniority-level">differ by industry type</a>. For example:</p>
<ul>
<li>The <a href="https://www.asce.org/career-growth/early-career-engineers/asce-guidelines-for-engineering-grades">American Society of Civil Engineers</a> has grades (from I to VIII) with <a href="https://www.asce.org/-/media/asce-images-and-files/career-and-growth/early-career-engineer/engineering-grades.pdf">detailed descriptions</a> of expected skills and responsibilities. For example, an engineer at Grade IV has at least four years of experience.</li>
<li><a href="https://hrsimple.com/law-firm-hierarchy-roles-and-career-paths/">Associates in a law firm</a> can be junior (1–3 years of experience), mid-level (4–6 years), or senior (7–10 years).</li>
<li>In <a href="https://strategycase.com/big-4-salaries/">large consulting firms</a>, senior associates typically have 2–5 years of experience.</li>
</ul>
<p>For the expected minimum number of years of experience for UX researchers, it makes sense to start with personal experience. In our decades of experience at large companies (IBM, Oracle, GE, Intuit, PeopleSoft), something like five years was a loose criterion. Below that, people would question the designation.</p>
<p>We carry a similar tradition at MeasuringU, and those with five years’ experience are considered senior. But at a tech-enabled agency, UX researchers typically conduct hundreds of moderated sessions and use a wide variety of methods such as unmoderated benchmarking, eye-tracking, in-depth interviews, diary studies, and surveys. A couple of years working here usually exposes a researcher to significantly more UX-related tasks than in a typical in-house role. At the same time, they are much less exposed to the very real job of navigating the politics of competing stakeholders and corporate hierarchies.</p>
<h2>Data: Salary Surveys, LinkedIn Profiles, and Job Posts</h2>
<p>Our preferred method is looking for data to guide decisions. We have three sources. The first is the bi-annual UXPA Salary Survey, which was last conducted in <a href="https://uxpa.org/salary-surveys/">2024</a>. The second is LinkedIn, which provides access to job titles and a crude way of determining years of experience. The third is requirements from recent job postings.</p>
<h3>UXPA Senior User Researcher Data</h3>
<p>The 2024 Salary Survey had 444 responses. Of those, 64% (276) described themselves as user researchers. Respondents could pick one of five employment levels. Table 1 shows that about half (130) of the user researchers classified themselves as “Senior-level, non-supervisory.”</p>

<table id="tablepress-1043" class="tablepress tablepress-id-1043">
<thead>
<tr class="row-1">
	<th class="column-1">Employment Level</th><th class="column-2">Number</th><th class="column-3"> %</th>
</tr>
</thead>
<tbody>
<tr class="row-2">
	<td class="column-1">Entry</td><td class="column-2"> 18</td><td class="column-3"> 7%</td>
</tr>
<tr class="row-3">
	<td class="column-1">Mid-level, non-supervisory</td><td class="column-2"> 73</td><td class="column-3">26%</td>
</tr>
<tr class="row-4">
	<td class="column-1">Mid-level, supervisory</td><td class="column-2"> 10</td><td class="column-3"> 4%</td>
</tr>
<tr class="row-5">
	<td class="column-1">Senior-level, non-supervisory</td><td class="column-2">130</td><td class="column-3">47%</td>
</tr>
<tr class="row-6">
	<td class="column-1">Senior-level, supervisory</td><td class="column-2"> 45</td><td class="column-3">16%</td>
</tr>
<tr class="row-7">
	<td class="column-1">Total</td><td class="column-2">276</td><td class="column-3"></td>
</tr>
</tbody>
</table>
<!-- #tablepress-1043 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1</strong>: Distribution of user researchers by employment level (2024 UXPA salary survey).</p>
<p>Respondents also selected their years of experience in response to the question “How long have you worked in this field (please round to the nearest year)” using the pre-determined buckets shown in Table 2.</p>

<table id="tablepress-1044" class="tablepress tablepress-id-1044">
<thead>
<tr class="row-1">
	<th class="column-1">Years of Experience</th><th class="column-2"># Senior</th><th class="column-3">% Senior</th><th class="column-4">% With More Experience</th>
</tr>
</thead>
<tbody>
<tr class="row-2">
	<td class="column-1"> 0–2 yrs</td><td class="column-2"> 1</td><td class="column-3"> 1%</td><td class="column-4">99%</td>
</tr>
<tr class="row-3">
	<td class="column-1"> 3–4 yrs</td><td class="column-2">17</td><td class="column-3">13%</td><td class="column-4">86%</td>
</tr>
<tr class="row-4">
	<td class="column-1"> 5–7 yrs</td><td class="column-2">28</td><td class="column-3">22%</td><td class="column-4">65%</td>
</tr>
<tr class="row-5">
	<td class="column-1"> 8–10 yrs</td><td class="column-2">21</td><td class="column-3">16%</td><td class="column-4">48%</td>
</tr>
<tr class="row-6">
	<td class="column-1">11–15 yrs</td><td class="column-2">21</td><td class="column-3">16%</td><td class="column-4">32%</td>
</tr>
<tr class="row-7">
	<td class="column-1">16–20 yrs</td><td class="column-2">15</td><td class="column-3">12%</td><td class="column-4">21%</td>
</tr>
<tr class="row-8">
	<td class="column-1">21+ yrs</td><td class="column-2">27</td><td class="column-3">21%</td><td class="column-4"> 0%</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1044 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 2:</strong> Distribution of 130 non-supervisory senior-level user researchers by years of experience (2024 UXPA salary survey).</p>
<p>For example, only one person who reported being a senior user researcher had two years or fewer of experience. That means 99% had more than two years. The second row of the table shows that 17 had between three and four years of experience. Adding that to the one respondent with less experience gets 18 out of the 130 respondents. That means <strong>86% of non-supervisory senior user researchers reported 5 or more years of experience</strong>. Using the center of each age group as a rough estimate of experience, the average number of years across the sample was 12–13 years. Of course, people may inflate their years of experience on an anonymous survey.</p>
<p>We also looked at UX designers in the UXPA dataset and found a similar pattern. Of the 56 UX designers who self-identified as senior, 87% had at least five years of experience (Table 3).</p>

<table id="tablepress-1045" class="tablepress tablepress-id-1045">
<thead>
<tr class="row-1">
	<th class="column-1">Years of Experience</th><th class="column-2"># Senior</th><th class="column-3">% Senior</th><th class="column-4">% With More Experience</th>
</tr>
</thead>
<tbody>
<tr class="row-2">
	<td class="column-1"> 0–2 yrs</td><td class="column-2"> 1</td><td class="column-3"> 2%</td><td class="column-4">98%</td>
</tr>
<tr class="row-3">
	<td class="column-1"> 3–4 yrs</td><td class="column-2"> 6</td><td class="column-3">11%</td><td class="column-4">87%</td>
</tr>
<tr class="row-4">
	<td class="column-1"> 5–7 yrs</td><td class="column-2">13</td><td class="column-3">23%</td><td class="column-4">64%</td>
</tr>
<tr class="row-5">
	<td class="column-1"> 8–10 yrs</td><td class="column-2">14</td><td class="column-3">25%</td><td class="column-4">39%</td>
</tr>
<tr class="row-6">
	<td class="column-1">11–15 yrs</td><td class="column-2"> 8</td><td class="column-3">14%</td><td class="column-4">25%</td>
</tr>
<tr class="row-7">
	<td class="column-1">16–20 yrs</td><td class="column-2"> 6</td><td class="column-3">11%</td><td class="column-4">14%</td>
</tr>
<tr class="row-8">
	<td class="column-1">21+ yrs</td><td class="column-2">8</td><td class="column-3">14%</td><td class="column-4">0%</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1045 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 3:</strong> Distribution of 56 non-supervisory senior-level UX designers by years of experience (2024 UXPA Salary Survey).</p>
<h3>LinkedIn Profiles</h3>
<p>Another approach is to look at how many years of experience senior UX researchers on LinkedIn have in their job history. While job dates can always be padded a bit, it’s a lot harder to claim unearned experience on a public professional forum. We did an informal examination searching for “Senior UX Researcher” and hand-counting the years of non-supervisory experience for the first 50 respondents.</p>
<p>Of the 50 profiles, the average years of experience was a bit over nine years (Table 4). The minimum was just shy of five years at 4.75. Of the 50 profiles, only four (8%) had less than five years of experience. In other words, using this crude estimate suggests 92% of senior user researchers have more than five years of experience.</p>

<table id="tablepress-1046" class="tablepress tablepress-id-1046">
<tbody>
<tr class="row-1">
	<td class="column-1"><strong>Mean Years of Experience</td><td class="column-2">9.1</td>
</tr>
<tr class="row-2">
	<td class="column-1"><strong>Min Years</td><td class="column-2">4.75</td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong># < 5</td><td class="column-2">4</td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>% < 5</td><td class="column-2">8%</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>Total #</td><td class="column-2">50</td>
</tr>
<tr class="row-6">
	<td class="column-1"><strong>% > 5</td><td class="column-2">92%</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1046 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 4:</strong> Analysis of 50 LinkedIn profiles of senior-level non-supervisory UX researchers.</p>
<h3>Job Posts</h3>
<p>Finally, we did another (very) informal search for senior UX researcher job postings (as of May 3, 2026) that were posted on Indeed. Of the five we found, all explicitly required five or more years of experience.</p>
<h2>Discussion and Summary</h2>
<p>There’s no official rule for what makes a UX researcher senior, but multiple approaches point to a consistent answer: at least five years.</p>
<ul>
<li><strong>Principle-based heuristics are consistent with five.</strong> Guidelines loosely based on research (like the 10,000-hour rule) suggest it takes about <strong>five years of focused experience</strong> to develop expertise. This is a weak rationale, but it&#8217;s a starting point.</li>
<li><strong>Tradition and trends suggest five.</strong> In our experience in the industry, it’s common to use <strong>five years as a minimum threshold</strong>. Other industries fall close to the five-year threshold as well.</li>
<li><strong>Salary survey data supports five.</strong> In the 2024 UXPA Salary Survey, <strong>86% of senior UX researchers reported five or more years of experience</strong>, with an average of around 12–13 years. Of the senior UX Designers, an adjacent role in the UX industry, 87% reported five or more years of experience.</li>
<li><strong>Existing profiles and open jobs show five+ years.</strong> Our LinkedIn sample of 50 senior UX researchers showed similar results, with <strong>about 90% above five years of experience</strong> and an average of 9–10 years. Finally, a selection of five currently open senior UX researcher jobs on Indeed all explicitly require at least five years of experience.</li>
</ul>
<p>If you’re looking to set a threshold for becoming senior, five years seems like a good rule.</p>
<p>Of course, years alone don’t define seniority, but if someone has fewer than five years of experience, the <em>senior</em> title should be the exception, not the rule.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How to Interpret a Rating Scale Without Historical Data</title>
		<link>https://measuringu.com/how-to-interpret-a-rating-scale-without-historical-data/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=how-to-interpret-a-rating-scale-without-historical-data</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD and Jeff Sauro, PhD]]></dc:creator>
		<pubDate>Tue, 12 May 2026 20:19:38 +0000</pubDate>
				<category><![CDATA[Benchmarking]]></category>
		<category><![CDATA[Rating Scale]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[Rating Scales]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47556</guid>

					<description><![CDATA[UX researchers use a lot of rating scales. We recommend using standardized rating scales when possible. One of the benefits of some standardized scales, such as the SUS, SUPR-Q®, and UX-Lite®, is that you have a reference database of historical data. But there’s not always a standardized questionnaire for everything you’re hoping to measure, so [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/05/051226-FeatureImage-scaled.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47595 size-medium" src="https://measuringu.com/wp-content/uploads/2026/05/051226-FeatureImage-300x169.png" alt="Feature image showing a UX researcher interpreting a rating scale without historical data" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/05/051226-FeatureImage-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/05/051226-FeatureImage-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/05/051226-FeatureImage-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/05/051226-FeatureImage-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/05/051226-FeatureImage-2048x1152.png 2048w, https://measuringu.com/wp-content/uploads/2026/05/051226-FeatureImage-600x338.png 600w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>UX researchers use a lot of rating scales. We recommend using standardized rating scales when possible. One of the benefits of some standardized scales, such as the SUS, SUPR-Q<sup>®</sup>, and UX-Lite<sup>®</sup>, is that you have a reference database of historical data.</p>
<p>But there’s not always a standardized questionnaire for everything you’re hoping to measure, so researchers need to create <a href="https://en.wikipedia.org/wiki/Ad_hoc">ad hoc</a> ones.</p>
<p>Data collected with ad hoc rating scales can be difficult to interpret, especially if you don’t have any historical data (e.g., from past product performance or competitors).</p>
<p>If you’re comparing multiple conditions (e.g., ratings on attributes for two or more websites), then you can check for significant differences in rating scale means.</p>
<p>But even clear differences in means don’t answer the question about whether a given mean indicates a poor or good user experience.</p>
<p>In this article, we provide a way to interpret five- and seven-point UX rating scales when you don’t have enough historical data for custom benchmarks. We use the well-known distribution of the System Usability Scale (<a href="https://measuringu.com/10-things-sus/">SUS</a>) as the basis for our recommendation.</p>
<h2>UX Rating Scales Tend to Be Negatively Skewed</h2>
<p>If you’ve never plotted your distributions of rating scale response options, you should. But don’t be surprised when you see a negatively skewed distribution (tail of data points to the left).</p>
<p>Most UX rating scales have this negative skew because (1) most item stems have a positive tone (e.g., “I felt very confident using this website”) and (2) respondents are <a href="https://dl.acm.org/doi/pdf/10.1145/175276.175282">generally more likely to agree</a> (selecting higher responses). This means that the middle value (e.g., a 3 on a five-point scale) isn’t a good measure of the “average.” This skew doesn’t make the responses necessarily bad or not useful. It just means you need to account for that skew when interpreting them.</p>
<p>For example, you can see the skew in distributions of SUS scores, for which 50 is the middle of the scale (Figure 1), but is not the middle of the distribution (68 is the median).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/05/Figure1-1.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47597 size-full" src="https://measuringu.com/wp-content/uploads/2026/05/Figure1-1.png" alt="Figure 1: Distribution of 3,187 individual SUS scores (50 is the middle of the scale, but the median is 68)." width="1200" height="719" srcset="https://measuringu.com/wp-content/uploads/2026/05/Figure1-1.png 1200w, https://measuringu.com/wp-content/uploads/2026/05/Figure1-1-300x180.png 300w, https://measuringu.com/wp-content/uploads/2026/05/Figure1-1-1024x614.png 1024w, https://measuringu.com/wp-content/uploads/2026/05/Figure1-1-768x460.png 768w, https://measuringu.com/wp-content/uploads/2026/05/Figure1-1-600x360.png 600w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> Distribution of 3,187 individual SUS scores (50 is the middle of the scale, but the median is 68).</p>
<h2>Default Benchmarks Based on Historical SUS Distribution</h2>
<p>Taking advantage of the well-known distribution of the SUS, we created a curved grading scale that is <a href="https://www.researchgate.net/publication/324116412_The_System_Usability_Scale_Past_Present_and_Future">widely used in UX research</a> (Table 1). We’ll use this as a basis for interpreting ad hoc scales.</p>

<table id="tablepress-1040" class="tablepress tablepress-id-1040">
<thead>
<tr class="row-1">
	<th class="column-1">SUS Score Range</th><th class="column-2">Grade</th><th class="column-3">Percentile Range</th>
</tr>
</thead>
<tbody>
<tr class="row-2">
	<td class="column-1">84.1–100</td><td class="column-2">A+</td><td class="column-3">96–100</td>
</tr>
<tr class="row-3">
	<td class="column-1">80.8–84.0</td><td class="column-2">A</td><td class="column-3">90–95</td>
</tr>
<tr class="row-4">
	<td class="column-1">78.9–80.7</td><td class="column-2">A−</td><td class="column-3">85–89</td>
</tr>
<tr class="row-5">
	<td class="column-1">77.2–78.8</td><td class="column-2">B+</td><td class="column-3">80–84</td>
</tr>
<tr class="row-6">
	<td class="column-1">74.1–77.1</td><td class="column-2">B</td><td class="column-3">70–79</td>
</tr>
<tr class="row-7">
	<td class="column-1">72.6–74.0</td><td class="column-2">B−</td><td class="column-3">65–69</td>
</tr>
<tr class="row-8">
	<td class="column-1">71.1–72.5</td><td class="column-2">C+</td><td class="column-3">60–64</td>
</tr>
<tr class="row-9">
	<td class="column-1">65.0-71.0</td><td class="column-2">C</td><td class="column-3">41–59</td>
</tr>
<tr class="row-10">
	<td class="column-1">62.7–64.9</td><td class="column-2">C−</td><td class="column-3">35–40</td>
</tr>
<tr class="row-11">
	<td class="column-1">51.7–62.6</td><td class="column-2">D</td><td class="column-3">15–34</td>
</tr>
<tr class="row-12">
	<td class="column-1"> 0.0–51.6</td><td class="column-2">F</td><td class="column-3"> 0–14</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1040 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Curved grading scale for the SUS.</p>
<p>The 50<sup>th</sup> percentile of this scale is a SUS score of 68, a solid C. Another important benchmark commonly used in practice is an aspirational score of 80 (the upper end of an A−, a bit higher than the 85<sup>th</sup> percentile). Scores lower than 51.7 are in the F range (just below the 15<sup>th</sup> percentile).</p>
<p>Based on the SUS research, when we consult with clients who need a benchmark for five- or seven-point scales and there is no historical data, we usually recommend setting a benchmark for average to about 70% of the range of the scale, 80% for good, and 50% for poor—similar to the historical benchmarks for the SUS. For example, this is what we did when we created our <a href="https://measuringu.com/grading-scales-for-the-ux-lite/">standard grading scale for the UX-Lite</a>.</p>
<p>Table 2 shows those values for five- and seven-point scales (the midpoint for a five-point scale is 3, and for a seven-point scale is 4).</p>

<table id="tablepress-1042" class="tablepress tablepress-id-1042">
<thead>
<tr class="row-1">
	<th class="column-1">Location on Scale</th><th class="column-2">Interpretation</th><th class="column-3">Five-point</th><th class="column-4">Seven-point</th>
</tr>
</thead>
<tbody>
<tr class="row-2">
	<td class="column-1">80%</td><td class="column-2">Good</td><td class="column-3">4.2</td><td class="column-4">5.8</td>
</tr>
<tr class="row-3">
	<td class="column-1">70%</td><td class="column-2">Average</td><td class="column-3">3.8</td><td class="column-4">5.2</td>
</tr>
<tr class="row-4">
	<td class="column-1">60%</td><td class="column-2">Below Average</td><td class="column-3">3.4</td><td class="column-4">4.6</td>
</tr>
<tr class="row-5">
	<td class="column-1">50%</td><td class="column-2">Poor</td><td class="column-3">3.0</td><td class="column-4">4.0</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1042 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 2:</strong> Initial benchmarks for 70 and 80% of the range of five- and seven-point scales.</p>
<p>The formula for computing these values is based on the <a href="https://measuringu.com/converting-scales-to-100-points/">methods for interpolating rating scale scores</a> that start with 1 to a 0–100-point scale, algebraically manipulated to compute the benchmark for the rating scale from the target range (e.g., 80% of the scale, expressed as 80 in the computation) and the maximum possible value of the rating scale (e.g., typically 5 or 7 for scales that start with 1):</p>
<p>Benchmark = Target / (100 / (MaxRating − 1)) + 1</p>
<p>For example, to find 70% of the range of a five-point scale, the benchmark would be:</p>
<p>70 / (100 /(5 − 1)) + 1 = 70 / 25 + 1 = 3.8</p>
<p>An <a href="https://measuringu.com/types-of-100-point-scales/">alternative approach</a> is to convert five- or seven-point ratings to a 0–100-point scale. John Brooke, the developer of the SUS, <a href="https://uxpajournal.org/sus-a-retrospective/">described the value of this approach</a>: “Project managers, product managers, and engineers were more likely to understand a scale that went from 0 to 100 than one that went from 10 to 50, and the important thing was to be able to grab their attention in the short space of time they were likely to spend thinking about usability, without having to go into a detailed explanation.”</p>
<p>The general formula for converting a five- or seven-point scale to 0–100 points is:</p>
<p>Rating100 = (Rating − 1) * 100 / (MaxRating − 1)</p>
<p>For example, a five-point mean rating of 4.2 would become 80:</p>
<p>(4.2 − 1) * (100 / (5 − 1)) = 3.2(25) = 80</p>
<p>A seven-point mean rating of 4.0 would become 50:</p>
<p>(4 − 1) * (100 / (7 − 1)) = 3(16.67) ≈ 50</p>
<p><strong><em>Caveat</em></strong><em><strong>:</strong> Note that these are initial benchmarks to use when UX researchers lack a more grounded rationale for interpreting mean rating scale scores. After a reasonable amount of data collection with the scale, it’s a good idea to revisit the initial benchmarks to see whether they should be adjusted.</em></p>
<h2>Summary</h2>
<p>When you&#8217;re working with an ad hoc rating scale and have no historical data to lean on, the SUS distribution gives you a principled starting point. Because UX rating scales share a consistent negative skew (driven by positive item wording and respondent agreement bias), benchmarks derived from the SUS translate reasonably well to other five- and seven-point scales. It’s not that there’s something magic about the SUS. It works well because it’s a composite of ten five-point UX rating scales that share the tendency of other UX rating scales to be negatively skewed (more favorable than unfavorable). This means that benchmarks informed by the SUS provide a good initial approximation for other UX rating scales.</p>
<p>The characteristics of UX rating scales that this pattern supports are:</p>
<ul>
<li>Setting “Poor” below the midpoint of the scale (50% of the range) because means of positive-tone UX rating scales are consistently higher than the scale midpoint.</li>
<li>Setting “Good” above 80% of the scale range is the <a href="https://www.uslanguageservices.com/guides-resources/understanding-the-u-s-grading-system/">traditional score for a B</a> (above average).</li>
</ul>
<p>Placing other cut points between 50% and 80% leads to these initial benchmarks:</p>
<ul>
<li><strong>Good</strong>: Located at <strong>80%</strong> of the range of the scale</li>
<li><strong>Average</strong>: Located at <strong>70%</strong> of the range of the scale</li>
<li><strong>Below average</strong>: Located at <strong>60%</strong> of the range of the scale</li>
<li><strong>Poor</strong>: Located at <strong>50%</strong> of the range of the scale (the midpoint)</li>
</ul>
<p>It’s important to keep in mind that these are reasonable best guesses without a strong normative database. For UX rating scale items that will be used frequently over time, researchers should plan to build normative databases and use them to tune the benchmarks (like we have <a href="https://measuringu.com/evolution-of-seq/">done with the SEQ<sup>®</sup></a>).</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Can AI Detect Usability Problems Like Researchers?</title>
		<link>https://measuringu.com/ai-vs-human-usability-problem-analysis-of-a-video/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=ai-vs-human-usability-problem-analysis-of-a-video</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD •&nbsp;Jeff Sauro, PhD •&nbsp;Will Schiavone, PhD&nbsp;•&nbsp;Lucas Plabst, PhD]]></dc:creator>
		<pubDate>Wed, 06 May 2026 04:25:45 +0000</pubDate>
				<category><![CDATA[Usability]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[ChatGPT]]></category>
		<category><![CDATA[Gemini]]></category>
		<category><![CDATA[Problem Discovery]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47502</guid>

					<description><![CDATA[AI can “watch” videos. It can even generate a list of problems. In some cases, these problem lists seem to be reasonably consistent (reliable). But consistency is not accuracy. Are these real problems or just sophisticated AI slop generated consistently by autocorrect for video? How can we know? One way to find out is to [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/05/Feature050526.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47551 size-medium" src="https://measuringu.com/wp-content/uploads/2026/05/Feature050526-300x169.png" alt="Feature image showing a count of problems found by AI versus human researchers" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/05/Feature050526-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/05/Feature050526-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/05/Feature050526-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/05/Feature050526-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/05/Feature050526.png 1280w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>AI can <a href="https://measuringu.com/can-ai-detect-usability-problems">“watch” videos</a>.</p>
<p>It can even generate a list of problems. In some cases, these problem lists seem to be <a href="https://measuringu.com/ai-usability-problem-analysis-of-a-video">reasonably consistent (reliable)</a>.</p>
<p>But consistency is not accuracy. Are these real problems or just sophisticated AI slop generated consistently by autocorrect for video?</p>
<p>How can we know? One way to find out is to compare the AI problem lists to those created by trained UX researchers.</p>
<p><strong>Are the problems an AI finds the same problems a UX researcher would find?</strong></p>
<p>In this article, we move from reliability to <strong>validity</strong> by comparing the problems identified by AI to those found by human UX researchers reviewing the same video.</p>
<h2><span lang="EN-US">Humans vs. AI: Same Video, Same Task</span></h2>
<p>For this study, four UX researchers at MeasuringU independently reviewed a roughly six-minute usability test video and created lists of observed usability problems. The primary evaluator had over 40 years of experience coding usability problems, while the other three, at the time of the study, each had less than a year of experience. The video they watched was the same one reviewed by two LLMs in our previous assessment of AI reliability (ChatGPT-5.4 Thinking and Gemini 3 Flash Thinking, four runs per LLM to assess reliability). The participant’s task was to use OpenTable.com to book a reservation:</p>
<blockquote><p>“Make a reservation for four people at a sushi restaurant in Denver, CO tomorrow anytime after 5:00pm. Make sure the restaurant you select is not at the lowest or highest price point. Of the restaurants that fit these criteria, look at their overall rating, customer reviews, and photos to select the one that is the most appealing to you. Go as far as you can in the reservation process until you are asked for your personal information or account details. DO NOT fully confirm the reservation. Write down the restaurant name and the time of the reservation. You will be asked about this information after the task.”</p></blockquote>
<p>The directions for the human evaluators matched the prompt given to the LLMs:</p>
<blockquote><p>“During a usability test, the facilitator must keep track of participant behaviors as they navigate through tasks on a website, mobile app, software program, etc. We’d like you to watch a video of a usability test where participants were asked to book a table at a Sushi restaurant. As you&#8217;re watching, please look for problems the participant has while attempting to complete the task. For example, you can document the path users take, describe issues they encounter as well as what on the website might be causing problems.”</p></blockquote>
<h3><span lang="EN-US">Coding and Matching Problems</span></h3>
<p>The four researchers independently created their problem lists. The senior researcher (Evaluator 1) then reviewed, matched, and consolidated the problems as shown in Table 1.</p>

<table id="tablepress-1034" class="tablepress tablepress-id-1034">
<thead>
<tr class="row-1">
	<th class="column-1">Prob #</th><th class="column-2">Human Evaluators Problem List</th><th class="column-3">Eval 1</th><th class="column-4">Eval 2</th><th class="column-5">Eval 3</th><th class="column-6">Eval 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1">1</td><td class="column-2">Complex search field with placeholder text and unexpected behaviors significantly delayed user who selected sushi from search bar dropdown but for Dallas (default) instead of Denver</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-3">
	<td class="column-1">2</td><td class="column-2">Odd display of email addresses upon click in search field</td><td class="column-3"></td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-4">
	<td class="column-1">3</td><td class="column-2">Avoided search field and looked in filters to try to change location</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-5">
	<td class="column-1">4</td><td class="column-2">Entering Denver in search field lost previous selection of sushi as cuisine</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-6">
	<td class="column-1">5</td><td class="column-2">Scanning through 86 cuisines is effortful, then top that off by sushi not being in the list</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-7">
	<td class="column-1">6</td><td class="column-2">Surprised when typing sushi into search field did not lose current location</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-8">
	<td class="column-1">7</td><td class="column-2">Participant seemed to miss price point filter—sorted on ratings and examined price points in descriptions</td><td class="column-3"></td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-9">
	<td class="column-1">8</td><td class="column-2">Participant wanted to change sort to lowest rating first but not an option</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-10">
	<td class="column-1">9</td><td class="column-2">Despite having selected 5pm at start of process user needed to reselect it later</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-11">
	<td class="column-1"><i>Total</td><td class="column-2"></td><td class="column-3"><i><center>6</td><td class="column-4"><i><center>5</td><td class="column-5"><i><center>8</td><td class="column-6"><i><center>7</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1034 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Human evaluators&#8217; problem list.</p>
<p>Nine total problems were identified, none of which was classified as a false alarm by Evaluator 1. Four problems (3, 4, 5, and 6) were identified by all four UX researchers. Two problems (1, 8) were identified by three evaluators, one problem (7) by two evaluators, and two problems (2, 9) by one evaluator.</p>
<h3><span lang="EN-US">High Reliability for Humans</span></h3>
<p>With the consolidated problem list, we computed the <a href="https://measuringu.com/assessing-interrater-reliability-in-ux-research/">any-2 agreement</a> across all pairs of evaluators as shown in Table 2. Any-2 agreement accounts for interrater reliability of the different problem lists better than Kappa when assessing agreement.</p>

<table id="tablepress-1035" class="tablepress tablepress-id-1035">
<thead>
<tr class="row-1">
	<th class="column-1">Any-2</th><th class="column-2">Eval 1</th><th class="column-3">Eval 2</th><th class="column-4">Eval 3</th><th class="column-5">Eval 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1"><strong>Eval 1</td><td class="column-2"> x</td><td class="column-3">57%</td><td class="column-4">56%</td><td class="column-5">63%</td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong>Eval 2</td><td class="column-2">57%</td><td class="column-3"> x</td><td class="column-4">63%</td><td class="column-5">71%</td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>Eval 3</td><td class="column-2">56%</td><td class="column-3">63%</td><td class="column-4"> x</td><td class="column-5">88%</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>Eval 4</td><td class="column-2">63%</td><td class="column-3">71%</td><td class="column-4">88%</td><td class="column-5"> x</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1035 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 2: </strong>Any-2 agreement for the human evaluators.</p>
<p>The average any-2 agreement across all pairs was 66%. Based on our data, the general rule of thumb for interpreting any-2 agreement is that 50% is typical, 25% is low, and 75% is high.</p>
<p>That means the reliability of the human evaluators was <strong>relatively high</strong>, likely because some of the usability problems in the list were quite salient (4/9 identified by all four evaluators, 6/9 identified by at least three evaluators).</p>
<p>In our <a href="https://measuringu.com/ai-usability-problem-analysis-of-a-video">previous study</a> of AI analysis, the reliability of ChatGPT was relatively low (31%) while Gemini was above average (57%).</p>
<h2><span lang="EN-US">Agreement Was Low Between AIs and Humans</span></h2>
<p>We created consolidated problem lists for ChatGPT and Gemini by combining results across four runs and matching them to the human-identified problems. Problems labeled “ChatGPT” or “Gem” are unique to those systems. Problems without labels were also found by humans.</p>
<h3><span lang="EN-US">ChatGPT Validity</span></h3>
<p>Table 3 shows the combined problem list for the four runs of ChatGPT. It included five problems from the human list and seven unique problems. Table 4 shows the human by ChatGPT any-2 agreement.</p>

<table id="tablepress-1036" class="tablepress tablepress-id-1036">
<thead>
<tr class="row-1">
	<th class="column-1">Prob #</th><th class="column-2">ChatGPT Problem List</th><th class="column-3">Run 1</th><th class="column-4">Run 2</th><th class="column-5">Run 3</th><th class="column-6">Run 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1">1</td><td class="column-2">Complex search field with placeholder text and unexpected behaviors significantly delayed user who selected sushi from search bar dropdown but for Dallas (default) instead of Denver</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-3">
	<td class="column-1">4</td><td class="column-2">Entering Denver in search field lost previous selection of sushi as cuisine</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-4">
	<td class="column-1">4b-ChatGPT</td><td class="column-2">Filters not helpful</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-5">
	<td class="column-1">5</td><td class="column-2">Scanning through 86 cuisines is effortful, then top that off by sushi not being in the list</td><td class="column-3"></td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-6">
	<td class="column-1">6</td><td class="column-2">Surprised when typing sushi into search field did not lose current location</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-7">
	<td class="column-1">6b-ChatGPT</td><td class="column-2">Search results for sushi included many non-sushi restaurants</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"></td><td class="column-6"><center>1</td>
</tr>
<tr class="row-8">
	<td class="column-1">6c-ChatGPT</td><td class="column-2">Weak presentation of cuisine information in search results</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-9">
	<td class="column-1">7</td><td class="column-2">Participant seemed to miss price point filter—sorted on ratings and examined price points in descriptions</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-10">
	<td class="column-1">7b-ChatGPT</td><td class="column-2">Sorting by highest rated put many non-sushi restaurants at the top of the list</td><td class="column-3"></td><td class="column-4"></td><td class="column-5"></td><td class="column-6"><center>1</td>
</tr>
<tr class="row-11">
	<td class="column-1">8b-ChatGPT</td><td class="column-2">UI pushes browsing without good decision support</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-12">
	<td class="column-1">10a-ChatGPT</td><td class="column-2">Selected result labeled seafood instead of sushi</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-13">
	<td class="column-1">10b-ChatGPT</td><td class="column-2">Task not completed because did not reach reservation form</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-14">
	<td class="column-1"><i>Total</td><td class="column-2"></td><td class="column-3"><i><center>7</td><td class="column-4"><i><center>6</td><td class="column-5"><i><center>6</td><td class="column-6"><i><center>4</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1036 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 3: </strong>ChatGPT evaluations problem list (problems tagged with ChatGPT were not reported by humans).</p>

<table id="tablepress-1037" class="tablepress tablepress-id-1037">
<thead>
<tr class="row-1">
	<th class="column-1">Any-2</th><th class="column-2">ChatGPT 1</th><th class="column-3">ChatGPT 2</th><th class="column-4">ChatGPT 3</th><th class="column-5">ChatGPT 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1"><strong>Eval 1</td><td class="column-2">18%</td><td class="column-3">33%</td><td class="column-4">33%</td><td class="column-5">0%</td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong>Eval 2</td><td class="column-2"> 9%</td><td class="column-3">22%</td><td class="column-4">22%</td><td class="column-5">0%</td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>Eval 3</td><td class="column-2">25%</td><td class="column-3">27%</td><td class="column-4">27%</td><td class="column-5">0%</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>Eval 4</td><td class="column-2">27%</td><td class="column-3">30%</td><td class="column-4">30%</td><td class="column-5">0%</td>
</tr>
<tr class="row-6">
	<td class="column-1"></td><td class="column-2"></td><td class="column-3"></td><td class="column-4"></td><td class="column-5"></td>
</tr>
</tbody>
<tfoot>
<tr class="row-7">
	<th class="column-1">Mean:</th><th class="column-2">19%</th><td class="column-3"></td><td class="column-4"></td><td class="column-5"></td>
</tr>
</tfoot>
</table>
<!-- #tablepress-1037 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 4: </strong>Any-2 agreement for human with ChatGPT evaluations.</p>
<h3><span lang="EN-US">Gemini Validity</span></h3>
<p>Table 5 shows the combined problem list for the four runs of Gemini. It included four problems from the human list and five unique problems. Table 6 shows the human by Gemini any-2 agreement.</p>

<table id="tablepress-1038" class="tablepress tablepress-id-1038">
<thead>
<tr class="row-1">
	<th class="column-1">Prob #</th><th class="column-2">Gemini Problem List</th><th class="column-3">Run 1</th><th class="column-4">Run 2</th><th class="column-5">Run 3</th><th class="column-6">Run 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1">4</td><td class="column-2">Entering Denver in search field lost previous selection of sushi as cuisine</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-3">
	<td class="column-1">5</td><td class="column-2">Scanning through 86 cuisines is effortful, then top that off by sushi not being in the list</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-4">
	<td class="column-1">5b-Gem</td><td class="column-2">Participant used Ctrl-F to search page for "sushi"—not found</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-5">
	<td class="column-1">7</td><td class="column-2">Participant seemed to miss price point filter—sorted on ratings and examined price points in descriptions</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-6">
	<td class="column-1">7b-Gem</td><td class="column-2">Participant chose highest price tier</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-7">
	<td class="column-1">8</td><td class="column-2">Participant wanted to change sort to lowest rating first but not an option</td><td class="column-3"></td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-8">
	<td class="column-1">9b-Gem</td><td class="column-2">Seating options only presented after selecting time</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-9">
	<td class="column-1">9c-Gem</td><td class="column-2">Set time to 5:10</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-10">
	<td class="column-1">10a-Gem</td><td class="column-2">Selected result labeled seafood instead of sushi</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-11">
	<td class="column-1"><i>Total</td><td class="column-2"></td><td class="column-3"><i><center>5</td><td class="column-4"><i><center>6</td><td class="column-5"><i><center>5</td><td class="column-6"><i><center>4</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1038 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 5: </strong>Gemini evaluations problem list (problems tagged with Gem were not reported by humans).</p>

<table id="tablepress-1039" class="tablepress tablepress-id-1039">
<thead>
<tr class="row-1">
	<th class="column-1">Any-2</th><th class="column-2">Gem 1</th><th class="column-3">Gem 2</th><th class="column-4">Gem 3</th><th class="column-5">Gem 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1">Eval 1</td><td class="column-2">29%</td><td class="column-3">29%</td><td class="column-4">25%</td><td class="column-5">29%</td>
</tr>
<tr class="row-3">
	<td class="column-1">Eval 2</td><td class="column-2">33%</td><td class="column-3">33%</td><td class="column-4">50%</td><td class="column-5">33%</td>
</tr>
<tr class="row-4">
	<td class="column-1">Eval 3</td><td class="column-2">38%</td><td class="column-3">22%</td><td class="column-4">50%</td><td class="column-5">38%</td>
</tr>
<tr class="row-5">
	<td class="column-1">Eval 4</td><td class="column-2">43%</td><td class="column-3">25%</td><td class="column-4">57%</td><td class="column-5">43%</td>
</tr>
<tr class="row-6">
	<td class="column-1"></td><td class="column-2"></td><td class="column-3"></td><td class="column-4"></td><td class="column-5"></td>
</tr>
</tbody>
<tfoot>
<tr class="row-7">
	<th class="column-1">Mean:</th><th class="column-2">36%</th><td class="column-3"></td><td class="column-4"></td><td class="column-5"></td>
</tr>
</tfoot>
</table>
<!-- #tablepress-1039 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 6: </strong>Any-2 agreement for human with Gemini evaluations.</p>
<p>We found the <strong>poorest agreement between human and ChatGPT evaluations</strong> (19%). Agreement between human and Gemini evaluations (36%) was substantially higher but still relatively low.</p>
<p>These agreement rates account for problems AI found that humans didn’t. We treated AI-discovered problems as if they were real (for now), but they could have been false positives (an error humans make, too). What is a “real” problem? That’s been a long-standing research question. For now, we’re relying on the senior researcher to determine the real problems. That human-verified problem list is how we’ll evaluate the AIs.</p>
<h3><span lang="EN-US">Did AI Find the Same Problems as Researchers?</span></h3>
<p>We can use the human-generated and verified problem lists as the “gold-standard” and assess AI’s “hit-rate” as another measure of validity beyond any-2 agreement. The four human evaluators identified nine usability problems. <strong>ChatGPT identified five, and Gemini identified four. </strong>Figure 1 shows the problems identified by human evaluators and how well both AI models identified them. We consolidated the runs, counting a problem if it was found at least once across any of the four runs.</p>
<p>Four of the problems were found by all four researchers, suggesting they were more salient problems. ChatGPT uncovered three of these four, and Gemini uncovered two.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/05/Figure1.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47541" src="https://measuringu.com/wp-content/uploads/2026/05/Figure1-300x83.png" alt="How well AI models found usability problems identified by researchers. " width="1200" height="331" srcset="https://measuringu.com/wp-content/uploads/2026/05/Figure1-300x83.png 300w, https://measuringu.com/wp-content/uploads/2026/05/Figure1-1024x282.png 1024w, https://measuringu.com/wp-content/uploads/2026/05/Figure1-768x212.png 768w, https://measuringu.com/wp-content/uploads/2026/05/Figure1-1536x423.png 1536w, https://measuringu.com/wp-content/uploads/2026/05/Figure1-600x165.png 600w, https://measuringu.com/wp-content/uploads/2026/05/Figure1.png 1971w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1: </strong>How well AI models found usability problems identified by researchers.</p>
<p>Comparing AI to a pooled set of problems from four researchers may not be a fair comparison. We should also consider how well AI does compared to each individual researcher. Figure 1 shows that, for example, ChatGPT identified four of the six problems identified by the senior evaluator. Gemini uncovered two of the six problems. Across each of the four evaluators, ChatGPT identified between <strong>60% and 71% </strong>of the usability problems, and Gemini identified between <strong>33% and 60%</strong> (see right side of Figure 1).</p>
<p>Figure 2 is a Venn diagram that shows the overlap in problems found between both AIs and between AIs and humans. AIs generated eleven problems not identified by any of the four researchers, and there were three problems identified by humans only. ChatGPT came up with seven new ones and Gemini five (they agreed on one of the problems).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/05/Figure2.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47542" src="https://measuringu.com/wp-content/uploads/2026/05/Figure2-1024x901.png" alt="Venn diagram of usability problem discovery by humans, ChatGPT, and Gemini. " width="600" height="528" srcset="https://measuringu.com/wp-content/uploads/2026/05/Figure2-1024x901.png 1024w, https://measuringu.com/wp-content/uploads/2026/05/Figure2-300x264.png 300w, https://measuringu.com/wp-content/uploads/2026/05/Figure2-768x676.png 768w, https://measuringu.com/wp-content/uploads/2026/05/Figure2-1536x1351.png 1536w, https://measuringu.com/wp-content/uploads/2026/05/Figure2-2048x1802.png 2048w, https://measuringu.com/wp-content/uploads/2026/05/Figure2-600x528.png 600w" sizes="auto, (max-width: 600px) 100vw, 600px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 2: </strong>Venn diagram of usability problem discovery by humans, ChatGPT, and Gemini.</p>
<p>AIs generated more new problems (eleven) than the total list generated by four humans (nine). It’s not clear whether these additional AI-identified problems represent true usability issues that humans missed or are false positives/hallucinations. We’ll dig into the qualitative difference on those problems in an upcoming article. What is clear is that all these additional problems likely require a human’s time to review them.</p>
<h2><span lang="EN-US">Summary and Discussion</span></h2>
<p>Building on our previous research into the reliability of AI usability problem discovery, we investigated the validity of AI evaluations by seeing whether AI and human evaluators agree on <em>which</em> problems they find. Using the same video, task, and prompt, four UX researchers and two LLMs (ChatGPT and Gemini, four runs each) independently produced problem lists. Our key findings:</p>
<p><strong>Humans had higher within-group reliability than the LLMs.</strong> Any-2 agreement among human evaluators was 66%, well above the 31% we previously reported for ChatGPT and somewhat above Gemini&#8217;s 57%.</p>
<p><strong>Agreement between humans and AI was low.</strong> The human-ChatGPT any-2 agreement was just 19%, the lowest we observed. Human-Gemini agreement was better at 36%, but still below the typical human baseline of 47%. Low agreement means AI and humans often flag different problems when watching the same video with known usability issues.</p>
<p><strong>AI identifies roughly half the problems humans find.</strong> ChatGPT identified five of the nine human-verified problems, and Gemini identified four. Of the four problems that were identified by all human evaluators, three were identified by both AIs. The nine problems were a vetted compilation from all four human evaluators. When we limited the comparison to individual evaluators, ChatGPT matched 60–71% of each researcher&#8217;s list, and Gemini matched 33–60%. The AIs didn’t find all the problems reported by humans, but depending on the evaluator(s), they can find more than half of them.</p>
<p><strong>AI generates more new problems than humans do.</strong> The two AIs together produced eleven problems that no human identified (at least from one video), which is more than the entire human problem list of nine. ChatGPT contributed seven unique problems and Gemini five, with one shared between them. It&#8217;s not yet clear whether these represent real usability issues that trained researchers missed or are false positives (we’ll explore these possibilities in an upcoming article).</p>
<p><strong>AI-only problems create a new validation burden.</strong> Someone has to determine which AI-generated problems are real, and that means a human reviewing each one. If AI is being used to save time, the volume of unverified AI-generated problems may offset much of those savings. Whether the tradeoff is worth it likely depends on how many of those problems turn out to be real, again something we&#8217;ll examine in a follow-up.</p>
<p>In our next article, we shift from this quantitative comparison to a qualitative examination: using the video as ground truth, we ask whether the AI-only problems reflect events that actually happened or whether the LLMs hallucinated issues that never occurred.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How Reliable Is AI at Finding UI Problems?</title>
		<link>https://measuringu.com/ai-usability-problem-analysis-of-a-video/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=ai-usability-problem-analysis-of-a-video</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD •&nbsp;Jeff Sauro, PhD •&nbsp;Will Schiavone, PhD&nbsp;•&nbsp;Lucas Plabst, PhD]]></dc:creator>
		<pubDate>Tue, 28 Apr 2026 22:04:59 +0000</pubDate>
				<category><![CDATA[Usability]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[ChatGPT]]></category>
		<category><![CDATA[Gemini]]></category>
		<category><![CDATA[Problem Discovery]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47402</guid>

					<description><![CDATA[It looks like AI can “watch” videos. And if AI can watch videos, it can likely extract UI problems. That suggests it has the potential to support UX research. So maybe AI can “watch” a video and detect some problems. But if you run the same video through AI multiple times, do you get the [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-scaled.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47475 size-medium" src="https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-300x169.png" alt="Feature image showing two AI robots, each holding a clipboard with a UI problems list" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-2048x1152.png 2048w, https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-600x338.png 600w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>It looks like AI can “<a href="https://measuringu.com/can-ai-detect-usability-problems">watch” videos</a>. And if AI can watch videos, it can likely extract UI problems. That suggests it has the <a href="https://measuringu.com/ai-as-uxr-assistant-user-and-analyst/">potential to support UX research</a>.</p>
<p>So maybe AI can “watch” a video and detect some problems. But if you run the same video through AI multiple times, do you get the same results?</p>
<p>Reliability matters. If AI produces different results each time, it becomes untrustworthy, no matter how convincing its reasoning sounds.</p>
<p>There are a lot of variables that can affect our assessment of reliability, including:</p>
<ul>
<li>AI models (Gemini, ChatGPT, Claude, Grok)</li>
<li>Versions (models are changing monthly)</li>
<li>LLM settings like <em>temperature</em>, which affect the randomness of the output</li>
<li>Prompts: What you ask (and even how many times you ask)</li>
</ul>
<p>There are a lot of variables to consider, but we have to start somewhere. So we did. In this article, we take a first step in assessing the reliability of AI problem detection. We examined how consistent two popular AI chatbots are at identifying usability problems from the same video.</p>
<h2><span lang="EN-US">Study Setup</span></h2>
<p>We had two LLMs, ChatGPT-5.4 Thinking and Gemini 3 Flash Thinking, review the video and list the usability issues they discovered (four runs per LLM to check for within-LLM consistency; default settings only). Both are general-purpose LLMs for which MeasuringU has paid “pro” subscriptions (i.e., not free versions). Video 1 shows a 15-second clip of the full six-minute video.</p>
<div class="ast-oembed-container " style="height: 100%;"><iframe loading="lazy" title="MU_opentable_reservation first 15 sec" src="https://player.vimeo.com/video/1184232223?h=3252bb61a5&amp;dnt=1&amp;app_id=122963" width="1170" height="658" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write; encrypted-media; web-share" referrerpolicy="strict-origin-when-cross-origin"></iframe></div>
<p class="wp-caption-text" style="text-align: left;"><strong>Video 1: </strong>First 15 seconds of a participant searching for a sushi restaurant on the OpenTable website.</p>
<p>The task (visible at the bottom of the video) was to use OpenTable.com to:</p>
<blockquote><p>“Please think aloud. Make a reservation for four people at a sushi restaurant in Denver, CO tomorrow anytime after 5:00pm. Make sure the restaurant you select is not at the lowest or highest price point. Of the restaurants that fit these criteria, look at their overall rating, customer reviews, and photos to select the one that is the most appealing to you. Go as far as you can in the reservation process until you are asked for your personal information or account details. DO NOT fully confirm the reservation. Write down the restaurant name and the time of the reservation. You will be asked about this information after the task.”</p></blockquote>
<p>We used the following prompt:</p>
<blockquote><p>“During a usability test, the facilitator must keep track of participant behaviors as they navigate through tasks on a website, mobile app, software program, etc. We’d like you to watch a video of a usability test where participants were asked to book a table at a Sushi restaurant. As you&#8217;re watching, please look for problems the participant has while attempting to complete the task. For example, you can document the path users take, describe issues they encounter as well as what on the website might be causing problems. If you understand these instructions, let me know and I&#8217;ll drag the video in for you to review. Are you ready for the video?”</p></blockquote>
<p>The LLM response to this question was always some version of “Yes.”</p>
<p>In this study, we varied only the type of AI: ChatGPT and Gemini. The video, the prompt, and the LLM versions and settings were constant, but we plan to vary those variables in future studies.</p>
<h2><span lang="EN-US">Assessing Reliability</span></h2>
<p>If you ask, AI will deliver (something). For each run, we compiled a list of usability problems that the AI model “discovered.”</p>
<p>For example, a problem noticeable in the video (and on the current OpenTable website) is that when entering “Denver” in the search field, the previously selected cuisine (sushi) was removed, making for a clumsy filter and search experience.</p>
<p>To assess the reliability (consistency) of their problem discovery, we computed the <a href="https://measuringu.com/assessing-interrater-reliability-in-ux-research/">any-2 agreement</a> between ChatGPT and Gemini and within each model. We treated the models like evaluators.</p>
<p>Any-2 agreement is a UX context-specific version of the <a href="https://link.springer.com/article/10.1186/s12859-019-3118-5">Jaccard similarity coefficient (<em>J</em>)</a>, the ratio of the intersection of two binary measurements divided by their union. When there are more than two evaluators, the overall any-2 agreement is the average of the any-2 agreements for each pair of evaluators.</p>
<h3><span lang="EN-US">Computing Any-2 Agreement</span></h3>
<p>Imagine that (Y and C) have independently created lists of usability issues where Y’s list has 14 issues, C’s list has 17, and their two lists have ten issues in common (Figure 1). Their any-2 agreement is the intersection (the ten issues they both discovered) divided by the union of both lists (14 + 17 − 10 = 21), which is 48% (10/21).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/04/042826-Figure1.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47477" src="https://measuringu.com/wp-content/uploads/2026/04/042826-Figure1-300x172.png" alt="Venn diagram of problem discovery by two evaluators. " width="436" height="250" srcset="https://measuringu.com/wp-content/uploads/2026/04/042826-Figure1-300x172.png 300w, https://measuringu.com/wp-content/uploads/2026/04/042826-Figure1-768x440.png 768w, https://measuringu.com/wp-content/uploads/2026/04/042826-Figure1-600x344.png 600w, https://measuringu.com/wp-content/uploads/2026/04/042826-Figure1.png 909w" sizes="auto, (max-width: 436px) 100vw, 436px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1: </strong>Venn diagram of problem discovery by two evaluators.</p>
<p>Due to the <a href="https://measuringu.com/evaluator-effect/">well-documented evaluator effect</a>, we do not expect perfect agreement among UX researchers. In a controlled study like this (evaluators watching the same participants do the same tasks), our best estimate of typical any-2 agreement across multiple human evaluators (<a href="https://measuringu.com/examining-the-evaluator-effect-in-unmoderated-usability-testing/">based on 12 evaluations</a>) is 47%. (When studies are not controlled, the expected any-2 agreement is about 27%.)</p>
<p>This gives us a <strong>rough</strong> benchmark for assessing if an any-2 agreement is typical (around 50%), relatively low (around 25%), or relatively high (around 75%).</p>
<h2><span lang="EN-US">Within-Group Results</span></h2>
<p>The first step in our analysis was to compute the mean any-2 agreement for each group of “evaluators” (ChatGPT, Gemini) to estimate the levels of within-group reliability.</p>
<h3><span lang="EN-US">ChatGPT Reliability Was Fair</span></h3>
<p>Table 1 shows the combined problem list for the four runs of ChatGPT. Table 2 shows the any-2 agreements for each pair of runs.</p>

<table id="tablepress-1029" class="tablepress tablepress-id-1029">
<thead>
<tr class="row-1">
	<th class="column-1">GPT #</th><th class="column-2">GPT Problem List</th><th class="column-3">Run 1</th><th class="column-4">Run 2</th><th class="column-5">Run 3</th><th class="column-6">Run 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1">1</td><td class="column-2">Complex search field with placeholder text and unexpected behaviors significantly delayed user who selected sushi from search bar dropdown but for Dallas (default) instead of Denver</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-3">
	<td class="column-1">2</td><td class="column-2">Entering "Denver" in search field lost previous selection of sushi as cuisine</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-4">
	<td class="column-1">3</td><td class="column-2">Filters not helpful</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-5">
	<td class="column-1">4</td><td class="column-2">Scanning through 86 cuisines is effortful, then top that off by sushi not being in the list</td><td class="column-3"></td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-6">
	<td class="column-1">5</td><td class="column-2">Surprised when typing sushi into search field did not lose current location</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-7">
	<td class="column-1">6</td><td class="column-2">Search results for sushi included many non-sushi restaurants</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"></td><td class="column-6"><center>1</td>
</tr>
<tr class="row-8">
	<td class="column-1">7</td><td class="column-2">Weak presentation of cuisine information in search results</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-9">
	<td class="column-1">8</td><td class="column-2">Participant seemed to miss price point filter—sorted on ratings and examined price points in descriptions</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-10">
	<td class="column-1">9</td><td class="column-2">Sorting by highest rated put many non-sushi restaurants at the top of the list</td><td class="column-3"></td><td class="column-4"></td><td class="column-5"></td><td class="column-6"><center>1</td>
</tr>
<tr class="row-11">
	<td class="column-1">10</td><td class="column-2">UI pushes browsing without good decision support</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-12">
	<td class="column-1">11</td><td class="column-2">Selected result labeled seafood instead of sushi</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-13">
	<td class="column-1">12</td><td class="column-2">Task not completed because participant did not reach reservation form</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
</tbody>
</table>
<!-- #tablepress-1029 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> ChatGPT evaluations problem list.</p>

<table id="tablepress-1030" class="tablepress tablepress-id-1030">
<thead>
<tr class="row-1">
	<th class="column-1">Any-2</th><th class="column-2">Run 1</th><th class="column-3">Run 2</th><th class="column-4">Run 3</th><th class="column-5">Run 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1"><strong>Run 1</td><td class="column-2"> x</td><td class="column-3">30%</td><td class="column-4">63%</td><td class="column-5">38%</td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong>Run 2</td><td class="column-2">30%</td><td class="column-3"> x</td><td class="column-4">33%</td><td class="column-5"> 0%</td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>Run 3</td><td class="column-2">63%</td><td class="column-3">33%</td><td class="column-4"> x</td><td class="column-5">25%</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>Run 4</td><td class="column-2">38%</td><td class="column-3"> 0%</td><td class="column-4">25%</td><td class="column-5"> x</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1030 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 2: </strong>Any-2 agreement for the ChatGPT evaluations.</p>
<p>With an <strong>overall any-2 agreement of 31%</strong>, the reliability of the ChatGPT evaluations was <strong>fair</strong>. None of the problems was identified on all four runs (5/12 were identified on three runs). Runs 2 and 4 had no problems in common.</p>
<h3><span lang="EN-US">Gemini Reliability Was Better</span></h3>
<p>Table 3 shows the combined problem list for the four runs of Gemini. Table 4 shows the any-2 agreements for each pair of runs.</p>

<table id="tablepress-1031" class="tablepress tablepress-id-1031">
<thead>
<tr class="row-1">
	<th class="column-1">Gem #</th><th class="column-2">Gemini Problem List</th><th class="column-3">Run 1</th><th class="column-4">Run 2</th><th class="column-5">Run 3</th><th class="column-6">Run 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1">1</td><td class="column-2">Entering "Denver" in search field lost previous selection of sushi as cuisine</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-3">
	<td class="column-1">2</td><td class="column-2">Scanning through 86 cuisines is effortful, then top that off by sushi not being in the list</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-4">
	<td class="column-1">3</td><td class="column-2">Participant used Ctrl-F to search page for "sushi"—not found</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-5">
	<td class="column-1">4</td><td class="column-2">Participant seemed to miss price point filter—sorted on ratings and examined price points in descriptions</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-6">
	<td class="column-1">5</td><td class="column-2">Participant chose highest price tier</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-7">
	<td class="column-1">6</td><td class="column-2">Participant wanted to change sort to lowest rating first but not an option</td><td class="column-3"></td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-8">
	<td class="column-1">7</td><td class="column-2">Seating options only presented after selecting time</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-9">
	<td class="column-1">8</td><td class="column-2">Set time to 5:10</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-10">
	<td class="column-1">9</td><td class="column-2">Selected result labeled seafood instead of sushi</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
</tbody>
</table>
<!-- #tablepress-1031 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 3: </strong>Gemini evaluations problem list.</p>

<table id="tablepress-1032" class="tablepress tablepress-id-1032">
<thead>
<tr class="row-1">
	<th class="column-1">Any-2</th><th class="column-2">Run 1</th><th class="column-3">Run 2</th><th class="column-4">Run 3</th><th class="column-5">Run 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1"><strong>Run 1</td><td class="column-2"> x</td><td class="column-3">38%</td><td class="column-4">67%</td><td class="column-5">80%</td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong>Run 2</td><td class="column-2">38%</td><td class="column-3"> x</td><td class="column-4">38%</td><td class="column-5">43%</td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>Run 3</td><td class="column-2">67%</td><td class="column-3">38%</td><td class="column-4"> x</td><td class="column-5">80%</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>Run 4</td><td class="column-2">80%</td><td class="column-3">43%</td><td class="column-4">80%</td><td class="column-5"> x</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1032 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 4: </strong>Any-2 agreement for the Gemini evaluations.</p>
<p>With an <strong>overall any-2 agreement of 57%</strong>, the reliability of the Gemini evaluations was <strong>good</strong> (3/9 problems identified in all four runs, 4/9 identified by at least three runs).</p>
<h2><span lang="EN-US">Between-Group Results</span></h2>
<p>The second step in our analysis was to compute the mean any-2 agreement across LLMs to estimate the between-group reliability, shown in Table 5.</p>

<table id="tablepress-1033" class="tablepress tablepress-id-1033">
<thead>
<tr class="row-1">
	<th class="column-1">Any-2</th><th class="column-2">Gem 1</th><th class="column-3">Gem 2</th><th class="column-4">Gem 3</th><th class="column-5">Gem 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1"><strong>GPT 1</td><td class="column-2">40%</td><td class="column-3">40%</td><td class="column-4">33%</td><td class="column-5">40%</td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong>GPT 2</td><td class="column-2">20%</td><td class="column-3">20%</td><td class="column-4">17%</td><td class="column-5">20%</td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>GPT 3</td><td class="column-2">40%</td><td class="column-3">75%</td><td class="column-4">33%</td><td class="column-5">40%</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>GPT 4</td><td class="column-2"> 0%</td><td class="column-3">33%</td><td class="column-4"> 0%</td><td class="column-5"> 0%</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1033 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 5: </strong>Any-2 agreement between ChatGPT and Gemini evaluations.</p>
<p>With an <strong>overall any-2 agreement of 28%</strong>, the between-AI reliability was <strong>low</strong> (closer to 25% than to 50%).</p>
<h2><span lang="EN-US">Summary and Discussion</span></h2>
<p>Along with the rest of the UX researcher community, we have a strong interest in the roles that AI might play in facilitating our work. Watching participants attempt to complete tasks is a fundamental but labor-intensive UX research activity, so any relief AI assistance might offer would be welcome.</p>
<p>As a first step to investigate the capability of ChatGPT-5.4 Thinking and Gemini 3 Flash Thinking of finding usability problems in videos, we collected evaluations of a single video (summarized as lists of usability problems), performing four runs with each LLM.</p>
<p>In this article, we evaluated any-2 agreement within each group of evaluations (ChatGPT, Gemini) and between the AIs. Our key findings were:</p>
<p><strong>Gemini had good reliability, and ChatGPT’s was fair. </strong>The average any-2 agreement for ChatGPT was 31%. We expect this level of reliability when comparing different evaluators, <a href="https://www.dialogdesign.dk/cue-studies/">different methods</a>, or different users. It’s certainly lower than you’d want, but still at a level considered acceptable in our industry.</p>
<p>For Gemini, the average any-2 agreement was good at 57%. From the literature and our own research with human evaluators, 57% is above the mean of 47% and on the higher side of acceptability.</p>
<p><strong>Between-group reliability for Gemini and ChatGPT was low. </strong>The any-2 agreement between ChatGPT and Gemini was low at 28%. That’s about 20 points below the average when examining the same video by different people. This result is not great.</p>
<p><strong>Reliability isn’t accuracy. </strong>Are the problems identified by the LLMs as relevant as those discovered by a human evaluator? This question hasn’t been answered yet (a future analysis will). But to have accuracy (validity), we need to establish consistent (reliable) results, and at least for this video and prompt, the Gemini performance was sufficiently reliable.</p>
<p><strong>Humans vs. AI coming soon. </strong>We’re just getting started with our analyses. In an upcoming article, we’ll compare any-2 agreement between these LLMs and a problem list generated by professional human UX researchers. Stay tuned.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Can AI Detect Usability Problems?</title>
		<link>https://measuringu.com/can-ai-detect-usability-problems/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=can-ai-detect-usability-problems</link>
		
		<dc:creator><![CDATA[Jeff Sauro, PhD •&nbsp;Lucas Plabst, PhD •&nbsp;Jim Lewis, PhD&nbsp;•&nbsp;Will Schiavone, PhD]]></dc:creator>
		<pubDate>Wed, 22 Apr 2026 03:39:35 +0000</pubDate>
				<category><![CDATA[Usability]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[ChatGPT]]></category>
		<category><![CDATA[error]]></category>
		<category><![CDATA[Problem Discovery]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47333</guid>

					<description><![CDATA[You may have become numb to the overhyped headlines about AI. But it’d be wrong to dismiss the impact AI can have on our industry, not only because of job displacement, but also of helping us do our jobs more effectively (hopefully). To separate the hype and hysteria, we at MeasuringU think about AI’s impact [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-scaled.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47352 size-medium" src="https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-300x169.png" alt="Feature image showing an AI robot observing the user flow to detect usability issues" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-2048x1152.png 2048w, https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-600x338.png 600w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>You may have become numb to the overhyped headlines about AI.</p>
<p>But it’d be wrong to dismiss the impact AI can have on our industry, not only because of job displacement, but also of helping us do our jobs more effectively (hopefully).</p>
<p>To separate the hype and hysteria, we at MeasuringU think about AI’s impact in UX research in <a href="https://measuringu.com/ai-as-uxr-assistant-user-and-analyst/">three ways</a>: AI as Research Assistant, AI as (Synthetic) User, and AI as Researcher.</p>
<p>One of the more valuable activities we do in UX research as a researcher and assistant is to find (and recommend fixes for) usability problems in an interface.</p>
<p>Finding problems typically comes from researchers observing people interacting with a product, either live in a lab (like ours at MeasuringU, Figure 1), remotely using tools like <a href="https://measuringu.com/muiq/">MUiQ<sup>®</sup></a>, or by reviewing recordings of moderated or unmoderated sessions.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/04/042126-F1.jpg"><img loading="lazy" decoding="async" class="alignnone size-full wp-image-47336" src="https://measuringu.com/wp-content/uploads/2026/04/042126-F1.jpg" alt="AI-generated image of a robot observing a usability test session." width="602" height="602" srcset="https://measuringu.com/wp-content/uploads/2026/04/042126-F1.jpg 602w, https://measuringu.com/wp-content/uploads/2026/04/042126-F1-300x300.jpg 300w, https://measuringu.com/wp-content/uploads/2026/04/042126-F1-150x150.jpg 150w, https://measuringu.com/wp-content/uploads/2026/04/042126-F1-70x70.jpg 70w, https://measuringu.com/wp-content/uploads/2026/04/042126-F1-600x600.jpg 600w, https://measuringu.com/wp-content/uploads/2026/04/042126-F1-100x100.jpg 100w" sizes="auto, (max-width: 602px) 100vw, 602px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> AI-generated image of a robot observing a usability test session.</p>
<p>Watching lots of videos can be time-consuming and tedious. Usability problems often aren’t obvious defects in an interface. They can come from perfectly compiled code and a beautiful interface. Due to the <a href="https://measuringu.com/assessing-interrater-reliability-in-ux-research/">well-known evaluator effect</a>, even trained professionals do not detect usability problems with perfect consistency.</p>
<p>Can AI really uncover usability problems like UX researchers?</p>
<p>A few years ago, this sort of question seemed like science fiction. Now it’s not <em>if</em> but <em>how well</em> AI can watch videos like a researcher or research assistant. Can human researchers use it and trust the results?</p>
<h2><span lang="EN-US">How Does AI “Watch” a Video?</span></h2>
<p>Here’s a 15-second clip of a participant starting the task of finding a sushi restaurant on the OpenTable website. It’s part of a six-minute video of an unmoderated usability test conducted on the MUiQ platform.</p>
<div class="ast-oembed-container " style="height: 100%;"><iframe loading="lazy" title="MU_opentable_reservation first 15 sec" src="https://player.vimeo.com/video/1184232223?h=3252bb61a5&amp;dnt=1&amp;app_id=122963" width="1170" height="658" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write; encrypted-media; web-share" referrerpolicy="strict-origin-when-cross-origin"></iframe></div>
<p class="wp-caption-text" style="text-align: left;"><strong>Video 1: </strong>First 15 seconds of the OpenTable.com video.</p>
<p>We asked Gemini 3.1 Pro to “Please tell me in extreme detail what you see in the video from start to finish going second by second.” Here&#8217;s how it responded:</p>
<ul>
<li><strong>00:00–00:04:</strong> The video begins on the OpenTable.com homepage. The top navigation bar shows options for &#8220;For Business,&#8221; &#8220;Mobile,&#8221; &#8220;Help,&#8221; language selection (&#8220;EN&#8221;), and buttons for &#8220;Sign up&#8221; and &#8220;Sign in.&#8221; The main search area displays default parameters: a date of &#8220;Jul 19, 2022,&#8221; a time of &#8220;7:00 PM,&#8221; and a party size of &#8220;2 people.&#8221; The search bar says &#8220;Location, Restaurant, or Cuisine&#8221; with a red &#8220;Let&#8217;s go&#8221; button. Below this is a message reading, &#8220;It looks like you&#8217;re in Dallas Suburbs. Not correct? Get current location.&#8221; A section titled &#8220;Book for dinner tonight&#8221; shows several restaurants, including Fogo de Chao, Lombardi Cucina Italiana, Seasons 52, and Perry&#8217;s Steakhouse. At the bottom of the screen, a task description box outlines the user&#8217;s objective: booking a sushi restaurant in Denver, CO for four people tomorrow after 5:00 PM. The user begins speaking, stating they are looking for a sushi restaurant.</li>
<li><strong>00:05–00:09:</strong> The user&#8217;s mouse cursor idles on the left side of the screen. The user continues speaking, noting they need a reservation for tomorrow anytime after 5:00 PM. They observe that today is the 19th, so they need to book for the 20th.</li>
<li><strong>00:10–00:11:</strong> The cursor moves to the date selector and clicks on &#8220;Jul 19, 2022.&#8221; A calendar dropdown appears. The user clicks on the number &#8220;20&#8221; to change the date to Jul 20, 2022.</li>
<li><strong>00:12–00:15:</strong> The user moves the cursor to the time selector, currently set at &#8220;7:00 PM,&#8221; and clicks it. A dropdown menu of times in 30-minute increments appears. The user scrolls up and clicks on &#8220;5:00 PM.&#8221;</li>
</ul>
<p>Even though the summary of the onscreen text describing the user task does not include all task requirements (which we did not specifically prompt for), this is a reasonably accurate description of what’s on the screen and the events.</p>
<p>Gemini didn’t actually watch the video the way we do. It doesn’t experience a continuous stream of motion or notice subtle hesitation in real time. Instead, it only samples a few frames (often only one to two frames out of the 30 frames) per second of the video.</p>
<p>Each frame is then broken into smaller regions and converted into numerical representations that the model can process. If there’s audio, spoken words are transcribed into text and aligned with those frames. By the time the model begins “analyzing,” the video has already been reduced to a combination of image fragments and text.</p>
<p>From there, the model treats those inputs similarly to how it processes language. The visual and textual information is converted into tokens (small chunks of data) and passed through a neural network that looks for patterns (things such as interface elements, changes in screens, or sequences of actions).</p>
<h2>Autocorrect for Video Watching</h2>
<p>Because AI is working from snapshots rather than continuous playback, it doesn’t directly see motion. Instead, it infers what likely happened between frames (for example, that a user scrolled, tapped, or navigated to a new page). This makes the process efficient, but it also means short or subtle behaviors can be missed.</p>
<p>Based on the sampled frames and any accompanying text, it generates the most likely description of what happened, much like how it predicts the next word in a sentence. Basically, it’s like autocorrect on steroids for videos.</p>
<p>That’s why the output can sound surprisingly natural and insightful, even when it’s not entirely accurate. It’s less like a researcher watching a session and more like a system generating a plausible narrative from partial information.</p>
<h3>Losing Frames</h3>
<p>As long as there’s been autocorrect, there’s been, well, mistakes (often <a href="https://www.huffpost.com/entry/funniest-autocorrect-faiils-2014_n_6391880">hilarious ones</a>). The sampling that makes AI fast also makes it “<a href="https://cs.stanford.edu/people/eroberts/courses/soco/projects/data-compression/lossy/index.htm">lossy</a>.” By looking at only a fraction of the frames, the model can miss brief moments of hesitation, confusion, or micro-interactions that are often critical in usability analysis. What’s efficient for processing might not always be sufficient for insight.</p>
<h3>Probabilistic Output</h3>
<p>But unlike autocorrect, which works the same each time it’s presented with a partial word, AI outputs aren’t always the same. They’re probabilistic rather than deterministic. Even with the same video and the same prompt, the model may generate slightly different descriptions each time. That’s because it’s not retrieving a fixed answer but generating the most likely sequence of words from a range of possibilities. The results can be consistent in general themes, but not identical in wording or even emphasis. And with current systems, there is always the possibility of <a href="https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence)">hallucination</a>. For researchers, these concerns mean that AI outputs should be treated less like definitive observations and more like plausible interpretations that still needs validation.</p>
<h3>Temperature</h3>
<p>Part of this variability comes from a setting called <em>temperature</em>, which controls how much randomness the model uses when generating responses. Temperature typically ranges from 0 (close to deterministic) to around 2 (much more variable). Most models use a middle setting by default, which balances consistency and variation. Higher temperatures can surface a wider range of interpretations (sometimes useful for exploratory analysis), while lower temperatures produce more consistent outputs—but even then, results aren’t perfectly repeatable.</p>
<p>Figure 2 illustrates this process.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/04/042126-F2.png"><img loading="lazy" decoding="async" class="alignnone size-full wp-image-47337" src="https://measuringu.com/wp-content/uploads/2026/04/042126-F2.png" alt="Visual overview of how an AI “watches” a video. " width="1182" height="788" srcset="https://measuringu.com/wp-content/uploads/2026/04/042126-F2.png 1182w, https://measuringu.com/wp-content/uploads/2026/04/042126-F2-300x200.png 300w, https://measuringu.com/wp-content/uploads/2026/04/042126-F2-1024x683.png 1024w, https://measuringu.com/wp-content/uploads/2026/04/042126-F2-768x512.png 768w, https://measuringu.com/wp-content/uploads/2026/04/042126-F2-600x400.png 600w" sizes="auto, (max-width: 1182px) 100vw, 1182px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 2: </strong>Visual overview of how an AI “watches” a video.</p>
<p>This gives you an idea about how AI reviews videos for usability problems. But what does it look like when you ask an AI to perform a usability evaluation of a video?</p>
<h2><span lang="EN-US">Problem List from ChatGPT</span></h2>
<p>We uploaded the full six-minute video of a person attempting to find a sushi restaurant in Denver on the OpenTable website into ChatGPT (model 5.4 Thinking). We selected the video because it has several known usability problems that humans consistently detect. We prompted ChatGPT with:</p>
<blockquote><p>“During a usability test, the facilitator must keep track of participant behaviors as they navigate through tasks on a website, mobile app, software program, etc. We’d like you to watch a video of a usability test where participants were asked to book a table at a Sushi restaurant. As you&#8217;re watching, please look for problems the participant has while attempting to complete the task. For example, you can document the path users take, describe issues they encounter as well as what on the website might be causing problems. If you understand these instructions, let me know and I&#8217;ll drag the video in for you to review. Are you ready for the video?”</p></blockquote>
<p>ChatGPT indicated “yes,” then took only three minutes to process the video (half the time of the six minutes because it sampled a fraction of the frames to piece together its visual autocorrect narrative).</p>
<p>From its output, we derived a list of seven usability problems (Table 1).</p>

<table id="tablepress-1028" class="tablepress tablepress-id-1028">
<thead>
<tr class="row-1">
	<th class="column-1">Problem List</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1">Complex search field with placeholder text and unexpected behaviors significantly delayed user who selected sushi from search bar dropdown but for Dallas (default) instead of Denver</td>
</tr>
<tr class="row-3">
	<td class="column-1">Entering Denver in search field lost previous selection of sushi as cuisine</td>
</tr>
<tr class="row-4">
	<td class="column-1">Search results for sushi included many non-sushi restaurants</td>
</tr>
<tr class="row-5">
	<td class="column-1">Weak presentation of cuisine information in search results</td>
</tr>
<tr class="row-6">
	<td class="column-1">Participant seemed to miss price point filter—sorted on ratings and examined price points in descriptions</td>
</tr>
<tr class="row-7">
	<td class="column-1">UI pushes browsing without good decision support</td>
</tr>
<tr class="row-8">
	<td class="column-1">Selected restaurant was categorized as Seafood instead of Sushi, so participant failed the task</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1028 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> List of problems “discovered” by ChatGPT review of usability test video.</p>
<h2><span lang="EN-US">Looks Good, However …</span></h2>
<p>On the surface, that looks pretty good. It’s plausible, specific, and aligned with what a researcher might note. But it leaves us with a few questions:</p>
<ul>
<li>How many of these are <em>actual</em> usability problems versus plausible-sounding interpretations (autocorrect) or hallucinations?</li>
<li>How consistent are the results across multiple runs (reliability)?</li>
<li>How closely do these match what human UX researchers would identify (validity)?</li>
</ul>
<p>We’ll explore these important questions in future articles.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>A Review of Experiments with Synthetic Users</title>
		<link>https://measuringu.com/review-of-experiments-with-synthetic-users/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=review-of-experiments-with-synthetic-users</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD and Jeff Sauro, PhD]]></dc:creator>
		<pubDate>Wed, 15 Apr 2026 05:22:48 +0000</pubDate>
				<category><![CDATA[UX]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47300</guid>

					<description><![CDATA[One of the hardest parts of conducting user and market research is recruiting participants. It takes time, costs money, and on top of that, there are no-shows and fraudsters. Now imagine being able to conduct UX research without the hassle of recruiting the “U.” Enter the idea of AI-generated synthetic users that offer the promise [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/04/041426-FeatureImage.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47329 size-medium" src="https://measuringu.com/wp-content/uploads/2026/04/041426-FeatureImage-300x169.png" alt="Feature image showing two researchers examining an AI synthetic user robot" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/04/041426-FeatureImage-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/04/041426-FeatureImage-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/04/041426-FeatureImage-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/04/041426-FeatureImage-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/04/041426-FeatureImage-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/04/041426-FeatureImage.png 2000w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>One of the hardest parts of conducting user and market research is recruiting participants. It takes time, costs money, and on top of that, there are <a href="https://measuringu.com/typical-no-show-rate-for-moderated-studies/">no-shows</a> and fraudsters.</p>
<p>Now imagine being able to conduct UX research without the hassle of recruiting the “U.” Enter the idea of AI-generated synthetic users that offer the promise of participant input being:</p>
<ul>
<li>Simpler (no need to deal with humans)</li>
<li>Less costly (no panel/respondent fees)</li>
<li>Faster (finish in hours or days instead of weeks or months)</li>
<li>Scalable (get data from thousands of synthetic users instead of a relative handful of participants)</li>
<li>Broader in reach (access to user groups that are hard to find or very expensive to recruit)</li>
<li>More secure (no need for nondisclosure agreements and no risk of human disclosure)</li>
</ul>
<p>At least, that’s the dream of research with synthetic users.</p>
<p>Others view synthetic users as more of a nightmare. They are concerned that research with synthetic users can lead to:</p>
<ul>
<li>Plausible-looking data that’s just wrong</li>
<li>Shallow qualitative responses because synthetic users have no real lived experience</li>
<li>Reinforced biases driven by large language models (LLM)</li>
<li>Artificially low variability (quantitative or qualitative)</li>
</ul>
<p>We’ve seen these conflicting attitudes about synthetic users play out in online posts and conversations over the past few years, most recently with the promotion of proprietary models of synthetic users by companies like Qualtrics and Aaru, followed by criticism of that promotion by influential UX researchers.</p>
<h2>Pro-Synthetic Voices</h2>
<p>Qualtrics is the dominant (and <a href="https://www.pymnts.com/acquisitions/2026/wall-street-snubs-qualtrics-debt-over-ai-disruption-risks/">debt-loaded</a>) survey platform that’s made a big bet on synthetic users. Their synthetic dataset was trained on millions of survey responses, and they reported that it can <a href="https://www.greenbook.org/insights/data-science/testing-synthetic-data-against-academic-benchmarks-a-replication-study">realistically mimic human survey patterns</a>. The variability and correlations mirror human response patterns better than general LLMs data, at least for the attitudinal survey questions they used.</p>
<p>Aaru is another synthetic user simulation platform that has gotten attention. The global consulting firm EY used Aaru’s proprietary multi-agent simulation to replicate a 3,600-person global wealth survey. They <a href="https://www.ey.com/en_us/insights/wealth-asset-management/how-ai-simulation-accelerates-growth-in-wealth-and-asset-management">reported strong agreement across multiple statistical metrics</a> (high correlation, modest error), suggesting that synthetic data approximated real survey results at scale (done in one day versus six months!).</p>
<h2>Anti-Synthetic Voices</h2>
<p>First from the anti-synthetic camp is Chris Chapman, a longtime quantitative UX researcher (Amazon, Google, Microsoft) and co-chair of the Quant UX Conference. His most recent presentation clearly elaborates that <a href="https://quantuxblog.com/synthetic-survey-data-its-not-data">synthetic users are not users</a>. His blunt conclusion is that <strong>synthetic data has no place in survey research</strong>.</p>
<p>Another voice is John Mecke, a SaaS and product strategy writer who argues that <a href="https://developmentcorporate.com/saas/synthetic-responses-market-research-2025/">synthetic users face five core limitations</a>: no lived experience, misleading “too-accurate” results, cultural bias, weak statistical reliability, and narrower real-world usefulness than claimed.</p>
<p>And there’s also Constantine Papas, a UX research strategist and writer of <em>The Voice of User. </em>Papas argues that <a href="https://www.thevoiceofuser.com/aaa-billion-dollar-ai-startup-is-selling-you-a-survey-the-wall-street-journal-wrote-a-love-letter-about-it/"><em>s</em>ynthetic research is being oversold</a> largely by cherry-picking favorable results from financially interested parties. When describing the EY study from Aaru’s algorithms, he argues that the correlations are largely because the LLMs were already trained on this data. That’s hardly predicting.</p>
<p>Finally, a recent preprint of a comprehensive literature review of 182 papers also casts strong doubt on the ability of synthetic users to do more than mimic already collected data. We recommend reading the <a href="https://www.researchgate.net/publication/401777396_Synthetic_Participants_Generated_by_Large_Language_Models_A_Systematic_Literature_Review">preprint</a> (not yet peer reviewed) and a <a href="https://www.thevoiceofuser.com/the-largest-review-of-synthetic-participants-ever-conducted-found-exactly-what-youd-expect-synthetic-users-dont-work/">discussion of the research</a> by Papas.</p>
<p>As interesting as these online conversations are, they have not been formally reviewed for scientific quality. In this article, we briefly review 12 <strong>peer-reviewed research papers</strong> on the use of synthetic users in UX and UX-adjacent research. For full details on experimental designs and results (e.g., experimental comparisons, models, prompting, settings, metrics), see the links to the papers and articles in the appendix.</p>
<h2>Our Inclusion Criteria for Papers on Synthetic Users</h2>
<p>We searched the literature for peer-reviewed research that had been published no earlier than 2023 and used LLM models no earlier than GPT-3.5. This turned up 12 papers that can be broadly categorized as attempts to replicate:</p>
<ul>
<li>Psychological experiments (five papers)</li>
<li>Survey results (three papers)</li>
<li>Social research (three papers)</li>
<li>UX research (one paper)</li>
</ul>
<p>We’ll now review the evidence in each of these four categories.</p>
<h3>Psychological Experiments: Sometimes Human-Like, Often Inconsistent</h3>
<p>The idea that digital data can replicate people predates LLMs. From the mid-1990s through the 2000s, a popular research program in social psychology was the &#8220;<a href="https://en.wikipedia.org/wiki/Computers_are_social_actors">computers are social actors</a>&#8221; paradigm, recreating classical psychological experiments in which one of the human participants was replaced by a computer to investigate how this affected human behavior.</p>
<p>Several researchers have adapted this approach to one in which there are <em>no</em> human participants, exploring the extent to which LLMs mimic humans in psychological experiments.</p>
<p>If synthetic users act like humans in experiments, maybe we can use them instead of humans in some studies. But why would anyone think this would be possible? Well, because modern LLMs are trained on huge amounts of human-generated content, <a href="https://www.sciencedirect.com/science/article/pii/S2590198226000825">the models may include latent social information</a>. Depending on the quality of this latent information, with appropriate prompts, they might produce human-like outputs.</p>
<p>The results from these experiments were <strong>mixed</strong>, with the following key findings from the five papers:</p>
<ul>
<li>Using GPT-3.5, Dillion et al. (<a href="https://static1.squarespace.com/static/671011231a30d401349ce94c/t/67c7999a4348db3e090ea128/1741134235231/can-AI-language-models-replace-human-participants.pdf">2023</a>) found significant correlation (<em>r</em> = .95) between synthetic and human moral judgments (encouraging), but there were many points with large differences between human and synthetic mean ratings (discouraging).</li>
<li>Goli and Singh (<a href="https://pubsonline.informs.org/doi/10.1287/mksc.2023.0306">2024</a>) used GPT-3.5 and GPT-4 in a replication of experiments in which synthetic users were presented with a choice between a certain number of tokens in a month versus waiting for a larger number of tokens later. GPT-3.5 ignored differences in reward amounts (discouraging), while GPT-4 had some sensitivity to the differences (encouraging), but its discount rates were larger than those observed with humans (discouraging).</li>
<li>Attempting to replicate 14 classic social science studies using GPT-3.5, Park et al. (<a href="https://link.springer.com/article/10.3758/s13428-023-02307-x">2024</a>) reported that six had unanalyzable data (too little variability), five failed replication, and three were successfully replicated. So, 21% of the attempts were successful (encouraging), but 79% were unsuccessful (discouraging).</li>
<li>Using GPT-4, de Winter et al. (<a href="https://www.sciencedirect.com/science/article/pii/S0191886924001892">2024</a>) created 2000 text-based personas that completed a short form of the <a href="https://www.sciencedirect.com/topics/social-sciences/big-five">Big Five Inventory</a>. The synthetic data matched the expected factor structure and had high correlation with human data (encouraging) but significant deviation from the humans’ item means (discouraging).</li>
<li>Almeida et al. (<a href="https://arxiv.org/pdf/2308.01264">2024</a>) replicated eight psychology studies of legal and moral reasoning using Gemini Pro (1.0), Claude 2.1, GPT-4, and Llama 2 Chat 70b. They found differing levels of alignment with human responses, with the closest match for GPT-4. “Nonetheless, even when LLM-generated responses are highly correlated to human responses [encouraging], there are still systematic differences, with a tendency for models to exaggerate effects that are present among humans, in part by reducing variance” (discouraging).</li>
</ul>
<h3>Surveys: Match on Averages, Fail on Details</h3>
<p>Even if synthetic users are inconsistent in how they react to classical psychology experiments, they might be able to match human response patterns in surveys. However:</p>
<ul>
<li>Bisbee et al. (<a href="https://www.cambridge.org/core/journals/political-analysis/article/synthetic-replacements-for-human-survey-data-the-perils-of-large-language-models/B92267DC26195C7F36E63EA04A47D2FE">2024</a>) used GPT-3.5 Turbo (with some replication by GPT 4.0 and Falcon-40B-Instruct) to reproduce the 2016–2020 American National Election Survey (ANES). They encountered numerous statistical issues with synthetic respondents somewhat matching high-level means (encouraging) but having inaccurate subgroup means, small standard deviations, inaccurate regression coefficients, and failure to meet even basic requirements for replication (discouraging).</li>
<li>Using GPT-4, Shrestha et al. (<a href="https://journals.sagepub.com/doi/10.1177/23794607241311793">2024</a>) compared synthetic and human responses to 43 policy questions on topics like climate, spending, and labor in the U.S., Saudi Arabia, and the UAE. The means for the 43 questions indicated that the responses of human and synthetic participants were reasonably aligned (encouraging) but not precisely the same, with about 70% significantly different (discouraging).</li>
<li>Tjuatja et al. (<a href="https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00685/124261/Do-LLMs-Exhibit-Human-like-Response-Biases-A-Case">2024</a>) used variants of Llama, ChatGPT-3.5 Turbo, and Turbo Instruct to investigate whether synthetic responses to different item formats matched expected human response behavior biases. “Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior” (discouraging).</li>
</ul>
<h3>Social Research: Trends Match Humans, but the Details Don’t</h3>
<p>The goal of studies in social research is similar to psychological experimentation, though with more focus on interpersonal behaviors and attitudes.</p>
<ul>
<li>In experiments with GPT-4 and Llama3, Yu et al. (<a href="https://www.sciencedirect.com/science/article/pii/S2949882125001173">2025</a>) compared synthetic user and human responses to standardized psychological questionnaires measuring empathy. The expected factor structure of the questionnaires was produced by GPT-4 (encouraging), but the magnitudes of the synthetic scores did not match those of humans (discouraging). Responses from Llama3 synthetic users did not match the expected factor structure (discouraging).</li>
<li>Wang et al. (<a href="https://arxiv.org/pdf/2402.01908">2025</a>) showed that the LLMs they investigated (Llama-2-Chat 7B, Wizard Vicuna Uncensored 7B, GPT-3.5 Turbo, GPT-4) may not be able to distinguish between text written about different groups of people by others and those written by different groups of people, making them unsuitable for creating synthetic users that can replace actual users for social research due to inherent bias (discouraging).</li>
<li>Rafikova and Voronin (<a href="https://ideas.repec.org/a/spr/jcsosc/v9y2026i1d10.1007_s42001-025-00452-1.html">2026</a>) used GPT-4 to investigate synthetic responses to complex social issues (e.g., immigration, gender stereotypes). Synthetic users matched the direction and magnitude of human attitudinal trends (encouraging) but had weak correspondence with deeper models of attitudinal variance (discouraging).</li>
</ul>
<h3>UX Interviews: Convincing at First, Limited on Follow-Up</h3>
<p>We didn’t turn up studies directly related to quantitative UX research (although that is informed by psychological, survey, and social research). We did, however, find one related to researcher experiences interviewing humans and synthetic users.</p>
<ul>
<li>Kapania et al. (<a href="https://dl.acm.org/doi/full/10.1145/3706598.3713220">2025</a>) had 19 UX researchers recreate one of their recent projects conducted with human participants with GPT-4-Turbo. &#8220;Initially skeptical, researchers were surprised to see similar narratives emerge in the LLM-generated data when using the interview probe. However, over several conversational turns, they went on to identify fundamental limitations, such as how LLMs foreclose participants’ consent and agency, produce responses lacking in palpability and contextual depth, and risk delegitimizing qualitative research methods&#8221; (discouraging).</li>
</ul>
<h2>Summary and Discussion</h2>
<p>We reviewed 12 papers describing recent research comparing synthetic users and humans in four contexts of interest to UX researchers. In our summaries, we tagged 9 findings as encouraging and 14 as discouraging. So, the results aren’t universally bad, but they definitely aren’t great. We summarized those in Table 1.</p>

<table id="tablepress-1027" class="tablepress tablepress-id-1027">
<thead>
<tr class="row-1">
	<th class="column-1">Theme</th><th class="column-2">Encouraging Findings</th><th class="column-3">Discouraging Findings</th>
</tr>
</thead>
<tbody>
<tr class="row-2">
	<td class="column-1"><strong>Matched means/percents<strong></td><td class="column-2"><strong>4</strong> (B, P, R, S)</td><td class="column-3"><strong>7</strong>(A, D, G, P, S, W, Y)</td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong>Correlated<strong></td><td class="column-2"><strong>4</strong> (A, D, G, W)</td><td class="column-3"><strong>1 </strong>(R)</td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>Matched expected variance</td><td class="column-2"><strong>0</td><td class="column-3"><strong>3</strong> (A, B, P)</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>Matched factor structure</td><td class="column-2"><strong>2</strong> (W, Y)</td><td class="column-3"><strong>1</strong> (Y)</td>
</tr>
<tr class="row-6">
	<td class="column-1"><strong>Matched expected replication</td><td class="column-2"><strong>1 </strong>(P)</td><td class="column-3"><strong>2</strong> (B, P)</td>
</tr>
<tr class="row-7">
	<td class="column-1"><strong>Good qualitative depth</td><td class="column-2"><strong>0</td><td class="column-3"><strong>3</strong> (K, T, Wa)</td>
</tr>
<tr class="row-8">
	<td class="column-1"><strong>Unbiased/representative</td><td class="column-2"><strong>0</td><td class="column-3"><strong>1 </strong>(Wa)</td>
</tr>
<tr class="row-9">
	<td class="column-1"><strong>Matched regression weights</td><td class="column-2"><strong>0</td><td class="column-3"><strong>1</strong> (B)</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1027 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Summary of the number of encouraging and discouraging findings. The letters in parentheses indicate the sources for the findings. Letters are the first letter of the last name of the lead author; W = de Winter, Wa = Wang. Some studies produced both encouraging and discouraging findings in the same themes (e.g., Yu found both matching and mismatching factor structures), and some findings matched multiple themes.</p>
<h3>Correlation Does Not Mean Equivalence</h3>
<p>Some results were promising, but most found discrepancies between synthetic and human results.</p>
<p>For example, Dillion et al. (2023) found significant alignment between synthetic and human moral judgments, but Almeida et al. (2024) reported that even when synthetic and human moral judgments correlated, there were systematic differences with synthetic data exaggerating effects seen with humans.</p>
<h3>Superficial Agreement, Deeper Errors</h3>
<p>Issues with synthetic data included reduced variance, misalignment of means/percentages, distorted correlations, inaccurate regression coefficients, and shallow experiential narratives.</p>
<p>Different studies reported different issues. Sometimes high-level means matched but deeper correlational metrics were distorted (e.g., Bisbee et al., 2024); at other times, correlations were high, but there were significant differences among means (e.g., de Winter et al., 2024).</p>
<h3>Rapid Model Changes Make Findings Quickly Outdated</h3>
<p>Research on synthetic users is complicated by variation in contexts, models, prompting, and settings.</p>
<p>Controlled experimentation relies on being able to control the experimental environment. Different researchers use different models with different prompting and settings. Even the papers published in 2026 used older models than are currently available because research necessarily precedes peer-reviewed publication. Next year’s models will be different from this year’s.</p>
<h3>Proprietary Models May Work but Lack Validation</h3>
<p>Further complicating the research landscape is the emergence of proprietary models incorporating extensive amounts of survey data. Proprietary models from companies like Qualtrics and Aaru might perform better than general LLM chatbots in the production of synthetic samples that match human attitudes and performance. It’s just too early to tell. To date, we have not seen any peer-reviewed publications using these platforms.</p>
<h3>Directional When Answers Are Unknown</h3>
<p>The encouraging results regarding occasionally high correlation of human and synthetic data suggest that the results from synthetic users can provide directional signals, but synthetic estimates are often imprecise and inconsistent from study to study. The promise of synthetic users is alluring, but until there is strong evidence of consistently good matching with human data, it seems premature to rely on research with synthetic users for critical decision-making.</p>
<h3>Potentially Useful When Answers Are Known and Stable</h3>
<p>We’re not done yet with this topic and are planning our own analysis. But right now, it seems the most promising use of synthetic users is deriving insights from already collected data. Why ask a survey question if the answer is already known and stable? Most attitudes aren’t stable and are highly dependent on the audience. But if you have surveyed the same type of population repeatedly and have relatively stable results (possibly like the EY study), then you may know the answer. In that case, an LLM is just an easier way to query your database. Just don’t think it’s predicting something that’s not already known.</p>
<h2>Appendix: Links to Papers and Articles</h2>
<p><a href="https://arxiv.org/pdf/2308.01264">Almeida G. F. C. F., Nunes J. L., Engelmann N., Wiegmann A., &amp; de Araújo M. (2024). Exploring the psychology of LLMs’ moral and legal reasoning. <em>Artificial Intelligence</em>, 333.</a></p>
<p><a href="https://www.ey.com/en_us/insights/wealth-asset-management/how-ai-simulation-accelerates-growth-in-wealth-and-asset-management">Babcic, S., Hamaloglu, U., &amp; Munshi, S. (2025, Oct 25). <em>How AI simulation accelerates growth in wealth and asset management</em>. EY.</a></p>
<p><a href="https://www.cambridge.org/core/journals/political-analysis/article/synthetic-replacements-for-human-survey-data-the-perils-of-large-language-models/B92267DC26195C7F36E63EA04A47D2FE">Bisbee, J., Clinton, J. D., Dorff, C., Kenkel, B., &amp; Larson, J. M. (2024). Synthetic replacements for human survey data? The perils of large language models. <em>Political Analysis</em>, <em>32</em>(4), 401–416.</a></p>
<p><a href="https://quantuxblog.com/synthetic-survey-data-its-not-data">Chapman, C. (2025, Jun 18). <em>Synthetic survey data? It’s not data</em>. Quantitative UX Research Blog. </a></p>
<p><a href="https://www.sciencedirect.com/science/article/pii/S0191886924001892">de Winter J. C. F., Driessen T., &amp; Dodou D. (2024). The use of ChatGPT for personality research: Administering questionnaires using generated personas. <em>Personality and Individual Differences</em>, <em>228</em>, #112729.</a></p>
<p><a href="https://static1.squarespace.com/static/671011231a30d401349ce94c/t/67c7999a4348db3e090ea128/1741134235231/can-AI-language-models-replace-human-participants.pdf">Dillion, D., Tandon, N., Gu, Y., &amp; Gray, K. (2023). Can AI language models replace human participants? <em>Trends in Cognitive Sciences</em>, <em>27</em>(7), 597–600. </a></p>
<p><a href="https://pubsonline.informs.org/doi/10.1287/mksc.2023.0306">Goli, A., &amp; Singh, A. (2024). Can large language models capture human preferences? <em>Marketing Science</em>, <em>43</em>(4), 709–722. [Abstract only]. </a></p>
<p><a href="https://dl.acm.org/doi/10.1145/3706598.3713220">Kapania, S., Agnew, W., Eslami, M., Heidari, H., &amp; Fox, S. E. (2025). Simulacrum of stories: Examining large language models as qualitative research participants. <em>Proceedings of CHI ‘25</em>, #489,  1–17. </a></p>
<p><a href="https://www.researchgate.net/publication/401777396_Synthetic_Participants_Generated_by_Large_Language_Models_A_Systematic_Literature_Review">Kuric, E., Demcak, P., &amp; Krajcovic, M. (2026). Synthetic participants generated by large language models: A systematic literature review. [Preprint—Not yet peer reviewed]. </a></p>
<p><a href="https://www.greenbook.org/insights/data-science/testing-synthetic-data-against-academic-benchmarks-a-replication-study">McLean, D. (2026, Feb 2). <em>Testing synthetic data against academic benchmarks: A replication study</em>. Greenbook. </a></p>
<p><a href="https://developmentcorporate.com/saas/synthetic-responses-market-research-2025/">Mecke, J. (2025, Oct 31). Synthetic responses in market research: Promise vs. reality in 2025. Development Corporate.</a></p>
<p><a href="https://www.quirks.com/articles/exploring-the-challenges-and-potential-of-synthetic-data-and-survey-participants">Millman, S. (2025, Feb 25). <em>Exploring the challenges and potential of synthetic data and survey participants</em>. Quirk&#8217;s Media.</a></p>
<p><a href="https://www.thevoiceofuser.com/aaa-billion-dollar-ai-startup-is-selling-you-a-survey-the-wall-street-journal-wrote-a-love-letter-about-it/">Papas, C. (2026, Mar 15). <em>Question: Is Aaru actually proving that synthetic research can predict human behavior and replace real user research?</em> The Voice of User.</a></p>
<p><a href="https://link.springer.com/article/10.3758/s13428-023-02307-x">Park, P. S., Schoenegger, P., &amp; Zhu, C. (2024). Diminished diversity-of-thought in a standard large language model. <em>Behavior Research Methods</em>, <em>56</em>(6), 5754–5770.</a></p>
<p><a href="https://ideas.repec.org/a/spr/jcsosc/v9y2026i1d10.1007_s42001-025-00452-1.html">Rafikova, A., &amp; Voronin, A. (2026). ChatGPT as a research proxy: simulating human attitudes in social science research. <em>Journal of Computational Social Science</em>, <em>9</em>(17). [Abstract Only]</a></p>
<p><a href="https://journals.sagepub.com/doi/10.1177/23794607241311793">Shrestha, P., Krpan, D., Koaik, F., Schnider, R., Sayess, D., &amp; Binbaz, M. S. (2024). Beyond WEIRD: Can synthetic survey participants substitute for humans in global policy research? <em>Behavioral Science &amp; Policy</em>, <em>10</em>(2), 26–45.</a></p>
<p><a href="https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00685/124261/Do-LLMs-Exhibit-Human-like-Response-Biases-A-Case">Tjuatja, L. Chen, V., Wu, T., Talwalkwar, A., &amp; Neubig, G. (2024). Do LLMs exhibit human-like response biases? A case study in survey design. <em>Transactions of the Association for Computational Linguistics</em>, <em>12</em>, 1011–1026.</a></p>
<p><a href="https://www.sciencedirect.com/science/article/pii/S2590198226000825">Wallius, E., &amp; Lehtonen, E. (2026). Beyond human proxies: The roles and usefulness of large language models in user research for mobility service development. <em>Transportation Research Interdisciplinary Perspectives</em>, <em>36</em>, #101917.</a></p>
<p><a href="https://arxiv.org/pdf/2402.01908">Wang, A., Morgenstern, J. &amp; Dickerson, J. P. (2025) Large language models that replace human participants can harmfully misportray and flatten identity groups. <em>Nature Machine Intelligence</em>, <em>7</em>, 400–411.</a></p>
<p><a href="https://www.sciencedirect.com/science/article/pii/S2949882125001173">Yu, T., Pan, S., Fan, C., Luo, S., Jin, Y., &amp; Zhao, B. (2025). Can large language models exhibit cognitive and affective empathy as humans? <em>Computers in Human Behavior: Artificial Humans</em>, <em>6</em>, #100233.</a></p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Credible vs. Confidence Intervals: Different Meanings but Similar Decisions</title>
		<link>https://measuringu.com/credible-vs-confidence-intervals/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=credible-vs-confidence-intervals</link>
		
		<dc:creator><![CDATA[Jeff Sauro, PhD&nbsp;•&nbsp;Jim Lewis, PhD]]></dc:creator>
		<pubDate>Wed, 08 Apr 2026 06:49:35 +0000</pubDate>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[Bayes theorem]]></category>
		<category><![CDATA[Bayesian theorem]]></category>
		<category><![CDATA[confidence interval]]></category>
		<category><![CDATA[Confidence Intervals]]></category>
		<category><![CDATA[credible interval]]></category>
		<category><![CDATA[credible intervals]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47234</guid>

					<description><![CDATA[We’ve written a lot about confidence intervals for the last two decades. We especially encourage them for small sample studies. Some of you even bought into our recommendation and use them yourselves (a decision we continue to support). But maybe you’ve heard about Bayesian credible intervals and wonder if you should be using them instead. [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47288 size-medium" src="https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1-300x169.png" alt="Feature image shows two researchers, each examining a measuring tool, with a specific interval highlighted." width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1.png 2000w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>We’ve <a href="https://measuringu.com/article/estimating-completion-rates-from-small-samples-using-binomial-confidence-intervals-comparisons-and-recommendations/">written a lot about confidence intervals</a> for the last two decades.</p>
<p>We especially encourage them for small sample studies.</p>
<p>Some of you even bought into our recommendation and use them yourselves (a decision we continue to support).</p>
<p>But maybe you’ve heard about <a href="https://en.wikipedia.org/wiki/Credible_interval">Bayesian credible intervals</a> and wonder if you should be using them instead.</p>
<p>In this article, we return to an <a href="https://measuringu.com/bayes-law-in-ux-research-the-power-and-perils-of-priors">example used in our previous articles</a> on Bayesian methods applied to UX research and compare analyses of that example with confidence and credible intervals.</p>
<h2><span lang="EN-US">Confidence Interval Analysis</span></h2>
<p>In our recurring example, 18 of 20 participants successfully completed a checkout task (a 90% completion rate). But if we were to test hundreds, thousands, or (somehow) all potential users, would the completion rate be exactly 90%? Almost surely not. But instead of trying to nail down an exact single number, a likely range is usually sufficient for decision making and surprisingly easy to compute and accurate even for small sample sizes.</p>
<p>For this type of data (binary), the likely range can be computed using an adjusted-Wald confidence interval with 95% confidence. That interval is 68.7% to 98.4%.</p>
<p>We’ve made it easy to compute binomial confidence intervals with <a href="https://measuringu.com/calculators/wald/">our online calculator</a>. But how do you interpret or explain what it means? How about:</p>
<ul>
<li>There’s a 95% probability the true completion rate is between 68.7% and 98.4%.</li>
<li>There’s a 95% chance the true completion rate falls within 68.7% and 98.4%.</li>
<li>95% of future tests with completion rates will be between 68.7% and 98.4%.</li>
</ul>
<p>Strictly speaking, all three of those statements are wrong. A stats professor or Bayesian enthusiast will be happy to point out that error.</p>
<p>The more technically correct way to describe the interval is:</p>
<ul>
<li>If we ran many tests, each with 20 users from the same population and computed confidence intervals each time, on average, 95 out of 100 confidence intervals will contain the unknown population completion rate.</li>
</ul>
<p>Strictly speaking, we are 95% confident <em>in the method </em>of generating confidence intervals and not in any given interval. The confidence interval we generated from the sample data either does or does not contain the population completion rate.</p>
<p>We don’t know if our sample of 20 is one of those five whose confidence interval doesn&#8217;t contain the completion rate. So, it’s best to avoid using “probability” or “chance” when describing a confidence interval and remember that we’re 95% confident in the process of generating confidence intervals rather than a given interval.</p>
<p>So, we have just one study, and we computed only one interval. What does that mean? What are we “allowed” to say other than that cumbersome statement? We have a couple of recommendations suitable for practical decision making:</p>
<ul>
<li><strong>Likely range</strong>: “68.7% to 98.4% is the most likely range for the unknown completion rate from all users.”</li>
<li><strong>Plausible range</strong> (from <a href="https://www.amazon.com/Confidence-Intervals-Quantitative-Applications-Sciences/dp/076192499X">Smithson, 2002</a>): “Given this data, values inside the confidence interval are plausible while those outside are implausible. The observed completion rate of 90% is plausible but rates lower than 68.7% or higher than 98.4% are implausible.”</li>
</ul>
<p>This is where the precision of numbers meets the imprecision of language. Although confidence, probability, likely, and plausible all sound about the same, they have more precise usage when it comes to statistics and probability.</p>
<p>This rigidity in language makes them less ideal when communicating the results to stakeholders who will not likely have a sophisticated understanding of confidence intervals (although <a href="https://link.springer.com/article/10.3758/s13423-013-0572-3">even professors sometimes struggle with the concept</a>).</p>
<h2><span lang="EN-US">Credible Interval Analysis</span></h2>
<p>One proposed alternative is the Bayesian credible interval.</p>
<p>Credible intervals are designed to allow for the interpretation people naturally want to use. A 95% credible interval can be interpreted as having a 95% probability of containing the true value.</p>
<p>Like with confidence intervals, there are different computations used to generate credible intervals on binary data. And like with confidence intervals, there are debates about which method is optimal. We won’t get into that debate here. Instead, we’ve provided in Table 1 three Bayesian credible intervals for our example that differ in <a href="https://measuringu.com/bayes-law-in-ux-research-the-power-and-perils-of-priors">their priors</a> (all of which are <a href="https://nvlpubs.nist.gov/nistpubs/TechnicalNotes/NIST.TN.2119.pdf">commonly used in practice</a>).</p>

<table id="tablepress-1026" class="tablepress tablepress-id-1026">
<thead>
<tr class="row-1">
	<th class="column-1"><strong>Method</strong></th><th class="column-2"><strong>Prior/Setup</strong></th><th class="column-3"><strong>95% Interval</strong></th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1">Adjusted-Wald</td><td class="column-2">Add ~2 successes &amp; ~2 failures</td><td class="column-3"><strong>68.7% to 98.4%</strong></td>
</tr>
<tr class="row-3">
	<td class="column-1">Bayesian credible interval</td><td class="column-2">Beta(1,1)—Uniform prior</td><td class="column-3"><strong>69.6% to 97.0%</strong></td>
</tr>
<tr class="row-4">
	<td class="column-1">Bayesian credible interval</td><td class="column-2">Beta(0.5, 0.5)—Jeffreys prior</td><td class="column-3"><strong>71.6% to 97.9%</strong></td>
</tr>
<tr class="row-5">
	<td class="column-1">Bayesian credible interval</td><td class="column-2">Beta(2, 2)—Agresti prior</td><td class="column-3"><strong>66.4% to 95.0%</strong></td>
</tr>
</tbody>
</table>
<!-- #tablepress-1026 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Four 95% interval estimates, one confidence and three credible.</p>
<p>For example, a 95% Bayesian credible interval using a uniform prior for 18 successes and 2 failures generates a credible interval of 69.6% to 97.0%.</p>
<p>We can say there’s a 95% probability that the true and unknown completion rate is between 69.6% and 97%.</p>
<p>Stats professors are happy with that statement. Bayesian purists are happy with that statement. And your stakeholders probably understand that statement too!</p>
<p>So, should we all start using credible intervals and abandon confidence intervals? Not necessarily.</p>
<p>Credible intervals require more complex calculations and usually don’t have the simple closed-form solution of the adjusted-Wald interval. In practice, however, this difference is negligible because modern software handles the computation (e.g., we used the binom.bayes function in the R package binom).</p>
<p>But did you notice anything about the values in Table 1? The intervals are all very similar, as shown in the graph in Figure 1.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-scaled.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47293" src="https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-300x175.png" alt="Graph of the four intervals " width="1200" height="698" srcset="https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-300x175.png 300w, https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-1024x596.png 1024w, https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-768x447.png 768w, https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-1536x894.png 1536w, https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-2048x1192.png 2048w, https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-600x349.png 600w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> Graph of the four intervals (Green: adjusted-Wald, Blue: Bayesian Uniform, Orange: Bayesian Jeffreys, Black: Bayesian Agresti); dashed green line shows limits of adjusted-Wald interval across the three Bayesian intervals.</p>
<p>There are very few differences between the intervals. The width of the adjusted-Wald interval is 29.7%. The Uniform and Jeffreys intervals lie within the adjusted-Wald (with respective widths of 27.4% and 26.3%) while the Agresti interval has about the same width as the adjusted-Wald (28.6%), with its upper and lower endpoints shifted down relative to the adjusted-Wald interval by, respectively, 3.4% and 2.3%.</p>
<p>If the output is roughly the same, does it really matter? The numbers don’t know where they come from.</p>
<p>This is similar to the debate about ordinal versus interval data. As Lord (1951) noted, even <a href="https://is.muni.cz/el/fss/jaro2010/PSY454/um/Frederick_Lord_On_the_statistical_treatment_of_football_numbers.pdf">nominal values like football jersey numbers can be averaged</a>. The math works, but proper interpretation is critical.</p>
<p>Confidence intervals and credible intervals can yield nearly identical results, especially for this type of data. In many cases, <strong>they will lead to the same practical decision</strong>, even though the interpretation differs.</p>
<p>So, what should you do?</p>
<p>The results here suggest that, at least for this type of data, traditional confidence intervals and Bayesian credible intervals can produce very similar ranges. The main difference is not in the numbers, but in how we interpret and communicate them.</p>
<p>That’s one reason we continue to recommend confidence intervals. They are well understood, widely taught, and, when used appropriately, provide accurate estimates of the range of plausible values.</p>
<p>At the same time, we understand the appeal of credible intervals. The interpretation is more natural and often aligns better with how stakeholders think about uncertainty.</p>
<p>In practice, either approach can be effective. What matters most is understanding what the interval represents and communicating it clearly. Decisions are made by inspecting the endpoints of the intervals. If you’d make the same decision for both endpoints, then you have enough information to make the decision. Otherwise, you need more data. In this example, it seems unlikely that the slight variation in endpoint values would affect real-world decision making.</p>
<p>Notably, in this example, the confidence interval encompassed two of the Bayesian intervals, so not only did it have 95% confidence from a frequentist perspective, but it also had at least 95% credibility from a Bayesian perspective.</p>
<p>We’ll continue to explore where these approaches differ more meaningfully in future articles, including whether these similarities extend beyond this example to different proportions and to other statistics such as means.</p>
<h2>Key Takeaways</h2>
<p>In this latest article on Bayesian methods, we covered:</p>
<ul>
<li>Confidence intervals are harder to explain than most people think.</li>
<li>Credible intervals match how people want to interpret uncertainty.</li>
<li>In this example, both methods produce very similar ranges.</li>
<li>The difference is less about the numbers and more about what we can say about them.</li>
<li>Use either approach thoughtfully, but focus on clear communication.</li>
</ul>
]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>

<!-- plugin=object-cache-pro client=phpredis metric#hits=7775 metric#misses=55 metric#hit-ratio=99.3 metric#bytes=4702694 metric#prefetches=212 metric#store-reads=134 metric#store-writes=167 metric#store-hits=317 metric#store-misses=37 metric#sql-queries=53 metric#ms-total=4485.79 metric#ms-cache=406.01 metric#ms-cache-avg=1.3534 metric#ms-cache-ratio=9.1 -->
