<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>MeasuringU</title>
	<atom:link href="https://measuringu.com/feed/" rel="self" type="application/rss+xml" />
	<link>https://measuringu.com</link>
	<description>UX Research and Software</description>
	<lastBuildDate>Tue, 23 Jun 2026 21:54:24 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://measuringu.com/wp-content/uploads/2020/11/site-icon.png</url>
	<title>MeasuringU</title>
	<link>https://measuringu.com</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>What Are the Different Types of Synthetic Users?</title>
		<link>https://measuringu.com/what-are-the-different-types-of-synthetic-users/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=what-are-the-different-types-of-synthetic-users</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD and Jeff Sauro, PhD]]></dc:creator>
		<pubDate>Tue, 23 Jun 2026 21:54:24 +0000</pubDate>
				<category><![CDATA[Survey]]></category>
		<category><![CDATA[User Experience]]></category>
		<category><![CDATA[User Research]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[Synthetic user]]></category>
		<category><![CDATA[Synthetic users]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47761</guid>

					<description><![CDATA[Recruiting participants for research is expensive. It’s also rife with problems: Are these people really who they say they are? Are they actually paying attention? Or is the data from some survey farm where people click through and make money? AI is disrupting UX research. But the disruption is leading to more software, not less. [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/06/062323-FeatureImage-1-scaled.jpg"><img fetchpriority="high" decoding="async" class="alignleft wp-image-47827 size-medium" src="https://measuringu.com/wp-content/uploads/2026/06/062323-FeatureImage-1-300x169.jpg" alt="Feature image showing 5 different AI bots representing 5 synthetic user types" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/06/062323-FeatureImage-1-300x169.jpg 300w, https://measuringu.com/wp-content/uploads/2026/06/062323-FeatureImage-1-1024x576.jpg 1024w, https://measuringu.com/wp-content/uploads/2026/06/062323-FeatureImage-1-768x432.jpg 768w, https://measuringu.com/wp-content/uploads/2026/06/062323-FeatureImage-1-1536x864.jpg 1536w, https://measuringu.com/wp-content/uploads/2026/06/062323-FeatureImage-1-2048x1152.jpg 2048w, https://measuringu.com/wp-content/uploads/2026/06/062323-FeatureImage-1-600x338.jpg 600w" sizes="(max-width: 300px) 100vw, 300px" /></a>Recruiting participants for research is expensive. It’s also rife with problems: Are these people really who they say they are? Are they actually paying attention? Or is the data from some <a href="https://www.intellisurvey.com/blog/fighting-survey-farms">survey farm</a> where people click through and make money?</p>
<p>AI is disrupting UX research. But the disruption is leading to more software, not less. The need for insights into how people will use that software isn’t going away.</p>
<p>But can AI help? Can we use AI to synthesize people’s attitudes, beliefs, and behaviors? Instead of trying to find the right people to take surveys, could these <strong>synthetic users</strong> generate insights faster and at almost no cost? News about synthetic users would certainly make headlines. <a href="https://measuringu.com/review-of-experiments-with-synthetic-users/">And they do</a>.</p>
<p>But what exactly <em>is</em> a synthetic user? Is that the same as a digital twin? Or a synthetic persona?</p>
<p>To properly assess the effectiveness of AI tools, we think it’s important to have a good understanding of the terms and how they fit together.</p>
<p>In this article, we propose a preliminary taxonomy of five distinct types of synthetic users, organized by how grounded they are in real human data. Before we get to the taxonomy, though, it helps to ask a question that sounds simpler than it is.</p>
<h2>What Birds Can Teach Us about Synthetic Users</h2>
<p>How do we know that a bird is a bird?</p>
<p>Is it because a bird can fly? Well, bats are mammals that can fly, while penguins are birds that can’t fly.</p>
<p>Is it because they lay eggs? Platypuses are mammals that lay eggs (as do most reptiles, amphibians, fish, and arthropods).</p>
<p>Maybe it’s because birds have feathers rather than scales or fur? That might be true in the present, but in the past, many dinosaurs not in the lineage leading to birds are <a href="https://www.smithsonianmag.com/science-nature/dinosaurs-evolved-feathers-for-far-more-than-flight-180985012/">now known to have had feathers</a>.</p>
<p>The answer, as <a href="https://www.linnean.org/learning/who-was-linnaeus/career-and-legacy">Linnaeus worked out in the 1700s</a>, is that no single feature is sufficient. A bird is defined by a <em>constellation</em> of characteristics (feathers, beak, two wings, two feet, warm blood, hard-shelled eggs) organized within a hierarchy of kingdom, class, genus, and species. Figure 1 shows how that plays out from the animal kingdom down to a single species. And this is our guide for classifying and understanding synthetic users.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/06/Figure-1-Example-of-classification-scaled.png" rel="attachment wp-att-47768"><img decoding="async" class="alignnone wp-image-47831 size-full" src="https://measuringu.com/wp-content/uploads/2026/06/Figure-1-Example-of-classification-scaled.png" alt="Figure 2: Classification of different types (species) of synthetic users. " width="2560" height="1349" srcset="https://measuringu.com/wp-content/uploads/2026/06/Figure-1-Example-of-classification-scaled.png 2560w, https://measuringu.com/wp-content/uploads/2026/06/Figure-1-Example-of-classification-300x158.png 300w, https://measuringu.com/wp-content/uploads/2026/06/Figure-1-Example-of-classification-1024x540.png 1024w, https://measuringu.com/wp-content/uploads/2026/06/Figure-1-Example-of-classification-768x405.png 768w, https://measuringu.com/wp-content/uploads/2026/06/Figure-1-Example-of-classification-1536x809.png 1536w, https://measuringu.com/wp-content/uploads/2026/06/Figure-1-Example-of-classification-2048x1079.png 2048w, https://measuringu.com/wp-content/uploads/2026/06/Figure-1-Example-of-classification-600x316.png 600w" sizes="(max-width: 2560px) 100vw, 2560px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> Example of classification from the animal kingdom to species of stork.</p>
<h2>Synthetic Users: More of a Genus than a Species</h2>
<p>The topic of classifying different types of synthetic users is in a state of flux (lots of labels, overlapping meanings, vendor-specific definitions). Despite this, in Figure 2, we attempt a preliminary classification scheme similar to Figure 1 for five types (species) under the genus of Synthetic User.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/06/Figure-2-Example-of-classification.png" rel="attachment wp-att-47769"><img decoding="async" class="alignnone wp-image-47832 size-full" src="https://measuringu.com/wp-content/uploads/2026/06/Figure-2-Example-of-classification.png" alt="Classification of different types (species) of synthetic users." width="1185" height="868" srcset="https://measuringu.com/wp-content/uploads/2026/06/Figure-2-Example-of-classification.png 1185w, https://measuringu.com/wp-content/uploads/2026/06/Figure-2-Example-of-classification-300x220.png 300w, https://measuringu.com/wp-content/uploads/2026/06/Figure-2-Example-of-classification-1024x750.png 1024w, https://measuringu.com/wp-content/uploads/2026/06/Figure-2-Example-of-classification-768x563.png 768w, https://measuringu.com/wp-content/uploads/2026/06/Figure-2-Example-of-classification-600x439.png 600w" sizes="(max-width: 1185px) 100vw, 1185px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 2:</strong> Classification of different types (species) of synthetic users.</p>
<p>Table 1 lists the identifying characteristics for each of these types of synthetic users, primarily focusing on the type of data used to create the synthetic user and how grounded the synthetic user is in actual user data.</p>

<table id="tablepress-1052" class="tablepress tablepress-id-1052">
<thead>
<tr class="row-1">
	<th class="column-1">Synthetic User Type</th><th class="column-2">Identifying Characteristics/Descriptions</th>
</tr>
</thead>
<tbody>
<tr class="row-2">
	<td class="column-1">AI Proto Persona</td><td class="column-2">This is the weakest (least grounded) type of synthetic user generated with simple role-playing prompts (e.g., “You are a world-class Python programmer”). This method produces preliminary user profiles based on broad assumptions rather than research.</td>
</tr>
<tr class="row-3">
	<td class="column-1">Demographic Based</td><td class="column-2">Prompts specify age, gender, occupation, region, etc. to approximate group-level tendencies. This method is somewhat more grounded than a proto persona but is still limited in the quality of its output, especially when demographics have only weak relationships with research topics (e.g., much UX research).</td>
</tr>
<tr class="row-4">
	<td class="column-1">Persona Based</td><td class="column-2">Prompts focus on richer persona paragraphs (e.g., “Bill G. is a 27-year old male graphic designer who always has his sketchbook at hand, has a track record of being creative and innovative, and is up-to-date on current design trends. How would he complete the following questionnaire?”). Because these synthetic users are still weakly grounded, they are limited to approximate group-level tendencies.</td>
</tr>
<tr class="row-5">
	<td class="column-1">Research Grounded</td><td class="column-2">Prompts refer to actual research artifacts with traceable sources but do not attempt to model individual human responses. These are based on actual interviews, survey results, analytics, customer-support logs, or other user data that are typically not available to publicly generated LLMs.</td>
</tr>
<tr class="row-6">
	<td class="column-1">Digital Twins</td><td class="column-2">Prompts refer to rich individual-level data for the purpose of modeling each individual in a dataset. This approach has the strongest grounding in actual user data but its accuracy in real-world deployments is still an open research question.</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1052 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1</strong>: Brief descriptions of types of synthetic users.</p>
<h2>Discussion</h2>
<p>In this preliminary taxonomy, we’ve defined five types of synthetic users: AI proto persona, demographic based, persona based, research grounded, and digital twins, based on differences in the types of data (e.g., demographic, persona) and the strength of the relationship between the synthetic user and human user data.</p>
<p>Preliminary taxonomies change over time. In this article, we’ve used the levels originally defined by Linnaeus because they were adequate for our purpose. Modern biological taxonomies have eight levels (Domain, Kingdom, Phylum, Class, Order, Family, Genus, Species), and the number of kingdoms has increased to six (Bacteria, Archaea, Protista, Fungi, Plants, Animals).</p>
<p>We fully expect changes to our classification scheme over time, but it’s a start.</p>
<p>For example, we have not included generative agents in this taxonomy because they are qualitatively different from synthetic users that simulate responses and are more like simulated actors, trying to model what people might do over time. This may eventually become its own branch from the genus of synthetic users, separate from the synthetic respondents. Time will tell.</p>
<p>Just like how there are hybrids in the animal kingdom (e.g., mules, ligers), in practice, there may be hybrids of different types of synthetic users. For example, in Bisbee et al.’s “Synthetic Replacements for Human Survey Data? The Perils of Large Language Models” (<a href="https://www.cambridge.org/core/journals/political-analysis/article/synthetic-replacements-for-human-survey-data-the-perils-of-large-language-models/B92267DC26195C7F36E63EA04A47D2FE">2024</a>), the researchers used the following prompt to elicit 30 synthetic responses for each of the 7,350 human respondents for each of the 7,350 human respondents in the 2016–2020 ANES survey to get a final dataset with 3,614,400 responses:</p>
<p style="padding-left: 25px;"><em>It is [YEAR]. You are a [AGE] year-old, [MARST], [RACETH] [GENDER] with [EDUCATION] making [INCOME] per year, living in the United States. You are [IDEO], [REGIS] [PID] who [INTEREST] pays attention to what’s going on in government and politics. Provide responses from this person’s perspective. Use only knowledge about politics that they would have.</em></p>
<p>Each bracketed item is a variable with values corresponding to a real respondent in the two waves of the ANES. For example, [YEAR] was 2016 or 2020, [AGE] matched the selected respondent’s age, [MARST] was marital status (e.g., married, divorced, single), [IDEO] was political ideology from extremely liberal to extremely conservative, [REGIS] was voter registration status, and [PID] was party membership (Democrat, Independent, Republican).</p>
<p>Thus, this is a hybrid between demographic- and persona-based types with a light sprinkle of digital twinning. It’s less than a fully research-grounded respondent or digital twin because the prompt doesn’t include access to the respondent’s prior answers, interview transcript, open-ended comments, voting history, occupation, or religion. It uses selected ANES variables as conditioning attributes and asks the LLM to answer from that perspective (multiple times for each human respondent).</p>
<h2>Summary</h2>
<p>Our key conclusions from this exercise are:</p>
<p><strong>“Synthetic users” is more of an umbrella term (like a genus) than a type (like a species). </strong></p>
<p>All five types we’ve described can fall under the umbrella of synthetic users. In practice, that means when we talk about synthetic users, it’s like talking about storks. There are a variety of storks, so knowing which bird we’re talking about helps move the conversation forward.</p>
<p><strong>Key criteria for discriminating among types of synthetic users include data type and grounding. </strong></p>
<p>The types we’ve defined differ in the kind of data used to model responses (e.g., demographic, persona) and the extent to which they are grounded in real user data.</p>
<p><strong>Taxonomies change over time. </strong></p>
<p>We consider this article a necessary exercise in a preliminary taxonomy of synthetic users, but fully expect it to evolve over time, maybe quickly due to rapid changes in these technologies.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Should a PhD Count as Years of Experience?</title>
		<link>https://measuringu.com/should-a-phd-count-as-years-of-experience/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=should-a-phd-count-as-years-of-experience</link>
		
		<dc:creator><![CDATA[Jeff Sauro&nbsp;•&nbsp;Jim Lewis, PhD]]></dc:creator>
		<pubDate>Wed, 17 Jun 2026 02:29:37 +0000</pubDate>
				<category><![CDATA[UX]]></category>
		<category><![CDATA[PhD]]></category>
		<category><![CDATA[UX Research]]></category>
		<category><![CDATA[work]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47740</guid>

					<description><![CDATA[You put in the grueling hours. You know what it takes to learn, to persevere, to deliver. There’s nothing quite like the experience of a PhD program. But perhaps a career in academia isn’t what you’re looking for. You decide to leave academia and apply your skills in the UX industry. You look for jobs [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/06/061626-FeatureImage1-scaled.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47811 size-medium" src="https://measuringu.com/wp-content/uploads/2026/06/061626-FeatureImage1-300x169.png" alt="Feature image showing a graduate and an infographic" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/06/061626-FeatureImage1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/06/061626-FeatureImage1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/06/061626-FeatureImage1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/06/061626-FeatureImage1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/06/061626-FeatureImage1-2048x1152.png 2048w, https://measuringu.com/wp-content/uploads/2026/06/061626-FeatureImage1-600x338.png 600w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>You put in the grueling hours. You know what it takes to learn, to persevere, to deliver.</p>
<p>There’s nothing quite like the experience of a PhD program. But perhaps a career in academia isn’t what you’re looking for.</p>
<p>You decide to leave academia and apply your skills in the UX industry. You look for jobs and see that most require a minimum of three to five years of experience.</p>
<p>The PhD program took five years. Don’t those years in the program count as years of job experience? Should they count?</p>
<p>Probably not.</p>
<h2>The Value of a PhD</h2>
<p>First, we value what a PhD offers. We know. We both went through the experience—while working and raising kids! There’s nothing quite like it. The classwork, teaching, papers, dissertation, defense, and the pile of academic rules.</p>
<p>Averaged over the past decade, about 10% of respondents to the UXPA salary survey <a href="https://uxpa.org/salary-surveys/">had a PhD</a>. That’s seven times higher than the general U.S. population, so it’s a bit of an exclusive club. Although intellectually rewarding, it’s not clear that a PhD pays off financially if entering industry after being a full-time student.</p>
<p>While PhDs do get paid more than their peers with master’s and bachelor’s degrees, the delay before getting into the workforce <a href="https://measuringu.com/does-an-advanced-degree-pay-off/">offsets much of the gain</a>. It takes about 24 years of employment to break even (Figure 1). The value of the PhD to us is not in the clear financial payoff.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/06/061626-Figure1.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47809" src="https://measuringu.com/wp-content/uploads/2026/06/061626-Figure1-1024x815.png" alt="Figure 1: “Lifetime” earnings of PhD and master’s degree recipients vs. not having an advanced degree. " width="800" height="637" srcset="https://measuringu.com/wp-content/uploads/2026/06/061626-Figure1-1024x815.png 1024w, https://measuringu.com/wp-content/uploads/2026/06/061626-Figure1-300x239.png 300w, https://measuringu.com/wp-content/uploads/2026/06/061626-Figure1-768x611.png 768w, https://measuringu.com/wp-content/uploads/2026/06/061626-Figure1-600x478.png 600w, https://measuringu.com/wp-content/uploads/2026/06/061626-Figure1.png 1226w" sizes="auto, (max-width: 800px) 100vw, 800px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> “Lifetime” earnings of PhD and master’s degree recipients vs. not having an advanced degree.</p>
<h2>PhD Skills Are Transferable</h2>
<p>At MeasuringU, we hire PhDs and probably have one of the highest percentages of PhDs among UX research companies and related fields. (Figure 2).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/06/061626-Figure2.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47810" src="https://measuringu.com/wp-content/uploads/2026/06/061626-Figure2-300x234.png" alt="Figure 2: Doctorates at MeasuringU – we like to hire them!" width="800" height="625" srcset="https://measuringu.com/wp-content/uploads/2026/06/061626-Figure2-300x234.png 300w, https://measuringu.com/wp-content/uploads/2026/06/061626-Figure2-1024x800.png 1024w, https://measuringu.com/wp-content/uploads/2026/06/061626-Figure2-768x600.png 768w, https://measuringu.com/wp-content/uploads/2026/06/061626-Figure2-600x469.png 600w, https://measuringu.com/wp-content/uploads/2026/06/061626-Figure2.png 1184w" sizes="auto, (max-width: 800px) 100vw, 800px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 2:</strong> Doctorates at MeasuringU—we like to hire them!</p>
<p>We do this because many of the skills learned in a PhD program are transferable. Table 1 shows examples of the demands of PhD programs and industrial UX research that lead to similar skills.</p>

<table id="tablepress-1050" class="tablepress tablepress-id-1050">
<thead>
<tr class="row-1">
	<th class="column-1">Skill / Demand</th><th class="column-2">PhD Program</th><th class="column-3">Industry UX Research</th>
</tr>
</thead>
<tbody>
<tr class="row-2">
	<td class="column-1">Designing and running studies</td><td class="column-2"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2705.png" alt="✅" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Extensive</td><td class="column-3"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2705.png" alt="✅" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Extensive</td>
</tr>
<tr class="row-3">
	<td class="column-1">Statistical analysis and interpretation</td><td class="column-2"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2705.png" alt="✅" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Rigorous</td><td class="column-3"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2705.png" alt="✅" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Applied</td>
</tr>
<tr class="row-4">
	<td class="column-1">Communicating findings to a skeptical audience</td><td class="column-2"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2705.png" alt="✅" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Committees, advisors</td><td class="column-3"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2705.png" alt="✅" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Stakeholders, leadership</td>
</tr>
<tr class="row-5">
	<td class="column-1">Working under deadlines</td><td class="column-2"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2705.png" alt="✅" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Milestones, defenses</td><td class="column-3"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2705.png" alt="✅" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Sprint cycles, launches</td>
</tr>
<tr class="row-6">
	<td class="column-1">Managing ambiguity in data</td><td class="column-2"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2705.png" alt="✅" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Common</td><td class="column-3"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2705.png" alt="✅" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Constant</td>
</tr>
<tr class="row-7">
	<td class="column-1">Writing clearly and persuasively</td><td class="column-2"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2705.png" alt="✅" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Dissertations, papers</td><td class="column-3"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2705.png" alt="✅" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Reports, readouts</td>
</tr>
<tr class="row-8">
	<td class="column-1">Defending methodology under scrutiny</td><td class="column-2"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2705.png" alt="✅" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Peer review, committee</td><td class="column-3"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2705.png" alt="✅" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Cross-functional critique</td>
</tr>
<tr class="row-9">
	<td class="column-1">Literature synthesis and benchmarking</td><td class="column-2"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2705.png" alt="✅" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Deep</td><td class="column-3"><img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2705.png" alt="✅" class="wp-smiley" style="height: 1em; max-height: 1em;" /> Applied</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1050 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Comparisons of skills and demands in PhD programs and industrial UX research.</p>
<h2>Where the Comparison Breaks Down</h2>
<p>Beyond the overlap shown in Table 1, industry experience adds something different: critical understanding of company politics, constraints, and compromises of conducting research inside a business.</p>
<p>PhD programs rarely teach these skills because they’re not primarily what graduate school is for. Years of professional UX experience expose practitioners to these important skills for navigating the workforce:</p>
<p><strong>Communicating with upper management.</strong> Academic communication is oriented toward depth, precision, and scholarly audiences. It’s an entirely different skill to effectively present findings to a vice president who has eight minutes until the next meeting and who has little interest in methods, much less your confidence intervals.</p>
<p><strong>Cost-justifying research.</strong> In industry, research doesn&#8217;t happen because it&#8217;s intellectually interesting. You have to make the business case for why this study is worth three weeks of execution, headcount, and recruiting budget.</p>
<p><strong>Making recommendations. </strong>This is probably one of the harder challenges for fresh PhDs. In academic work, published research is the product where the interpretative challenge is to show how the data support or fail to support competing theories. The publication process has its own timeline, which can span over years. In industrial research, the timeline is much faster, and the interpretive challenge is to make connections between results and decisions. You’re expected to be the expert and to use the data as a guardrail when making specific recommendations.</p>
<p><strong>Making decisions with incomplete data.</strong> Making recommendations is hard, but making them with incomplete, partial data and small sample sizes is even harder, and it&#8217;s pretty much guaranteed to be part of the job. Academic research is built around the idea that you don&#8217;t publish until the evidence is sufficient. Industry doesn&#8217;t wait. Researchers routinely must make actionable calls with <em>n</em> = 6, two weeks of data, and a product manager asking, &#8220;So what do we do?&#8221; Being comfortable with that is a skill that only comes from real-world practice.</p>
<p><strong>Navigating organizational dynamics.</strong> Whose priorities take precedence? What happens when engineering ignores your findings? How do you influence a roadmap when you don&#8217;t own it? These questions don&#8217;t appear on qualifying exams.</p>
<p><strong>Learning to say “no.”</strong> As a student, you’re expected to do everything you’re asked to do without any significant pushback. When you join an industrial team as a UX researcher, you can count on being asked to do more than you can. Learning how to say “no” in socially and politically savvy ways becomes an important skill.</p>
<p><strong>Learning to deal with the “PowerPoint problem.”</strong> Translating nuanced research into a compelling, executive-ready slide deck (one that lands the insight without losing the truth) is a learned, practiced skill. Many new PhDs have never had to do it, and it shows. This is not writing an academic paper where findings and conclusions come after the methodological details. Stakeholders assume you know how to do the work. You need to learn to start with the findings and recommendations and then provide the supporting details.</p>
<p><strong>Being okay with &#8220;good enough&#8221; research.</strong> Graduate training optimizes for rigor. Industry optimizes for decision-making. Sometimes the right study is a fast-and-dirty five-session usability test, not a mixed-methods longitudinal study. Knowing when to go &#8220;good enough&#8221; (and not feeling bad about it) takes time and context that academia doesn&#8217;t provide.</p>
<h2>Do Years of Experience Count Toward a PhD?</h2>
<p>While we’re sympathetic to PhD applicants who want their years of program experience to count as industry experience, should the inverse also hold?</p>
<p>That is, if a PhD counts as five years of industry experience, should five years of UX research experience earn an honorary doctorate?</p>
<p>Most PhDs would immediately bristle at this suggestion, and they&#8217;d be right to do so. Years of industry work, however excellent, don&#8217;t replicate what a doctoral program demands: the years of independent scholarship, the depth of the literature review, the grueling process of designing and defending original research, the personal reckoning of being solely responsible for a body of work over years. Nobody gets a PhD by accident or by accumulating time. It&#8217;s a specific, demanding thing.</p>
<p>But the fact that this inverse is obviously false signals that the original equivalence is weaker than it sounds. Despite some overlapping skills, the demands of the two research contexts are very different.</p>
<h2>The Bottom Line</h2>
<p>A PhD and years of industry experience are not the same thing. They produce overlapping but distinct skill sets, so treating them as interchangeable flattens what is genuinely valuable about each.</p>
<p>But that&#8217;s not the same as saying a PhD doesn&#8217;t matter in industry. For certain roles—particularly in applied research firms where methodological depth, statistical rigor, and the ability to defend findings under scrutiny are core to the work (like MeasuringU!), a PhD is a genuine competitive advantage. It&#8217;s not a substitute for industry experience, but it&#8217;s a strong foundation to build on, and the gap closes quickly for PhDs who are self-aware about what they still need to learn.</p>
<p>If you have a PhD, be proud of it. Use it. The methodological foundation you built is real and hard to replace.</p>
<p>The PhD is a head start on the craft. Experience is a head start on the context. The best industrial researchers, eventually, have both.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Do Statistics Really Require 30 Participants?</title>
		<link>https://measuringu.com/do-statistics-really-require-30-participants/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=do-statistics-really-require-30-participants</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD and Jeff Sauro, PhD]]></dc:creator>
		<pubDate>Tue, 09 Jun 2026 21:37:45 +0000</pubDate>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Sample Size]]></category>
		<category><![CDATA[Usability Testing]]></category>
		<category><![CDATA[User Experience]]></category>
		<category><![CDATA[User Research]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[confidence interval]]></category>
		<category><![CDATA[Confidence Intervals]]></category>
		<category><![CDATA[t-distribution]]></category>
		<category><![CDATA[t-test]]></category>
		<category><![CDATA[z-distribution]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47696</guid>

					<description><![CDATA[Should the sample size n be greater than 30? If you’ve taken any introductory statistics course or an AP statistics class (or helped your child with it), you’ve encountered the n ≥ 30 rule. The “magic number 5” rule we’ve written extensively about applies (with its important caveats) to problem discovery for usability testing. But the [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/06/060926-FeatureImage1-scaled.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47752 size-medium" src="https://measuringu.com/wp-content/uploads/2026/06/060926-FeatureImage1-300x169.png" alt="Feature image showing an icon representing participants and &quot;≥&quot;" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/06/060926-FeatureImage1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/06/060926-FeatureImage1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/06/060926-FeatureImage1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/06/060926-FeatureImage1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/06/060926-FeatureImage1-2048x1152.png 2048w, https://measuringu.com/wp-content/uploads/2026/06/060926-FeatureImage1-600x338.png 600w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>Should the sample size <em>n</em> be greater than 30?</p>
<p>If you’ve taken any introductory statistics course or an AP statistics class (or helped your child with it), you’ve encountered the <em>n</em> ≥ 30 rule.</p>
<p>The “<a href="https://measuringu.com/specific-sample-sizes-in-discovery-studies/">magic number 5</a>” rule we’ve written extensively about applies (with its important caveats) to problem discovery for usability testing.</p>
<p>But the <em>n</em> ≥ 30 rule goes beyond usability testing, coming up across disciplines and even in classrooms. It will often be mentioned by skeptical stakeholders and during the peer review process (probably from <a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC8505560/">Reviewer Number 2</a>). Violating it can feel like a methodological sin.</p>
<p>But where does it actually come from? And does it hold up in general and in UX research in particular?</p>
<p>The short answer is that the rule has real statistical roots, but they’re often misunderstood and misapplied.</p>
<h2>On One Hand: Arguments for Why Researchers Need <em>n </em>≥ 30</h2>
<p>The <em>n</em> ≥ 30 rule is grounded in two related concerns: whether your statistical analyses will perform accurately with smaller sample sizes (1) when your raw data is normally distributed and (2) when raw data is not normally distributed.</p>
<h3>The <em>t</em>-Distribution Converges to <em>Z</em> at about <em>n</em> = 30 for Continuous Data</h3>
<p>When learning statistics, you’ll often start with the normal <em>z</em>-distribution for statistical tests. You can use tables or simple formulas to look up <em>z</em> values when making computations. But using the normal <em>z</em>-distribution and tables means you need to know the population standard deviation.</p>
<p>Unfortunately, in applied settings, we rarely know the population standard deviation! Fortunately, the alternative is to use <em>t</em>-distribution tables and computations. They have one additional input compared to <em>z</em> computations, which is the sample size (strictly speaking, <em>n</em> − 1, the <a href="https://sites.utexas.edu/sos/degreesfreedom/">degrees of freedom</a> for the <em>t</em>-distributions). However, when the sample size gets to about 30, <em>z</em> and <em>t</em> values converge, so they are roughly the same (see Figure 1).</p>
<p>Consequently, when <em>n</em> is at least 30, you don’t have to deal with somewhat more complicated small-sample statistics. Over time, this statistical footnote calcified into a general-purpose sample size rule.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/06/Figure1-scaled.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47754" src="https://measuringu.com/wp-content/uploads/2026/06/Figure1-300x225.png" alt="Figure 1: Approach of t to z as a function of degrees of freedom (n - 1)." width="800" height="599" srcset="https://measuringu.com/wp-content/uploads/2026/06/Figure1-300x225.png 300w, https://measuringu.com/wp-content/uploads/2026/06/Figure1-1024x767.png 1024w, https://measuringu.com/wp-content/uploads/2026/06/Figure1-768x575.png 768w, https://measuringu.com/wp-content/uploads/2026/06/Figure1-1536x1150.png 1536w, https://measuringu.com/wp-content/uploads/2026/06/Figure1-2048x1534.png 2048w, https://measuringu.com/wp-content/uploads/2026/06/Figure1-600x449.png 600w" sizes="auto, (max-width: 800px) 100vw, 800px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> Approach of <em>t</em> to <em>z</em> as a function of degrees of freedom (<em>n</em> − 1).</p>
<h3>Traditional “Wald” Confidence Intervals Are Less Accurate at Small Sample Sizes</h3>
<p>The <em>n</em> ≥ 30 rule isn’t just for continuous data. It’s also been applied to binary (0/1, yes/no) data. Like the convergence of <em>t</em> to <em>z</em> shown in Figure 1, there is a similar convergence of the binomial distribution to the <em>z</em>-distribution, which becomes approximately normal when <em>n</em> = 30 and the expected proportion is not very close to 0 or 1. The most widely taught method for calculating binomial confidence intervals (the Wald method) grossly understates the width of the true interval when sample sizes are small because it’s based on the <em>z</em>-distribution. We demonstrated the inaccuracy of the Wald method in <a href="https://measuringu.com/article/estimating-completion-rates-from-small-samples-using-binomial-confidence-intervals-comparisons-and-recommendations/">our 2005 paper</a> using real-world completion rate data. For example, a 95% confidence interval around a completion rate with <em>n</em> = 15 constructed with the Wald method is more like a 72% confidence interval (wildly inaccurate—see below for other findings from that paper).</p>
<h3>UX Data Generally Isn’t Normally Distributed</h3>
<p>If you’ve ever stared at a histogram of task completion rates, time-on-task, or even SUS scores, you know that raw UX data rarely forms the classic bell curve. Completion rates are binary (0 or 1), usually with more successes than failures. Task times are often-right skewed due to a long tail of slow participants. Likert-scale items such as the Single Ease Question (<a href="https://measuringu.com/evolution-of-seq/">SEQ</a><sup>®</sup>) tend to cluster toward the top (more scores above the median than below it). As we showed in our article <a href="https://measuringu.com/is-ux-data-normal/">Is UX Data Normally Distributed?</a>, none of these distributions look remotely normal (see Figure 2).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/05/Figure-2-Distributions-of-four-UX-metrics-showing-their-non-normal-raw-distributions.png" rel="attachment wp-att-47698"><img loading="lazy" decoding="async" class="alignnone wp-image-47698 size-full" src="https://measuringu.com/wp-content/uploads/2026/05/Figure-2-Distributions-of-four-UX-metrics-showing-their-non-normal-raw-distributions.png" alt="Figure 2: Distributions of four UX metrics showing their non-normal raw distributions. " width="1041" height="1100" srcset="https://measuringu.com/wp-content/uploads/2026/05/Figure-2-Distributions-of-four-UX-metrics-showing-their-non-normal-raw-distributions.png 1041w, https://measuringu.com/wp-content/uploads/2026/05/Figure-2-Distributions-of-four-UX-metrics-showing-their-non-normal-raw-distributions-284x300.png 284w, https://measuringu.com/wp-content/uploads/2026/05/Figure-2-Distributions-of-four-UX-metrics-showing-their-non-normal-raw-distributions-969x1024.png 969w, https://measuringu.com/wp-content/uploads/2026/05/Figure-2-Distributions-of-four-UX-metrics-showing-their-non-normal-raw-distributions-768x812.png 768w, https://measuringu.com/wp-content/uploads/2026/05/Figure-2-Distributions-of-four-UX-metrics-showing-their-non-normal-raw-distributions-600x634.png 600w" sizes="auto, (max-width: 1041px) 100vw, 1041px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 2:</strong> Distributions of four UX metrics showing their non-normal raw distributions.</p>
<p>That might sound alarming. Many statistical tests, such as confidence intervals, <em>t</em>-tests, and ANOVA, assume normality. If the raw data isn’t normal, are those analyses invalid?</p>
<h2>On the Other Hand: Arguments Why Sometimes <em>n</em> &lt; 30 Is OK</h2>
<p>Fortunately, neither the convergence of <em>t</em> to <em>z</em> nor the non-normality of raw data means confidence intervals or statistical comparison tests are invalid below <em>n</em> = 30. For continuous data, the <em>t</em>-distribution works correctly at any sample size. For binary data, the standard-Wald method can be replaced with better-performing alternatives. And for the normality concern, the Central Limit Theorem (CLT) means the distribution of your raw data matters far less than most researchers assume. This is where the distinction between the distribution of your <em>raw data</em> and the distribution of your <em>sample means</em> becomes critical. We’ll start with the normality issue.</p>
<h3>The Central Limit Theorem Solves Most Normality Issues</h3>
<p>One of the most important concepts in all of statistics is the CLT. According to the CLT, as the sample size increases, the distribution of the mean becomes more and more normal, <a href="https://measuringu.com/is-ux-data-normal/">regardless of the normality of the underlying distribution</a>.</p>
<p>How quickly does the CLT kick in? Considering a wide variety of distributions, most achieve a normal or near-normal distribution of the means when <em>n</em> is 30, making that a reasonably safe bet when you don’t know anything about the distributions.</p>
<p>For many common UX metrics, however, the distributions of the means approach normality sooner than you’d expect. Using bootstrap simulations on real UX datasets (repeatedly drawing sub-samples and computing means), the sampling distributions of SEQ ratings and SUS scores approach normality by <em>n</em> = 10. Even binomial completion rates (which are maximally non-normal) and completion times approach a normal sampling distribution by <em>n</em> = 20–30 (see Figure 3). But it turns out there’s a fix for even smaller binomial sample sizes (see below).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/05/Figure-3-Distributions-of-the-means-for-four-UX-metrics-with-varying-sample-sizes.png" rel="attachment wp-att-47699"><img loading="lazy" decoding="async" class="alignnone wp-image-47699 size-full" src="https://measuringu.com/wp-content/uploads/2026/05/Figure-3-Distributions-of-the-means-for-four-UX-metrics-with-varying-sample-sizes.png" alt="Figure 3: Distributions of the means for four UX metrics with varying sample sizes." width="1185" height="1445" srcset="https://measuringu.com/wp-content/uploads/2026/05/Figure-3-Distributions-of-the-means-for-four-UX-metrics-with-varying-sample-sizes.png 1185w, https://measuringu.com/wp-content/uploads/2026/05/Figure-3-Distributions-of-the-means-for-four-UX-metrics-with-varying-sample-sizes-246x300.png 246w, https://measuringu.com/wp-content/uploads/2026/05/Figure-3-Distributions-of-the-means-for-four-UX-metrics-with-varying-sample-sizes-840x1024.png 840w, https://measuringu.com/wp-content/uploads/2026/05/Figure-3-Distributions-of-the-means-for-four-UX-metrics-with-varying-sample-sizes-768x937.png 768w, https://measuringu.com/wp-content/uploads/2026/05/Figure-3-Distributions-of-the-means-for-four-UX-metrics-with-varying-sample-sizes-600x732.png 600w" sizes="auto, (max-width: 1185px) 100vw, 1185px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 3:</strong> Distributions of the means for four UX metrics with varying sample sizes.</p>
<p>This is why means, <em>t</em>-tests, and confidence intervals work reasonably well even when your raw responses are skewed (e.g., completion times) or bounded (e.g., rating scales).</p>
<h3>The <em>t</em>-Distribution Was Built for Small Samples</h3>
<p>The deep irony of being advised that “you need 30 to use a <em>t</em>-test” is that the <em>t</em>-distribution was invented specifically for small samples.</p>
<p>In 1899, William S. Gossett (Figure 4), a recent graduate of New College, Oxford with degrees in chemistry and mathematics, became one of the first scientists to join the Guinness brewery.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/05/Figure-4-An-anachronistic-interpretation-of-William-S.-Gossett.jpg" rel="attachment wp-att-47700"><img loading="lazy" decoding="async" class="alignnone wp-image-47700 size-full" src="https://measuringu.com/wp-content/uploads/2026/05/Figure-4-An-anachronistic-interpretation-of-William-S.-Gossett.jpg" alt="Figure 4: An anachronistic interpretation of William S. Gossett (Student), adapted from his Wikipedia page photo (public domain) with AI assistance." width="837" height="1075" srcset="https://measuringu.com/wp-content/uploads/2026/05/Figure-4-An-anachronistic-interpretation-of-William-S.-Gossett.jpg 837w, https://measuringu.com/wp-content/uploads/2026/05/Figure-4-An-anachronistic-interpretation-of-William-S.-Gossett-234x300.jpg 234w, https://measuringu.com/wp-content/uploads/2026/05/Figure-4-An-anachronistic-interpretation-of-William-S.-Gossett-797x1024.jpg 797w, https://measuringu.com/wp-content/uploads/2026/05/Figure-4-An-anachronistic-interpretation-of-William-S.-Gossett-768x986.jpg 768w, https://measuringu.com/wp-content/uploads/2026/05/Figure-4-An-anachronistic-interpretation-of-William-S.-Gossett-600x771.jpg 600w" sizes="auto, (max-width: 837px) 100vw, 837px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 4:</strong> An anachronistic interpretation of William S. Gossett (Student), adapted from his Wikipedia page photo (public domain) with AI assistance.</p>
<p>As Michael Cowles wrote in his book <em>Statistics in Psychology: An Historical Perspective</em>, “Compared with the giants of his day, he published very little, but his contribution is of critical importance. … The nature of the process of brewing, with its variability in temperature and ingredients, means that it is not possible to take large samples over a long run” (pp. 108–109).</p>
<p>Gossett couldn’t use <em>z</em>-scores because they don’t perform well with small samples. After analyzing the deficiencies of the <em>z</em>-distribution for small-sample statistical tests, he worked out the necessary adjustments as a function of degrees of freedom to produce the <em>t</em>-distribution—published in 1908 under the pseudonym “Student,” because Guinness prohibited employees from publishing. In the work that led to his tables, Gossett performed an early version of <a href="https://en.wikipedia.org/wiki/Monte_Carlo_method">Monte Carlo simulations</a>. He prepared 3,000 cards labeled with physical measurements taken on criminals, shuffled them, then dealt them into 750 groups of size 4 (<em>n</em> much smaller than 30).</p>
<p>The point is that the <em>t</em>-distribution was designed precisely to handle small samples correctly. The idea that you need <em>n</em> ≥ 30 even when using the <em>t</em>-distribution contradicts both the history and purpose of the statistic.</p>
<p>Historians of statistics widely regard Gossett&#8217;s publication of Student&#8217;s <em>t</em>-test as a landmark event. In a letter to <a href="https://en.wikipedia.org/wiki/Ronald_Fisher">Ronald A. Fisher</a> containing an early copy of the <em>t</em>-tables, Gossett wrote, &#8220;<a href="https://www.physoc.org/magazine-articles/the-strange-origins-of-the-students-t-test/">You are the only man that&#8217;s ever likely to use them</a>.&#8221;</p>
<p>Gossett got a lot of things right. He certainly got that wrong.</p>
<h3>The Wald Method Can Be Fixed with a Simple Adjustment</h3>
<p>While binary data generates inaccurate confidence intervals with the standard-Wald method when <em>n</em> &lt; 20–30, there&#8217;s a fix. It turns out that a slight adjustment makes even small binomial samples generate accurate confidence intervals. In the same <a href="https://math.unm.edu/~james/Agresti1998.pdf">2005 paper</a> where we demonstrated the problem with the Wald method, we also showed that a simple adjustment brings 95% confidence intervals back to accurate coverage. The adjustment is very close to just adding two successes and two failures to your observed data, then computing a standard-Wald interval around that adjusted proportion. This is now often called the adjusted-Wald method, formalized by work by Agresti and Coull. A small tweak to the math turns an unreliable method into a reliable one, even with small samples.</p>
<h2>Our Recommendation: Stats Work Below <em>n</em> = 30 with the Right Approach</h2>
<p>When the cost of sampling is high, as it often is, always insisting on at least 30 users regardless of study goals is wasteful at best and infeasible at worst. A more appropriate approach is to use sample size formulas derived from the specific statistical analysis (e.g., confidence interval estimation, significance test), accounting for the data type, expected variability, desired confidence, and target effect size. We’ve published articles on this for several common UX scenarios, for example:</p>
<ul>
<li><a href="https://measuringu.com/ux-lite-sample-sizes-for-confidence-intervals/">UX-Lite Sample Sizes for Confidence Intervals</a></li>
<li><a href="https://measuringu.com/ux-lite-sample-sizes-for-comparison-to-a-benchmark/">UX-Lite Sample Sizes for Comparison to a Benchmark</a></li>
<li><a href="https://measuringu.com/sample-sizes-for-comparison-of-ux-lite-scores/">Sample Sizes for Comparing UX-Lite Scores</a></li>
</ul>
<p>Knowing that small samples <em>can</em> work statistically doesn&#8217;t tell you how to handle them. The right approach depends on the type of data you&#8217;re analyzing:</p>
<ul>
<li><strong>Rating scales</strong> (SUS, SEQ, SUPR-Q): Use the <em>t</em>-distribution with the correct degrees of freedom for confidence intervals and tests of significance. It was designed for exactly this situation.</li>
<li><strong>Binary data</strong> (completion rates, yes/no): Use adjusted methods (e.g., <a href="https://measuringu.com/article/estimating-completion-rates-from-small-samples-using-binomial-confidence-intervals-comparisons-and-recommendations/">adjusted-Wald</a> for confidence intervals, <a href="https://measuringu.com/what-is-the-n-1-two-proportion-test/"><em>N</em>−1 two-proportion</a> method for significance tests), which perform accurately at small samples where standard methods based on the <em>z</em>-distribution break down.</li>
<li><strong>Time data</strong>: For confidence intervals, log-transform the raw data to correct for right-skew, then transform back to the original scale (usually no need to transform for tests of significance, but it is always an option).</li>
</ul>
<p>When samples are small, the concern shifts from normality and sampling distributions to sensitivity and power—the accuracy of an estimate (confidence intervals) or your ability to detect a true difference when one exists (hypothesis testing).</p>
<p>The procedures work correctly with small samples, but you&#8217;re limited to relatively imprecise estimates (confidence intervals) or reliably detecting only large differences (hypothesis tests). Subtle or moderate effects will likely go undetected, not because the statistics are broken but because small samples carry more uncertainty. We cover how to plan for adequate sensitivity and power in all our articles on sample size estimation (e.g., <a href="https://measuringu.com/sample-sizes-for-rating-scale-comparisons/">Sample Sizes for Comparing Rating Scale Means</a>). Sometimes a small sample is all you need to achieve your research goal.</p>
<p>This controversy is similar to the &#8220;magic number 5&#8221; controversy but applied to <a href="https://measuringu.com/three-goals/">summative rather than formative</a> research. The &#8220;magic number 30&#8221; has real empirical rationale, as it&#8217;s roughly where the CLT kicks in for a wide variety of distributions and where <em>t</em> converges on <em>z</em>. In practice, however, it&#8217;s applied far too rigidly. The appropriate sample size depends on the  distribution, the expected variability, the desired confidence and power, and the minimum <a href="https://measuringu.com/an-introduction-to-effect-sizes/">effect size</a> you need to detect. A sample of 30 is almost never exactly right for any specific situation.</p>
<p>It isn&#8217;t much more complicated to use the <em>t</em>-distribution than the <em>z</em>-distribution, or the adjusted-Wald instead of the Wald method (you just need to account for the sample size). The entire reason the <em>t</em>-distribution was developed was to enable the analysis of small samples. This is just one of the less obvious ways usability practitioners benefit from the science and practice of beer brewing.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Using the TAC-10 for Screening and Data Cleaning</title>
		<link>https://measuringu.com/using-the-tac10-for-screening-and-data-cleaning/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=using-the-tac10-for-screening-and-data-cleaning</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD and Jeff Sauro, PhD]]></dc:creator>
		<pubDate>Tue, 02 Jun 2026 23:09:08 +0000</pubDate>
				<category><![CDATA[Methods]]></category>
		<category><![CDATA[User Research]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[Online Panels]]></category>
		<category><![CDATA[TAC-10]]></category>
		<category><![CDATA[Tech savviness]]></category>
		<category><![CDATA[UX Research]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47676</guid>

					<description><![CDATA[It’s hard to collect data for UX research, and once you have it, you have to clean it. In a simpler world, all respondents would be honest and focused on providing high-quality information rather than maximizing income, but that’s not the world we live in. From past research, we estimate the prevalence of cheating on [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/05/060226-FeatureImage-1.png" rel="attachment wp-att-24969"><img loading="lazy" decoding="async" class="alignleft wp-image-47712 size-medium" src="https://measuringu.com/wp-content/uploads/2026/05/060226-FeatureImage-1-300x169.png" alt="Feature image showing TAC-10 being used for screening and data cleaning" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/05/060226-FeatureImage-1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/05/060226-FeatureImage-1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/05/060226-FeatureImage-1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/05/060226-FeatureImage-1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/05/060226-FeatureImage-1-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/05/060226-FeatureImage-1.png 2000w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>It’s hard to collect data for UX research, and once you have it, you have to clean it.</p>
<p>In a simpler world, all respondents would be honest and focused on providing high-quality information rather than maximizing income, but that’s not the world we live in. From <a href="https://measuringu.com/cheat-survey/">past research</a>, we estimate the prevalence of cheating on paid panels to be about 10% of respondents (ranging from 3–20%).</p>
<p>UX researchers can use <a href="https://measuringu.com/cleaning-data/">numerous strategies</a> for screening (stopping bad actors before they get to the actual study) and cleaning (finding and removing poor quality respondents after study completion). These include:</p>
<ul>
<li>Identification of speeders</li>
<li>Disqualifying questions</li>
<li>Attention checks</li>
<li>Review of open-ended responses</li>
<li>Internal consistency</li>
<li>Straightlining</li>
<li>Review of session recordings (when available)</li>
<li>Duplicate and bot detection</li>
</ul>
<p>AI complicates all these approaches. Modern AI can mimic attentive respondent behavior well enough to slip past most of these detection methods. We are encouraged, however, that many panel operators have taken active steps to restrict AI fraud at the source.</p>
<p>When those safeguards are in place, or when participants come from a verified human population such as a customer list, we propose another dual-purpose and quick approach.</p>
<p>In this article, we demonstrate how to use TAC-10<img src="https://s.w.org/images/core/emoji/17.0.2/72x72/2122.png" alt="™" class="wp-smiley" style="height: 1em; max-height: 1em;" /> response patterns not only for its primary purpose as a measure of tech savviness but also as a type of internal consistency check for screening and detecting inattentive or misrepresenting human respondents.</p>
<h2>TAC-10 Basics</h2>
<p>In a <a href="https://measuringu.com/how-to-use-the-tac/">series of articles</a>, we reviewed the findings of eight years of research into measuring tech savviness. In that research program, we explored <a href="https://measuringu.com/in-search-of-tech-savvy-measures/">several methods</a> for measuring tech savviness, including quizzes (what people know), self-assessment questionnaires (what people feel), and technical activity checklists (what people are confident doing).</p>
<p>After analyzing thousands of participants’ data to see how measures of tech savviness predict task performance, we determined that technical activity checklists had better measurement properties than quizzes or questionnaires. Of the various versions of checklists that we studied, we determined that the one with ten activities and a none-of-the-above option (the TAC-10 shown in Figure 1) has the best balance between conciseness and completeness.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/05/1.png" rel="attachment wp-att-47677"><img loading="lazy" decoding="async" class="alignnone wp-image-47677 size-full" src="https://measuringu.com/wp-content/uploads/2026/05/1.png" alt="Figure 1: The current version of the TAC-10 (image from the MUiQ® platform). " width="683" height="625" srcset="https://measuringu.com/wp-content/uploads/2026/05/1.png 683w, https://measuringu.com/wp-content/uploads/2026/05/1-300x275.png 300w, https://measuringu.com/wp-content/uploads/2026/05/1-600x549.png 600w" sizes="auto, (max-width: 683px) 100vw, 683px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> The current version of the TAC-10 (image from the <a href="https://measuringu.com/muiq/">MUiQ<sup>®</sup> platform</a>).</p>
<p>The TAC-10 score for a person is the number of selected items. It’s a reliable (consistent) and valid (predictive) measure of tech savviness that, for its primary purpose, can be used (1) to classify participants into groups with low, medium, or high levels of tech savviness and (2) as a tech savviness predictor or covariate in advanced statistical analysis.</p>
<h2>Some TAC-10 Response Patterns Are More Plausible than Others</h2>
<p>In addition to being a tool to measure and classify tech savviness, TAC-10 response patterns can also be used to identify potentially problematic respondents based on the plausibility of the pattern.</p>
<p>In May 2023, we collected a large sample of completed TAC-16 checklists as part of a screening survey (<em>n </em>= 4,731) to acquire enough data for <a href="https://measuringu.com/rasch-analysis-of-three-technical-activity-checklists/">Rasch analysis</a> of three versions of the TAC (TAC-9, TAC-10, and TAC-16). In this new analysis, we applied various methods to classify response patterns as plausible, implausible, or indeterminate. Examples of plausible patterns are those consistent with perfect or near-perfect <a href="https://en.wikipedia.org/wiki/Guttman_scale">Guttman scaling</a>. Response patterns that are logically inconsistent are implausible. Patterns that are not clearly plausible or implausible are indeterminate. For these analyses, we coded each TAC-10 response as a binary string of 0s (not selected) and 1s (selected) for activities in the order shown in Figure 1. For example, 1100000000 indicates a user who selected &#8220;installing a new app on your phone&#8221; and &#8220;setting up a new phone,&#8221; but no other activities.</p>
<h3>Responses Consistent with Guttman Scaling Are Plausible</h3>
<p>Guttman scaling, which dates back to the 1940s, is a deterministic predecessor of probabilistic Rasch scaling. The goal of a Guttman scale is to develop a set of distinctive items, from easy to difficult, that form a unidimensional scale. The range from easy to difficult can refer to characteristics like easy to solve to difficult to solve for math problems or easy to agree with to difficult to agree with for attitudinal scales.</p>
<p>For 10 binary (yes/no) items like the TAC-10, there are 2<sup>10</sup> (1,024) possible arrangements of selected (1) or unselected (0) items. Only 11, however, are consistent with a perfect Guttman scale (all 1s toward the left side of the pattern, all 0s to the right): 0000000000, 1000000000, 1100000000, 1110000000, 1111000000, 1111100000, 1111110000, 1111111000, 1111111100, 1111111110, and 1111111111. We categorized these patterns as plausible.</p>
<h3>Other Plausible Response Patterns</h3>
<p>In practice, other patterns that are close to Guttman patterns are also likely to be plausible. For example, if someone is comfortable with all activities except HTML, the pattern would be 1111111101. Although it’s unlikely that someone who programs efficiently in C knows nothing about HTML, it’s possible that they would lack sufficient practical or deep familiarity with it to be comfortable selecting it. In most cases, Guttman-like patterns with one or two discontinuities are plausible.</p>
<h3>Implausible Response Patterns</h3>
<p>Patterns that are the inverse of Guttman patterns (1 and 0 swapped) are categorized as implausible (except for 0000000000 and 1111111111). For example, the pattern 0000000001 indicates someone who programs efficiently but isn’t comfortable with anything else in the list—possible but highly unlikely.</p>
<p>Other problematic patterns are those that start with 01 because, for this to be plausible, the respondent would have to be comfortable setting up a new phone but uncomfortable adding an app to that phone.</p>
<p>Patterns that contain just one 1 (other than 100000000) are implausible and may indicate a respondent who misunderstood the instruction to select all that apply.</p>
<h3>Indeterminate Response Patterns</h3>
<p>Patterns not categorized as plausible or implausible are provisionally defined as indeterminate.</p>
<h2>Plausible TAC-10 Patterns Are Much More Likely in Practice than Implausible Patterns</h2>
<p>We investigated the frequency of occurrence of plausible, implausible, and indeterminate patterns in our large sample of TAC-10 scores. Of the 1,024 possible patterns, only 199 appeared at least once in our dataset of 4,731 cases.</p>
<h3>Guttman Patterns</h3>
<p>Table 1 shows the frequency of Guttman patterns in the large TAC-10 database, accounting for 56.4% of cases.</p>

<table id="tablepress-1048" class="tablepress tablepress-id-1048">
<thead>
<tr class="row-1">
	<th class="column-1">Guttman Patterns</th><th class="column-2">Freq</th><th class="column-3">Percent</th>
</tr>
</thead>
<tbody>
<tr class="row-2">
	<td class="column-1">1111111111</td><td class="column-2">365</td><td class="column-3"> 7.7%</td>
</tr>
<tr class="row-3">
	<td class="column-1">1111111110</td><td class="column-2">633</td><td class="column-3">13.4%</td>
</tr>
<tr class="row-4">
	<td class="column-1">1111111100</td><td class="column-2">764</td><td class="column-3">16.1%</td>
</tr>
<tr class="row-5">
	<td class="column-1">1111111000</td><td class="column-2">474</td><td class="column-3">10.0%</td>
</tr>
<tr class="row-6">
	<td class="column-1">1111110000</td><td class="column-2">268</td><td class="column-3"> 5.7%</td>
</tr>
<tr class="row-7">
	<td class="column-1">1111100000</td><td class="column-2">104</td><td class="column-3"> 2.2%</td>
</tr>
<tr class="row-8">
	<td class="column-1">1111000000</td><td class="column-2"> 27</td><td class="column-3"> 0.6%</td>
</tr>
<tr class="row-9">
	<td class="column-1">1110000000</td><td class="column-2"> 20</td><td class="column-3"> 0.4%</td>
</tr>
<tr class="row-10">
	<td class="column-1">1100000000</td><td class="column-2">  9</td><td class="column-3"> 0.2%</td>
</tr>
<tr class="row-11">
	<td class="column-1">1000000000</td><td class="column-2">  5</td><td class="column-3"> 0.1%</td>
</tr>
<tr class="row-12">
	<td class="column-1">0000000000</td><td class="column-2">  0</td><td class="column-3"> 0.0%</td>
</tr>
<tr class="row-13">
	<td class="column-1"></td><td class="column-2"></td><td class="column-3"></td>
</tr>
<tr class="row-14">
	<td class="column-1">Total</td><td class="column-2">2669</td><td class="column-3">56.4%</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1048 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Distribution of Guttman patterns in the large database.</p>
<h3>Other Plausible Patterns</h3>
<p>Table 2 shows other frequently occurring plausible patterns (each accounting for at least 0.5% of cases in the database). The 21 patterns in the table accounted for 30.7% of cases. In combination, the percentage of the 32 Guttman and other high-frequency plausible patterns in the database is 87.1%.</p>

<table id="tablepress-1049" class="tablepress tablepress-id-1049">
<thead>
<tr class="row-1">
	<th class="column-1">Other Plausible Patterns</th><th class="column-2"> Freq</th><th class="column-3">Percent</th>
</tr>
</thead>
<tbody>
<tr class="row-2">
	<td class="column-1">1111110100</td><td class="column-2"> 271</td><td class="column-3"> 5.7%</td>
</tr>
<tr class="row-3">
	<td class="column-1">1111111010</td><td class="column-2"> 179</td><td class="column-3"> 3.8%</td>
</tr>
<tr class="row-4">
	<td class="column-1">1111101000</td><td class="column-2"> 152</td><td class="column-3"> 3.2%</td>
</tr>
<tr class="row-5">
	<td class="column-1">1111110110</td><td class="column-2"> 127</td><td class="column-3"> 2.7%</td>
</tr>
<tr class="row-6">
	<td class="column-1">1111010100</td><td class="column-2"> 104</td><td class="column-3"> 2.2%</td>
</tr>
<tr class="row-7">
	<td class="column-1">1111010000</td><td class="column-2">  99</td><td class="column-3"> 2.1%</td>
</tr>
<tr class="row-8">
	<td class="column-1">1111110010</td><td class="column-2">  72</td><td class="column-3"> 1.5%</td>
</tr>
<tr class="row-9">
	<td class="column-1">1111101100</td><td class="column-2">  66</td><td class="column-3"> 1.4%</td>
</tr>
<tr class="row-10">
	<td class="column-1">1111011100</td><td class="column-2">  52</td><td class="column-3"> 1.1%</td>
</tr>
<tr class="row-11">
	<td class="column-1">1111010110</td><td class="column-2">  44</td><td class="column-3"> 0.9%</td>
</tr>
<tr class="row-12">
	<td class="column-1">1111111101</td><td class="column-2">  32</td><td class="column-3"> 0.7%</td>
</tr>
<tr class="row-13">
	<td class="column-1">1110100000</td><td class="column-2">  30</td><td class="column-3"> 0.6%</td>
</tr>
<tr class="row-14">
	<td class="column-1">1111111011</td><td class="column-2">  30</td><td class="column-3"> 0.6%</td>
</tr>
<tr class="row-15">
	<td class="column-1">1111100100</td><td class="column-2">  28</td><td class="column-3"> 0.6%</td>
</tr>
<tr class="row-16">
	<td class="column-1">1111011110</td><td class="column-2">  26</td><td class="column-3"> 0.5%</td>
</tr>
<tr class="row-17">
	<td class="column-1">1111101010</td><td class="column-2">  26</td><td class="column-3"> 0.5%</td>
</tr>
<tr class="row-18">
	<td class="column-1">1111011000</td><td class="column-2">  25</td><td class="column-3"> 0.5%</td>
</tr>
<tr class="row-19">
	<td class="column-1">1110101000</td><td class="column-2">  23</td><td class="column-3"> 0.5%</td>
</tr>
<tr class="row-20">
	<td class="column-1">1100100000</td><td class="column-2">  22</td><td class="column-3"> 0.5%</td>
</tr>
<tr class="row-21">
	<td class="column-1">1101100000</td><td class="column-2">  22</td><td class="column-3"> 0.5%</td>
</tr>
<tr class="row-22">
	<td class="column-1">1110111000</td><td class="column-2">  22</td><td class="column-3"> 0.5%</td>
</tr>
<tr class="row-23">
	<td class="column-1"></td><td class="column-2"></td><td class="column-3"></td>
</tr>
<tr class="row-24">
	<td class="column-1">Total</td><td class="column-2">1452</td><td class="column-3">30.7%</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1049 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 2:</strong> Distribution of other patterns in the large database that are plausible and had frequencies of at least 0.5%.</p>
<h3>Implausible Patterns</h3>
<p>The database did not contain any cases matching an inverse Guttman pattern.</p>
<p>There were 17 implausible patterns that started with 01, each having a frequency of 1 or 2 for a total of 21, accounting for just 0.4% of the data.</p>
<p>There were only four cases (0.1% of the data) in which a single activity was chosen past the phone activities (three cases with 0010000000 and one with 0001000000, two additional implausible patterns).</p>
<h3>Indeterminate Patterns</h3>
<p>Because there were 32 plausible and 19 implausible patterns (51) out of a total of 199 patterns, the remaining 148 patterns are indeterminate.</p>
<p>Combined, the indeterminate patterns account for 12.4% of the data, with no individual indeterminate case having a frequency greater than 0.4%.</p>
<h2>Summary and Discussion</h2>
<p>In addition to its use as a measure of tech savviness, we investigated how well the TAC-10 might be used to identify plausible and implausible response patterns for the purpose of identifying potentially problematic respondents in screening and data cleaning.</p>
<p>Based on our large database of TAC-10 cases (<em>n</em> = 4,731), using two criteria for identifying plausible response patterns (matching Guttman patterns and/or frequently occurring patterns), we found that 56.4% of cases matched Guttman patterns; an additional 21 frequently occurring patterns that slightly deviated from Guttman patterns accounted for 30.7%, for a total of 87.1%. Clearly implausible patterns accounted for only 0.5% of cases, leaving the others indeterminate.</p>
<p>Our key conclusions from these analyses were:</p>
<p><strong>Plausible patterns made up the vast majority (87%) of TAC-10 cases. </strong>This suggests that most respondents were attending to the items rather than carelessly checking boxes, especially because we randomized the order of presentation of the items.</p>
<p><strong>Implausible patterns were rare. </strong>There were no occurrences of inverse Guttman patterns, and less than 0.5% of the of the responses had a problematic pattern starting with 01 or containing a single 1 (aside from a single 1 for the easiest activity).</p>
<p><strong>TAC-10 responses can be used for screening and data cleaning. </strong>These results (a large percentage of plausible and a low percentage of implausible response patterns) are encouraging regarding the application of TAC-10 to identify potentially problematic (implausible or indeterminate) response patterns as part of a battery of approaches used to identify potential cheaters (along with other strategies such as examination of open-ended responses, implausible completion times, distractors in multiple choice items, attention checks, and straightlining).</p>
<p><strong>Not going to solve AI fraud</strong>: We don’t think the TAC-10 is necessarily a solution to AI fraud. More sophisticated AI methods can convincingly mimic either a low- or high-skilled human respondent, possibly by training on the articles we’ve published on the TAC-10. However, the TAC-10 remains a valuable screening tool in contexts where respondents come from a known population, such as a customer list, or where other panel-level methods have already confirmed that participants are human.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Does AI Find Real UI Problems or Just Hallucinations?</title>
		<link>https://measuringu.com/does-ai-find-real-ui-problems-or-just-hallucinations/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=does-ai-find-real-ui-problems-or-just-hallucinations</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD •&nbsp;Jeff Sauro, PhD •&nbsp;Will Schiavone, PhD&nbsp;•&nbsp;Lucas Plabst, PhD]]></dc:creator>
		<pubDate>Wed, 27 May 2026 00:41:28 +0000</pubDate>
				<category><![CDATA[Usability]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[ChatGPT]]></category>
		<category><![CDATA[Gemini]]></category>
		<category><![CDATA[Problem Discovery]]></category>
		<category><![CDATA[Usability Problem]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47650</guid>

					<description><![CDATA[In a previous experiment, AI identified roughly half the usability problems that trained researchers found in a video of a usability test session. That sounds promising. If AI can find usability issues, it can substantially increase the amount of usability testing that research teams can conduct. But in our analysis of that video, AI generated nearly [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/05/052626-FeatureImage-scaled.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47672 size-medium" src="https://measuringu.com/wp-content/uploads/2026/05/052626-FeatureImage-300x169.png" alt="Feature image showing an AI robot, three documents each labeled &quot;Verified&quot;, &quot;Fake&quot;, or &quot;False alarm&quot;, and a researcher." width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/05/052626-FeatureImage-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/05/052626-FeatureImage-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/05/052626-FeatureImage-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/05/052626-FeatureImage-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/05/052626-FeatureImage-2048x1152.png 2048w, https://measuringu.com/wp-content/uploads/2026/05/052626-FeatureImage-600x338.png 600w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>In a <a href="https://measuringu.com/ai-vs-human-usability-problem-analysis-of-a-video/">previous experiment</a>, AI identified roughly half the usability problems that trained researchers found in a video of a usability test session.</p>
<p>That sounds promising. If AI can find usability issues, it can substantially increase the amount of usability testing that research teams can conduct.</p>
<p>But in our analysis of that video, AI generated nearly as many <em>additional</em> problems that humans never flagged. Are these problems hidden gems missed by multiple researchers, or just AI hallucinations?</p>
<p>For this article, we classified all the unique problems the AIs generated into one of three categories:</p>
<ol>
<li>a real problem humans missed</li>
<li>a false alarm (a true observation misread as a usability problem)</li>
<li>a hallucination (something the AI reported that simply never happened)</li>
</ol>
<p>What we found suggests that the new AI problems are mostly false alarms, but there are some notable exceptions.</p>
<h2>Experimental Design: Four Researchers, Two LLMs, and One Video</h2>
<p>For this study, we had four humans (professional UX researchers working at MeasuringU) review a video from a previous usability benchmark study of online dining reservation websites. Each researcher independently created a list of the usability issues they observed in the six-minute video.</p>
<p>We ran the video through two LLMs (ChatGPT-5.4 Thinking and Gemini 3 Flash Thinking) four times <a href="https://measuringu.com/ai-usability-problem-analysis-of-a-video/">using the same prompt</a> each time.</p>
<p>So in this study, we held constant the video, the key elements of the prompt, and the LLM versions/settings—variables that we plan to vary in future studies. This time, we varied only the type of analyst: human, ChatGPT, and Gemini.</p>
<h2>Gemini Finds a Jewel; ChatGPT Goes on a Tangent</h2>
<p>Using the human-generated and -verified problem lists as the “gold standard,” Figure 1 shows a summary of what we found.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/05/Figure2.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47542" src="https://measuringu.com/wp-content/uploads/2026/05/Figure2-1024x901.png" alt="Venn diagram of usability problem discovery by humans, ChatGPT, and Gemini." width="700" height="616" srcset="https://measuringu.com/wp-content/uploads/2026/05/Figure2-1024x901.png 1024w, https://measuringu.com/wp-content/uploads/2026/05/Figure2-300x264.png 300w, https://measuringu.com/wp-content/uploads/2026/05/Figure2-768x676.png 768w, https://measuringu.com/wp-content/uploads/2026/05/Figure2-1536x1351.png 1536w, https://measuringu.com/wp-content/uploads/2026/05/Figure2-2048x1802.png 2048w, https://measuringu.com/wp-content/uploads/2026/05/Figure2-600x528.png 600w" sizes="auto, (max-width: 700px) 100vw, 700px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1: </strong>Venn diagram of usability problem discovery by humans, ChatGPT, and Gemini.</p>
<p>We know Venn diagrams can generate some bad high school math memories, so here’s a summary for all our sanity:</p>
<ul>
<li>Four human researchers found nine problems (3 + 2 + 3 + 1).</li>
<li>Two AIs combined found 14 problems (6 + 1 + 3 + 4).</li>
<li>Only three problems were found by researchers and both AIs (the 3 in the middle of the circles).</li>
<li>ChatGPT matched five of the nine researcher-identified problems (3 + 2).</li>
<li>Gemini matched four (3 + 1) of the researcher-identified problems.</li>
<li>That leaves 11 problems the AIs flagged that no researcher identified (6 + 1 + 4).</li>
<li>Of those 11 problems, six were unique to ChatGPT, four were unique to Gemini, and one was identified by both AIs.</li>
</ul>
<p>So, <strong>AIs generated 11 new problems</strong> not identified by any of the human researchers. Table 1 has details of those 11 problems, listed in chronological order using problem number codes from the previous article. Of the 11 problems no human flagged, one was a genuine find, seven were false alarms, and three were hallucinations. Here&#8217;s more detail about each category.</p>

<table id="tablepress-1047" class="tablepress tablepress-id-1047">
<thead>
<tr class="row-1">
	<th class="column-1">Prob #</th><th class="column-2">Problem Description</th><th class="column-3">Source</th><th class="column-4">Classification</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1"><strong>4b</strong></td><td class="column-2">Filters not helpful</td><td class="column-3">ChatGPT</td><td class="column-4">False alarm</td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong>5b</strong></td><td class="column-2">Participant used Ctrl-F to search for "sushi" when it wasn't in the 86-cuisine list</td><td class="column-3">Gemini</td><td class="column-4"><strong>Genuine find</strong></td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>6b</strong></td><td class="column-2">Search results for sushi included many non-sushi restaurants</td><td class="column-3">ChatGPT</td><td class="column-4">False alarm</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>6c</strong></td><td class="column-2">Weak presentation of cuisine information in search results</td><td class="column-3">ChatGPT</td><td class="column-4">False alarm</td>
</tr>
<tr class="row-6">
	<td class="column-1"><strong>7b-Gem</strong></td><td class="column-2">Participant chose the highest price tier</td><td class="column-3">Gemini</td><td class="column-4"><strong><font color="red">Hallucination</font></strong></td>
</tr>
<tr class="row-7">
	<td class="column-1"><strong>7b-GPT</strong></td><td class="column-2">Sorting by highest rated surfaced many non-sushi restaurants</td><td class="column-3">ChatGPT</td><td class="column-4">False alarm</td>
</tr>
<tr class="row-8">
	<td class="column-1"><strong>8b</strong></td><td class="column-2">UI pushes browsing without good decision support</td><td class="column-3">ChatGPT</td><td class="column-4">False alarm</td>
</tr>
<tr class="row-9">
	<td class="column-1"><strong>9b</strong></td><td class="column-2">Seating options only presented after selecting reservation time</td><td class="column-3">Gemini</td><td class="column-4">False alarm</td>
</tr>
<tr class="row-10">
	<td class="column-1"><strong>9c</strong></td><td class="column-2">Participant set time to 5:10 instead of 5:00</td><td class="column-3">Gemini</td><td class="column-4"><strong><font color="red">Hallucination</font></strong></td>
</tr>
<tr class="row-11">
	<td class="column-1"><strong>10a</strong></td><td class="column-2">Selected restaurant labeled "seafood" rather than "sushi" by OpenTable</td><td class="column-3">Both</td><td class="column-4">False alarm</td>
</tr>
<tr class="row-12">
	<td class="column-1"><strong>10b</strong></td><td class="column-2">Task not completed—participant never reached the reservation form</td><td class="column-3">ChatGPT</td><td class="column-4"><strong><font color="red">Hallucination</font></strong></td>
</tr>
</tbody>
</table>
<!-- #tablepress-1047 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1: </strong>The 11 AI-generated problems not identified by any human researcher, classified by type.</p>
<h3>Gemini’s Genuine Find</h3>
<p>Let’s start with the good news: All four Gemini runs identified that after the participant expanded the cuisine filter to show all 86 cuisines, she used Ctrl-F to search the page for “sushi” (5b-Gem)—an event not reported by any of the human evaluators. It happened quickly, so it’s possible that the search field was not in the visual focus of the humans who were likely examining the list of cuisines (Figure 2). We consider this a true usability problem because (1) this behavior was driven by poor filter design and (2) it was unsuccessful—the word “sushi” was not on the page even though the cuisine filter was fully expanded.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/05/052626-F2V2.png"><img loading="lazy" decoding="async" class="alignnone size-full wp-image-47654" src="https://measuringu.com/wp-content/uploads/2026/05/052626-F2.png" alt="Frame from video showing Ctrl-F search field with first few letters of “sushi” typed at the top right of the screen and 28 of the 86 cuisine types on the left. " width="825" height="648" srcset="https://measuringu.com/wp-content/uploads/2026/05/052626-F2.png 825w, https://measuringu.com/wp-content/uploads/2026/05/052626-F2-300x236.png 300w, https://measuringu.com/wp-content/uploads/2026/05/052626-F2-768x603.png 768w, https://measuringu.com/wp-content/uploads/2026/05/052626-F2-600x471.png 600w" sizes="auto, (max-width: 825px) 100vw, 825px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 2:</strong> Frame from video showing Ctrl-F search field with first few letters of “sushi” typed at the top right of the screen and 28 of the 86 cuisine types on the left.</p>
<h3>Seven False Alarms</h3>
<p>Next, the not-so-good news. When a researcher identifies something that happened, but it’s not really considered a problem, it’s referred to as a false alarm. Sometimes things are literally a feature and not a bug! From our interpretation, AIs generated seven false alarms (not too different from what <a href="https://measuringu.com/false-positives/">you sometimes see</a> with a group of human evaluators).</p>
<h4><strong>The seafood/sushi labeling issue (6b, 6c, 7b-GPT, 8b, 10a)</strong></h4>
<p>Five of the seven false alarms (6b, 6c, 7b-GPT, 8b, 10a) were derived from ChatGPT taking the search for sushi restaurants too literally. After searching for sushi, many OpenTable results were labeled &#8220;seafood.&#8221; ChatGPT flagged this repeatedly across multiple runs in different ways (e.g., weak cuisine presentation, non-sushi results surfacing, poor decision support), but they all trace back to the same fundamental observation. ChatGPT only considered acceptable restaurants that OpenTable labeled as sushi restaurants, not restaurants that serve sushi on the menu, regardless of OpenTable&#8217;s labeling.</p>
<p>The restaurant the participant ultimately selected was labeled seafood, which led ChatGPT to declare task failure in three of four runs. The human reviewers took a more pragmatic view: the restaurant served sushi, so the participant successfully completed the task. Gemini flagged the same seafood/sushi labeling issue once (10a) but didn&#8217;t spiral into multiple variations of it.</p>
<h4><strong>Seating options not shown until after time selection (9b-Gem)</strong></h4>
<p>OpenTable withholds seating options until you pick a time. Given the range of possible seating configurations (inside, patio, bar, banquette, communal, high top, private, counter), showing them before a time is selected isn&#8217;t really feasible. And if a seating option doesn&#8217;t work out, the recovery path is low friction. Gemini flagged this as a problem. The human researchers recognized this as a design tradeoff rather than a usability problem.</p>
<h4><strong>Filters not helpful (4b-GPT)</strong></h4>
<p>We categorized this as a false alarm because it was overly vague. It’s true that there were issues with some filters (e.g., cuisine), but that was not true of all filters.</p>
<h3>Three Hallucinations</h3>
<p>In contrast to false alarms, which we consider misinterpretations of events that happened, a hallucination is when a problem is associated with something that just didn’t happen. We saw three of these.</p>
<h4><strong>AI claimed the participant incorrectly selected the highest price tier (7b-Gem)</strong></h4>
<p>From the narrative of the second Gemini run:</p>
<blockquote><p><em>The task required selecting a restaurant that was not the lowest or highest price point.</em></p>
<p><em>Problem: The participant chose Ocean Prime, which is a restaurant (the highest tier on the platform). At 05:13, the participant verbally identified this as &#8220;mid-range.&#8221;</em></p>
<p><em>User Impact: The participant technically failed this part of the task constraints.</em></p></blockquote>
<p><strong><em>This didn’t happen</em></strong>. Ocean Prime had a mid-range price designation.</p>
<h4><strong>AI claimed the participant set the reservation time for 5:10 pm (9c-Gem) </strong></h4>
<p>From the narrative of the second Gemini run:</p>
<blockquote><p><em>The participant selected Ocean Prime at 05:10</em></p></blockquote>
<p><strong><em>This didn’t happen</em></strong>. The participant, in accordance with the task instructions, selected 5:00 pm.</p>
<h4><strong>AI claimed the participant did not reach the reservation form (10b-ChatGPT)</strong></h4>
<p>From the narrative of the second ChatGPT run:</p>
<blockquote><p><em>By the end of the clip, they are still comparing list items and time slots; they do not appear to reach the restaurant detail/reservation form step.</em></p></blockquote>
<p><strong><em>This isn’t accurate</em></strong>. The clip ended with the participant selecting the reservation time, then standard dining room seating, then stopping before entering her personal information.</p>
<p>The good news is that there were only three hallucinations out of 11 AI-generated problems. The bad news is you can&#8217;t know which AI-generated problem descriptions were hallucinated without watching the video and reviewing all the problems yourself.</p>
<h2>Summary and Discussion</h2>
<p>In this article, we focused on qualitative similarities and differences in the usability problems listed by professional human UX researchers and two AIs (Gemini 3 Flash Thinking and ChatGPT-5.4 Thinking) after reviewing a video in which a participant made a restaurant reservation.</p>
<p>Our key findings were:</p>
<p><strong>False alarms and hallucinations dominate.</strong> Of the 11 problems the AIs generated that no human flagged, seven (64%) were false alarms, three (27%) were hallucinations, and one (9%) was a genuine find. That&#8217;s a useful number to keep in mind: roughly nine out of ten AI-only problems in this study required either correction or dismissal.</p>
<p><strong>AI adds value as a junior researcher, not a trusted expert.</strong> AI was able to find one problem (a participant had to use Ctrl-F) that was real and useful and not found by humans. But getting to it required reviewing ten other problems that ranged from technically true but irrelevant to simply fabricated. The ROI depends on how much that review costs you.</p>
<p><strong>Most false alarms came from a single fixation.</strong> Five of the seven traced back to ChatGPT interpreting &#8220;sushi restaurant&#8221; more literally than any human would. At least in this video and our criteria for what constitutes a problem, this is a systematic bias worth knowing about if you&#8217;re using these models for task-based evaluations.</p>
<p><strong>Hallucinations were infrequent but consequential.</strong> Three of the problems (27%) were hallucinations. Although nominally low, this is probably too high for most applications. You can&#8217;t catch those without going back to the video, which means human review isn&#8217;t optional.</p>
<p><strong>Like humans, AI usability reviews of videos are prone to the “evaluator effect.” </strong>Just like with human evaluators, multiple runs of AI usability evaluations of videos are not perfectly consistent, so it’s good practice to run these evaluations multiple times for consistency checks. Two of the three hallucinations came from the same Gemini run. Running multiple evaluations and looking for consistency across runs is a practical filter before any human review.</p>
<p><strong>Bottom line: AI usability reviews of videos require human oversight. </strong>In their current form (what we tested), these AI products can add value to this type of UX research, but more as junior researchers whose actions and conclusions require expert human oversight rather than as trusted experts themselves.<strong><br />
</strong></p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How Many Years Does It Take to Become a Senior UX Researcher?</title>
		<link>https://measuringu.com/how-many-years-does-it-take-to-become-a-senior-ux-researcher/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=how-many-years-does-it-take-to-become-a-senior-ux-researcher</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD and Jeff Sauro, PhD]]></dc:creator>
		<pubDate>Tue, 19 May 2026 22:15:45 +0000</pubDate>
				<category><![CDATA[Methods]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[Usability]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[UX Maturity]]></category>
		<category><![CDATA[experience]]></category>
		<category><![CDATA[Salary Survey]]></category>
		<category><![CDATA[UX Salary Survey]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47607</guid>

					<description><![CDATA[What does it take to become a senior UX researcher? An advanced degree? Particular experience and skills, like the number of moderated studies conducted or a variety of methods employed? While all those play a role, the type of job (in-house small-team, in-house large-team, solo researcher, or agency) can affect what you are exposed to. [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/05/051926-FeatureImage1.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47633 size-medium" src="https://measuringu.com/wp-content/uploads/2026/05/051926-FeatureImage1-300x169.png" alt="Feature image showing an entry-level UX researcher becoming a senior UX researcher over several years" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/05/051926-FeatureImage1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/05/051926-FeatureImage1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/05/051926-FeatureImage1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/05/051926-FeatureImage1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/05/051926-FeatureImage1-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/05/051926-FeatureImage1.png 2000w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>What does it take to become a senior UX researcher?</p>
<p>An advanced degree? Particular experience and skills, like the number of moderated studies conducted or a variety of methods employed?</p>
<p>While all those play a role, the type of job (in-house small-team, in-house large-team, solo researcher, or agency) can affect what you are exposed to.</p>
<p>Certainly, most would agree that one to two years of experience seems too little time to demonstrate senior-level performance in UX research. We thought that something around five years of experience was a good benchmark. But is that warranted? What is a good number of years of experience?</p>
<p>There is no official rule book on titles. As is often the case when making decisions about jobs, we can use a few approaches:</p>
<ol>
<li>Principle-based: Set a rule based on a principle that disregards what people do.</li>
<li>Tradition and trends: Look to broader workforce trends, what others report, or guidance online.</li>
<li>Data: See what’s happening in practice if you have access to data.</li>
</ol>
<h2>Principle Based</h2>
<p>Even though there isn’t an official designation, we can look broadly at how long it takes to master a skill or job like UX researcher. One used in popular culture (based on some research) and popularized by Malcolm Gladwell is the <a href="https://jamesclear.com/deliberate-practice-strategy">10,000-hour rule</a>. That is, after about 10k hours of practice, you master a skill. That is a very rough guideline and <a href="https://www.bbc.com/future/article/20121114-gladwells-10000-hour-rule-myth">definitely has its critics</a>.</p>
<h2>Tradition and Trends</h2>
<p>Seniority levels can <a href="https://www.indeed.com/career-advice/career-development/seniority-level">differ by industry type</a>. For example:</p>
<ul>
<li>The <a href="https://www.asce.org/career-growth/early-career-engineers/asce-guidelines-for-engineering-grades">American Society of Civil Engineers</a> has grades (from I to VIII) with <a href="https://www.asce.org/-/media/asce-images-and-files/career-and-growth/early-career-engineer/engineering-grades.pdf">detailed descriptions</a> of expected skills and responsibilities. For example, an engineer at Grade IV has at least four years of experience.</li>
<li><a href="https://hrsimple.com/law-firm-hierarchy-roles-and-career-paths/">Associates in a law firm</a> can be junior (1–3 years of experience), mid-level (4–6 years), or senior (7–10 years).</li>
<li>In <a href="https://strategycase.com/big-4-salaries/">large consulting firms</a>, senior associates typically have 2–5 years of experience.</li>
</ul>
<p>For the expected minimum number of years of experience for UX researchers, it makes sense to start with personal experience. In our decades of experience at large companies (IBM, Oracle, GE, Intuit, PeopleSoft), something like five years was a loose criterion. Below that, people would question the designation.</p>
<p>We carry a similar tradition at MeasuringU, and those with five years’ experience are considered senior. But at a tech-enabled agency, UX researchers typically conduct hundreds of moderated sessions and use a wide variety of methods such as unmoderated benchmarking, eye-tracking, in-depth interviews, diary studies, and surveys. A couple of years working here usually exposes a researcher to significantly more UX-related tasks than in a typical in-house role. At the same time, they are much less exposed to the very real job of navigating the politics of competing stakeholders and corporate hierarchies.</p>
<h2>Data: Salary Surveys, LinkedIn Profiles, and Job Posts</h2>
<p>Our preferred method is looking for data to guide decisions. We have three sources. The first is the bi-annual UXPA Salary Survey, which was last conducted in <a href="https://uxpa.org/salary-surveys/">2024</a>. The second is LinkedIn, which provides access to job titles and a crude way of determining years of experience. The third is requirements from recent job postings.</p>
<h3>UXPA Senior User Researcher Data</h3>
<p>The 2024 Salary Survey had 444 responses. Of those, 64% (276) described themselves as user researchers. Respondents could pick one of five employment levels. Table 1 shows that about half (130) of the user researchers classified themselves as “Senior-level, non-supervisory.”</p>

<table id="tablepress-1043" class="tablepress tablepress-id-1043">
<thead>
<tr class="row-1">
	<th class="column-1">Employment Level</th><th class="column-2">Number</th><th class="column-3"> %</th>
</tr>
</thead>
<tbody>
<tr class="row-2">
	<td class="column-1">Entry</td><td class="column-2"> 18</td><td class="column-3"> 7%</td>
</tr>
<tr class="row-3">
	<td class="column-1">Mid-level, non-supervisory</td><td class="column-2"> 73</td><td class="column-3">26%</td>
</tr>
<tr class="row-4">
	<td class="column-1">Mid-level, supervisory</td><td class="column-2"> 10</td><td class="column-3"> 4%</td>
</tr>
<tr class="row-5">
	<td class="column-1">Senior-level, non-supervisory</td><td class="column-2">130</td><td class="column-3">47%</td>
</tr>
<tr class="row-6">
	<td class="column-1">Senior-level, supervisory</td><td class="column-2"> 45</td><td class="column-3">16%</td>
</tr>
<tr class="row-7">
	<td class="column-1">Total</td><td class="column-2">276</td><td class="column-3"></td>
</tr>
</tbody>
</table>
<!-- #tablepress-1043 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1</strong>: Distribution of user researchers by employment level (2024 UXPA salary survey).</p>
<p>Respondents also selected their years of experience in response to the question “How long have you worked in this field (please round to the nearest year)” using the pre-determined buckets shown in Table 2.</p>

<table id="tablepress-1044" class="tablepress tablepress-id-1044">
<thead>
<tr class="row-1">
	<th class="column-1">Years of Experience</th><th class="column-2"># Senior</th><th class="column-3">% Senior</th><th class="column-4">% With More Experience</th>
</tr>
</thead>
<tbody>
<tr class="row-2">
	<td class="column-1"> 0–2 yrs</td><td class="column-2"> 1</td><td class="column-3"> 1%</td><td class="column-4">99%</td>
</tr>
<tr class="row-3">
	<td class="column-1"> 3–4 yrs</td><td class="column-2">17</td><td class="column-3">13%</td><td class="column-4">86%</td>
</tr>
<tr class="row-4">
	<td class="column-1"> 5–7 yrs</td><td class="column-2">28</td><td class="column-3">22%</td><td class="column-4">65%</td>
</tr>
<tr class="row-5">
	<td class="column-1"> 8–10 yrs</td><td class="column-2">21</td><td class="column-3">16%</td><td class="column-4">48%</td>
</tr>
<tr class="row-6">
	<td class="column-1">11–15 yrs</td><td class="column-2">21</td><td class="column-3">16%</td><td class="column-4">32%</td>
</tr>
<tr class="row-7">
	<td class="column-1">16–20 yrs</td><td class="column-2">15</td><td class="column-3">12%</td><td class="column-4">21%</td>
</tr>
<tr class="row-8">
	<td class="column-1">21+ yrs</td><td class="column-2">27</td><td class="column-3">21%</td><td class="column-4"> 0%</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1044 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 2:</strong> Distribution of 130 non-supervisory senior-level user researchers by years of experience (2024 UXPA salary survey).</p>
<p>For example, only one person who reported being a senior user researcher had two years or fewer of experience. That means 99% had more than two years. The second row of the table shows that 17 had between three and four years of experience. Adding that to the one respondent with less experience gets 18 out of the 130 respondents. That means <strong>86% of non-supervisory senior user researchers reported 5 or more years of experience</strong>. Using the center of each age group as a rough estimate of experience, the average number of years across the sample was 12–13 years. Of course, people may inflate their years of experience on an anonymous survey.</p>
<p>We also looked at UX designers in the UXPA dataset and found a similar pattern. Of the 56 UX designers who self-identified as senior, 87% had at least five years of experience (Table 3).</p>

<table id="tablepress-1045" class="tablepress tablepress-id-1045">
<thead>
<tr class="row-1">
	<th class="column-1">Years of Experience</th><th class="column-2"># Senior</th><th class="column-3">% Senior</th><th class="column-4">% With More Experience</th>
</tr>
</thead>
<tbody>
<tr class="row-2">
	<td class="column-1"> 0–2 yrs</td><td class="column-2"> 1</td><td class="column-3"> 2%</td><td class="column-4">98%</td>
</tr>
<tr class="row-3">
	<td class="column-1"> 3–4 yrs</td><td class="column-2"> 6</td><td class="column-3">11%</td><td class="column-4">87%</td>
</tr>
<tr class="row-4">
	<td class="column-1"> 5–7 yrs</td><td class="column-2">13</td><td class="column-3">23%</td><td class="column-4">64%</td>
</tr>
<tr class="row-5">
	<td class="column-1"> 8–10 yrs</td><td class="column-2">14</td><td class="column-3">25%</td><td class="column-4">39%</td>
</tr>
<tr class="row-6">
	<td class="column-1">11–15 yrs</td><td class="column-2"> 8</td><td class="column-3">14%</td><td class="column-4">25%</td>
</tr>
<tr class="row-7">
	<td class="column-1">16–20 yrs</td><td class="column-2"> 6</td><td class="column-3">11%</td><td class="column-4">14%</td>
</tr>
<tr class="row-8">
	<td class="column-1">21+ yrs</td><td class="column-2">8</td><td class="column-3">14%</td><td class="column-4">0%</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1045 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 3:</strong> Distribution of 56 non-supervisory senior-level UX designers by years of experience (2024 UXPA Salary Survey).</p>
<h3>LinkedIn Profiles</h3>
<p>Another approach is to look at how many years of experience senior UX researchers on LinkedIn have in their job history. While job dates can always be padded a bit, it’s a lot harder to claim unearned experience on a public professional forum. We did an informal examination searching for “Senior UX Researcher” and hand-counting the years of non-supervisory experience for the first 50 respondents.</p>
<p>Of the 50 profiles, the average years of experience was a bit over nine years (Table 4). The minimum was just shy of five years at 4.75. Of the 50 profiles, only four (8%) had less than five years of experience. In other words, using this crude estimate suggests 92% of senior user researchers have more than five years of experience.</p>

<table id="tablepress-1046" class="tablepress tablepress-id-1046">
<tbody>
<tr class="row-1">
	<td class="column-1"><strong>Mean Years of Experience</td><td class="column-2">9.1</td>
</tr>
<tr class="row-2">
	<td class="column-1"><strong>Min Years</td><td class="column-2">4.75</td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong># < 5</td><td class="column-2">4</td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>% < 5</td><td class="column-2">8%</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>Total #</td><td class="column-2">50</td>
</tr>
<tr class="row-6">
	<td class="column-1"><strong>% > 5</td><td class="column-2">92%</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1046 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 4:</strong> Analysis of 50 LinkedIn profiles of senior-level non-supervisory UX researchers.</p>
<h3>Job Posts</h3>
<p>Finally, we did another (very) informal search for senior UX researcher job postings (as of May 3, 2026) that were posted on Indeed. Of the five we found, all explicitly required five or more years of experience.</p>
<h2>Discussion and Summary</h2>
<p>There’s no official rule for what makes a UX researcher senior, but multiple approaches point to a consistent answer: at least five years.</p>
<ul>
<li><strong>Principle-based heuristics are consistent with five.</strong> Guidelines loosely based on research (like the 10,000-hour rule) suggest it takes about <strong>five years of focused experience</strong> to develop expertise. This is a weak rationale, but it&#8217;s a starting point.</li>
<li><strong>Tradition and trends suggest five.</strong> In our experience in the industry, it’s common to use <strong>five years as a minimum threshold</strong>. Other industries fall close to the five-year threshold as well.</li>
<li><strong>Salary survey data supports five.</strong> In the 2024 UXPA Salary Survey, <strong>86% of senior UX researchers reported five or more years of experience</strong>, with an average of around 12–13 years. Of the senior UX Designers, an adjacent role in the UX industry, 87% reported five or more years of experience.</li>
<li><strong>Existing profiles and open jobs show five+ years.</strong> Our LinkedIn sample of 50 senior UX researchers showed similar results, with <strong>about 90% above five years of experience</strong> and an average of 9–10 years. Finally, a selection of five currently open senior UX researcher jobs on Indeed all explicitly require at least five years of experience.</li>
</ul>
<p>If you’re looking to set a threshold for becoming senior, five years seems like a good rule.</p>
<p>Of course, years alone don’t define seniority, but if someone has fewer than five years of experience, the <em>senior</em> title should be the exception, not the rule.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How to Interpret a Rating Scale Without Historical Data</title>
		<link>https://measuringu.com/how-to-interpret-a-rating-scale-without-historical-data/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=how-to-interpret-a-rating-scale-without-historical-data</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD and Jeff Sauro, PhD]]></dc:creator>
		<pubDate>Tue, 12 May 2026 20:19:38 +0000</pubDate>
				<category><![CDATA[Benchmarking]]></category>
		<category><![CDATA[Rating Scale]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[Rating Scales]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47556</guid>

					<description><![CDATA[UX researchers use a lot of rating scales. We recommend using standardized rating scales when possible. One of the benefits of some standardized scales, such as the SUS, SUPR-Q®, and UX-Lite®, is that you have a reference database of historical data. But there’s not always a standardized questionnaire for everything you’re hoping to measure, so [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/05/051226-FeatureImage-scaled.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47595 size-medium" src="https://measuringu.com/wp-content/uploads/2026/05/051226-FeatureImage-300x169.png" alt="Feature image showing a UX researcher interpreting a rating scale without historical data" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/05/051226-FeatureImage-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/05/051226-FeatureImage-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/05/051226-FeatureImage-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/05/051226-FeatureImage-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/05/051226-FeatureImage-2048x1152.png 2048w, https://measuringu.com/wp-content/uploads/2026/05/051226-FeatureImage-600x338.png 600w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>UX researchers use a lot of rating scales. We recommend using standardized rating scales when possible. One of the benefits of some standardized scales, such as the SUS, SUPR-Q<sup>®</sup>, and UX-Lite<sup>®</sup>, is that you have a reference database of historical data.</p>
<p>But there’s not always a standardized questionnaire for everything you’re hoping to measure, so researchers need to create <a href="https://en.wikipedia.org/wiki/Ad_hoc">ad hoc</a> ones.</p>
<p>Data collected with ad hoc rating scales can be difficult to interpret, especially if you don’t have any historical data (e.g., from past product performance or competitors).</p>
<p>If you’re comparing multiple conditions (e.g., ratings on attributes for two or more websites), then you can check for significant differences in rating scale means.</p>
<p>But even clear differences in means don’t answer the question about whether a given mean indicates a poor or good user experience.</p>
<p>In this article, we provide a way to interpret five- and seven-point UX rating scales when you don’t have enough historical data for custom benchmarks. We use the well-known distribution of the System Usability Scale (<a href="https://measuringu.com/10-things-sus/">SUS</a>) as the basis for our recommendation.</p>
<h2>UX Rating Scales Tend to Be Negatively Skewed</h2>
<p>If you’ve never plotted your distributions of rating scale response options, you should. But don’t be surprised when you see a negatively skewed distribution (tail of data points to the left).</p>
<p>Most UX rating scales have this negative skew because (1) most item stems have a positive tone (e.g., “I felt very confident using this website”) and (2) respondents are <a href="https://dl.acm.org/doi/pdf/10.1145/175276.175282">generally more likely to agree</a> (selecting higher responses). This means that the middle value (e.g., a 3 on a five-point scale) isn’t a good measure of the “average.” This skew doesn’t make the responses necessarily bad or not useful. It just means you need to account for that skew when interpreting them.</p>
<p>For example, you can see the skew in distributions of SUS scores, for which 50 is the middle of the scale (Figure 1), but is not the middle of the distribution (68 is the median).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/05/Figure1-1.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47597 size-full" src="https://measuringu.com/wp-content/uploads/2026/05/Figure1-1.png" alt="Figure 1: Distribution of 3,187 individual SUS scores (50 is the middle of the scale, but the median is 68)." width="1200" height="719" srcset="https://measuringu.com/wp-content/uploads/2026/05/Figure1-1.png 1200w, https://measuringu.com/wp-content/uploads/2026/05/Figure1-1-300x180.png 300w, https://measuringu.com/wp-content/uploads/2026/05/Figure1-1-1024x614.png 1024w, https://measuringu.com/wp-content/uploads/2026/05/Figure1-1-768x460.png 768w, https://measuringu.com/wp-content/uploads/2026/05/Figure1-1-600x360.png 600w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> Distribution of 3,187 individual SUS scores (50 is the middle of the scale, but the median is 68).</p>
<h2>Default Benchmarks Based on Historical SUS Distribution</h2>
<p>Taking advantage of the well-known distribution of the SUS, we created a curved grading scale that is <a href="https://www.researchgate.net/publication/324116412_The_System_Usability_Scale_Past_Present_and_Future">widely used in UX research</a> (Table 1). We’ll use this as a basis for interpreting ad hoc scales.</p>

<table id="tablepress-1040" class="tablepress tablepress-id-1040">
<thead>
<tr class="row-1">
	<th class="column-1">SUS Score Range</th><th class="column-2">Grade</th><th class="column-3">Percentile Range</th>
</tr>
</thead>
<tbody>
<tr class="row-2">
	<td class="column-1">84.1–100</td><td class="column-2">A+</td><td class="column-3">96–100</td>
</tr>
<tr class="row-3">
	<td class="column-1">80.8–84.0</td><td class="column-2">A</td><td class="column-3">90–95</td>
</tr>
<tr class="row-4">
	<td class="column-1">78.9–80.7</td><td class="column-2">A−</td><td class="column-3">85–89</td>
</tr>
<tr class="row-5">
	<td class="column-1">77.2–78.8</td><td class="column-2">B+</td><td class="column-3">80–84</td>
</tr>
<tr class="row-6">
	<td class="column-1">74.1–77.1</td><td class="column-2">B</td><td class="column-3">70–79</td>
</tr>
<tr class="row-7">
	<td class="column-1">72.6–74.0</td><td class="column-2">B−</td><td class="column-3">65–69</td>
</tr>
<tr class="row-8">
	<td class="column-1">71.1–72.5</td><td class="column-2">C+</td><td class="column-3">60–64</td>
</tr>
<tr class="row-9">
	<td class="column-1">65.0-71.0</td><td class="column-2">C</td><td class="column-3">41–59</td>
</tr>
<tr class="row-10">
	<td class="column-1">62.7–64.9</td><td class="column-2">C−</td><td class="column-3">35–40</td>
</tr>
<tr class="row-11">
	<td class="column-1">51.7–62.6</td><td class="column-2">D</td><td class="column-3">15–34</td>
</tr>
<tr class="row-12">
	<td class="column-1"> 0.0–51.6</td><td class="column-2">F</td><td class="column-3"> 0–14</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1040 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Curved grading scale for the SUS.</p>
<p>The 50<sup>th</sup> percentile of this scale is a SUS score of 68, a solid C. Another important benchmark commonly used in practice is an aspirational score of 80 (the upper end of an A−, a bit higher than the 85<sup>th</sup> percentile). Scores lower than 51.7 are in the F range (just below the 15<sup>th</sup> percentile).</p>
<p>Based on the SUS research, when we consult with clients who need a benchmark for five- or seven-point scales and there is no historical data, we usually recommend setting a benchmark for average to about 70% of the range of the scale, 80% for good, and 50% for poor—similar to the historical benchmarks for the SUS. For example, this is what we did when we created our <a href="https://measuringu.com/grading-scales-for-the-ux-lite/">standard grading scale for the UX-Lite</a>.</p>
<p>Table 2 shows those values for five- and seven-point scales (the midpoint for a five-point scale is 3, and for a seven-point scale is 4).</p>

<table id="tablepress-1042" class="tablepress tablepress-id-1042">
<thead>
<tr class="row-1">
	<th class="column-1">Location on Scale</th><th class="column-2">Interpretation</th><th class="column-3">Five-point</th><th class="column-4">Seven-point</th>
</tr>
</thead>
<tbody>
<tr class="row-2">
	<td class="column-1">80%</td><td class="column-2">Good</td><td class="column-3">4.2</td><td class="column-4">5.8</td>
</tr>
<tr class="row-3">
	<td class="column-1">70%</td><td class="column-2">Average</td><td class="column-3">3.8</td><td class="column-4">5.2</td>
</tr>
<tr class="row-4">
	<td class="column-1">60%</td><td class="column-2">Below Average</td><td class="column-3">3.4</td><td class="column-4">4.6</td>
</tr>
<tr class="row-5">
	<td class="column-1">50%</td><td class="column-2">Poor</td><td class="column-3">3.0</td><td class="column-4">4.0</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1042 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 2:</strong> Initial benchmarks for 70 and 80% of the range of five- and seven-point scales.</p>
<p>The formula for computing these values is based on the <a href="https://measuringu.com/converting-scales-to-100-points/">methods for interpolating rating scale scores</a> that start with 1 to a 0–100-point scale, algebraically manipulated to compute the benchmark for the rating scale from the target range (e.g., 80% of the scale, expressed as 80 in the computation) and the maximum possible value of the rating scale (e.g., typically 5 or 7 for scales that start with 1):</p>
<p>Benchmark = Target / (100 / (MaxRating − 1)) + 1</p>
<p>For example, to find 70% of the range of a five-point scale, the benchmark would be:</p>
<p>70 / (100 /(5 − 1)) + 1 = 70 / 25 + 1 = 3.8</p>
<p>An <a href="https://measuringu.com/types-of-100-point-scales/">alternative approach</a> is to convert five- or seven-point ratings to a 0–100-point scale. John Brooke, the developer of the SUS, <a href="https://uxpajournal.org/sus-a-retrospective/">described the value of this approach</a>: “Project managers, product managers, and engineers were more likely to understand a scale that went from 0 to 100 than one that went from 10 to 50, and the important thing was to be able to grab their attention in the short space of time they were likely to spend thinking about usability, without having to go into a detailed explanation.”</p>
<p>The general formula for converting a five- or seven-point scale to 0–100 points is:</p>
<p>Rating100 = (Rating − 1) * 100 / (MaxRating − 1)</p>
<p>For example, a five-point mean rating of 4.2 would become 80:</p>
<p>(4.2 − 1) * (100 / (5 − 1)) = 3.2(25) = 80</p>
<p>A seven-point mean rating of 4.0 would become 50:</p>
<p>(4 − 1) * (100 / (7 − 1)) = 3(16.67) ≈ 50</p>
<p><strong><em>Caveat</em></strong><em><strong>:</strong> Note that these are initial benchmarks to use when UX researchers lack a more grounded rationale for interpreting mean rating scale scores. After a reasonable amount of data collection with the scale, it’s a good idea to revisit the initial benchmarks to see whether they should be adjusted.</em></p>
<h2>Summary</h2>
<p>When you&#8217;re working with an ad hoc rating scale and have no historical data to lean on, the SUS distribution gives you a principled starting point. Because UX rating scales share a consistent negative skew (driven by positive item wording and respondent agreement bias), benchmarks derived from the SUS translate reasonably well to other five- and seven-point scales. It’s not that there’s something magic about the SUS. It works well because it’s a composite of ten five-point UX rating scales that share the tendency of other UX rating scales to be negatively skewed (more favorable than unfavorable). This means that benchmarks informed by the SUS provide a good initial approximation for other UX rating scales.</p>
<p>The characteristics of UX rating scales that this pattern supports are:</p>
<ul>
<li>Setting “Poor” below the midpoint of the scale (50% of the range) because means of positive-tone UX rating scales are consistently higher than the scale midpoint.</li>
<li>Setting “Good” above 80% of the scale range is the <a href="https://www.uslanguageservices.com/guides-resources/understanding-the-u-s-grading-system/">traditional score for a B</a> (above average).</li>
</ul>
<p>Placing other cut points between 50% and 80% leads to these initial benchmarks:</p>
<ul>
<li><strong>Good</strong>: Located at <strong>80%</strong> of the range of the scale</li>
<li><strong>Average</strong>: Located at <strong>70%</strong> of the range of the scale</li>
<li><strong>Below average</strong>: Located at <strong>60%</strong> of the range of the scale</li>
<li><strong>Poor</strong>: Located at <strong>50%</strong> of the range of the scale (the midpoint)</li>
</ul>
<p>It’s important to keep in mind that these are reasonable best guesses without a strong normative database. For UX rating scale items that will be used frequently over time, researchers should plan to build normative databases and use them to tune the benchmarks (like we have <a href="https://measuringu.com/evolution-of-seq/">done with the SEQ<sup>®</sup></a>).</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Can AI Detect Usability Problems Like Researchers?</title>
		<link>https://measuringu.com/ai-vs-human-usability-problem-analysis-of-a-video/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=ai-vs-human-usability-problem-analysis-of-a-video</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD •&nbsp;Jeff Sauro, PhD •&nbsp;Will Schiavone, PhD&nbsp;•&nbsp;Lucas Plabst, PhD]]></dc:creator>
		<pubDate>Wed, 06 May 2026 04:25:45 +0000</pubDate>
				<category><![CDATA[Usability]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[ChatGPT]]></category>
		<category><![CDATA[Gemini]]></category>
		<category><![CDATA[Problem Discovery]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47502</guid>

					<description><![CDATA[AI can “watch” videos. It can even generate a list of problems. In some cases, these problem lists seem to be reasonably consistent (reliable). But consistency is not accuracy. Are these real problems or just sophisticated AI slop generated consistently by autocorrect for video? How can we know? One way to find out is to [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/05/Feature050526.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47551 size-medium" src="https://measuringu.com/wp-content/uploads/2026/05/Feature050526-300x169.png" alt="Feature image showing a count of problems found by AI versus human researchers" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/05/Feature050526-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/05/Feature050526-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/05/Feature050526-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/05/Feature050526-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/05/Feature050526.png 1280w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>AI can <a href="https://measuringu.com/can-ai-detect-usability-problems">“watch” videos</a>.</p>
<p>It can even generate a list of problems. In some cases, these problem lists seem to be <a href="https://measuringu.com/ai-usability-problem-analysis-of-a-video">reasonably consistent (reliable)</a>.</p>
<p>But consistency is not accuracy. Are these real problems or just sophisticated AI slop generated consistently by autocorrect for video?</p>
<p>How can we know? One way to find out is to compare the AI problem lists to those created by trained UX researchers.</p>
<p><strong>Are the problems an AI finds the same problems a UX researcher would find?</strong></p>
<p>In this article, we move from reliability to <strong>validity</strong> by comparing the problems identified by AI to those found by human UX researchers reviewing the same video.</p>
<h2><span lang="EN-US">Humans vs. AI: Same Video, Same Task</span></h2>
<p>For this study, four UX researchers at MeasuringU independently reviewed a roughly six-minute usability test video and created lists of observed usability problems. The primary evaluator had over 40 years of experience coding usability problems, while the other three, at the time of the study, each had less than a year of experience. The video they watched was the same one reviewed by two LLMs in our previous assessment of AI reliability (ChatGPT-5.4 Thinking and Gemini 3 Flash Thinking, four runs per LLM to assess reliability). The participant’s task was to use OpenTable.com to book a reservation:</p>
<blockquote><p>“Make a reservation for four people at a sushi restaurant in Denver, CO tomorrow anytime after 5:00pm. Make sure the restaurant you select is not at the lowest or highest price point. Of the restaurants that fit these criteria, look at their overall rating, customer reviews, and photos to select the one that is the most appealing to you. Go as far as you can in the reservation process until you are asked for your personal information or account details. DO NOT fully confirm the reservation. Write down the restaurant name and the time of the reservation. You will be asked about this information after the task.”</p></blockquote>
<p>The directions for the human evaluators matched the prompt given to the LLMs:</p>
<blockquote><p>“During a usability test, the facilitator must keep track of participant behaviors as they navigate through tasks on a website, mobile app, software program, etc. We’d like you to watch a video of a usability test where participants were asked to book a table at a Sushi restaurant. As you&#8217;re watching, please look for problems the participant has while attempting to complete the task. For example, you can document the path users take, describe issues they encounter as well as what on the website might be causing problems.”</p></blockquote>
<h3><span lang="EN-US">Coding and Matching Problems</span></h3>
<p>The four researchers independently created their problem lists. The senior researcher (Evaluator 1) then reviewed, matched, and consolidated the problems as shown in Table 1.</p>

<table id="tablepress-1034" class="tablepress tablepress-id-1034">
<thead>
<tr class="row-1">
	<th class="column-1">Prob #</th><th class="column-2">Human Evaluators Problem List</th><th class="column-3">Eval 1</th><th class="column-4">Eval 2</th><th class="column-5">Eval 3</th><th class="column-6">Eval 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1">1</td><td class="column-2">Complex search field with placeholder text and unexpected behaviors significantly delayed user who selected sushi from search bar dropdown but for Dallas (default) instead of Denver</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-3">
	<td class="column-1">2</td><td class="column-2">Odd display of email addresses upon click in search field</td><td class="column-3"></td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-4">
	<td class="column-1">3</td><td class="column-2">Avoided search field and looked in filters to try to change location</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-5">
	<td class="column-1">4</td><td class="column-2">Entering Denver in search field lost previous selection of sushi as cuisine</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-6">
	<td class="column-1">5</td><td class="column-2">Scanning through 86 cuisines is effortful, then top that off by sushi not being in the list</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-7">
	<td class="column-1">6</td><td class="column-2">Surprised when typing sushi into search field did not lose current location</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-8">
	<td class="column-1">7</td><td class="column-2">Participant seemed to miss price point filter—sorted on ratings and examined price points in descriptions</td><td class="column-3"></td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-9">
	<td class="column-1">8</td><td class="column-2">Participant wanted to change sort to lowest rating first but not an option</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-10">
	<td class="column-1">9</td><td class="column-2">Despite having selected 5pm at start of process user needed to reselect it later</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-11">
	<td class="column-1"><i>Total</td><td class="column-2"></td><td class="column-3"><i><center>6</td><td class="column-4"><i><center>5</td><td class="column-5"><i><center>8</td><td class="column-6"><i><center>7</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1034 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Human evaluators&#8217; problem list.</p>
<p>Nine total problems were identified, none of which was classified as a false alarm by Evaluator 1. Four problems (3, 4, 5, and 6) were identified by all four UX researchers. Two problems (1, 8) were identified by three evaluators, one problem (7) by two evaluators, and two problems (2, 9) by one evaluator.</p>
<h3><span lang="EN-US">High Reliability for Humans</span></h3>
<p>With the consolidated problem list, we computed the <a href="https://measuringu.com/assessing-interrater-reliability-in-ux-research/">any-2 agreement</a> across all pairs of evaluators as shown in Table 2. Any-2 agreement accounts for interrater reliability of the different problem lists better than Kappa when assessing agreement.</p>

<table id="tablepress-1035" class="tablepress tablepress-id-1035">
<thead>
<tr class="row-1">
	<th class="column-1">Any-2</th><th class="column-2">Eval 1</th><th class="column-3">Eval 2</th><th class="column-4">Eval 3</th><th class="column-5">Eval 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1"><strong>Eval 1</td><td class="column-2"> x</td><td class="column-3">57%</td><td class="column-4">56%</td><td class="column-5">63%</td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong>Eval 2</td><td class="column-2">57%</td><td class="column-3"> x</td><td class="column-4">63%</td><td class="column-5">71%</td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>Eval 3</td><td class="column-2">56%</td><td class="column-3">63%</td><td class="column-4"> x</td><td class="column-5">88%</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>Eval 4</td><td class="column-2">63%</td><td class="column-3">71%</td><td class="column-4">88%</td><td class="column-5"> x</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1035 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 2: </strong>Any-2 agreement for the human evaluators.</p>
<p>The average any-2 agreement across all pairs was 66%. Based on our data, the general rule of thumb for interpreting any-2 agreement is that 50% is typical, 25% is low, and 75% is high.</p>
<p>That means the reliability of the human evaluators was <strong>relatively high</strong>, likely because some of the usability problems in the list were quite salient (4/9 identified by all four evaluators, 6/9 identified by at least three evaluators).</p>
<p>In our <a href="https://measuringu.com/ai-usability-problem-analysis-of-a-video">previous study</a> of AI analysis, the reliability of ChatGPT was relatively low (31%) while Gemini was above average (57%).</p>
<h2><span lang="EN-US">Agreement Was Low Between AIs and Humans</span></h2>
<p>We created consolidated problem lists for ChatGPT and Gemini by combining results across four runs and matching them to the human-identified problems. Problems labeled “ChatGPT” or “Gem” are unique to those systems. Problems without labels were also found by humans.</p>
<h3><span lang="EN-US">ChatGPT Validity</span></h3>
<p>Table 3 shows the combined problem list for the four runs of ChatGPT. It included five problems from the human list and seven unique problems. Table 4 shows the human by ChatGPT any-2 agreement.</p>

<table id="tablepress-1036" class="tablepress tablepress-id-1036">
<thead>
<tr class="row-1">
	<th class="column-1">Prob #</th><th class="column-2">ChatGPT Problem List</th><th class="column-3">Run 1</th><th class="column-4">Run 2</th><th class="column-5">Run 3</th><th class="column-6">Run 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1">1</td><td class="column-2">Complex search field with placeholder text and unexpected behaviors significantly delayed user who selected sushi from search bar dropdown but for Dallas (default) instead of Denver</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-3">
	<td class="column-1">4</td><td class="column-2">Entering Denver in search field lost previous selection of sushi as cuisine</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-4">
	<td class="column-1">4b-ChatGPT</td><td class="column-2">Filters not helpful</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-5">
	<td class="column-1">5</td><td class="column-2">Scanning through 86 cuisines is effortful, then top that off by sushi not being in the list</td><td class="column-3"></td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-6">
	<td class="column-1">6</td><td class="column-2">Surprised when typing sushi into search field did not lose current location</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-7">
	<td class="column-1">6b-ChatGPT</td><td class="column-2">Search results for sushi included many non-sushi restaurants</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"></td><td class="column-6"><center>1</td>
</tr>
<tr class="row-8">
	<td class="column-1">6c-ChatGPT</td><td class="column-2">Weak presentation of cuisine information in search results</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-9">
	<td class="column-1">7</td><td class="column-2">Participant seemed to miss price point filter—sorted on ratings and examined price points in descriptions</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-10">
	<td class="column-1">7b-ChatGPT</td><td class="column-2">Sorting by highest rated put many non-sushi restaurants at the top of the list</td><td class="column-3"></td><td class="column-4"></td><td class="column-5"></td><td class="column-6"><center>1</td>
</tr>
<tr class="row-11">
	<td class="column-1">8b-ChatGPT</td><td class="column-2">UI pushes browsing without good decision support</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-12">
	<td class="column-1">10a-ChatGPT</td><td class="column-2">Selected result labeled seafood instead of sushi</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-13">
	<td class="column-1">10b-ChatGPT</td><td class="column-2">Task not completed because did not reach reservation form</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-14">
	<td class="column-1"><i>Total</td><td class="column-2"></td><td class="column-3"><i><center>7</td><td class="column-4"><i><center>6</td><td class="column-5"><i><center>6</td><td class="column-6"><i><center>4</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1036 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 3: </strong>ChatGPT evaluations problem list (problems tagged with ChatGPT were not reported by humans).</p>

<table id="tablepress-1037" class="tablepress tablepress-id-1037">
<thead>
<tr class="row-1">
	<th class="column-1">Any-2</th><th class="column-2">ChatGPT 1</th><th class="column-3">ChatGPT 2</th><th class="column-4">ChatGPT 3</th><th class="column-5">ChatGPT 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1"><strong>Eval 1</td><td class="column-2">18%</td><td class="column-3">33%</td><td class="column-4">33%</td><td class="column-5">0%</td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong>Eval 2</td><td class="column-2"> 9%</td><td class="column-3">22%</td><td class="column-4">22%</td><td class="column-5">0%</td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>Eval 3</td><td class="column-2">25%</td><td class="column-3">27%</td><td class="column-4">27%</td><td class="column-5">0%</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>Eval 4</td><td class="column-2">27%</td><td class="column-3">30%</td><td class="column-4">30%</td><td class="column-5">0%</td>
</tr>
<tr class="row-6">
	<td class="column-1"></td><td class="column-2"></td><td class="column-3"></td><td class="column-4"></td><td class="column-5"></td>
</tr>
</tbody>
<tfoot>
<tr class="row-7">
	<th class="column-1">Mean:</th><th class="column-2">19%</th><td class="column-3"></td><td class="column-4"></td><td class="column-5"></td>
</tr>
</tfoot>
</table>
<!-- #tablepress-1037 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 4: </strong>Any-2 agreement for human with ChatGPT evaluations.</p>
<h3><span lang="EN-US">Gemini Validity</span></h3>
<p>Table 5 shows the combined problem list for the four runs of Gemini. It included four problems from the human list and five unique problems. Table 6 shows the human by Gemini any-2 agreement.</p>

<table id="tablepress-1038" class="tablepress tablepress-id-1038">
<thead>
<tr class="row-1">
	<th class="column-1">Prob #</th><th class="column-2">Gemini Problem List</th><th class="column-3">Run 1</th><th class="column-4">Run 2</th><th class="column-5">Run 3</th><th class="column-6">Run 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1">4</td><td class="column-2">Entering Denver in search field lost previous selection of sushi as cuisine</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-3">
	<td class="column-1">5</td><td class="column-2">Scanning through 86 cuisines is effortful, then top that off by sushi not being in the list</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-4">
	<td class="column-1">5b-Gem</td><td class="column-2">Participant used Ctrl-F to search page for "sushi"—not found</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-5">
	<td class="column-1">7</td><td class="column-2">Participant seemed to miss price point filter—sorted on ratings and examined price points in descriptions</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-6">
	<td class="column-1">7b-Gem</td><td class="column-2">Participant chose highest price tier</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-7">
	<td class="column-1">8</td><td class="column-2">Participant wanted to change sort to lowest rating first but not an option</td><td class="column-3"></td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-8">
	<td class="column-1">9b-Gem</td><td class="column-2">Seating options only presented after selecting time</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-9">
	<td class="column-1">9c-Gem</td><td class="column-2">Set time to 5:10</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-10">
	<td class="column-1">10a-Gem</td><td class="column-2">Selected result labeled seafood instead of sushi</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-11">
	<td class="column-1"><i>Total</td><td class="column-2"></td><td class="column-3"><i><center>5</td><td class="column-4"><i><center>6</td><td class="column-5"><i><center>5</td><td class="column-6"><i><center>4</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1038 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 5: </strong>Gemini evaluations problem list (problems tagged with Gem were not reported by humans).</p>

<table id="tablepress-1039" class="tablepress tablepress-id-1039">
<thead>
<tr class="row-1">
	<th class="column-1">Any-2</th><th class="column-2">Gem 1</th><th class="column-3">Gem 2</th><th class="column-4">Gem 3</th><th class="column-5">Gem 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1">Eval 1</td><td class="column-2">29%</td><td class="column-3">29%</td><td class="column-4">25%</td><td class="column-5">29%</td>
</tr>
<tr class="row-3">
	<td class="column-1">Eval 2</td><td class="column-2">33%</td><td class="column-3">33%</td><td class="column-4">50%</td><td class="column-5">33%</td>
</tr>
<tr class="row-4">
	<td class="column-1">Eval 3</td><td class="column-2">38%</td><td class="column-3">22%</td><td class="column-4">50%</td><td class="column-5">38%</td>
</tr>
<tr class="row-5">
	<td class="column-1">Eval 4</td><td class="column-2">43%</td><td class="column-3">25%</td><td class="column-4">57%</td><td class="column-5">43%</td>
</tr>
<tr class="row-6">
	<td class="column-1"></td><td class="column-2"></td><td class="column-3"></td><td class="column-4"></td><td class="column-5"></td>
</tr>
</tbody>
<tfoot>
<tr class="row-7">
	<th class="column-1">Mean:</th><th class="column-2">36%</th><td class="column-3"></td><td class="column-4"></td><td class="column-5"></td>
</tr>
</tfoot>
</table>
<!-- #tablepress-1039 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 6: </strong>Any-2 agreement for human with Gemini evaluations.</p>
<p>We found the <strong>poorest agreement between human and ChatGPT evaluations</strong> (19%). Agreement between human and Gemini evaluations (36%) was substantially higher but still relatively low.</p>
<p>These agreement rates account for problems AI found that humans didn’t. We treated AI-discovered problems as if they were real (for now), but they could have been false positives (an error humans make, too). What is a “real” problem? That’s been a long-standing research question. For now, we’re relying on the senior researcher to determine the real problems. That human-verified problem list is how we’ll evaluate the AIs.</p>
<h3><span lang="EN-US">Did AI Find the Same Problems as Researchers?</span></h3>
<p>We can use the human-generated and verified problem lists as the “gold-standard” and assess AI’s “hit-rate” as another measure of validity beyond any-2 agreement. The four human evaluators identified nine usability problems. <strong>ChatGPT identified five, and Gemini identified four. </strong>Figure 1 shows the problems identified by human evaluators and how well both AI models identified them. We consolidated the runs, counting a problem if it was found at least once across any of the four runs.</p>
<p>Four of the problems were found by all four researchers, suggesting they were more salient problems. ChatGPT uncovered three of these four, and Gemini uncovered two.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/05/Figure1.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47541" src="https://measuringu.com/wp-content/uploads/2026/05/Figure1-300x83.png" alt="How well AI models found usability problems identified by researchers. " width="1200" height="331" srcset="https://measuringu.com/wp-content/uploads/2026/05/Figure1-300x83.png 300w, https://measuringu.com/wp-content/uploads/2026/05/Figure1-1024x282.png 1024w, https://measuringu.com/wp-content/uploads/2026/05/Figure1-768x212.png 768w, https://measuringu.com/wp-content/uploads/2026/05/Figure1-1536x423.png 1536w, https://measuringu.com/wp-content/uploads/2026/05/Figure1-600x165.png 600w, https://measuringu.com/wp-content/uploads/2026/05/Figure1.png 1971w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1: </strong>How well AI models found usability problems identified by researchers.</p>
<p>Comparing AI to a pooled set of problems from four researchers may not be a fair comparison. We should also consider how well AI does compared to each individual researcher. Figure 1 shows that, for example, ChatGPT identified four of the six problems identified by the senior evaluator. Gemini uncovered two of the six problems. Across each of the four evaluators, ChatGPT identified between <strong>60% and 71% </strong>of the usability problems, and Gemini identified between <strong>33% and 60%</strong> (see right side of Figure 1).</p>
<p>Figure 2 is a Venn diagram that shows the overlap in problems found between both AIs and between AIs and humans. AIs generated eleven problems not identified by any of the four researchers, and there were three problems identified by humans only. ChatGPT came up with seven new ones and Gemini five (they agreed on one of the problems).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/05/Figure2.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47542" src="https://measuringu.com/wp-content/uploads/2026/05/Figure2-1024x901.png" alt="Venn diagram of usability problem discovery by humans, ChatGPT, and Gemini. " width="600" height="528" srcset="https://measuringu.com/wp-content/uploads/2026/05/Figure2-1024x901.png 1024w, https://measuringu.com/wp-content/uploads/2026/05/Figure2-300x264.png 300w, https://measuringu.com/wp-content/uploads/2026/05/Figure2-768x676.png 768w, https://measuringu.com/wp-content/uploads/2026/05/Figure2-1536x1351.png 1536w, https://measuringu.com/wp-content/uploads/2026/05/Figure2-2048x1802.png 2048w, https://measuringu.com/wp-content/uploads/2026/05/Figure2-600x528.png 600w" sizes="auto, (max-width: 600px) 100vw, 600px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 2: </strong>Venn diagram of usability problem discovery by humans, ChatGPT, and Gemini.</p>
<p>AIs generated more new problems (eleven) than the total list generated by four humans (nine). It’s not clear whether these additional AI-identified problems represent true usability issues that humans missed or are false positives/hallucinations. We’ll dig into the qualitative difference on those problems in an upcoming article. What is clear is that all these additional problems likely require a human’s time to review them.</p>
<h2><span lang="EN-US">Summary and Discussion</span></h2>
<p>Building on our previous research into the reliability of AI usability problem discovery, we investigated the validity of AI evaluations by seeing whether AI and human evaluators agree on <em>which</em> problems they find. Using the same video, task, and prompt, four UX researchers and two LLMs (ChatGPT and Gemini, four runs each) independently produced problem lists. Our key findings:</p>
<p><strong>Humans had higher within-group reliability than the LLMs.</strong> Any-2 agreement among human evaluators was 66%, well above the 31% we previously reported for ChatGPT and somewhat above Gemini&#8217;s 57%.</p>
<p><strong>Agreement between humans and AI was low.</strong> The human-ChatGPT any-2 agreement was just 19%, the lowest we observed. Human-Gemini agreement was better at 36%, but still below the typical human baseline of 47%. Low agreement means AI and humans often flag different problems when watching the same video with known usability issues.</p>
<p><strong>AI identifies roughly half the problems humans find.</strong> ChatGPT identified five of the nine human-verified problems, and Gemini identified four. Of the four problems that were identified by all human evaluators, three were identified by both AIs. The nine problems were a vetted compilation from all four human evaluators. When we limited the comparison to individual evaluators, ChatGPT matched 60–71% of each researcher&#8217;s list, and Gemini matched 33–60%. The AIs didn’t find all the problems reported by humans, but depending on the evaluator(s), they can find more than half of them.</p>
<p><strong>AI generates more new problems than humans do.</strong> The two AIs together produced eleven problems that no human identified (at least from one video), which is more than the entire human problem list of nine. ChatGPT contributed seven unique problems and Gemini five, with one shared between them. It&#8217;s not yet clear whether these represent real usability issues that trained researchers missed or are false positives (we’ll explore these possibilities in an upcoming article).</p>
<p><strong>AI-only problems create a new validation burden.</strong> Someone has to determine which AI-generated problems are real, and that means a human reviewing each one. If AI is being used to save time, the volume of unverified AI-generated problems may offset much of those savings. Whether the tradeoff is worth it likely depends on how many of those problems turn out to be real, again something we&#8217;ll examine in a follow-up.</p>
<p>In our next article, we shift from this quantitative comparison to a qualitative examination: using the video as ground truth, we ask whether the AI-only problems reflect events that actually happened or whether the LLMs hallucinated issues that never occurred.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How Reliable Is AI at Finding UI Problems?</title>
		<link>https://measuringu.com/ai-usability-problem-analysis-of-a-video/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=ai-usability-problem-analysis-of-a-video</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD •&nbsp;Jeff Sauro, PhD •&nbsp;Will Schiavone, PhD&nbsp;•&nbsp;Lucas Plabst, PhD]]></dc:creator>
		<pubDate>Tue, 28 Apr 2026 22:04:59 +0000</pubDate>
				<category><![CDATA[Usability]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[ChatGPT]]></category>
		<category><![CDATA[Gemini]]></category>
		<category><![CDATA[Problem Discovery]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47402</guid>

					<description><![CDATA[It looks like AI can “watch” videos. And if AI can watch videos, it can likely extract UI problems. That suggests it has the potential to support UX research. So maybe AI can “watch” a video and detect some problems. But if you run the same video through AI multiple times, do you get the [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-scaled.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47475 size-medium" src="https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-300x169.png" alt="Feature image showing two AI robots, each holding a clipboard with a UI problems list" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-2048x1152.png 2048w, https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-600x338.png 600w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>It looks like AI can “<a href="https://measuringu.com/can-ai-detect-usability-problems">watch” videos</a>. And if AI can watch videos, it can likely extract UI problems. That suggests it has the <a href="https://measuringu.com/ai-as-uxr-assistant-user-and-analyst/">potential to support UX research</a>.</p>
<p>So maybe AI can “watch” a video and detect some problems. But if you run the same video through AI multiple times, do you get the same results?</p>
<p>Reliability matters. If AI produces different results each time, it becomes untrustworthy, no matter how convincing its reasoning sounds.</p>
<p>There are a lot of variables that can affect our assessment of reliability, including:</p>
<ul>
<li>AI models (Gemini, ChatGPT, Claude, Grok)</li>
<li>Versions (models are changing monthly)</li>
<li>LLM settings like <em>temperature</em>, which affect the randomness of the output</li>
<li>Prompts: What you ask (and even how many times you ask)</li>
</ul>
<p>There are a lot of variables to consider, but we have to start somewhere. So we did. In this article, we take a first step in assessing the reliability of AI problem detection. We examined how consistent two popular AI chatbots are at identifying usability problems from the same video.</p>
<h2><span lang="EN-US">Study Setup</span></h2>
<p>We had two LLMs, ChatGPT-5.4 Thinking and Gemini 3 Flash Thinking, review the video and list the usability issues they discovered (four runs per LLM to check for within-LLM consistency; default settings only). Both are general-purpose LLMs for which MeasuringU has paid “pro” subscriptions (i.e., not free versions). Video 1 shows a 15-second clip of the full six-minute video.</p>
<div class="ast-oembed-container " style="height: 100%;"><iframe loading="lazy" title="MU_opentable_reservation first 15 sec" src="https://player.vimeo.com/video/1184232223?h=3252bb61a5&amp;dnt=1&amp;app_id=122963" width="1170" height="658" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write; encrypted-media; web-share" referrerpolicy="strict-origin-when-cross-origin"></iframe></div>
<p class="wp-caption-text" style="text-align: left;"><strong>Video 1: </strong>First 15 seconds of a participant searching for a sushi restaurant on the OpenTable website.</p>
<p>The task (visible at the bottom of the video) was to use OpenTable.com to:</p>
<blockquote><p>“Please think aloud. Make a reservation for four people at a sushi restaurant in Denver, CO tomorrow anytime after 5:00pm. Make sure the restaurant you select is not at the lowest or highest price point. Of the restaurants that fit these criteria, look at their overall rating, customer reviews, and photos to select the one that is the most appealing to you. Go as far as you can in the reservation process until you are asked for your personal information or account details. DO NOT fully confirm the reservation. Write down the restaurant name and the time of the reservation. You will be asked about this information after the task.”</p></blockquote>
<p>We used the following prompt:</p>
<blockquote><p>“During a usability test, the facilitator must keep track of participant behaviors as they navigate through tasks on a website, mobile app, software program, etc. We’d like you to watch a video of a usability test where participants were asked to book a table at a Sushi restaurant. As you&#8217;re watching, please look for problems the participant has while attempting to complete the task. For example, you can document the path users take, describe issues they encounter as well as what on the website might be causing problems. If you understand these instructions, let me know and I&#8217;ll drag the video in for you to review. Are you ready for the video?”</p></blockquote>
<p>The LLM response to this question was always some version of “Yes.”</p>
<p>In this study, we varied only the type of AI: ChatGPT and Gemini. The video, the prompt, and the LLM versions and settings were constant, but we plan to vary those variables in future studies.</p>
<h2><span lang="EN-US">Assessing Reliability</span></h2>
<p>If you ask, AI will deliver (something). For each run, we compiled a list of usability problems that the AI model “discovered.”</p>
<p>For example, a problem noticeable in the video (and on the current OpenTable website) is that when entering “Denver” in the search field, the previously selected cuisine (sushi) was removed, making for a clumsy filter and search experience.</p>
<p>To assess the reliability (consistency) of their problem discovery, we computed the <a href="https://measuringu.com/assessing-interrater-reliability-in-ux-research/">any-2 agreement</a> between ChatGPT and Gemini and within each model. We treated the models like evaluators.</p>
<p>Any-2 agreement is a UX context-specific version of the <a href="https://link.springer.com/article/10.1186/s12859-019-3118-5">Jaccard similarity coefficient (<em>J</em>)</a>, the ratio of the intersection of two binary measurements divided by their union. When there are more than two evaluators, the overall any-2 agreement is the average of the any-2 agreements for each pair of evaluators.</p>
<h3><span lang="EN-US">Computing Any-2 Agreement</span></h3>
<p>Imagine that (Y and C) have independently created lists of usability issues where Y’s list has 14 issues, C’s list has 17, and their two lists have ten issues in common (Figure 1). Their any-2 agreement is the intersection (the ten issues they both discovered) divided by the union of both lists (14 + 17 − 10 = 21), which is 48% (10/21).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/04/042826-Figure1.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47477" src="https://measuringu.com/wp-content/uploads/2026/04/042826-Figure1-300x172.png" alt="Venn diagram of problem discovery by two evaluators. " width="436" height="250" srcset="https://measuringu.com/wp-content/uploads/2026/04/042826-Figure1-300x172.png 300w, https://measuringu.com/wp-content/uploads/2026/04/042826-Figure1-768x440.png 768w, https://measuringu.com/wp-content/uploads/2026/04/042826-Figure1-600x344.png 600w, https://measuringu.com/wp-content/uploads/2026/04/042826-Figure1.png 909w" sizes="auto, (max-width: 436px) 100vw, 436px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1: </strong>Venn diagram of problem discovery by two evaluators.</p>
<p>Due to the <a href="https://measuringu.com/evaluator-effect/">well-documented evaluator effect</a>, we do not expect perfect agreement among UX researchers. In a controlled study like this (evaluators watching the same participants do the same tasks), our best estimate of typical any-2 agreement across multiple human evaluators (<a href="https://measuringu.com/examining-the-evaluator-effect-in-unmoderated-usability-testing/">based on 12 evaluations</a>) is 47%. (When studies are not controlled, the expected any-2 agreement is about 27%.)</p>
<p>This gives us a <strong>rough</strong> benchmark for assessing if an any-2 agreement is typical (around 50%), relatively low (around 25%), or relatively high (around 75%).</p>
<h2><span lang="EN-US">Within-Group Results</span></h2>
<p>The first step in our analysis was to compute the mean any-2 agreement for each group of “evaluators” (ChatGPT, Gemini) to estimate the levels of within-group reliability.</p>
<h3><span lang="EN-US">ChatGPT Reliability Was Fair</span></h3>
<p>Table 1 shows the combined problem list for the four runs of ChatGPT. Table 2 shows the any-2 agreements for each pair of runs.</p>

<table id="tablepress-1029" class="tablepress tablepress-id-1029">
<thead>
<tr class="row-1">
	<th class="column-1">GPT #</th><th class="column-2">GPT Problem List</th><th class="column-3">Run 1</th><th class="column-4">Run 2</th><th class="column-5">Run 3</th><th class="column-6">Run 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1">1</td><td class="column-2">Complex search field with placeholder text and unexpected behaviors significantly delayed user who selected sushi from search bar dropdown but for Dallas (default) instead of Denver</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-3">
	<td class="column-1">2</td><td class="column-2">Entering "Denver" in search field lost previous selection of sushi as cuisine</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-4">
	<td class="column-1">3</td><td class="column-2">Filters not helpful</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-5">
	<td class="column-1">4</td><td class="column-2">Scanning through 86 cuisines is effortful, then top that off by sushi not being in the list</td><td class="column-3"></td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-6">
	<td class="column-1">5</td><td class="column-2">Surprised when typing sushi into search field did not lose current location</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-7">
	<td class="column-1">6</td><td class="column-2">Search results for sushi included many non-sushi restaurants</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"></td><td class="column-6"><center>1</td>
</tr>
<tr class="row-8">
	<td class="column-1">7</td><td class="column-2">Weak presentation of cuisine information in search results</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-9">
	<td class="column-1">8</td><td class="column-2">Participant seemed to miss price point filter—sorted on ratings and examined price points in descriptions</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-10">
	<td class="column-1">9</td><td class="column-2">Sorting by highest rated put many non-sushi restaurants at the top of the list</td><td class="column-3"></td><td class="column-4"></td><td class="column-5"></td><td class="column-6"><center>1</td>
</tr>
<tr class="row-11">
	<td class="column-1">10</td><td class="column-2">UI pushes browsing without good decision support</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-12">
	<td class="column-1">11</td><td class="column-2">Selected result labeled seafood instead of sushi</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-13">
	<td class="column-1">12</td><td class="column-2">Task not completed because participant did not reach reservation form</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
</tbody>
</table>
<!-- #tablepress-1029 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> ChatGPT evaluations problem list.</p>

<table id="tablepress-1030" class="tablepress tablepress-id-1030">
<thead>
<tr class="row-1">
	<th class="column-1">Any-2</th><th class="column-2">Run 1</th><th class="column-3">Run 2</th><th class="column-4">Run 3</th><th class="column-5">Run 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1"><strong>Run 1</td><td class="column-2"> x</td><td class="column-3">30%</td><td class="column-4">63%</td><td class="column-5">38%</td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong>Run 2</td><td class="column-2">30%</td><td class="column-3"> x</td><td class="column-4">33%</td><td class="column-5"> 0%</td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>Run 3</td><td class="column-2">63%</td><td class="column-3">33%</td><td class="column-4"> x</td><td class="column-5">25%</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>Run 4</td><td class="column-2">38%</td><td class="column-3"> 0%</td><td class="column-4">25%</td><td class="column-5"> x</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1030 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 2: </strong>Any-2 agreement for the ChatGPT evaluations.</p>
<p>With an <strong>overall any-2 agreement of 31%</strong>, the reliability of the ChatGPT evaluations was <strong>fair</strong>. None of the problems was identified on all four runs (5/12 were identified on three runs). Runs 2 and 4 had no problems in common.</p>
<h3><span lang="EN-US">Gemini Reliability Was Better</span></h3>
<p>Table 3 shows the combined problem list for the four runs of Gemini. Table 4 shows the any-2 agreements for each pair of runs.</p>

<table id="tablepress-1031" class="tablepress tablepress-id-1031">
<thead>
<tr class="row-1">
	<th class="column-1">Gem #</th><th class="column-2">Gemini Problem List</th><th class="column-3">Run 1</th><th class="column-4">Run 2</th><th class="column-5">Run 3</th><th class="column-6">Run 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1">1</td><td class="column-2">Entering "Denver" in search field lost previous selection of sushi as cuisine</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-3">
	<td class="column-1">2</td><td class="column-2">Scanning through 86 cuisines is effortful, then top that off by sushi not being in the list</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-4">
	<td class="column-1">3</td><td class="column-2">Participant used Ctrl-F to search page for "sushi"—not found</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-5">
	<td class="column-1">4</td><td class="column-2">Participant seemed to miss price point filter—sorted on ratings and examined price points in descriptions</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-6">
	<td class="column-1">5</td><td class="column-2">Participant chose highest price tier</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-7">
	<td class="column-1">6</td><td class="column-2">Participant wanted to change sort to lowest rating first but not an option</td><td class="column-3"></td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-8">
	<td class="column-1">7</td><td class="column-2">Seating options only presented after selecting time</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-9">
	<td class="column-1">8</td><td class="column-2">Set time to 5:10</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-10">
	<td class="column-1">9</td><td class="column-2">Selected result labeled seafood instead of sushi</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
</tbody>
</table>
<!-- #tablepress-1031 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 3: </strong>Gemini evaluations problem list.</p>

<table id="tablepress-1032" class="tablepress tablepress-id-1032">
<thead>
<tr class="row-1">
	<th class="column-1">Any-2</th><th class="column-2">Run 1</th><th class="column-3">Run 2</th><th class="column-4">Run 3</th><th class="column-5">Run 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1"><strong>Run 1</td><td class="column-2"> x</td><td class="column-3">38%</td><td class="column-4">67%</td><td class="column-5">80%</td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong>Run 2</td><td class="column-2">38%</td><td class="column-3"> x</td><td class="column-4">38%</td><td class="column-5">43%</td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>Run 3</td><td class="column-2">67%</td><td class="column-3">38%</td><td class="column-4"> x</td><td class="column-5">80%</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>Run 4</td><td class="column-2">80%</td><td class="column-3">43%</td><td class="column-4">80%</td><td class="column-5"> x</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1032 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 4: </strong>Any-2 agreement for the Gemini evaluations.</p>
<p>With an <strong>overall any-2 agreement of 57%</strong>, the reliability of the Gemini evaluations was <strong>good</strong> (3/9 problems identified in all four runs, 4/9 identified by at least three runs).</p>
<h2><span lang="EN-US">Between-Group Results</span></h2>
<p>The second step in our analysis was to compute the mean any-2 agreement across LLMs to estimate the between-group reliability, shown in Table 5.</p>

<table id="tablepress-1033" class="tablepress tablepress-id-1033">
<thead>
<tr class="row-1">
	<th class="column-1">Any-2</th><th class="column-2">Gem 1</th><th class="column-3">Gem 2</th><th class="column-4">Gem 3</th><th class="column-5">Gem 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1"><strong>GPT 1</td><td class="column-2">40%</td><td class="column-3">40%</td><td class="column-4">33%</td><td class="column-5">40%</td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong>GPT 2</td><td class="column-2">20%</td><td class="column-3">20%</td><td class="column-4">17%</td><td class="column-5">20%</td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>GPT 3</td><td class="column-2">40%</td><td class="column-3">75%</td><td class="column-4">33%</td><td class="column-5">40%</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>GPT 4</td><td class="column-2"> 0%</td><td class="column-3">33%</td><td class="column-4"> 0%</td><td class="column-5"> 0%</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1033 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 5: </strong>Any-2 agreement between ChatGPT and Gemini evaluations.</p>
<p>With an <strong>overall any-2 agreement of 28%</strong>, the between-AI reliability was <strong>low</strong> (closer to 25% than to 50%).</p>
<h2><span lang="EN-US">Summary and Discussion</span></h2>
<p>Along with the rest of the UX researcher community, we have a strong interest in the roles that AI might play in facilitating our work. Watching participants attempt to complete tasks is a fundamental but labor-intensive UX research activity, so any relief AI assistance might offer would be welcome.</p>
<p>As a first step to investigate the capability of ChatGPT-5.4 Thinking and Gemini 3 Flash Thinking of finding usability problems in videos, we collected evaluations of a single video (summarized as lists of usability problems), performing four runs with each LLM.</p>
<p>In this article, we evaluated any-2 agreement within each group of evaluations (ChatGPT, Gemini) and between the AIs. Our key findings were:</p>
<p><strong>Gemini had good reliability, and ChatGPT’s was fair. </strong>The average any-2 agreement for ChatGPT was 31%. We expect this level of reliability when comparing different evaluators, <a href="https://www.dialogdesign.dk/cue-studies/">different methods</a>, or different users. It’s certainly lower than you’d want, but still at a level considered acceptable in our industry.</p>
<p>For Gemini, the average any-2 agreement was good at 57%. From the literature and our own research with human evaluators, 57% is above the mean of 47% and on the higher side of acceptability.</p>
<p><strong>Between-group reliability for Gemini and ChatGPT was low. </strong>The any-2 agreement between ChatGPT and Gemini was low at 28%. That’s about 20 points below the average when examining the same video by different people. This result is not great.</p>
<p><strong>Reliability isn’t accuracy. </strong>Are the problems identified by the LLMs as relevant as those discovered by a human evaluator? This question hasn’t been answered yet (a future analysis will). But to have accuracy (validity), we need to establish consistent (reliable) results, and at least for this video and prompt, the Gemini performance was sufficiently reliable.</p>
<p><strong>Humans vs. AI coming soon. </strong>We’re just getting started with our analyses. In an upcoming article, we’ll compare any-2 agreement between these LLMs and a problem list generated by professional human UX researchers. Stay tuned.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Can AI Detect Usability Problems?</title>
		<link>https://measuringu.com/can-ai-detect-usability-problems/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=can-ai-detect-usability-problems</link>
		
		<dc:creator><![CDATA[Jeff Sauro, PhD •&nbsp;Lucas Plabst, PhD •&nbsp;Jim Lewis, PhD&nbsp;•&nbsp;Will Schiavone, PhD]]></dc:creator>
		<pubDate>Wed, 22 Apr 2026 03:39:35 +0000</pubDate>
				<category><![CDATA[Usability]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[ChatGPT]]></category>
		<category><![CDATA[error]]></category>
		<category><![CDATA[Problem Discovery]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47333</guid>

					<description><![CDATA[You may have become numb to the overhyped headlines about AI. But it’d be wrong to dismiss the impact AI can have on our industry, not only because of job displacement, but also of helping us do our jobs more effectively (hopefully). To separate the hype and hysteria, we at MeasuringU think about AI’s impact [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-scaled.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47352 size-medium" src="https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-300x169.png" alt="Feature image showing an AI robot observing the user flow to detect usability issues" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-2048x1152.png 2048w, https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-600x338.png 600w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>You may have become numb to the overhyped headlines about AI.</p>
<p>But it’d be wrong to dismiss the impact AI can have on our industry, not only because of job displacement, but also of helping us do our jobs more effectively (hopefully).</p>
<p>To separate the hype and hysteria, we at MeasuringU think about AI’s impact in UX research in <a href="https://measuringu.com/ai-as-uxr-assistant-user-and-analyst/">three ways</a>: AI as Research Assistant, AI as (Synthetic) User, and AI as Researcher.</p>
<p>One of the more valuable activities we do in UX research as a researcher and assistant is to find (and recommend fixes for) usability problems in an interface.</p>
<p>Finding problems typically comes from researchers observing people interacting with a product, either live in a lab (like ours at MeasuringU, Figure 1), remotely using tools like <a href="https://measuringu.com/muiq/">MUiQ<sup>®</sup></a>, or by reviewing recordings of moderated or unmoderated sessions.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/04/042126-F1.jpg"><img loading="lazy" decoding="async" class="alignnone size-full wp-image-47336" src="https://measuringu.com/wp-content/uploads/2026/04/042126-F1.jpg" alt="AI-generated image of a robot observing a usability test session." width="602" height="602" srcset="https://measuringu.com/wp-content/uploads/2026/04/042126-F1.jpg 602w, https://measuringu.com/wp-content/uploads/2026/04/042126-F1-300x300.jpg 300w, https://measuringu.com/wp-content/uploads/2026/04/042126-F1-150x150.jpg 150w, https://measuringu.com/wp-content/uploads/2026/04/042126-F1-70x70.jpg 70w, https://measuringu.com/wp-content/uploads/2026/04/042126-F1-600x600.jpg 600w, https://measuringu.com/wp-content/uploads/2026/04/042126-F1-100x100.jpg 100w" sizes="auto, (max-width: 602px) 100vw, 602px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> AI-generated image of a robot observing a usability test session.</p>
<p>Watching lots of videos can be time-consuming and tedious. Usability problems often aren’t obvious defects in an interface. They can come from perfectly compiled code and a beautiful interface. Due to the <a href="https://measuringu.com/assessing-interrater-reliability-in-ux-research/">well-known evaluator effect</a>, even trained professionals do not detect usability problems with perfect consistency.</p>
<p>Can AI really uncover usability problems like UX researchers?</p>
<p>A few years ago, this sort of question seemed like science fiction. Now it’s not <em>if</em> but <em>how well</em> AI can watch videos like a researcher or research assistant. Can human researchers use it and trust the results?</p>
<h2><span lang="EN-US">How Does AI “Watch” a Video?</span></h2>
<p>Here’s a 15-second clip of a participant starting the task of finding a sushi restaurant on the OpenTable website. It’s part of a six-minute video of an unmoderated usability test conducted on the MUiQ platform.</p>
<div class="ast-oembed-container " style="height: 100%;"><iframe loading="lazy" title="MU_opentable_reservation first 15 sec" src="https://player.vimeo.com/video/1184232223?h=3252bb61a5&amp;dnt=1&amp;app_id=122963" width="1170" height="658" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write; encrypted-media; web-share" referrerpolicy="strict-origin-when-cross-origin"></iframe></div>
<p class="wp-caption-text" style="text-align: left;"><strong>Video 1: </strong>First 15 seconds of the OpenTable.com video.</p>
<p>We asked Gemini 3.1 Pro to “Please tell me in extreme detail what you see in the video from start to finish going second by second.” Here&#8217;s how it responded:</p>
<ul>
<li><strong>00:00–00:04:</strong> The video begins on the OpenTable.com homepage. The top navigation bar shows options for &#8220;For Business,&#8221; &#8220;Mobile,&#8221; &#8220;Help,&#8221; language selection (&#8220;EN&#8221;), and buttons for &#8220;Sign up&#8221; and &#8220;Sign in.&#8221; The main search area displays default parameters: a date of &#8220;Jul 19, 2022,&#8221; a time of &#8220;7:00 PM,&#8221; and a party size of &#8220;2 people.&#8221; The search bar says &#8220;Location, Restaurant, or Cuisine&#8221; with a red &#8220;Let&#8217;s go&#8221; button. Below this is a message reading, &#8220;It looks like you&#8217;re in Dallas Suburbs. Not correct? Get current location.&#8221; A section titled &#8220;Book for dinner tonight&#8221; shows several restaurants, including Fogo de Chao, Lombardi Cucina Italiana, Seasons 52, and Perry&#8217;s Steakhouse. At the bottom of the screen, a task description box outlines the user&#8217;s objective: booking a sushi restaurant in Denver, CO for four people tomorrow after 5:00 PM. The user begins speaking, stating they are looking for a sushi restaurant.</li>
<li><strong>00:05–00:09:</strong> The user&#8217;s mouse cursor idles on the left side of the screen. The user continues speaking, noting they need a reservation for tomorrow anytime after 5:00 PM. They observe that today is the 19th, so they need to book for the 20th.</li>
<li><strong>00:10–00:11:</strong> The cursor moves to the date selector and clicks on &#8220;Jul 19, 2022.&#8221; A calendar dropdown appears. The user clicks on the number &#8220;20&#8221; to change the date to Jul 20, 2022.</li>
<li><strong>00:12–00:15:</strong> The user moves the cursor to the time selector, currently set at &#8220;7:00 PM,&#8221; and clicks it. A dropdown menu of times in 30-minute increments appears. The user scrolls up and clicks on &#8220;5:00 PM.&#8221;</li>
</ul>
<p>Even though the summary of the onscreen text describing the user task does not include all task requirements (which we did not specifically prompt for), this is a reasonably accurate description of what’s on the screen and the events.</p>
<p>Gemini didn’t actually watch the video the way we do. It doesn’t experience a continuous stream of motion or notice subtle hesitation in real time. Instead, it only samples a few frames (often only one to two frames out of the 30 frames) per second of the video.</p>
<p>Each frame is then broken into smaller regions and converted into numerical representations that the model can process. If there’s audio, spoken words are transcribed into text and aligned with those frames. By the time the model begins “analyzing,” the video has already been reduced to a combination of image fragments and text.</p>
<p>From there, the model treats those inputs similarly to how it processes language. The visual and textual information is converted into tokens (small chunks of data) and passed through a neural network that looks for patterns (things such as interface elements, changes in screens, or sequences of actions).</p>
<h2>Autocorrect for Video Watching</h2>
<p>Because AI is working from snapshots rather than continuous playback, it doesn’t directly see motion. Instead, it infers what likely happened between frames (for example, that a user scrolled, tapped, or navigated to a new page). This makes the process efficient, but it also means short or subtle behaviors can be missed.</p>
<p>Based on the sampled frames and any accompanying text, it generates the most likely description of what happened, much like how it predicts the next word in a sentence. Basically, it’s like autocorrect on steroids for videos.</p>
<p>That’s why the output can sound surprisingly natural and insightful, even when it’s not entirely accurate. It’s less like a researcher watching a session and more like a system generating a plausible narrative from partial information.</p>
<h3>Losing Frames</h3>
<p>As long as there’s been autocorrect, there’s been, well, mistakes (often <a href="https://www.huffpost.com/entry/funniest-autocorrect-faiils-2014_n_6391880">hilarious ones</a>). The sampling that makes AI fast also makes it “<a href="https://cs.stanford.edu/people/eroberts/courses/soco/projects/data-compression/lossy/index.htm">lossy</a>.” By looking at only a fraction of the frames, the model can miss brief moments of hesitation, confusion, or micro-interactions that are often critical in usability analysis. What’s efficient for processing might not always be sufficient for insight.</p>
<h3>Probabilistic Output</h3>
<p>But unlike autocorrect, which works the same each time it’s presented with a partial word, AI outputs aren’t always the same. They’re probabilistic rather than deterministic. Even with the same video and the same prompt, the model may generate slightly different descriptions each time. That’s because it’s not retrieving a fixed answer but generating the most likely sequence of words from a range of possibilities. The results can be consistent in general themes, but not identical in wording or even emphasis. And with current systems, there is always the possibility of <a href="https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence)">hallucination</a>. For researchers, these concerns mean that AI outputs should be treated less like definitive observations and more like plausible interpretations that still needs validation.</p>
<h3>Temperature</h3>
<p>Part of this variability comes from a setting called <em>temperature</em>, which controls how much randomness the model uses when generating responses. Temperature typically ranges from 0 (close to deterministic) to around 2 (much more variable). Most models use a middle setting by default, which balances consistency and variation. Higher temperatures can surface a wider range of interpretations (sometimes useful for exploratory analysis), while lower temperatures produce more consistent outputs—but even then, results aren’t perfectly repeatable.</p>
<p>Figure 2 illustrates this process.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/04/042126-F2.png"><img loading="lazy" decoding="async" class="alignnone size-full wp-image-47337" src="https://measuringu.com/wp-content/uploads/2026/04/042126-F2.png" alt="Visual overview of how an AI “watches” a video. " width="1182" height="788" srcset="https://measuringu.com/wp-content/uploads/2026/04/042126-F2.png 1182w, https://measuringu.com/wp-content/uploads/2026/04/042126-F2-300x200.png 300w, https://measuringu.com/wp-content/uploads/2026/04/042126-F2-1024x683.png 1024w, https://measuringu.com/wp-content/uploads/2026/04/042126-F2-768x512.png 768w, https://measuringu.com/wp-content/uploads/2026/04/042126-F2-600x400.png 600w" sizes="auto, (max-width: 1182px) 100vw, 1182px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 2: </strong>Visual overview of how an AI “watches” a video.</p>
<p>This gives you an idea about how AI reviews videos for usability problems. But what does it look like when you ask an AI to perform a usability evaluation of a video?</p>
<h2><span lang="EN-US">Problem List from ChatGPT</span></h2>
<p>We uploaded the full six-minute video of a person attempting to find a sushi restaurant in Denver on the OpenTable website into ChatGPT (model 5.4 Thinking). We selected the video because it has several known usability problems that humans consistently detect. We prompted ChatGPT with:</p>
<blockquote><p>“During a usability test, the facilitator must keep track of participant behaviors as they navigate through tasks on a website, mobile app, software program, etc. We’d like you to watch a video of a usability test where participants were asked to book a table at a Sushi restaurant. As you&#8217;re watching, please look for problems the participant has while attempting to complete the task. For example, you can document the path users take, describe issues they encounter as well as what on the website might be causing problems. If you understand these instructions, let me know and I&#8217;ll drag the video in for you to review. Are you ready for the video?”</p></blockquote>
<p>ChatGPT indicated “yes,” then took only three minutes to process the video (half the time of the six minutes because it sampled a fraction of the frames to piece together its visual autocorrect narrative).</p>
<p>From its output, we derived a list of seven usability problems (Table 1).</p>

<table id="tablepress-1028" class="tablepress tablepress-id-1028">
<thead>
<tr class="row-1">
	<th class="column-1">Problem List</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1">Complex search field with placeholder text and unexpected behaviors significantly delayed user who selected sushi from search bar dropdown but for Dallas (default) instead of Denver</td>
</tr>
<tr class="row-3">
	<td class="column-1">Entering Denver in search field lost previous selection of sushi as cuisine</td>
</tr>
<tr class="row-4">
	<td class="column-1">Search results for sushi included many non-sushi restaurants</td>
</tr>
<tr class="row-5">
	<td class="column-1">Weak presentation of cuisine information in search results</td>
</tr>
<tr class="row-6">
	<td class="column-1">Participant seemed to miss price point filter—sorted on ratings and examined price points in descriptions</td>
</tr>
<tr class="row-7">
	<td class="column-1">UI pushes browsing without good decision support</td>
</tr>
<tr class="row-8">
	<td class="column-1">Selected restaurant was categorized as Seafood instead of Sushi, so participant failed the task</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1028 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> List of problems “discovered” by ChatGPT review of usability test video.</p>
<h2><span lang="EN-US">Looks Good, However …</span></h2>
<p>On the surface, that looks pretty good. It’s plausible, specific, and aligned with what a researcher might note. But it leaves us with a few questions:</p>
<ul>
<li>How many of these are <em>actual</em> usability problems versus plausible-sounding interpretations (autocorrect) or hallucinations?</li>
<li>How consistent are the results across multiple runs (reliability)?</li>
<li>How closely do these match what human UX researchers would identify (validity)?</li>
</ul>
<p>We’ll explore these important questions in future articles.</p>
]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>

<!-- plugin=object-cache-pro client=phpredis metric#hits=7798 metric#misses=55 metric#hit-ratio=99.3 metric#bytes=4774014 metric#prefetches=215 metric#store-reads=134 metric#store-writes=167 metric#store-hits=324 metric#store-misses=37 metric#sql-queries=52 metric#ms-total=859.73 metric#ms-cache=48.01 metric#ms-cache-avg=0.1600 metric#ms-cache-ratio=5.6 -->
