<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>MeasuringU</title>
	<atom:link href="https://measuringu.com/feed/" rel="self" type="application/rss+xml" />
	<link>https://measuringu.com</link>
	<description>UX Research and Software</description>
	<lastBuildDate>Tue, 19 May 2026 22:15:45 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://measuringu.com/wp-content/uploads/2020/11/site-icon.png</url>
	<title>MeasuringU</title>
	<link>https://measuringu.com</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>How Many Years Does It Take to Become a Senior UX Researcher?</title>
		<link>https://measuringu.com/how-many-years-does-it-take-to-become-a-senior-ux-researcher/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=how-many-years-does-it-take-to-become-a-senior-ux-researcher</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD and Jeff Sauro, PhD]]></dc:creator>
		<pubDate>Tue, 19 May 2026 22:15:45 +0000</pubDate>
				<category><![CDATA[Methods]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[Usability]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[UX Maturity]]></category>
		<category><![CDATA[experience]]></category>
		<category><![CDATA[Salary Survey]]></category>
		<category><![CDATA[UX Salary Survey]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47607</guid>

					<description><![CDATA[What does it take to become a senior UX researcher? An advanced degree? Particular experience and skills, like the number of moderated studies conducted or a variety of methods employed? While all those play a role, the type of job (in-house small-team, in-house large-team, solo researcher, or agency) can affect what you are exposed to. [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/05/051926-FeatureImage1.png"><img fetchpriority="high" decoding="async" class="alignleft wp-image-47633 size-medium" src="https://measuringu.com/wp-content/uploads/2026/05/051926-FeatureImage1-300x169.png" alt="Feature image showing an entry-level UX researcher becoming a senior UX researcher over several years" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/05/051926-FeatureImage1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/05/051926-FeatureImage1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/05/051926-FeatureImage1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/05/051926-FeatureImage1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/05/051926-FeatureImage1-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/05/051926-FeatureImage1.png 2000w" sizes="(max-width: 300px) 100vw, 300px" /></a>What does it take to become a senior UX researcher?</p>
<p>An advanced degree? Particular experience and skills, like the number of moderated studies conducted or a variety of methods employed?</p>
<p>While all those play a role, the type of job (in-house small-team, in-house large-team, solo researcher, or agency) can affect what you are exposed to.</p>
<p>Certainly, most would agree that one to two years of experience seems too little time to demonstrate senior-level performance in UX research. We thought that something around five years of experience was a good benchmark. But is that warranted? What is a good number of years of experience?</p>
<p>There is no official rule book on titles. As is often the case when making decisions about jobs, we can use a few approaches:</p>
<ol>
<li>Principle-based: Set a rule based on a principle that disregards what people do.</li>
<li>Tradition and trends: Look to broader workforce trends, what others report, or guidance online.</li>
<li>Data: See what’s happening in practice if you have access to data.</li>
</ol>
<h2>Principle Based</h2>
<p>Even though there isn’t an official designation, we can look broadly at how long it takes to master a skill or job like UX researcher. One used in popular culture (based on some research) and popularized by Malcolm Gladwell is the <a href="https://jamesclear.com/deliberate-practice-strategy">10,000-hour rule</a>. That is, after about 10k hours of practice, you master a skill. That is a very rough guideline and <a href="https://www.bbc.com/future/article/20121114-gladwells-10000-hour-rule-myth">definitely has its critics</a>.</p>
<h2>Tradition and Trends</h2>
<p>Seniority levels can <a href="https://www.indeed.com/career-advice/career-development/seniority-level">differ by industry type</a>. For example:</p>
<ul>
<li>The <a href="https://www.asce.org/career-growth/early-career-engineers/asce-guidelines-for-engineering-grades">American Society of Civil Engineers</a> has grades (from I to VIII) with <a href="https://www.asce.org/-/media/asce-images-and-files/career-and-growth/early-career-engineer/engineering-grades.pdf">detailed descriptions</a> of expected skills and responsibilities. For example, an engineer at Grade IV has at least four years of experience.</li>
<li><a href="https://hrsimple.com/law-firm-hierarchy-roles-and-career-paths/">Associates in a law firm</a> can be junior (1–3 years of experience), mid-level (4–6 years), or senior (7–10 years).</li>
<li>In <a href="https://strategycase.com/big-4-salaries/">large consulting firms</a>, senior associates typically have 2–5 years of experience.</li>
</ul>
<p>For the expected minimum number of years of experience for UX researchers, it makes sense to start with personal experience. In our decades of experience at large companies (IBM, Oracle, GE, Intuit, PeopleSoft), something like five years was a loose criterion. Below that, people would question the designation.</p>
<p>We carry a similar tradition at MeasuringU, and those with five years’ experience are considered senior. But at a tech-enabled agency, UX researchers typically conduct hundreds of moderated sessions and use a wide variety of methods such as unmoderated benchmarking, eye-tracking, in-depth interviews, diary studies, and surveys. A couple of years working here usually exposes a researcher to significantly more UX-related tasks than in a typical in-house role. At the same time, they are much less exposed to the very real job of navigating the politics of competing stakeholders and corporate hierarchies.</p>
<h2>Data: Salary Surveys, LinkedIn Profiles, and Job Posts</h2>
<p>Our preferred method is looking for data to guide decisions. We have three sources. The first is the bi-annual UXPA Salary Survey, which was last conducted in <a href="https://uxpa.org/salary-surveys/">2024</a>. The second is LinkedIn, which provides access to job titles and a crude way of determining years of experience. The third is requirements from recent job postings.</p>
<h3>UXPA Senior User Researcher Data</h3>
<p>The 2024 Salary Survey had 444 responses. Of those, 64% (276) described themselves as user researchers. Respondents could pick one of five employment levels. Table 1 shows that about half (130) of the user researchers classified themselves as “Senior-level, non-supervisory.”</p>

<table id="tablepress-1043" class="tablepress tablepress-id-1043">
<thead>
<tr class="row-1">
	<th class="column-1">Employment Level</th><th class="column-2">Number</th><th class="column-3"> %</th>
</tr>
</thead>
<tbody>
<tr class="row-2">
	<td class="column-1">Entry</td><td class="column-2"> 18</td><td class="column-3"> 7%</td>
</tr>
<tr class="row-3">
	<td class="column-1">Mid-level, non-supervisory</td><td class="column-2"> 73</td><td class="column-3">26%</td>
</tr>
<tr class="row-4">
	<td class="column-1">Mid-level, supervisory</td><td class="column-2"> 10</td><td class="column-3"> 4%</td>
</tr>
<tr class="row-5">
	<td class="column-1">Senior-level, non-supervisory</td><td class="column-2">130</td><td class="column-3">47%</td>
</tr>
<tr class="row-6">
	<td class="column-1">Senior-level, supervisory</td><td class="column-2"> 45</td><td class="column-3">16%</td>
</tr>
<tr class="row-7">
	<td class="column-1">Total</td><td class="column-2">276</td><td class="column-3"></td>
</tr>
</tbody>
</table>
<!-- #tablepress-1043 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1</strong>: Distribution of user researchers by employment level (2024 UXPA salary survey).</p>
<p>Respondents also selected their years of experience in response to the question “How long have you worked in this field (please round to the nearest year)” using the pre-determined buckets shown in Table 2.</p>

<table id="tablepress-1044" class="tablepress tablepress-id-1044">
<thead>
<tr class="row-1">
	<th class="column-1">Years of Experience</th><th class="column-2"># Senior</th><th class="column-3">% Senior</th><th class="column-4">% With More Experience</th>
</tr>
</thead>
<tbody>
<tr class="row-2">
	<td class="column-1"> 0–2 yrs</td><td class="column-2"> 1</td><td class="column-3"> 1%</td><td class="column-4">99%</td>
</tr>
<tr class="row-3">
	<td class="column-1"> 3–4 yrs</td><td class="column-2">17</td><td class="column-3">13%</td><td class="column-4">86%</td>
</tr>
<tr class="row-4">
	<td class="column-1"> 5–7 yrs</td><td class="column-2">28</td><td class="column-3">22%</td><td class="column-4">65%</td>
</tr>
<tr class="row-5">
	<td class="column-1"> 8–10 yrs</td><td class="column-2">21</td><td class="column-3">16%</td><td class="column-4">48%</td>
</tr>
<tr class="row-6">
	<td class="column-1">11–15 yrs</td><td class="column-2">21</td><td class="column-3">16%</td><td class="column-4">32%</td>
</tr>
<tr class="row-7">
	<td class="column-1">16–20 yrs</td><td class="column-2">15</td><td class="column-3">12%</td><td class="column-4">21%</td>
</tr>
<tr class="row-8">
	<td class="column-1">21+ yrs</td><td class="column-2">27</td><td class="column-3">21%</td><td class="column-4"> 0%</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1044 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 2:</strong> Distribution of 130 non-supervisory senior-level user researchers by years of experience (2024 UXPA salary survey).</p>
<p>For example, only one person who reported being a senior user researcher had two years or fewer of experience. That means 99% had more than two years. The second row of the table shows that 17 had between three and four years of experience. Adding that to the one respondent with less experience gets 18 out of the 130 respondents. That means <strong>86% of non-supervisory senior user researchers reported 5 or more years of experience</strong>. Using the center of each age group as a rough estimate of experience, the average number of years across the sample was 12–13 years. Of course, people may inflate their years of experience on an anonymous survey.</p>
<p>We also looked at UX designers in the UXPA dataset and found a similar pattern. Of the 56 UX designers who self-identified as senior, 87% had at least five years of experience (Table 3).</p>

<table id="tablepress-1045" class="tablepress tablepress-id-1045">
<thead>
<tr class="row-1">
	<th class="column-1">Years of Experience</th><th class="column-2"># Senior</th><th class="column-3">% Senior</th><th class="column-4">% With More Experience</th>
</tr>
</thead>
<tbody>
<tr class="row-2">
	<td class="column-1"> 0–2 yrs</td><td class="column-2"> 1</td><td class="column-3"> 2%</td><td class="column-4">98%</td>
</tr>
<tr class="row-3">
	<td class="column-1"> 3–4 yrs</td><td class="column-2"> 6</td><td class="column-3">11%</td><td class="column-4">87%</td>
</tr>
<tr class="row-4">
	<td class="column-1"> 5–7 yrs</td><td class="column-2">13</td><td class="column-3">23%</td><td class="column-4">64%</td>
</tr>
<tr class="row-5">
	<td class="column-1"> 8–10 yrs</td><td class="column-2">14</td><td class="column-3">25%</td><td class="column-4">39%</td>
</tr>
<tr class="row-6">
	<td class="column-1">11–15 yrs</td><td class="column-2"> 8</td><td class="column-3">14%</td><td class="column-4">25%</td>
</tr>
<tr class="row-7">
	<td class="column-1">16–20 yrs</td><td class="column-2"> 6</td><td class="column-3">11%</td><td class="column-4">14%</td>
</tr>
<tr class="row-8">
	<td class="column-1">21+ yrs</td><td class="column-2">8</td><td class="column-3">14%</td><td class="column-4">0%</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1045 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 3:</strong> Distribution of 56 non-supervisory senior-level UX designers by years of experience (2024 UXPA Salary Survey).</p>
<h3>LinkedIn Profiles</h3>
<p>Another approach is to look at how many years of experience senior UX researchers on LinkedIn have in their job history. While job dates can always be padded a bit, it’s a lot harder to claim unearned experience on a public professional forum. We did an informal examination searching for “Senior UX Researcher” and hand-counting the years of non-supervisory experience for the first 50 respondents.</p>
<p>Of the 50 profiles, the average years of experience was a bit over nine years (Table 4). The minimum was just shy of five years at 4.75. Of the 50 profiles, only four (8%) had less than five years of experience. In other words, using this crude estimate suggests 92% of senior user researchers have more than five years of experience.</p>

<table id="tablepress-1046" class="tablepress tablepress-id-1046">
<tbody>
<tr class="row-1">
	<td class="column-1"><strong>Mean Years of Experience</td><td class="column-2">9.1</td>
</tr>
<tr class="row-2">
	<td class="column-1"><strong>Min Years</td><td class="column-2">4.75</td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong># < 5</td><td class="column-2">4</td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>% < 5</td><td class="column-2">8%</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>Total #</td><td class="column-2">50</td>
</tr>
<tr class="row-6">
	<td class="column-1"><strong>% > 5</td><td class="column-2">92%</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1046 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 4:</strong> Analysis of 50 LinkedIn profiles of senior-level non-supervisory UX researchers.</p>
<h3>Job Posts</h3>
<p>Finally, we did another (very) informal search for senior UX researcher job postings (as of May 3, 2026) that were posted on Indeed. Of the five we found, all explicitly required five or more years of experience.</p>
<h2>Discussion and Summary</h2>
<p>There’s no official rule for what makes a UX researcher senior, but multiple approaches point to a consistent answer: at least five years.</p>
<ul>
<li><strong>Principle-based heuristics are consistent with five.</strong> Guidelines loosely based on research (like the 10,000-hour rule) suggest it takes about <strong>five years of focused experience</strong> to develop expertise. This is a weak rationale, but it&#8217;s a starting point.</li>
<li><strong>Tradition and trends suggest five.</strong> In our experience in the industry, it’s common to use <strong>five years as a minimum threshold</strong>. Other industries fall close to the five-year threshold as well.</li>
<li><strong>Salary survey data supports five.</strong> In the 2024 UXPA Salary Survey, <strong>86% of senior UX researchers reported five or more years of experience</strong>, with an average of around 12–13 years. Of the senior UX Designers, an adjacent role in the UX industry, 87% reported five or more years of experience.</li>
<li><strong>Existing profiles and open jobs show five+ years.</strong> Our LinkedIn sample of 50 senior UX researchers showed similar results, with <strong>about 90% above five years of experience</strong> and an average of 9–10 years. Finally, a selection of five currently open senior UX researcher jobs on Indeed all explicitly require at least five years of experience.</li>
</ul>
<p>If you’re looking to set a threshold for becoming senior, five years seems like a good rule.</p>
<p>Of course, years alone don’t define seniority, but if someone has fewer than five years of experience, the <em>senior</em> title should be the exception, not the rule.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How to Interpret a Rating Scale Without Historical Data</title>
		<link>https://measuringu.com/how-to-interpret-a-rating-scale-without-historical-data/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=how-to-interpret-a-rating-scale-without-historical-data</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD and Jeff Sauro, PhD]]></dc:creator>
		<pubDate>Tue, 12 May 2026 20:19:38 +0000</pubDate>
				<category><![CDATA[Benchmarking]]></category>
		<category><![CDATA[Rating Scale]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[Rating Scales]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47556</guid>

					<description><![CDATA[UX researchers use a lot of rating scales. We recommend using standardized rating scales when possible. One of the benefits of some standardized scales, such as the SUS, SUPR-Q®, and UX-Lite®, is that you have a reference database of historical data. But there’s not always a standardized questionnaire for everything you’re hoping to measure, so [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/05/051226-FeatureImage-scaled.png"><img decoding="async" class="alignleft wp-image-47595 size-medium" src="https://measuringu.com/wp-content/uploads/2026/05/051226-FeatureImage-300x169.png" alt="Feature image showing a UX researcher interpreting a rating scale without historical data" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/05/051226-FeatureImage-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/05/051226-FeatureImage-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/05/051226-FeatureImage-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/05/051226-FeatureImage-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/05/051226-FeatureImage-2048x1152.png 2048w, https://measuringu.com/wp-content/uploads/2026/05/051226-FeatureImage-600x338.png 600w" sizes="(max-width: 300px) 100vw, 300px" /></a>UX researchers use a lot of rating scales. We recommend using standardized rating scales when possible. One of the benefits of some standardized scales, such as the SUS, SUPR-Q<sup>®</sup>, and UX-Lite<sup>®</sup>, is that you have a reference database of historical data.</p>
<p>But there’s not always a standardized questionnaire for everything you’re hoping to measure, so researchers need to create <a href="https://en.wikipedia.org/wiki/Ad_hoc">ad hoc</a> ones.</p>
<p>Data collected with ad hoc rating scales can be difficult to interpret, especially if you don’t have any historical data (e.g., from past product performance or competitors).</p>
<p>If you’re comparing multiple conditions (e.g., ratings on attributes for two or more websites), then you can check for significant differences in rating scale means.</p>
<p>But even clear differences in means don’t answer the question about whether a given mean indicates a poor or good user experience.</p>
<p>In this article, we provide a way to interpret five- and seven-point UX rating scales when you don’t have enough historical data for custom benchmarks. We use the well-known distribution of the System Usability Scale (<a href="https://measuringu.com/10-things-sus/">SUS</a>) as the basis for our recommendation.</p>
<h2>UX Rating Scales Tend to Be Negatively Skewed</h2>
<p>If you’ve never plotted your distributions of rating scale response options, you should. But don’t be surprised when you see a negatively skewed distribution (tail of data points to the left).</p>
<p>Most UX rating scales have this negative skew because (1) most item stems have a positive tone (e.g., “I felt very confident using this website”) and (2) respondents are <a href="https://dl.acm.org/doi/pdf/10.1145/175276.175282">generally more likely to agree</a> (selecting higher responses). This means that the middle value (e.g., a 3 on a five-point scale) isn’t a good measure of the “average.” This skew doesn’t make the responses necessarily bad or not useful. It just means you need to account for that skew when interpreting them.</p>
<p>For example, you can see the skew in distributions of SUS scores, for which 50 is the middle of the scale (Figure 1), but is not the middle of the distribution (68 is the median).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/05/Figure1-1.png"><img decoding="async" class="alignnone wp-image-47597 size-full" src="https://measuringu.com/wp-content/uploads/2026/05/Figure1-1.png" alt="Figure 1: Distribution of 3,187 individual SUS scores (50 is the middle of the scale, but the median is 68)." width="1200" height="719" srcset="https://measuringu.com/wp-content/uploads/2026/05/Figure1-1.png 1200w, https://measuringu.com/wp-content/uploads/2026/05/Figure1-1-300x180.png 300w, https://measuringu.com/wp-content/uploads/2026/05/Figure1-1-1024x614.png 1024w, https://measuringu.com/wp-content/uploads/2026/05/Figure1-1-768x460.png 768w, https://measuringu.com/wp-content/uploads/2026/05/Figure1-1-600x360.png 600w" sizes="(max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> Distribution of 3,187 individual SUS scores (50 is the middle of the scale, but the median is 68).</p>
<h2>Default Benchmarks Based on Historical SUS Distribution</h2>
<p>Taking advantage of the well-known distribution of the SUS, we created a curved grading scale that is <a href="https://www.researchgate.net/publication/324116412_The_System_Usability_Scale_Past_Present_and_Future">widely used in UX research</a> (Table 1). We’ll use this as a basis for interpreting ad hoc scales.</p>

<table id="tablepress-1040" class="tablepress tablepress-id-1040">
<thead>
<tr class="row-1">
	<th class="column-1">SUS Score Range</th><th class="column-2">Grade</th><th class="column-3">Percentile Range</th>
</tr>
</thead>
<tbody>
<tr class="row-2">
	<td class="column-1">84.1–100</td><td class="column-2">A+</td><td class="column-3">96–100</td>
</tr>
<tr class="row-3">
	<td class="column-1">80.8–84.0</td><td class="column-2">A</td><td class="column-3">90–95</td>
</tr>
<tr class="row-4">
	<td class="column-1">78.9–80.7</td><td class="column-2">A−</td><td class="column-3">85–89</td>
</tr>
<tr class="row-5">
	<td class="column-1">77.2–78.8</td><td class="column-2">B+</td><td class="column-3">80–84</td>
</tr>
<tr class="row-6">
	<td class="column-1">74.1–77.1</td><td class="column-2">B</td><td class="column-3">70–79</td>
</tr>
<tr class="row-7">
	<td class="column-1">72.6–74.0</td><td class="column-2">B−</td><td class="column-3">65–69</td>
</tr>
<tr class="row-8">
	<td class="column-1">71.1–72.5</td><td class="column-2">C+</td><td class="column-3">60–64</td>
</tr>
<tr class="row-9">
	<td class="column-1">65.0-71.0</td><td class="column-2">C</td><td class="column-3">41–59</td>
</tr>
<tr class="row-10">
	<td class="column-1">62.7–64.9</td><td class="column-2">C−</td><td class="column-3">35–40</td>
</tr>
<tr class="row-11">
	<td class="column-1">51.7–62.6</td><td class="column-2">D</td><td class="column-3">15–34</td>
</tr>
<tr class="row-12">
	<td class="column-1"> 0.0–51.6</td><td class="column-2">F</td><td class="column-3"> 0–14</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1040 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Curved grading scale for the SUS.</p>
<p>The 50<sup>th</sup> percentile of this scale is a SUS score of 68, a solid C. Another important benchmark commonly used in practice is an aspirational score of 80 (the upper end of an A−, a bit higher than the 85<sup>th</sup> percentile). Scores lower than 51.7 are in the F range (just below the 15<sup>th</sup> percentile).</p>
<p>Based on the SUS research, when we consult with clients who need a benchmark for five- or seven-point scales and there is no historical data, we usually recommend setting a benchmark for average to about 70% of the range of the scale, 80% for good, and 50% for poor—similar to the historical benchmarks for the SUS. For example, this is what we did when we created our <a href="https://measuringu.com/grading-scales-for-the-ux-lite/">standard grading scale for the UX-Lite</a>.</p>
<p>Table 2 shows those values for five- and seven-point scales (the midpoint for a five-point scale is 3, and for a seven-point scale is 4).</p>

<table id="tablepress-1042" class="tablepress tablepress-id-1042">
<thead>
<tr class="row-1">
	<th class="column-1">Location on Scale</th><th class="column-2">Interpretation</th><th class="column-3">Five-point</th><th class="column-4">Seven-point</th>
</tr>
</thead>
<tbody>
<tr class="row-2">
	<td class="column-1">80%</td><td class="column-2">Good</td><td class="column-3">4.2</td><td class="column-4">5.8</td>
</tr>
<tr class="row-3">
	<td class="column-1">70%</td><td class="column-2">Average</td><td class="column-3">3.8</td><td class="column-4">5.2</td>
</tr>
<tr class="row-4">
	<td class="column-1">60%</td><td class="column-2">Below Average</td><td class="column-3">3.4</td><td class="column-4">4.6</td>
</tr>
<tr class="row-5">
	<td class="column-1">50%</td><td class="column-2">Poor</td><td class="column-3">3.0</td><td class="column-4">4.0</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1042 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 2:</strong> Initial benchmarks for 70 and 80% of the range of five- and seven-point scales.</p>
<p>The formula for computing these values is based on the <a href="https://measuringu.com/converting-scales-to-100-points/">methods for interpolating rating scale scores</a> that start with 1 to a 0–100-point scale, algebraically manipulated to compute the benchmark for the rating scale from the target range (e.g., 80% of the scale, expressed as 80 in the computation) and the maximum possible value of the rating scale (e.g., typically 5 or 7 for scales that start with 1):</p>
<p>Benchmark = Target / (100 / (MaxRating − 1)) + 1</p>
<p>For example, to find 70% of the range of a five-point scale, the benchmark would be:</p>
<p>70 / (100 /(5 − 1)) + 1 = 70 / 25 + 1 = 3.8</p>
<p>An <a href="https://measuringu.com/types-of-100-point-scales/">alternative approach</a> is to convert five- or seven-point ratings to a 0–100-point scale. John Brooke, the developer of the SUS, <a href="https://uxpajournal.org/sus-a-retrospective/">described the value of this approach</a>: “Project managers, product managers, and engineers were more likely to understand a scale that went from 0 to 100 than one that went from 10 to 50, and the important thing was to be able to grab their attention in the short space of time they were likely to spend thinking about usability, without having to go into a detailed explanation.”</p>
<p>The general formula for converting a five- or seven-point scale to 0–100 points is:</p>
<p>Rating100 = (Rating − 1) * 100 / (MaxRating − 1)</p>
<p>For example, a five-point mean rating of 4.2 would become 80:</p>
<p>(4.2 − 1) * (100 / (5 − 1)) = 3.2(25) = 80</p>
<p>A seven-point mean rating of 4.0 would become 50:</p>
<p>(4 − 1) * (100 / (7 − 1)) = 3(16.67) ≈ 50</p>
<p><strong><em>Caveat</em></strong><em><strong>:</strong> Note that these are initial benchmarks to use when UX researchers lack a more grounded rationale for interpreting mean rating scale scores. After a reasonable amount of data collection with the scale, it’s a good idea to revisit the initial benchmarks to see whether they should be adjusted.</em></p>
<h2>Summary</h2>
<p>When you&#8217;re working with an ad hoc rating scale and have no historical data to lean on, the SUS distribution gives you a principled starting point. Because UX rating scales share a consistent negative skew (driven by positive item wording and respondent agreement bias), benchmarks derived from the SUS translate reasonably well to other five- and seven-point scales. It’s not that there’s something magic about the SUS. It works well because it’s a composite of ten five-point UX rating scales that share the tendency of other UX rating scales to be negatively skewed (more favorable than unfavorable). This means that benchmarks informed by the SUS provide a good initial approximation for other UX rating scales.</p>
<p>The characteristics of UX rating scales that this pattern supports are:</p>
<ul>
<li>Setting “Poor” below the midpoint of the scale (50% of the range) because means of positive-tone UX rating scales are consistently higher than the scale midpoint.</li>
<li>Setting “Good” above 80% of the scale range is the <a href="https://www.uslanguageservices.com/guides-resources/understanding-the-u-s-grading-system/">traditional score for a B</a> (above average).</li>
</ul>
<p>Placing other cut points between 50% and 80% leads to these initial benchmarks:</p>
<ul>
<li><strong>Good</strong>: Located at <strong>80%</strong> of the range of the scale</li>
<li><strong>Average</strong>: Located at <strong>70%</strong> of the range of the scale</li>
<li><strong>Below average</strong>: Located at <strong>60%</strong> of the range of the scale</li>
<li><strong>Poor</strong>: Located at <strong>50%</strong> of the range of the scale (the midpoint)</li>
</ul>
<p>It’s important to keep in mind that these are reasonable best guesses without a strong normative database. For UX rating scale items that will be used frequently over time, researchers should plan to build normative databases and use them to tune the benchmarks (like we have <a href="https://measuringu.com/evolution-of-seq/">done with the SEQ<sup>®</sup></a>).</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Can AI Detect Usability Problems Like Researchers?</title>
		<link>https://measuringu.com/ai-vs-human-usability-problem-analysis-of-a-video/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=ai-vs-human-usability-problem-analysis-of-a-video</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD •&nbsp;Jeff Sauro, PhD •&nbsp;Will Schiavone, PhD&nbsp;•&nbsp;Lucas Plabst, PhD]]></dc:creator>
		<pubDate>Wed, 06 May 2026 04:25:45 +0000</pubDate>
				<category><![CDATA[Usability]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[ChatGPT]]></category>
		<category><![CDATA[Gemini]]></category>
		<category><![CDATA[Problem Discovery]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47502</guid>

					<description><![CDATA[AI can “watch” videos. It can even generate a list of problems. In some cases, these problem lists seem to be reasonably consistent (reliable). But consistency is not accuracy. Are these real problems or just sophisticated AI slop generated consistently by autocorrect for video? How can we know? One way to find out is to [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/05/Feature050526.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47551 size-medium" src="https://measuringu.com/wp-content/uploads/2026/05/Feature050526-300x169.png" alt="Feature image showing a count of problems found by AI versus human researchers" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/05/Feature050526-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/05/Feature050526-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/05/Feature050526-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/05/Feature050526-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/05/Feature050526.png 1280w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>AI can <a href="https://measuringu.com/can-ai-detect-usability-problems">“watch” videos</a>.</p>
<p>It can even generate a list of problems. In some cases, these problem lists seem to be <a href="https://measuringu.com/ai-usability-problem-analysis-of-a-video">reasonably consistent (reliable)</a>.</p>
<p>But consistency is not accuracy. Are these real problems or just sophisticated AI slop generated consistently by autocorrect for video?</p>
<p>How can we know? One way to find out is to compare the AI problem lists to those created by trained UX researchers.</p>
<p><strong>Are the problems an AI finds the same problems a UX researcher would find?</strong></p>
<p>In this article, we move from reliability to <strong>validity</strong> by comparing the problems identified by AI to those found by human UX researchers reviewing the same video.</p>
<h2><span lang="EN-US">Humans vs. AI: Same Video, Same Task</span></h2>
<p>For this study, four UX researchers at MeasuringU independently reviewed a roughly six-minute usability test video and created lists of observed usability problems. The primary evaluator had over 40 years of experience coding usability problems, while the other three, at the time of the study, each had less than a year of experience. The video they watched was the same one reviewed by two LLMs in our previous assessment of AI reliability (ChatGPT-5.4 Thinking and Gemini 3 Flash Thinking, four runs per LLM to assess reliability). The participant’s task was to use OpenTable.com to book a reservation:</p>
<blockquote><p>“Make a reservation for four people at a sushi restaurant in Denver, CO tomorrow anytime after 5:00pm. Make sure the restaurant you select is not at the lowest or highest price point. Of the restaurants that fit these criteria, look at their overall rating, customer reviews, and photos to select the one that is the most appealing to you. Go as far as you can in the reservation process until you are asked for your personal information or account details. DO NOT fully confirm the reservation. Write down the restaurant name and the time of the reservation. You will be asked about this information after the task.”</p></blockquote>
<p>The directions for the human evaluators matched the prompt given to the LLMs:</p>
<blockquote><p>“During a usability test, the facilitator must keep track of participant behaviors as they navigate through tasks on a website, mobile app, software program, etc. We’d like you to watch a video of a usability test where participants were asked to book a table at a Sushi restaurant. As you&#8217;re watching, please look for problems the participant has while attempting to complete the task. For example, you can document the path users take, describe issues they encounter as well as what on the website might be causing problems.”</p></blockquote>
<h3><span lang="EN-US">Coding and Matching Problems</span></h3>
<p>The four researchers independently created their problem lists. The senior researcher (Evaluator 1) then reviewed, matched, and consolidated the problems as shown in Table 1.</p>

<table id="tablepress-1034" class="tablepress tablepress-id-1034">
<thead>
<tr class="row-1">
	<th class="column-1">Prob #</th><th class="column-2">Human Evaluators Problem List</th><th class="column-3">Eval 1</th><th class="column-4">Eval 2</th><th class="column-5">Eval 3</th><th class="column-6">Eval 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1">1</td><td class="column-2">Complex search field with placeholder text and unexpected behaviors significantly delayed user who selected sushi from search bar dropdown but for Dallas (default) instead of Denver</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-3">
	<td class="column-1">2</td><td class="column-2">Odd display of email addresses upon click in search field</td><td class="column-3"></td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-4">
	<td class="column-1">3</td><td class="column-2">Avoided search field and looked in filters to try to change location</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-5">
	<td class="column-1">4</td><td class="column-2">Entering Denver in search field lost previous selection of sushi as cuisine</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-6">
	<td class="column-1">5</td><td class="column-2">Scanning through 86 cuisines is effortful, then top that off by sushi not being in the list</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-7">
	<td class="column-1">6</td><td class="column-2">Surprised when typing sushi into search field did not lose current location</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-8">
	<td class="column-1">7</td><td class="column-2">Participant seemed to miss price point filter—sorted on ratings and examined price points in descriptions</td><td class="column-3"></td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-9">
	<td class="column-1">8</td><td class="column-2">Participant wanted to change sort to lowest rating first but not an option</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-10">
	<td class="column-1">9</td><td class="column-2">Despite having selected 5pm at start of process user needed to reselect it later</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-11">
	<td class="column-1"><i>Total</td><td class="column-2"></td><td class="column-3"><i><center>6</td><td class="column-4"><i><center>5</td><td class="column-5"><i><center>8</td><td class="column-6"><i><center>7</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1034 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Human evaluators&#8217; problem list.</p>
<p>Nine total problems were identified, none of which was classified as a false alarm by Evaluator 1. Four problems (3, 4, 5, and 6) were identified by all four UX researchers. Two problems (1, 8) were identified by three evaluators, one problem (7) by two evaluators, and two problems (2, 9) by one evaluator.</p>
<h3><span lang="EN-US">High Reliability for Humans</span></h3>
<p>With the consolidated problem list, we computed the <a href="https://measuringu.com/assessing-interrater-reliability-in-ux-research/">any-2 agreement</a> across all pairs of evaluators as shown in Table 2. Any-2 agreement accounts for interrater reliability of the different problem lists better than Kappa when assessing agreement.</p>

<table id="tablepress-1035" class="tablepress tablepress-id-1035">
<thead>
<tr class="row-1">
	<th class="column-1">Any-2</th><th class="column-2">Eval 1</th><th class="column-3">Eval 2</th><th class="column-4">Eval 3</th><th class="column-5">Eval 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1"><strong>Eval 1</td><td class="column-2"> x</td><td class="column-3">57%</td><td class="column-4">56%</td><td class="column-5">63%</td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong>Eval 2</td><td class="column-2">57%</td><td class="column-3"> x</td><td class="column-4">63%</td><td class="column-5">71%</td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>Eval 3</td><td class="column-2">56%</td><td class="column-3">63%</td><td class="column-4"> x</td><td class="column-5">88%</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>Eval 4</td><td class="column-2">63%</td><td class="column-3">71%</td><td class="column-4">88%</td><td class="column-5"> x</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1035 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 2: </strong>Any-2 agreement for the human evaluators.</p>
<p>The average any-2 agreement across all pairs was 66%. Based on our data, the general rule of thumb for interpreting any-2 agreement is that 50% is typical, 25% is low, and 75% is high.</p>
<p>That means the reliability of the human evaluators was <strong>relatively high</strong>, likely because some of the usability problems in the list were quite salient (4/9 identified by all four evaluators, 6/9 identified by at least three evaluators).</p>
<p>In our <a href="https://measuringu.com/ai-usability-problem-analysis-of-a-video">previous study</a> of AI analysis, the reliability of ChatGPT was relatively low (31%) while Gemini was above average (57%).</p>
<h2><span lang="EN-US">Agreement Was Low Between AIs and Humans</span></h2>
<p>We created consolidated problem lists for ChatGPT and Gemini by combining results across four runs and matching them to the human-identified problems. Problems labeled “ChatGPT” or “Gem” are unique to those systems. Problems without labels were also found by humans.</p>
<h3><span lang="EN-US">ChatGPT Validity</span></h3>
<p>Table 3 shows the combined problem list for the four runs of ChatGPT. It included five problems from the human list and seven unique problems. Table 4 shows the human by ChatGPT any-2 agreement.</p>

<table id="tablepress-1036" class="tablepress tablepress-id-1036">
<thead>
<tr class="row-1">
	<th class="column-1">Prob #</th><th class="column-2">ChatGPT Problem List</th><th class="column-3">Run 1</th><th class="column-4">Run 2</th><th class="column-5">Run 3</th><th class="column-6">Run 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1">1</td><td class="column-2">Complex search field with placeholder text and unexpected behaviors significantly delayed user who selected sushi from search bar dropdown but for Dallas (default) instead of Denver</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-3">
	<td class="column-1">4</td><td class="column-2">Entering Denver in search field lost previous selection of sushi as cuisine</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-4">
	<td class="column-1">4b-ChatGPT</td><td class="column-2">Filters not helpful</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-5">
	<td class="column-1">5</td><td class="column-2">Scanning through 86 cuisines is effortful, then top that off by sushi not being in the list</td><td class="column-3"></td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-6">
	<td class="column-1">6</td><td class="column-2">Surprised when typing sushi into search field did not lose current location</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-7">
	<td class="column-1">6b-ChatGPT</td><td class="column-2">Search results for sushi included many non-sushi restaurants</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"></td><td class="column-6"><center>1</td>
</tr>
<tr class="row-8">
	<td class="column-1">6c-ChatGPT</td><td class="column-2">Weak presentation of cuisine information in search results</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-9">
	<td class="column-1">7</td><td class="column-2">Participant seemed to miss price point filter—sorted on ratings and examined price points in descriptions</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-10">
	<td class="column-1">7b-ChatGPT</td><td class="column-2">Sorting by highest rated put many non-sushi restaurants at the top of the list</td><td class="column-3"></td><td class="column-4"></td><td class="column-5"></td><td class="column-6"><center>1</td>
</tr>
<tr class="row-11">
	<td class="column-1">8b-ChatGPT</td><td class="column-2">UI pushes browsing without good decision support</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-12">
	<td class="column-1">10a-ChatGPT</td><td class="column-2">Selected result labeled seafood instead of sushi</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-13">
	<td class="column-1">10b-ChatGPT</td><td class="column-2">Task not completed because did not reach reservation form</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-14">
	<td class="column-1"><i>Total</td><td class="column-2"></td><td class="column-3"><i><center>7</td><td class="column-4"><i><center>6</td><td class="column-5"><i><center>6</td><td class="column-6"><i><center>4</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1036 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 3: </strong>ChatGPT evaluations problem list (problems tagged with ChatGPT were not reported by humans).</p>

<table id="tablepress-1037" class="tablepress tablepress-id-1037">
<thead>
<tr class="row-1">
	<th class="column-1">Any-2</th><th class="column-2">ChatGPT 1</th><th class="column-3">ChatGPT 2</th><th class="column-4">ChatGPT 3</th><th class="column-5">ChatGPT 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1"><strong>Eval 1</td><td class="column-2">18%</td><td class="column-3">33%</td><td class="column-4">33%</td><td class="column-5">0%</td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong>Eval 2</td><td class="column-2"> 9%</td><td class="column-3">22%</td><td class="column-4">22%</td><td class="column-5">0%</td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>Eval 3</td><td class="column-2">25%</td><td class="column-3">27%</td><td class="column-4">27%</td><td class="column-5">0%</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>Eval 4</td><td class="column-2">27%</td><td class="column-3">30%</td><td class="column-4">30%</td><td class="column-5">0%</td>
</tr>
<tr class="row-6">
	<td class="column-1"></td><td class="column-2"></td><td class="column-3"></td><td class="column-4"></td><td class="column-5"></td>
</tr>
</tbody>
<tfoot>
<tr class="row-7">
	<th class="column-1">Mean:</th><th class="column-2">19%</th><td class="column-3"></td><td class="column-4"></td><td class="column-5"></td>
</tr>
</tfoot>
</table>
<!-- #tablepress-1037 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 4: </strong>Any-2 agreement for human with ChatGPT evaluations.</p>
<h3><span lang="EN-US">Gemini Validity</span></h3>
<p>Table 5 shows the combined problem list for the four runs of Gemini. It included four problems from the human list and five unique problems. Table 6 shows the human by Gemini any-2 agreement.</p>

<table id="tablepress-1038" class="tablepress tablepress-id-1038">
<thead>
<tr class="row-1">
	<th class="column-1">Prob #</th><th class="column-2">Gemini Problem List</th><th class="column-3">Run 1</th><th class="column-4">Run 2</th><th class="column-5">Run 3</th><th class="column-6">Run 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1">4</td><td class="column-2">Entering Denver in search field lost previous selection of sushi as cuisine</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-3">
	<td class="column-1">5</td><td class="column-2">Scanning through 86 cuisines is effortful, then top that off by sushi not being in the list</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-4">
	<td class="column-1">5b-Gem</td><td class="column-2">Participant used Ctrl-F to search page for "sushi"—not found</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-5">
	<td class="column-1">7</td><td class="column-2">Participant seemed to miss price point filter—sorted on ratings and examined price points in descriptions</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-6">
	<td class="column-1">7b-Gem</td><td class="column-2">Participant chose highest price tier</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-7">
	<td class="column-1">8</td><td class="column-2">Participant wanted to change sort to lowest rating first but not an option</td><td class="column-3"></td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-8">
	<td class="column-1">9b-Gem</td><td class="column-2">Seating options only presented after selecting time</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-9">
	<td class="column-1">9c-Gem</td><td class="column-2">Set time to 5:10</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-10">
	<td class="column-1">10a-Gem</td><td class="column-2">Selected result labeled seafood instead of sushi</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-11">
	<td class="column-1"><i>Total</td><td class="column-2"></td><td class="column-3"><i><center>5</td><td class="column-4"><i><center>6</td><td class="column-5"><i><center>5</td><td class="column-6"><i><center>4</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1038 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 5: </strong>Gemini evaluations problem list (problems tagged with Gem were not reported by humans).</p>

<table id="tablepress-1039" class="tablepress tablepress-id-1039">
<thead>
<tr class="row-1">
	<th class="column-1">Any-2</th><th class="column-2">Gem 1</th><th class="column-3">Gem 2</th><th class="column-4">Gem 3</th><th class="column-5">Gem 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1">Eval 1</td><td class="column-2">29%</td><td class="column-3">29%</td><td class="column-4">25%</td><td class="column-5">29%</td>
</tr>
<tr class="row-3">
	<td class="column-1">Eval 2</td><td class="column-2">33%</td><td class="column-3">33%</td><td class="column-4">50%</td><td class="column-5">33%</td>
</tr>
<tr class="row-4">
	<td class="column-1">Eval 3</td><td class="column-2">38%</td><td class="column-3">22%</td><td class="column-4">50%</td><td class="column-5">38%</td>
</tr>
<tr class="row-5">
	<td class="column-1">Eval 4</td><td class="column-2">43%</td><td class="column-3">25%</td><td class="column-4">57%</td><td class="column-5">43%</td>
</tr>
<tr class="row-6">
	<td class="column-1"></td><td class="column-2"></td><td class="column-3"></td><td class="column-4"></td><td class="column-5"></td>
</tr>
</tbody>
<tfoot>
<tr class="row-7">
	<th class="column-1">Mean:</th><th class="column-2">36%</th><td class="column-3"></td><td class="column-4"></td><td class="column-5"></td>
</tr>
</tfoot>
</table>
<!-- #tablepress-1039 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 6: </strong>Any-2 agreement for human with Gemini evaluations.</p>
<p>We found the <strong>poorest agreement between human and ChatGPT evaluations</strong> (19%). Agreement between human and Gemini evaluations (36%) was substantially higher but still relatively low.</p>
<p>These agreement rates account for problems AI found that humans didn’t. We treated AI-discovered problems as if they were real (for now), but they could have been false positives (an error humans make, too). What is a “real” problem? That’s been a long-standing research question. For now, we’re relying on the senior researcher to determine the real problems. That human-verified problem list is how we’ll evaluate the AIs.</p>
<h3><span lang="EN-US">Did AI Find the Same Problems as Researchers?</span></h3>
<p>We can use the human-generated and verified problem lists as the “gold-standard” and assess AI’s “hit-rate” as another measure of validity beyond any-2 agreement. The four human evaluators identified nine usability problems. <strong>ChatGPT identified five, and Gemini identified four. </strong>Figure 1 shows the problems identified by human evaluators and how well both AI models identified them. We consolidated the runs, counting a problem if it was found at least once across any of the four runs.</p>
<p>Four of the problems were found by all four researchers, suggesting they were more salient problems. ChatGPT uncovered three of these four, and Gemini uncovered two.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/05/Figure1.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47541" src="https://measuringu.com/wp-content/uploads/2026/05/Figure1-300x83.png" alt="How well AI models found usability problems identified by researchers. " width="1200" height="331" srcset="https://measuringu.com/wp-content/uploads/2026/05/Figure1-300x83.png 300w, https://measuringu.com/wp-content/uploads/2026/05/Figure1-1024x282.png 1024w, https://measuringu.com/wp-content/uploads/2026/05/Figure1-768x212.png 768w, https://measuringu.com/wp-content/uploads/2026/05/Figure1-1536x423.png 1536w, https://measuringu.com/wp-content/uploads/2026/05/Figure1-600x165.png 600w, https://measuringu.com/wp-content/uploads/2026/05/Figure1.png 1971w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1: </strong>How well AI models found usability problems identified by researchers.</p>
<p>Comparing AI to a pooled set of problems from four researchers may not be a fair comparison. We should also consider how well AI does compared to each individual researcher. Figure 1 shows that, for example, ChatGPT identified four of the six problems identified by the senior evaluator. Gemini uncovered two of the six problems. Across each of the four evaluators, ChatGPT identified between <strong>60% and 71% </strong>of the usability problems, and Gemini identified between <strong>33% and 60%</strong> (see right side of Figure 1).</p>
<p>Figure 2 is a Venn diagram that shows the overlap in problems found between both AIs and between AIs and humans. AIs generated eleven problems not identified by any of the four researchers, and there were three problems identified by humans only. ChatGPT came up with seven new ones and Gemini five (they agreed on one of the problems).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/05/Figure2.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47542" src="https://measuringu.com/wp-content/uploads/2026/05/Figure2-1024x901.png" alt="Venn diagram of usability problem discovery by humans, ChatGPT, and Gemini. " width="600" height="528" srcset="https://measuringu.com/wp-content/uploads/2026/05/Figure2-1024x901.png 1024w, https://measuringu.com/wp-content/uploads/2026/05/Figure2-300x264.png 300w, https://measuringu.com/wp-content/uploads/2026/05/Figure2-768x676.png 768w, https://measuringu.com/wp-content/uploads/2026/05/Figure2-1536x1351.png 1536w, https://measuringu.com/wp-content/uploads/2026/05/Figure2-2048x1802.png 2048w, https://measuringu.com/wp-content/uploads/2026/05/Figure2-600x528.png 600w" sizes="auto, (max-width: 600px) 100vw, 600px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 2: </strong>Venn diagram of usability problem discovery by humans, ChatGPT, and Gemini.</p>
<p>AIs generated more new problems (eleven) than the total list generated by four humans (nine). It’s not clear whether these additional AI-identified problems represent true usability issues that humans missed or are false positives/hallucinations. We’ll dig into the qualitative difference on those problems in an upcoming article. What is clear is that all these additional problems likely require a human’s time to review them.</p>
<h2><span lang="EN-US">Summary and Discussion</span></h2>
<p>Building on our previous research into the reliability of AI usability problem discovery, we investigated the validity of AI evaluations by seeing whether AI and human evaluators agree on <em>which</em> problems they find. Using the same video, task, and prompt, four UX researchers and two LLMs (ChatGPT and Gemini, four runs each) independently produced problem lists. Our key findings:</p>
<p><strong>Humans had higher within-group reliability than the LLMs.</strong> Any-2 agreement among human evaluators was 66%, well above the 31% we previously reported for ChatGPT and somewhat above Gemini&#8217;s 57%.</p>
<p><strong>Agreement between humans and AI was low.</strong> The human-ChatGPT any-2 agreement was just 19%, the lowest we observed. Human-Gemini agreement was better at 36%, but still below the typical human baseline of 47%. Low agreement means AI and humans often flag different problems when watching the same video with known usability issues.</p>
<p><strong>AI identifies roughly half the problems humans find.</strong> ChatGPT identified five of the nine human-verified problems, and Gemini identified four. Of the four problems that were identified by all human evaluators, three were identified by both AIs. The nine problems were a vetted compilation from all four human evaluators. When we limited the comparison to individual evaluators, ChatGPT matched 60–71% of each researcher&#8217;s list, and Gemini matched 33–60%. The AIs didn’t find all the problems reported by humans, but depending on the evaluator(s), they can find more than half of them.</p>
<p><strong>AI generates more new problems than humans do.</strong> The two AIs together produced eleven problems that no human identified (at least from one video), which is more than the entire human problem list of nine. ChatGPT contributed seven unique problems and Gemini five, with one shared between them. It&#8217;s not yet clear whether these represent real usability issues that trained researchers missed or are false positives (we’ll explore these possibilities in an upcoming article).</p>
<p><strong>AI-only problems create a new validation burden.</strong> Someone has to determine which AI-generated problems are real, and that means a human reviewing each one. If AI is being used to save time, the volume of unverified AI-generated problems may offset much of those savings. Whether the tradeoff is worth it likely depends on how many of those problems turn out to be real, again something we&#8217;ll examine in a follow-up.</p>
<p>In our next article, we shift from this quantitative comparison to a qualitative examination: using the video as ground truth, we ask whether the AI-only problems reflect events that actually happened or whether the LLMs hallucinated issues that never occurred.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How Reliable Is AI at Finding UI Problems?</title>
		<link>https://measuringu.com/ai-usability-problem-analysis-of-a-video/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=ai-usability-problem-analysis-of-a-video</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD •&nbsp;Jeff Sauro, PhD •&nbsp;Will Schiavone, PhD&nbsp;•&nbsp;Lucas Plabst, PhD]]></dc:creator>
		<pubDate>Tue, 28 Apr 2026 22:04:59 +0000</pubDate>
				<category><![CDATA[Usability]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[ChatGPT]]></category>
		<category><![CDATA[Gemini]]></category>
		<category><![CDATA[Problem Discovery]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47402</guid>

					<description><![CDATA[It looks like AI can “watch” videos. And if AI can watch videos, it can likely extract UI problems. That suggests it has the potential to support UX research. So maybe AI can “watch” a video and detect some problems. But if you run the same video through AI multiple times, do you get the [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-scaled.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47475 size-medium" src="https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-300x169.png" alt="Feature image showing two AI robots, each holding a clipboard with a UI problems list" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-2048x1152.png 2048w, https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-600x338.png 600w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>It looks like AI can “<a href="https://measuringu.com/can-ai-detect-usability-problems">watch” videos</a>. And if AI can watch videos, it can likely extract UI problems. That suggests it has the <a href="https://measuringu.com/ai-as-uxr-assistant-user-and-analyst/">potential to support UX research</a>.</p>
<p>So maybe AI can “watch” a video and detect some problems. But if you run the same video through AI multiple times, do you get the same results?</p>
<p>Reliability matters. If AI produces different results each time, it becomes untrustworthy, no matter how convincing its reasoning sounds.</p>
<p>There are a lot of variables that can affect our assessment of reliability, including:</p>
<ul>
<li>AI models (Gemini, ChatGPT, Claude, Grok)</li>
<li>Versions (models are changing monthly)</li>
<li>LLM settings like <em>temperature</em>, which affect the randomness of the output</li>
<li>Prompts: What you ask (and even how many times you ask)</li>
</ul>
<p>There are a lot of variables to consider, but we have to start somewhere. So we did. In this article, we take a first step in assessing the reliability of AI problem detection. We examined how consistent two popular AI chatbots are at identifying usability problems from the same video.</p>
<h2><span lang="EN-US">Study Setup</span></h2>
<p>We had two LLMs, ChatGPT-5.4 Thinking and Gemini 3 Flash Thinking, review the video and list the usability issues they discovered (four runs per LLM to check for within-LLM consistency; default settings only). Both are general-purpose LLMs for which MeasuringU has paid “pro” subscriptions (i.e., not free versions). Video 1 shows a 15-second clip of the full six-minute video.</p>
<div class="ast-oembed-container " style="height: 100%;"><iframe loading="lazy" title="MU_opentable_reservation first 15 sec" src="https://player.vimeo.com/video/1184232223?h=3252bb61a5&amp;dnt=1&amp;app_id=122963" width="1170" height="658" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write; encrypted-media; web-share" referrerpolicy="strict-origin-when-cross-origin"></iframe></div>
<p class="wp-caption-text" style="text-align: left;"><strong>Video 1: </strong>First 15 seconds of a participant searching for a sushi restaurant on the OpenTable website.</p>
<p>The task (visible at the bottom of the video) was to use OpenTable.com to:</p>
<blockquote><p>“Please think aloud. Make a reservation for four people at a sushi restaurant in Denver, CO tomorrow anytime after 5:00pm. Make sure the restaurant you select is not at the lowest or highest price point. Of the restaurants that fit these criteria, look at their overall rating, customer reviews, and photos to select the one that is the most appealing to you. Go as far as you can in the reservation process until you are asked for your personal information or account details. DO NOT fully confirm the reservation. Write down the restaurant name and the time of the reservation. You will be asked about this information after the task.”</p></blockquote>
<p>We used the following prompt:</p>
<blockquote><p>“During a usability test, the facilitator must keep track of participant behaviors as they navigate through tasks on a website, mobile app, software program, etc. We’d like you to watch a video of a usability test where participants were asked to book a table at a Sushi restaurant. As you&#8217;re watching, please look for problems the participant has while attempting to complete the task. For example, you can document the path users take, describe issues they encounter as well as what on the website might be causing problems. If you understand these instructions, let me know and I&#8217;ll drag the video in for you to review. Are you ready for the video?”</p></blockquote>
<p>The LLM response to this question was always some version of “Yes.”</p>
<p>In this study, we varied only the type of AI: ChatGPT and Gemini. The video, the prompt, and the LLM versions and settings were constant, but we plan to vary those variables in future studies.</p>
<h2><span lang="EN-US">Assessing Reliability</span></h2>
<p>If you ask, AI will deliver (something). For each run, we compiled a list of usability problems that the AI model “discovered.”</p>
<p>For example, a problem noticeable in the video (and on the current OpenTable website) is that when entering “Denver” in the search field, the previously selected cuisine (sushi) was removed, making for a clumsy filter and search experience.</p>
<p>To assess the reliability (consistency) of their problem discovery, we computed the <a href="https://measuringu.com/assessing-interrater-reliability-in-ux-research/">any-2 agreement</a> between ChatGPT and Gemini and within each model. We treated the models like evaluators.</p>
<p>Any-2 agreement is a UX context-specific version of the <a href="https://link.springer.com/article/10.1186/s12859-019-3118-5">Jaccard similarity coefficient (<em>J</em>)</a>, the ratio of the intersection of two binary measurements divided by their union. When there are more than two evaluators, the overall any-2 agreement is the average of the any-2 agreements for each pair of evaluators.</p>
<h3><span lang="EN-US">Computing Any-2 Agreement</span></h3>
<p>Imagine that (Y and C) have independently created lists of usability issues where Y’s list has 14 issues, C’s list has 17, and their two lists have ten issues in common (Figure 1). Their any-2 agreement is the intersection (the ten issues they both discovered) divided by the union of both lists (14 + 17 − 10 = 21), which is 48% (10/21).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/04/042826-Figure1.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47477" src="https://measuringu.com/wp-content/uploads/2026/04/042826-Figure1-300x172.png" alt="Venn diagram of problem discovery by two evaluators. " width="436" height="250" srcset="https://measuringu.com/wp-content/uploads/2026/04/042826-Figure1-300x172.png 300w, https://measuringu.com/wp-content/uploads/2026/04/042826-Figure1-768x440.png 768w, https://measuringu.com/wp-content/uploads/2026/04/042826-Figure1-600x344.png 600w, https://measuringu.com/wp-content/uploads/2026/04/042826-Figure1.png 909w" sizes="auto, (max-width: 436px) 100vw, 436px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1: </strong>Venn diagram of problem discovery by two evaluators.</p>
<p>Due to the <a href="https://measuringu.com/evaluator-effect/">well-documented evaluator effect</a>, we do not expect perfect agreement among UX researchers. In a controlled study like this (evaluators watching the same participants do the same tasks), our best estimate of typical any-2 agreement across multiple human evaluators (<a href="https://measuringu.com/examining-the-evaluator-effect-in-unmoderated-usability-testing/">based on 12 evaluations</a>) is 47%. (When studies are not controlled, the expected any-2 agreement is about 27%.)</p>
<p>This gives us a <strong>rough</strong> benchmark for assessing if an any-2 agreement is typical (around 50%), relatively low (around 25%), or relatively high (around 75%).</p>
<h2><span lang="EN-US">Within-Group Results</span></h2>
<p>The first step in our analysis was to compute the mean any-2 agreement for each group of “evaluators” (ChatGPT, Gemini) to estimate the levels of within-group reliability.</p>
<h3><span lang="EN-US">ChatGPT Reliability Was Fair</span></h3>
<p>Table 1 shows the combined problem list for the four runs of ChatGPT. Table 2 shows the any-2 agreements for each pair of runs.</p>

<table id="tablepress-1029" class="tablepress tablepress-id-1029">
<thead>
<tr class="row-1">
	<th class="column-1">GPT #</th><th class="column-2">GPT Problem List</th><th class="column-3">Run 1</th><th class="column-4">Run 2</th><th class="column-5">Run 3</th><th class="column-6">Run 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1">1</td><td class="column-2">Complex search field with placeholder text and unexpected behaviors significantly delayed user who selected sushi from search bar dropdown but for Dallas (default) instead of Denver</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-3">
	<td class="column-1">2</td><td class="column-2">Entering "Denver" in search field lost previous selection of sushi as cuisine</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-4">
	<td class="column-1">3</td><td class="column-2">Filters not helpful</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-5">
	<td class="column-1">4</td><td class="column-2">Scanning through 86 cuisines is effortful, then top that off by sushi not being in the list</td><td class="column-3"></td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-6">
	<td class="column-1">5</td><td class="column-2">Surprised when typing sushi into search field did not lose current location</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-7">
	<td class="column-1">6</td><td class="column-2">Search results for sushi included many non-sushi restaurants</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"></td><td class="column-6"><center>1</td>
</tr>
<tr class="row-8">
	<td class="column-1">7</td><td class="column-2">Weak presentation of cuisine information in search results</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-9">
	<td class="column-1">8</td><td class="column-2">Participant seemed to miss price point filter—sorted on ratings and examined price points in descriptions</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-10">
	<td class="column-1">9</td><td class="column-2">Sorting by highest rated put many non-sushi restaurants at the top of the list</td><td class="column-3"></td><td class="column-4"></td><td class="column-5"></td><td class="column-6"><center>1</td>
</tr>
<tr class="row-11">
	<td class="column-1">10</td><td class="column-2">UI pushes browsing without good decision support</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-12">
	<td class="column-1">11</td><td class="column-2">Selected result labeled seafood instead of sushi</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-13">
	<td class="column-1">12</td><td class="column-2">Task not completed because participant did not reach reservation form</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
</tbody>
</table>
<!-- #tablepress-1029 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> ChatGPT evaluations problem list.</p>

<table id="tablepress-1030" class="tablepress tablepress-id-1030">
<thead>
<tr class="row-1">
	<th class="column-1">Any-2</th><th class="column-2">Run 1</th><th class="column-3">Run 2</th><th class="column-4">Run 3</th><th class="column-5">Run 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1"><strong>Run 1</td><td class="column-2"> x</td><td class="column-3">30%</td><td class="column-4">63%</td><td class="column-5">38%</td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong>Run 2</td><td class="column-2">30%</td><td class="column-3"> x</td><td class="column-4">33%</td><td class="column-5"> 0%</td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>Run 3</td><td class="column-2">63%</td><td class="column-3">33%</td><td class="column-4"> x</td><td class="column-5">25%</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>Run 4</td><td class="column-2">38%</td><td class="column-3"> 0%</td><td class="column-4">25%</td><td class="column-5"> x</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1030 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 2: </strong>Any-2 agreement for the ChatGPT evaluations.</p>
<p>With an <strong>overall any-2 agreement of 31%</strong>, the reliability of the ChatGPT evaluations was <strong>fair</strong>. None of the problems was identified on all four runs (5/12 were identified on three runs). Runs 2 and 4 had no problems in common.</p>
<h3><span lang="EN-US">Gemini Reliability Was Better</span></h3>
<p>Table 3 shows the combined problem list for the four runs of Gemini. Table 4 shows the any-2 agreements for each pair of runs.</p>

<table id="tablepress-1031" class="tablepress tablepress-id-1031">
<thead>
<tr class="row-1">
	<th class="column-1">Gem #</th><th class="column-2">Gemini Problem List</th><th class="column-3">Run 1</th><th class="column-4">Run 2</th><th class="column-5">Run 3</th><th class="column-6">Run 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1">1</td><td class="column-2">Entering "Denver" in search field lost previous selection of sushi as cuisine</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-3">
	<td class="column-1">2</td><td class="column-2">Scanning through 86 cuisines is effortful, then top that off by sushi not being in the list</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-4">
	<td class="column-1">3</td><td class="column-2">Participant used Ctrl-F to search page for "sushi"—not found</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-5">
	<td class="column-1">4</td><td class="column-2">Participant seemed to miss price point filter—sorted on ratings and examined price points in descriptions</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-6">
	<td class="column-1">5</td><td class="column-2">Participant chose highest price tier</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-7">
	<td class="column-1">6</td><td class="column-2">Participant wanted to change sort to lowest rating first but not an option</td><td class="column-3"></td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-8">
	<td class="column-1">7</td><td class="column-2">Seating options only presented after selecting time</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-9">
	<td class="column-1">8</td><td class="column-2">Set time to 5:10</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-10">
	<td class="column-1">9</td><td class="column-2">Selected result labeled seafood instead of sushi</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
</tbody>
</table>
<!-- #tablepress-1031 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 3: </strong>Gemini evaluations problem list.</p>

<table id="tablepress-1032" class="tablepress tablepress-id-1032">
<thead>
<tr class="row-1">
	<th class="column-1">Any-2</th><th class="column-2">Run 1</th><th class="column-3">Run 2</th><th class="column-4">Run 3</th><th class="column-5">Run 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1"><strong>Run 1</td><td class="column-2"> x</td><td class="column-3">38%</td><td class="column-4">67%</td><td class="column-5">80%</td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong>Run 2</td><td class="column-2">38%</td><td class="column-3"> x</td><td class="column-4">38%</td><td class="column-5">43%</td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>Run 3</td><td class="column-2">67%</td><td class="column-3">38%</td><td class="column-4"> x</td><td class="column-5">80%</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>Run 4</td><td class="column-2">80%</td><td class="column-3">43%</td><td class="column-4">80%</td><td class="column-5"> x</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1032 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 4: </strong>Any-2 agreement for the Gemini evaluations.</p>
<p>With an <strong>overall any-2 agreement of 57%</strong>, the reliability of the Gemini evaluations was <strong>good</strong> (3/9 problems identified in all four runs, 4/9 identified by at least three runs).</p>
<h2><span lang="EN-US">Between-Group Results</span></h2>
<p>The second step in our analysis was to compute the mean any-2 agreement across LLMs to estimate the between-group reliability, shown in Table 5.</p>

<table id="tablepress-1033" class="tablepress tablepress-id-1033">
<thead>
<tr class="row-1">
	<th class="column-1">Any-2</th><th class="column-2">Gem 1</th><th class="column-3">Gem 2</th><th class="column-4">Gem 3</th><th class="column-5">Gem 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1"><strong>GPT 1</td><td class="column-2">40%</td><td class="column-3">40%</td><td class="column-4">33%</td><td class="column-5">40%</td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong>GPT 2</td><td class="column-2">20%</td><td class="column-3">20%</td><td class="column-4">17%</td><td class="column-5">20%</td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>GPT 3</td><td class="column-2">40%</td><td class="column-3">75%</td><td class="column-4">33%</td><td class="column-5">40%</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>GPT 4</td><td class="column-2"> 0%</td><td class="column-3">33%</td><td class="column-4"> 0%</td><td class="column-5"> 0%</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1033 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 5: </strong>Any-2 agreement between ChatGPT and Gemini evaluations.</p>
<p>With an <strong>overall any-2 agreement of 28%</strong>, the between-AI reliability was <strong>low</strong> (closer to 25% than to 50%).</p>
<h2><span lang="EN-US">Summary and Discussion</span></h2>
<p>Along with the rest of the UX researcher community, we have a strong interest in the roles that AI might play in facilitating our work. Watching participants attempt to complete tasks is a fundamental but labor-intensive UX research activity, so any relief AI assistance might offer would be welcome.</p>
<p>As a first step to investigate the capability of ChatGPT-5.4 Thinking and Gemini 3 Flash Thinking of finding usability problems in videos, we collected evaluations of a single video (summarized as lists of usability problems), performing four runs with each LLM.</p>
<p>In this article, we evaluated any-2 agreement within each group of evaluations (ChatGPT, Gemini) and between the AIs. Our key findings were:</p>
<p><strong>Gemini had good reliability, and ChatGPT’s was fair. </strong>The average any-2 agreement for ChatGPT was 31%. We expect this level of reliability when comparing different evaluators, <a href="https://www.dialogdesign.dk/cue-studies/">different methods</a>, or different users. It’s certainly lower than you’d want, but still at a level considered acceptable in our industry.</p>
<p>For Gemini, the average any-2 agreement was good at 57%. From the literature and our own research with human evaluators, 57% is above the mean of 47% and on the higher side of acceptability.</p>
<p><strong>Between-group reliability for Gemini and ChatGPT was low. </strong>The any-2 agreement between ChatGPT and Gemini was low at 28%. That’s about 20 points below the average when examining the same video by different people. This result is not great.</p>
<p><strong>Reliability isn’t accuracy. </strong>Are the problems identified by the LLMs as relevant as those discovered by a human evaluator? This question hasn’t been answered yet (a future analysis will). But to have accuracy (validity), we need to establish consistent (reliable) results, and at least for this video and prompt, the Gemini performance was sufficiently reliable.</p>
<p><strong>Humans vs. AI coming soon. </strong>We’re just getting started with our analyses. In an upcoming article, we’ll compare any-2 agreement between these LLMs and a problem list generated by professional human UX researchers. Stay tuned.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Can AI Detect Usability Problems?</title>
		<link>https://measuringu.com/can-ai-detect-usability-problems/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=can-ai-detect-usability-problems</link>
		
		<dc:creator><![CDATA[Jeff Sauro, PhD •&nbsp;Lucas Plabst, PhD •&nbsp;Jim Lewis, PhD&nbsp;•&nbsp;Will Schiavone, PhD]]></dc:creator>
		<pubDate>Wed, 22 Apr 2026 03:39:35 +0000</pubDate>
				<category><![CDATA[Usability]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[ChatGPT]]></category>
		<category><![CDATA[error]]></category>
		<category><![CDATA[Problem Discovery]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47333</guid>

					<description><![CDATA[You may have become numb to the overhyped headlines about AI. But it’d be wrong to dismiss the impact AI can have on our industry, not only because of job displacement, but also of helping us do our jobs more effectively (hopefully). To separate the hype and hysteria, we at MeasuringU think about AI’s impact [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-scaled.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47352 size-medium" src="https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-300x169.png" alt="Feature image showing an AI robot observing the user flow to detect usability issues" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-2048x1152.png 2048w, https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-600x338.png 600w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>You may have become numb to the overhyped headlines about AI.</p>
<p>But it’d be wrong to dismiss the impact AI can have on our industry, not only because of job displacement, but also of helping us do our jobs more effectively (hopefully).</p>
<p>To separate the hype and hysteria, we at MeasuringU think about AI’s impact in UX research in <a href="https://measuringu.com/ai-as-uxr-assistant-user-and-analyst/">three ways</a>: AI as Research Assistant, AI as (Synthetic) User, and AI as Researcher.</p>
<p>One of the more valuable activities we do in UX research as a researcher and assistant is to find (and recommend fixes for) usability problems in an interface.</p>
<p>Finding problems typically comes from researchers observing people interacting with a product, either live in a lab (like ours at MeasuringU, Figure 1), remotely using tools like <a href="https://measuringu.com/muiq/">MUiQ<sup>®</sup></a>, or by reviewing recordings of moderated or unmoderated sessions.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/04/042126-F1.jpg"><img loading="lazy" decoding="async" class="alignnone size-full wp-image-47336" src="https://measuringu.com/wp-content/uploads/2026/04/042126-F1.jpg" alt="AI-generated image of a robot observing a usability test session." width="602" height="602" srcset="https://measuringu.com/wp-content/uploads/2026/04/042126-F1.jpg 602w, https://measuringu.com/wp-content/uploads/2026/04/042126-F1-300x300.jpg 300w, https://measuringu.com/wp-content/uploads/2026/04/042126-F1-150x150.jpg 150w, https://measuringu.com/wp-content/uploads/2026/04/042126-F1-70x70.jpg 70w, https://measuringu.com/wp-content/uploads/2026/04/042126-F1-600x600.jpg 600w, https://measuringu.com/wp-content/uploads/2026/04/042126-F1-100x100.jpg 100w" sizes="auto, (max-width: 602px) 100vw, 602px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> AI-generated image of a robot observing a usability test session.</p>
<p>Watching lots of videos can be time-consuming and tedious. Usability problems often aren’t obvious defects in an interface. They can come from perfectly compiled code and a beautiful interface. Due to the <a href="https://measuringu.com/assessing-interrater-reliability-in-ux-research/">well-known evaluator effect</a>, even trained professionals do not detect usability problems with perfect consistency.</p>
<p>Can AI really uncover usability problems like UX researchers?</p>
<p>A few years ago, this sort of question seemed like science fiction. Now it’s not <em>if</em> but <em>how well</em> AI can watch videos like a researcher or research assistant. Can human researchers use it and trust the results?</p>
<h2><span lang="EN-US">How Does AI “Watch” a Video?</span></h2>
<p>Here’s a 15-second clip of a participant starting the task of finding a sushi restaurant on the OpenTable website. It’s part of a six-minute video of an unmoderated usability test conducted on the MUiQ platform.</p>
<div class="ast-oembed-container " style="height: 100%;"><iframe loading="lazy" title="MU_opentable_reservation first 15 sec" src="https://player.vimeo.com/video/1184232223?h=3252bb61a5&amp;dnt=1&amp;app_id=122963" width="1170" height="658" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write; encrypted-media; web-share" referrerpolicy="strict-origin-when-cross-origin"></iframe></div>
<p class="wp-caption-text" style="text-align: left;"><strong>Video 1: </strong>First 15 seconds of the OpenTable.com video.</p>
<p>We asked Gemini 3.1 Pro to “Please tell me in extreme detail what you see in the video from start to finish going second by second.” Here&#8217;s how it responded:</p>
<ul>
<li><strong>00:00–00:04:</strong> The video begins on the OpenTable.com homepage. The top navigation bar shows options for &#8220;For Business,&#8221; &#8220;Mobile,&#8221; &#8220;Help,&#8221; language selection (&#8220;EN&#8221;), and buttons for &#8220;Sign up&#8221; and &#8220;Sign in.&#8221; The main search area displays default parameters: a date of &#8220;Jul 19, 2022,&#8221; a time of &#8220;7:00 PM,&#8221; and a party size of &#8220;2 people.&#8221; The search bar says &#8220;Location, Restaurant, or Cuisine&#8221; with a red &#8220;Let&#8217;s go&#8221; button. Below this is a message reading, &#8220;It looks like you&#8217;re in Dallas Suburbs. Not correct? Get current location.&#8221; A section titled &#8220;Book for dinner tonight&#8221; shows several restaurants, including Fogo de Chao, Lombardi Cucina Italiana, Seasons 52, and Perry&#8217;s Steakhouse. At the bottom of the screen, a task description box outlines the user&#8217;s objective: booking a sushi restaurant in Denver, CO for four people tomorrow after 5:00 PM. The user begins speaking, stating they are looking for a sushi restaurant.</li>
<li><strong>00:05–00:09:</strong> The user&#8217;s mouse cursor idles on the left side of the screen. The user continues speaking, noting they need a reservation for tomorrow anytime after 5:00 PM. They observe that today is the 19th, so they need to book for the 20th.</li>
<li><strong>00:10–00:11:</strong> The cursor moves to the date selector and clicks on &#8220;Jul 19, 2022.&#8221; A calendar dropdown appears. The user clicks on the number &#8220;20&#8221; to change the date to Jul 20, 2022.</li>
<li><strong>00:12–00:15:</strong> The user moves the cursor to the time selector, currently set at &#8220;7:00 PM,&#8221; and clicks it. A dropdown menu of times in 30-minute increments appears. The user scrolls up and clicks on &#8220;5:00 PM.&#8221;</li>
</ul>
<p>Even though the summary of the onscreen text describing the user task does not include all task requirements (which we did not specifically prompt for), this is a reasonably accurate description of what’s on the screen and the events.</p>
<p>Gemini didn’t actually watch the video the way we do. It doesn’t experience a continuous stream of motion or notice subtle hesitation in real time. Instead, it only samples a few frames (often only one to two frames out of the 30 frames) per second of the video.</p>
<p>Each frame is then broken into smaller regions and converted into numerical representations that the model can process. If there’s audio, spoken words are transcribed into text and aligned with those frames. By the time the model begins “analyzing,” the video has already been reduced to a combination of image fragments and text.</p>
<p>From there, the model treats those inputs similarly to how it processes language. The visual and textual information is converted into tokens (small chunks of data) and passed through a neural network that looks for patterns (things such as interface elements, changes in screens, or sequences of actions).</p>
<h2>Autocorrect for Video Watching</h2>
<p>Because AI is working from snapshots rather than continuous playback, it doesn’t directly see motion. Instead, it infers what likely happened between frames (for example, that a user scrolled, tapped, or navigated to a new page). This makes the process efficient, but it also means short or subtle behaviors can be missed.</p>
<p>Based on the sampled frames and any accompanying text, it generates the most likely description of what happened, much like how it predicts the next word in a sentence. Basically, it’s like autocorrect on steroids for videos.</p>
<p>That’s why the output can sound surprisingly natural and insightful, even when it’s not entirely accurate. It’s less like a researcher watching a session and more like a system generating a plausible narrative from partial information.</p>
<h3>Losing Frames</h3>
<p>As long as there’s been autocorrect, there’s been, well, mistakes (often <a href="https://www.huffpost.com/entry/funniest-autocorrect-faiils-2014_n_6391880">hilarious ones</a>). The sampling that makes AI fast also makes it “<a href="https://cs.stanford.edu/people/eroberts/courses/soco/projects/data-compression/lossy/index.htm">lossy</a>.” By looking at only a fraction of the frames, the model can miss brief moments of hesitation, confusion, or micro-interactions that are often critical in usability analysis. What’s efficient for processing might not always be sufficient for insight.</p>
<h3>Probabilistic Output</h3>
<p>But unlike autocorrect, which works the same each time it’s presented with a partial word, AI outputs aren’t always the same. They’re probabilistic rather than deterministic. Even with the same video and the same prompt, the model may generate slightly different descriptions each time. That’s because it’s not retrieving a fixed answer but generating the most likely sequence of words from a range of possibilities. The results can be consistent in general themes, but not identical in wording or even emphasis. And with current systems, there is always the possibility of <a href="https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence)">hallucination</a>. For researchers, these concerns mean that AI outputs should be treated less like definitive observations and more like plausible interpretations that still needs validation.</p>
<h3>Temperature</h3>
<p>Part of this variability comes from a setting called <em>temperature</em>, which controls how much randomness the model uses when generating responses. Temperature typically ranges from 0 (close to deterministic) to around 2 (much more variable). Most models use a middle setting by default, which balances consistency and variation. Higher temperatures can surface a wider range of interpretations (sometimes useful for exploratory analysis), while lower temperatures produce more consistent outputs—but even then, results aren’t perfectly repeatable.</p>
<p>Figure 2 illustrates this process.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/04/042126-F2.png"><img loading="lazy" decoding="async" class="alignnone size-full wp-image-47337" src="https://measuringu.com/wp-content/uploads/2026/04/042126-F2.png" alt="Visual overview of how an AI “watches” a video. " width="1182" height="788" srcset="https://measuringu.com/wp-content/uploads/2026/04/042126-F2.png 1182w, https://measuringu.com/wp-content/uploads/2026/04/042126-F2-300x200.png 300w, https://measuringu.com/wp-content/uploads/2026/04/042126-F2-1024x683.png 1024w, https://measuringu.com/wp-content/uploads/2026/04/042126-F2-768x512.png 768w, https://measuringu.com/wp-content/uploads/2026/04/042126-F2-600x400.png 600w" sizes="auto, (max-width: 1182px) 100vw, 1182px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 2: </strong>Visual overview of how an AI “watches” a video.</p>
<p>This gives you an idea about how AI reviews videos for usability problems. But what does it look like when you ask an AI to perform a usability evaluation of a video?</p>
<h2><span lang="EN-US">Problem List from ChatGPT</span></h2>
<p>We uploaded the full six-minute video of a person attempting to find a sushi restaurant in Denver on the OpenTable website into ChatGPT (model 5.4 Thinking). We selected the video because it has several known usability problems that humans consistently detect. We prompted ChatGPT with:</p>
<blockquote><p>“During a usability test, the facilitator must keep track of participant behaviors as they navigate through tasks on a website, mobile app, software program, etc. We’d like you to watch a video of a usability test where participants were asked to book a table at a Sushi restaurant. As you&#8217;re watching, please look for problems the participant has while attempting to complete the task. For example, you can document the path users take, describe issues they encounter as well as what on the website might be causing problems. If you understand these instructions, let me know and I&#8217;ll drag the video in for you to review. Are you ready for the video?”</p></blockquote>
<p>ChatGPT indicated “yes,” then took only three minutes to process the video (half the time of the six minutes because it sampled a fraction of the frames to piece together its visual autocorrect narrative).</p>
<p>From its output, we derived a list of seven usability problems (Table 1).</p>

<table id="tablepress-1028" class="tablepress tablepress-id-1028">
<thead>
<tr class="row-1">
	<th class="column-1">Problem List</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1">Complex search field with placeholder text and unexpected behaviors significantly delayed user who selected sushi from search bar dropdown but for Dallas (default) instead of Denver</td>
</tr>
<tr class="row-3">
	<td class="column-1">Entering Denver in search field lost previous selection of sushi as cuisine</td>
</tr>
<tr class="row-4">
	<td class="column-1">Search results for sushi included many non-sushi restaurants</td>
</tr>
<tr class="row-5">
	<td class="column-1">Weak presentation of cuisine information in search results</td>
</tr>
<tr class="row-6">
	<td class="column-1">Participant seemed to miss price point filter—sorted on ratings and examined price points in descriptions</td>
</tr>
<tr class="row-7">
	<td class="column-1">UI pushes browsing without good decision support</td>
</tr>
<tr class="row-8">
	<td class="column-1">Selected restaurant was categorized as Seafood instead of Sushi, so participant failed the task</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1028 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> List of problems “discovered” by ChatGPT review of usability test video.</p>
<h2><span lang="EN-US">Looks Good, However …</span></h2>
<p>On the surface, that looks pretty good. It’s plausible, specific, and aligned with what a researcher might note. But it leaves us with a few questions:</p>
<ul>
<li>How many of these are <em>actual</em> usability problems versus plausible-sounding interpretations (autocorrect) or hallucinations?</li>
<li>How consistent are the results across multiple runs (reliability)?</li>
<li>How closely do these match what human UX researchers would identify (validity)?</li>
</ul>
<p>We’ll explore these important questions in future articles.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>A Review of Experiments with Synthetic Users</title>
		<link>https://measuringu.com/review-of-experiments-with-synthetic-users/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=review-of-experiments-with-synthetic-users</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD and Jeff Sauro, PhD]]></dc:creator>
		<pubDate>Wed, 15 Apr 2026 05:22:48 +0000</pubDate>
				<category><![CDATA[UX]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47300</guid>

					<description><![CDATA[One of the hardest parts of conducting user and market research is recruiting participants. It takes time, costs money, and on top of that, there are no-shows and fraudsters. Now imagine being able to conduct UX research without the hassle of recruiting the “U.” Enter the idea of AI-generated synthetic users that offer the promise [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/04/041426-FeatureImage.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47329 size-medium" src="https://measuringu.com/wp-content/uploads/2026/04/041426-FeatureImage-300x169.png" alt="Feature image showing two researchers examining an AI synthetic user robot" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/04/041426-FeatureImage-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/04/041426-FeatureImage-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/04/041426-FeatureImage-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/04/041426-FeatureImage-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/04/041426-FeatureImage-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/04/041426-FeatureImage.png 2000w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>One of the hardest parts of conducting user and market research is recruiting participants. It takes time, costs money, and on top of that, there are <a href="https://measuringu.com/typical-no-show-rate-for-moderated-studies/">no-shows</a> and fraudsters.</p>
<p>Now imagine being able to conduct UX research without the hassle of recruiting the “U.” Enter the idea of AI-generated synthetic users that offer the promise of participant input being:</p>
<ul>
<li>Simpler (no need to deal with humans)</li>
<li>Less costly (no panel/respondent fees)</li>
<li>Faster (finish in hours or days instead of weeks or months)</li>
<li>Scalable (get data from thousands of synthetic users instead of a relative handful of participants)</li>
<li>Broader in reach (access to user groups that are hard to find or very expensive to recruit)</li>
<li>More secure (no need for nondisclosure agreements and no risk of human disclosure)</li>
</ul>
<p>At least, that’s the dream of research with synthetic users.</p>
<p>Others view synthetic users as more of a nightmare. They are concerned that research with synthetic users can lead to:</p>
<ul>
<li>Plausible-looking data that’s just wrong</li>
<li>Shallow qualitative responses because synthetic users have no real lived experience</li>
<li>Reinforced biases driven by large language models (LLM)</li>
<li>Artificially low variability (quantitative or qualitative)</li>
</ul>
<p>We’ve seen these conflicting attitudes about synthetic users play out in online posts and conversations over the past few years, most recently with the promotion of proprietary models of synthetic users by companies like Qualtrics and Aaru, followed by criticism of that promotion by influential UX researchers.</p>
<h2>Pro-Synthetic Voices</h2>
<p>Qualtrics is the dominant (and <a href="https://www.pymnts.com/acquisitions/2026/wall-street-snubs-qualtrics-debt-over-ai-disruption-risks/">debt-loaded</a>) survey platform that’s made a big bet on synthetic users. Their synthetic dataset was trained on millions of survey responses, and they reported that it can <a href="https://www.greenbook.org/insights/data-science/testing-synthetic-data-against-academic-benchmarks-a-replication-study">realistically mimic human survey patterns</a>. The variability and correlations mirror human response patterns better than general LLMs data, at least for the attitudinal survey questions they used.</p>
<p>Aaru is another synthetic user simulation platform that has gotten attention. The global consulting firm EY used Aaru’s proprietary multi-agent simulation to replicate a 3,600-person global wealth survey. They <a href="https://www.ey.com/en_us/insights/wealth-asset-management/how-ai-simulation-accelerates-growth-in-wealth-and-asset-management">reported strong agreement across multiple statistical metrics</a> (high correlation, modest error), suggesting that synthetic data approximated real survey results at scale (done in one day versus six months!).</p>
<h2>Anti-Synthetic Voices</h2>
<p>First from the anti-synthetic camp is Chris Chapman, a longtime quantitative UX researcher (Amazon, Google, Microsoft) and co-chair of the Quant UX Conference. His most recent presentation clearly elaborates that <a href="https://quantuxblog.com/synthetic-survey-data-its-not-data">synthetic users are not users</a>. His blunt conclusion is that <strong>synthetic data has no place in survey research</strong>.</p>
<p>Another voice is John Mecke, a SaaS and product strategy writer who argues that <a href="https://developmentcorporate.com/saas/synthetic-responses-market-research-2025/">synthetic users face five core limitations</a>: no lived experience, misleading “too-accurate” results, cultural bias, weak statistical reliability, and narrower real-world usefulness than claimed.</p>
<p>And there’s also Constantine Papas, a UX research strategist and writer of <em>The Voice of User. </em>Papas argues that <a href="https://www.thevoiceofuser.com/aaa-billion-dollar-ai-startup-is-selling-you-a-survey-the-wall-street-journal-wrote-a-love-letter-about-it/"><em>s</em>ynthetic research is being oversold</a> largely by cherry-picking favorable results from financially interested parties. When describing the EY study from Aaru’s algorithms, he argues that the correlations are largely because the LLMs were already trained on this data. That’s hardly predicting.</p>
<p>Finally, a recent preprint of a comprehensive literature review of 182 papers also casts strong doubt on the ability of synthetic users to do more than mimic already collected data. We recommend reading the <a href="https://www.researchgate.net/publication/401777396_Synthetic_Participants_Generated_by_Large_Language_Models_A_Systematic_Literature_Review">preprint</a> (not yet peer reviewed) and a <a href="https://www.thevoiceofuser.com/the-largest-review-of-synthetic-participants-ever-conducted-found-exactly-what-youd-expect-synthetic-users-dont-work/">discussion of the research</a> by Papas.</p>
<p>As interesting as these online conversations are, they have not been formally reviewed for scientific quality. In this article, we briefly review 12 <strong>peer-reviewed research papers</strong> on the use of synthetic users in UX and UX-adjacent research. For full details on experimental designs and results (e.g., experimental comparisons, models, prompting, settings, metrics), see the links to the papers and articles in the appendix.</p>
<h2>Our Inclusion Criteria for Papers on Synthetic Users</h2>
<p>We searched the literature for peer-reviewed research that had been published no earlier than 2023 and used LLM models no earlier than GPT-3.5. This turned up 12 papers that can be broadly categorized as attempts to replicate:</p>
<ul>
<li>Psychological experiments (five papers)</li>
<li>Survey results (three papers)</li>
<li>Social research (three papers)</li>
<li>UX research (one paper)</li>
</ul>
<p>We’ll now review the evidence in each of these four categories.</p>
<h3>Psychological Experiments: Sometimes Human-Like, Often Inconsistent</h3>
<p>The idea that digital data can replicate people predates LLMs. From the mid-1990s through the 2000s, a popular research program in social psychology was the &#8220;<a href="https://en.wikipedia.org/wiki/Computers_are_social_actors">computers are social actors</a>&#8221; paradigm, recreating classical psychological experiments in which one of the human participants was replaced by a computer to investigate how this affected human behavior.</p>
<p>Several researchers have adapted this approach to one in which there are <em>no</em> human participants, exploring the extent to which LLMs mimic humans in psychological experiments.</p>
<p>If synthetic users act like humans in experiments, maybe we can use them instead of humans in some studies. But why would anyone think this would be possible? Well, because modern LLMs are trained on huge amounts of human-generated content, <a href="https://www.sciencedirect.com/science/article/pii/S2590198226000825">the models may include latent social information</a>. Depending on the quality of this latent information, with appropriate prompts, they might produce human-like outputs.</p>
<p>The results from these experiments were <strong>mixed</strong>, with the following key findings from the five papers:</p>
<ul>
<li>Using GPT-3.5, Dillion et al. (<a href="https://static1.squarespace.com/static/671011231a30d401349ce94c/t/67c7999a4348db3e090ea128/1741134235231/can-AI-language-models-replace-human-participants.pdf">2023</a>) found significant correlation (<em>r</em> = .95) between synthetic and human moral judgments (encouraging), but there were many points with large differences between human and synthetic mean ratings (discouraging).</li>
<li>Goli and Singh (<a href="https://pubsonline.informs.org/doi/10.1287/mksc.2023.0306">2024</a>) used GPT-3.5 and GPT-4 in a replication of experiments in which synthetic users were presented with a choice between a certain number of tokens in a month versus waiting for a larger number of tokens later. GPT-3.5 ignored differences in reward amounts (discouraging), while GPT-4 had some sensitivity to the differences (encouraging), but its discount rates were larger than those observed with humans (discouraging).</li>
<li>Attempting to replicate 14 classic social science studies using GPT-3.5, Park et al. (<a href="https://link.springer.com/article/10.3758/s13428-023-02307-x">2024</a>) reported that six had unanalyzable data (too little variability), five failed replication, and three were successfully replicated. So, 21% of the attempts were successful (encouraging), but 79% were unsuccessful (discouraging).</li>
<li>Using GPT-4, de Winter et al. (<a href="https://www.sciencedirect.com/science/article/pii/S0191886924001892">2024</a>) created 2000 text-based personas that completed a short form of the <a href="https://www.sciencedirect.com/topics/social-sciences/big-five">Big Five Inventory</a>. The synthetic data matched the expected factor structure and had high correlation with human data (encouraging) but significant deviation from the humans’ item means (discouraging).</li>
<li>Almeida et al. (<a href="https://arxiv.org/pdf/2308.01264">2024</a>) replicated eight psychology studies of legal and moral reasoning using Gemini Pro (1.0), Claude 2.1, GPT-4, and Llama 2 Chat 70b. They found differing levels of alignment with human responses, with the closest match for GPT-4. “Nonetheless, even when LLM-generated responses are highly correlated to human responses [encouraging], there are still systematic differences, with a tendency for models to exaggerate effects that are present among humans, in part by reducing variance” (discouraging).</li>
</ul>
<h3>Surveys: Match on Averages, Fail on Details</h3>
<p>Even if synthetic users are inconsistent in how they react to classical psychology experiments, they might be able to match human response patterns in surveys. However:</p>
<ul>
<li>Bisbee et al. (<a href="https://www.cambridge.org/core/journals/political-analysis/article/synthetic-replacements-for-human-survey-data-the-perils-of-large-language-models/B92267DC26195C7F36E63EA04A47D2FE">2024</a>) used GPT-3.5 Turbo (with some replication by GPT 4.0 and Falcon-40B-Instruct) to reproduce the 2016–2020 American National Election Survey (ANES). They encountered numerous statistical issues with synthetic respondents somewhat matching high-level means (encouraging) but having inaccurate subgroup means, small standard deviations, inaccurate regression coefficients, and failure to meet even basic requirements for replication (discouraging).</li>
<li>Using GPT-4, Shrestha et al. (<a href="https://journals.sagepub.com/doi/10.1177/23794607241311793">2024</a>) compared synthetic and human responses to 43 policy questions on topics like climate, spending, and labor in the U.S., Saudi Arabia, and the UAE. The means for the 43 questions indicated that the responses of human and synthetic participants were reasonably aligned (encouraging) but not precisely the same, with about 70% significantly different (discouraging).</li>
<li>Tjuatja et al. (<a href="https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00685/124261/Do-LLMs-Exhibit-Human-like-Response-Biases-A-Case">2024</a>) used variants of Llama, ChatGPT-3.5 Turbo, and Turbo Instruct to investigate whether synthetic responses to different item formats matched expected human response behavior biases. “Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior” (discouraging).</li>
</ul>
<h3>Social Research: Trends Match Humans, but the Details Don’t</h3>
<p>The goal of studies in social research is similar to psychological experimentation, though with more focus on interpersonal behaviors and attitudes.</p>
<ul>
<li>In experiments with GPT-4 and Llama3, Yu et al. (<a href="https://www.sciencedirect.com/science/article/pii/S2949882125001173">2025</a>) compared synthetic user and human responses to standardized psychological questionnaires measuring empathy. The expected factor structure of the questionnaires was produced by GPT-4 (encouraging), but the magnitudes of the synthetic scores did not match those of humans (discouraging). Responses from Llama3 synthetic users did not match the expected factor structure (discouraging).</li>
<li>Wang et al. (<a href="https://arxiv.org/pdf/2402.01908">2025</a>) showed that the LLMs they investigated (Llama-2-Chat 7B, Wizard Vicuna Uncensored 7B, GPT-3.5 Turbo, GPT-4) may not be able to distinguish between text written about different groups of people by others and those written by different groups of people, making them unsuitable for creating synthetic users that can replace actual users for social research due to inherent bias (discouraging).</li>
<li>Rafikova and Voronin (<a href="https://ideas.repec.org/a/spr/jcsosc/v9y2026i1d10.1007_s42001-025-00452-1.html">2026</a>) used GPT-4 to investigate synthetic responses to complex social issues (e.g., immigration, gender stereotypes). Synthetic users matched the direction and magnitude of human attitudinal trends (encouraging) but had weak correspondence with deeper models of attitudinal variance (discouraging).</li>
</ul>
<h3>UX Interviews: Convincing at First, Limited on Follow-Up</h3>
<p>We didn’t turn up studies directly related to quantitative UX research (although that is informed by psychological, survey, and social research). We did, however, find one related to researcher experiences interviewing humans and synthetic users.</p>
<ul>
<li>Kapania et al. (<a href="https://dl.acm.org/doi/full/10.1145/3706598.3713220">2025</a>) had 19 UX researchers recreate one of their recent projects conducted with human participants with GPT-4-Turbo. &#8220;Initially skeptical, researchers were surprised to see similar narratives emerge in the LLM-generated data when using the interview probe. However, over several conversational turns, they went on to identify fundamental limitations, such as how LLMs foreclose participants’ consent and agency, produce responses lacking in palpability and contextual depth, and risk delegitimizing qualitative research methods&#8221; (discouraging).</li>
</ul>
<h2>Summary and Discussion</h2>
<p>We reviewed 12 papers describing recent research comparing synthetic users and humans in four contexts of interest to UX researchers. In our summaries, we tagged 9 findings as encouraging and 14 as discouraging. So, the results aren’t universally bad, but they definitely aren’t great. We summarized those in Table 1.</p>

<table id="tablepress-1027" class="tablepress tablepress-id-1027">
<thead>
<tr class="row-1">
	<th class="column-1">Theme</th><th class="column-2">Encouraging Findings</th><th class="column-3">Discouraging Findings</th>
</tr>
</thead>
<tbody>
<tr class="row-2">
	<td class="column-1"><strong>Matched means/percents<strong></td><td class="column-2"><strong>4</strong> (B, P, R, S)</td><td class="column-3"><strong>7</strong>(A, D, G, P, S, W, Y)</td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong>Correlated<strong></td><td class="column-2"><strong>4</strong> (A, D, G, W)</td><td class="column-3"><strong>1 </strong>(R)</td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>Matched expected variance</td><td class="column-2"><strong>0</td><td class="column-3"><strong>3</strong> (A, B, P)</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>Matched factor structure</td><td class="column-2"><strong>2</strong> (W, Y)</td><td class="column-3"><strong>1</strong> (Y)</td>
</tr>
<tr class="row-6">
	<td class="column-1"><strong>Matched expected replication</td><td class="column-2"><strong>1 </strong>(P)</td><td class="column-3"><strong>2</strong> (B, P)</td>
</tr>
<tr class="row-7">
	<td class="column-1"><strong>Good qualitative depth</td><td class="column-2"><strong>0</td><td class="column-3"><strong>3</strong> (K, T, Wa)</td>
</tr>
<tr class="row-8">
	<td class="column-1"><strong>Unbiased/representative</td><td class="column-2"><strong>0</td><td class="column-3"><strong>1 </strong>(Wa)</td>
</tr>
<tr class="row-9">
	<td class="column-1"><strong>Matched regression weights</td><td class="column-2"><strong>0</td><td class="column-3"><strong>1</strong> (B)</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1027 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Summary of the number of encouraging and discouraging findings. The letters in parentheses indicate the sources for the findings. Letters are the first letter of the last name of the lead author; W = de Winter, Wa = Wang. Some studies produced both encouraging and discouraging findings in the same themes (e.g., Yu found both matching and mismatching factor structures), and some findings matched multiple themes.</p>
<h3>Correlation Does Not Mean Equivalence</h3>
<p>Some results were promising, but most found discrepancies between synthetic and human results.</p>
<p>For example, Dillion et al. (2023) found significant alignment between synthetic and human moral judgments, but Almeida et al. (2024) reported that even when synthetic and human moral judgments correlated, there were systematic differences with synthetic data exaggerating effects seen with humans.</p>
<h3>Superficial Agreement, Deeper Errors</h3>
<p>Issues with synthetic data included reduced variance, misalignment of means/percentages, distorted correlations, inaccurate regression coefficients, and shallow experiential narratives.</p>
<p>Different studies reported different issues. Sometimes high-level means matched but deeper correlational metrics were distorted (e.g., Bisbee et al., 2024); at other times, correlations were high, but there were significant differences among means (e.g., de Winter et al., 2024).</p>
<h3>Rapid Model Changes Make Findings Quickly Outdated</h3>
<p>Research on synthetic users is complicated by variation in contexts, models, prompting, and settings.</p>
<p>Controlled experimentation relies on being able to control the experimental environment. Different researchers use different models with different prompting and settings. Even the papers published in 2026 used older models than are currently available because research necessarily precedes peer-reviewed publication. Next year’s models will be different from this year’s.</p>
<h3>Proprietary Models May Work but Lack Validation</h3>
<p>Further complicating the research landscape is the emergence of proprietary models incorporating extensive amounts of survey data. Proprietary models from companies like Qualtrics and Aaru might perform better than general LLM chatbots in the production of synthetic samples that match human attitudes and performance. It’s just too early to tell. To date, we have not seen any peer-reviewed publications using these platforms.</p>
<h3>Directional When Answers Are Unknown</h3>
<p>The encouraging results regarding occasionally high correlation of human and synthetic data suggest that the results from synthetic users can provide directional signals, but synthetic estimates are often imprecise and inconsistent from study to study. The promise of synthetic users is alluring, but until there is strong evidence of consistently good matching with human data, it seems premature to rely on research with synthetic users for critical decision-making.</p>
<h3>Potentially Useful When Answers Are Known and Stable</h3>
<p>We’re not done yet with this topic and are planning our own analysis. But right now, it seems the most promising use of synthetic users is deriving insights from already collected data. Why ask a survey question if the answer is already known and stable? Most attitudes aren’t stable and are highly dependent on the audience. But if you have surveyed the same type of population repeatedly and have relatively stable results (possibly like the EY study), then you may know the answer. In that case, an LLM is just an easier way to query your database. Just don’t think it’s predicting something that’s not already known.</p>
<h2>Appendix: Links to Papers and Articles</h2>
<p><a href="https://arxiv.org/pdf/2308.01264">Almeida G. F. C. F., Nunes J. L., Engelmann N., Wiegmann A., &amp; de Araújo M. (2024). Exploring the psychology of LLMs’ moral and legal reasoning. <em>Artificial Intelligence</em>, 333.</a></p>
<p><a href="https://www.ey.com/en_us/insights/wealth-asset-management/how-ai-simulation-accelerates-growth-in-wealth-and-asset-management">Babcic, S., Hamaloglu, U., &amp; Munshi, S. (2025, Oct 25). <em>How AI simulation accelerates growth in wealth and asset management</em>. EY.</a></p>
<p><a href="https://www.cambridge.org/core/journals/political-analysis/article/synthetic-replacements-for-human-survey-data-the-perils-of-large-language-models/B92267DC26195C7F36E63EA04A47D2FE">Bisbee, J., Clinton, J. D., Dorff, C., Kenkel, B., &amp; Larson, J. M. (2024). Synthetic replacements for human survey data? The perils of large language models. <em>Political Analysis</em>, <em>32</em>(4), 401–416.</a></p>
<p><a href="https://quantuxblog.com/synthetic-survey-data-its-not-data">Chapman, C. (2025, Jun 18). <em>Synthetic survey data? It’s not data</em>. Quantitative UX Research Blog. </a></p>
<p><a href="https://www.sciencedirect.com/science/article/pii/S0191886924001892">de Winter J. C. F., Driessen T., &amp; Dodou D. (2024). The use of ChatGPT for personality research: Administering questionnaires using generated personas. <em>Personality and Individual Differences</em>, <em>228</em>, #112729.</a></p>
<p><a href="https://static1.squarespace.com/static/671011231a30d401349ce94c/t/67c7999a4348db3e090ea128/1741134235231/can-AI-language-models-replace-human-participants.pdf">Dillion, D., Tandon, N., Gu, Y., &amp; Gray, K. (2023). Can AI language models replace human participants? <em>Trends in Cognitive Sciences</em>, <em>27</em>(7), 597–600. </a></p>
<p><a href="https://pubsonline.informs.org/doi/10.1287/mksc.2023.0306">Goli, A., &amp; Singh, A. (2024). Can large language models capture human preferences? <em>Marketing Science</em>, <em>43</em>(4), 709–722. [Abstract only]. </a></p>
<p><a href="https://dl.acm.org/doi/10.1145/3706598.3713220">Kapania, S., Agnew, W., Eslami, M., Heidari, H., &amp; Fox, S. E. (2025). Simulacrum of stories: Examining large language models as qualitative research participants. <em>Proceedings of CHI ‘25</em>, #489,  1–17. </a></p>
<p><a href="https://www.researchgate.net/publication/401777396_Synthetic_Participants_Generated_by_Large_Language_Models_A_Systematic_Literature_Review">Kuric, E., Demcak, P., &amp; Krajcovic, M. (2026). Synthetic participants generated by large language models: A systematic literature review. [Preprint—Not yet peer reviewed]. </a></p>
<p><a href="https://www.greenbook.org/insights/data-science/testing-synthetic-data-against-academic-benchmarks-a-replication-study">McLean, D. (2026, Feb 2). <em>Testing synthetic data against academic benchmarks: A replication study</em>. Greenbook. </a></p>
<p><a href="https://developmentcorporate.com/saas/synthetic-responses-market-research-2025/">Mecke, J. (2025, Oct 31). Synthetic responses in market research: Promise vs. reality in 2025. Development Corporate.</a></p>
<p><a href="https://www.quirks.com/articles/exploring-the-challenges-and-potential-of-synthetic-data-and-survey-participants">Millman, S. (2025, Feb 25). <em>Exploring the challenges and potential of synthetic data and survey participants</em>. Quirk&#8217;s Media.</a></p>
<p><a href="https://www.thevoiceofuser.com/aaa-billion-dollar-ai-startup-is-selling-you-a-survey-the-wall-street-journal-wrote-a-love-letter-about-it/">Papas, C. (2026, Mar 15). <em>Question: Is Aaru actually proving that synthetic research can predict human behavior and replace real user research?</em> The Voice of User.</a></p>
<p><a href="https://link.springer.com/article/10.3758/s13428-023-02307-x">Park, P. S., Schoenegger, P., &amp; Zhu, C. (2024). Diminished diversity-of-thought in a standard large language model. <em>Behavior Research Methods</em>, <em>56</em>(6), 5754–5770.</a></p>
<p><a href="https://ideas.repec.org/a/spr/jcsosc/v9y2026i1d10.1007_s42001-025-00452-1.html">Rafikova, A., &amp; Voronin, A. (2026). ChatGPT as a research proxy: simulating human attitudes in social science research. <em>Journal of Computational Social Science</em>, <em>9</em>(17). [Abstract Only]</a></p>
<p><a href="https://journals.sagepub.com/doi/10.1177/23794607241311793">Shrestha, P., Krpan, D., Koaik, F., Schnider, R., Sayess, D., &amp; Binbaz, M. S. (2024). Beyond WEIRD: Can synthetic survey participants substitute for humans in global policy research? <em>Behavioral Science &amp; Policy</em>, <em>10</em>(2), 26–45.</a></p>
<p><a href="https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00685/124261/Do-LLMs-Exhibit-Human-like-Response-Biases-A-Case">Tjuatja, L. Chen, V., Wu, T., Talwalkwar, A., &amp; Neubig, G. (2024). Do LLMs exhibit human-like response biases? A case study in survey design. <em>Transactions of the Association for Computational Linguistics</em>, <em>12</em>, 1011–1026.</a></p>
<p><a href="https://www.sciencedirect.com/science/article/pii/S2590198226000825">Wallius, E., &amp; Lehtonen, E. (2026). Beyond human proxies: The roles and usefulness of large language models in user research for mobility service development. <em>Transportation Research Interdisciplinary Perspectives</em>, <em>36</em>, #101917.</a></p>
<p><a href="https://arxiv.org/pdf/2402.01908">Wang, A., Morgenstern, J. &amp; Dickerson, J. P. (2025) Large language models that replace human participants can harmfully misportray and flatten identity groups. <em>Nature Machine Intelligence</em>, <em>7</em>, 400–411.</a></p>
<p><a href="https://www.sciencedirect.com/science/article/pii/S2949882125001173">Yu, T., Pan, S., Fan, C., Luo, S., Jin, Y., &amp; Zhao, B. (2025). Can large language models exhibit cognitive and affective empathy as humans? <em>Computers in Human Behavior: Artificial Humans</em>, <em>6</em>, #100233.</a></p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Credible vs. Confidence Intervals: Different Meanings but Similar Decisions</title>
		<link>https://measuringu.com/credible-vs-confidence-intervals/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=credible-vs-confidence-intervals</link>
		
		<dc:creator><![CDATA[Jeff Sauro, PhD&nbsp;•&nbsp;Jim Lewis, PhD]]></dc:creator>
		<pubDate>Wed, 08 Apr 2026 06:49:35 +0000</pubDate>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[Bayes theorem]]></category>
		<category><![CDATA[Bayesian theorem]]></category>
		<category><![CDATA[confidence interval]]></category>
		<category><![CDATA[Confidence Intervals]]></category>
		<category><![CDATA[credible interval]]></category>
		<category><![CDATA[credible intervals]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47234</guid>

					<description><![CDATA[We’ve written a lot about confidence intervals for the last two decades. We especially encourage them for small sample studies. Some of you even bought into our recommendation and use them yourselves (a decision we continue to support). But maybe you’ve heard about Bayesian credible intervals and wonder if you should be using them instead. [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47288 size-medium" src="https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1-300x169.png" alt="Feature image shows two researchers, each examining a measuring tool, with a specific interval highlighted." width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1.png 2000w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>We’ve <a href="https://measuringu.com/article/estimating-completion-rates-from-small-samples-using-binomial-confidence-intervals-comparisons-and-recommendations/">written a lot about confidence intervals</a> for the last two decades.</p>
<p>We especially encourage them for small sample studies.</p>
<p>Some of you even bought into our recommendation and use them yourselves (a decision we continue to support).</p>
<p>But maybe you’ve heard about <a href="https://en.wikipedia.org/wiki/Credible_interval">Bayesian credible intervals</a> and wonder if you should be using them instead.</p>
<p>In this article, we return to an <a href="https://measuringu.com/bayes-law-in-ux-research-the-power-and-perils-of-priors">example used in our previous articles</a> on Bayesian methods applied to UX research and compare analyses of that example with confidence and credible intervals.</p>
<h2><span lang="EN-US">Confidence Interval Analysis</span></h2>
<p>In our recurring example, 18 of 20 participants successfully completed a checkout task (a 90% completion rate). But if we were to test hundreds, thousands, or (somehow) all potential users, would the completion rate be exactly 90%? Almost surely not. But instead of trying to nail down an exact single number, a likely range is usually sufficient for decision making and surprisingly easy to compute and accurate even for small sample sizes.</p>
<p>For this type of data (binary), the likely range can be computed using an adjusted-Wald confidence interval with 95% confidence. That interval is 68.7% to 98.4%.</p>
<p>We’ve made it easy to compute binomial confidence intervals with <a href="https://measuringu.com/calculators/wald/">our online calculator</a>. But how do you interpret or explain what it means? How about:</p>
<ul>
<li>There’s a 95% probability the true completion rate is between 68.7% and 98.4%.</li>
<li>There’s a 95% chance the true completion rate falls within 68.7% and 98.4%.</li>
<li>95% of future tests with completion rates will be between 68.7% and 98.4%.</li>
</ul>
<p>Strictly speaking, all three of those statements are wrong. A stats professor or Bayesian enthusiast will be happy to point out that error.</p>
<p>The more technically correct way to describe the interval is:</p>
<ul>
<li>If we ran many tests, each with 20 users from the same population and computed confidence intervals each time, on average, 95 out of 100 confidence intervals will contain the unknown population completion rate.</li>
</ul>
<p>Strictly speaking, we are 95% confident <em>in the method </em>of generating confidence intervals and not in any given interval. The confidence interval we generated from the sample data either does or does not contain the population completion rate.</p>
<p>We don’t know if our sample of 20 is one of those five whose confidence interval doesn&#8217;t contain the completion rate. So, it’s best to avoid using “probability” or “chance” when describing a confidence interval and remember that we’re 95% confident in the process of generating confidence intervals rather than a given interval.</p>
<p>So, we have just one study, and we computed only one interval. What does that mean? What are we “allowed” to say other than that cumbersome statement? We have a couple of recommendations suitable for practical decision making:</p>
<ul>
<li><strong>Likely range</strong>: “68.7% to 98.4% is the most likely range for the unknown completion rate from all users.”</li>
<li><strong>Plausible range</strong> (from <a href="https://www.amazon.com/Confidence-Intervals-Quantitative-Applications-Sciences/dp/076192499X">Smithson, 2002</a>): “Given this data, values inside the confidence interval are plausible while those outside are implausible. The observed completion rate of 90% is plausible but rates lower than 68.7% or higher than 98.4% are implausible.”</li>
</ul>
<p>This is where the precision of numbers meets the imprecision of language. Although confidence, probability, likely, and plausible all sound about the same, they have more precise usage when it comes to statistics and probability.</p>
<p>This rigidity in language makes them less ideal when communicating the results to stakeholders who will not likely have a sophisticated understanding of confidence intervals (although <a href="https://link.springer.com/article/10.3758/s13423-013-0572-3">even professors sometimes struggle with the concept</a>).</p>
<h2><span lang="EN-US">Credible Interval Analysis</span></h2>
<p>One proposed alternative is the Bayesian credible interval.</p>
<p>Credible intervals are designed to allow for the interpretation people naturally want to use. A 95% credible interval can be interpreted as having a 95% probability of containing the true value.</p>
<p>Like with confidence intervals, there are different computations used to generate credible intervals on binary data. And like with confidence intervals, there are debates about which method is optimal. We won’t get into that debate here. Instead, we’ve provided in Table 1 three Bayesian credible intervals for our example that differ in <a href="https://measuringu.com/bayes-law-in-ux-research-the-power-and-perils-of-priors">their priors</a> (all of which are <a href="https://nvlpubs.nist.gov/nistpubs/TechnicalNotes/NIST.TN.2119.pdf">commonly used in practice</a>).</p>

<table id="tablepress-1026" class="tablepress tablepress-id-1026">
<thead>
<tr class="row-1">
	<th class="column-1"><strong>Method</strong></th><th class="column-2"><strong>Prior/Setup</strong></th><th class="column-3"><strong>95% Interval</strong></th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1">Adjusted-Wald</td><td class="column-2">Add ~2 successes &amp; ~2 failures</td><td class="column-3"><strong>68.7% to 98.4%</strong></td>
</tr>
<tr class="row-3">
	<td class="column-1">Bayesian credible interval</td><td class="column-2">Beta(1,1)—Uniform prior</td><td class="column-3"><strong>69.6% to 97.0%</strong></td>
</tr>
<tr class="row-4">
	<td class="column-1">Bayesian credible interval</td><td class="column-2">Beta(0.5, 0.5)—Jeffreys prior</td><td class="column-3"><strong>71.6% to 97.9%</strong></td>
</tr>
<tr class="row-5">
	<td class="column-1">Bayesian credible interval</td><td class="column-2">Beta(2, 2)—Agresti prior</td><td class="column-3"><strong>66.4% to 95.0%</strong></td>
</tr>
</tbody>
</table>
<!-- #tablepress-1026 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Four 95% interval estimates, one confidence and three credible.</p>
<p>For example, a 95% Bayesian credible interval using a uniform prior for 18 successes and 2 failures generates a credible interval of 69.6% to 97.0%.</p>
<p>We can say there’s a 95% probability that the true and unknown completion rate is between 69.6% and 97%.</p>
<p>Stats professors are happy with that statement. Bayesian purists are happy with that statement. And your stakeholders probably understand that statement too!</p>
<p>So, should we all start using credible intervals and abandon confidence intervals? Not necessarily.</p>
<p>Credible intervals require more complex calculations and usually don’t have the simple closed-form solution of the adjusted-Wald interval. In practice, however, this difference is negligible because modern software handles the computation (e.g., we used the binom.bayes function in the R package binom).</p>
<p>But did you notice anything about the values in Table 1? The intervals are all very similar, as shown in the graph in Figure 1.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-scaled.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47293" src="https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-300x175.png" alt="Graph of the four intervals " width="1200" height="698" srcset="https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-300x175.png 300w, https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-1024x596.png 1024w, https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-768x447.png 768w, https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-1536x894.png 1536w, https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-2048x1192.png 2048w, https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-600x349.png 600w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> Graph of the four intervals (Green: adjusted-Wald, Blue: Bayesian Uniform, Orange: Bayesian Jeffreys, Black: Bayesian Agresti); dashed green line shows limits of adjusted-Wald interval across the three Bayesian intervals.</p>
<p>There are very few differences between the intervals. The width of the adjusted-Wald interval is 29.7%. The Uniform and Jeffreys intervals lie within the adjusted-Wald (with respective widths of 27.4% and 26.3%) while the Agresti interval has about the same width as the adjusted-Wald (28.6%), with its upper and lower endpoints shifted down relative to the adjusted-Wald interval by, respectively, 3.4% and 2.3%.</p>
<p>If the output is roughly the same, does it really matter? The numbers don’t know where they come from.</p>
<p>This is similar to the debate about ordinal versus interval data. As Lord (1951) noted, even <a href="https://is.muni.cz/el/fss/jaro2010/PSY454/um/Frederick_Lord_On_the_statistical_treatment_of_football_numbers.pdf">nominal values like football jersey numbers can be averaged</a>. The math works, but proper interpretation is critical.</p>
<p>Confidence intervals and credible intervals can yield nearly identical results, especially for this type of data. In many cases, <strong>they will lead to the same practical decision</strong>, even though the interpretation differs.</p>
<p>So, what should you do?</p>
<p>The results here suggest that, at least for this type of data, traditional confidence intervals and Bayesian credible intervals can produce very similar ranges. The main difference is not in the numbers, but in how we interpret and communicate them.</p>
<p>That’s one reason we continue to recommend confidence intervals. They are well understood, widely taught, and, when used appropriately, provide accurate estimates of the range of plausible values.</p>
<p>At the same time, we understand the appeal of credible intervals. The interpretation is more natural and often aligns better with how stakeholders think about uncertainty.</p>
<p>In practice, either approach can be effective. What matters most is understanding what the interval represents and communicating it clearly. Decisions are made by inspecting the endpoints of the intervals. If you’d make the same decision for both endpoints, then you have enough information to make the decision. Otherwise, you need more data. In this example, it seems unlikely that the slight variation in endpoint values would affect real-world decision making.</p>
<p>Notably, in this example, the confidence interval encompassed two of the Bayesian intervals, so not only did it have 95% confidence from a frequentist perspective, but it also had at least 95% credibility from a Bayesian perspective.</p>
<p>We’ll continue to explore where these approaches differ more meaningfully in future articles, including whether these similarities extend beyond this example to different proportions and to other statistics such as means.</p>
<h2>Key Takeaways</h2>
<p>In this latest article on Bayesian methods, we covered:</p>
<ul>
<li>Confidence intervals are harder to explain than most people think.</li>
<li>Credible intervals match how people want to interpret uncertainty.</li>
<li>In this example, both methods produce very similar ranges.</li>
<li>The difference is less about the numbers and more about what we can say about them.</li>
<li>Use either approach thoughtfully, but focus on clear communication.</li>
</ul>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Bayes’ Law in UX Research: The Power and Perils of Priors</title>
		<link>https://measuringu.com/bayes-law-in-ux-research-the-power-and-perils-of-priors/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=bayes-law-in-ux-research-the-power-and-perils-of-priors</link>
		
		<dc:creator><![CDATA[Jeff Sauro, PhD&nbsp;•&nbsp;Jim Lewis, PhD]]></dc:creator>
		<pubDate>Wed, 01 Apr 2026 03:35:38 +0000</pubDate>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Bayes theorem]]></category>
		<category><![CDATA[Bayesian theorem]]></category>
		<category><![CDATA[Statistics]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47171</guid>

					<description><![CDATA[&#8220;That confirms what I expected.&#8221; The same data, two different conclusions. A 90% completion rate from 20 participants on a usability test of a checkout flow. Is that completion rate better than the historical average of 78%? One researcher says yes, definitely. Another says no, it’s in line with the historical average. Both are using [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/03/033126-FeatureImage.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47227 size-medium" src="https://measuringu.com/wp-content/uploads/2026/03/033126-FeatureImage-300x169.png" alt="Feature image showing two balance scales with urns on each side" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/03/033126-FeatureImage-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/03/033126-FeatureImage-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/033126-FeatureImage-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/03/033126-FeatureImage-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/033126-FeatureImage-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/03/033126-FeatureImage.png 2000w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>&#8220;That confirms what I expected.&#8221;</p>
<p>The same data, two different conclusions.</p>
<p>A 90% completion rate from 20 participants on a usability test of a checkout flow. Is that completion rate better than the <a href="https://measuringu.com/task-completion/">historical average of 78%</a>?</p>
<p>One researcher says yes, definitely. Another says no, it’s in line with the historical average.</p>
<p>Both are using the same <a href="https://measuringu.com/intro-to-bayesian-thinking-in-ux-research/">Bayesian method</a>. How can the same data produce opposite conclusions?</p>
<p>The answer lies in <em>priors</em>, the assumptions you bring to the analysis before the data impact the decision.</p>
<p>In our previous article, we assumed equal priors when <a href="https://measuringu.com/bayes-law-in-ux-research-from-urns-to-users">analyzing completion rate data</a> to simplify the analysis. But what happens when those priors change?</p>
<p>In this article, we explore the consequences of manipulating those prior probabilities in different ways.</p>
<h2><span lang="EN-US">The Effect of Priors on the Outcome</span></h2>
<p>In Bayesian analysis, we assign numerical probabilities to prior beliefs about competing hypotheses. Priors reflect how plausible we think each explanation is before seeing the current data.</p>
<p>If a prior belief is well supported, we give it more weight. If it’s less credible, we give it less weight. When we don’t have strong prior information, we can assign roughly equal weights, allowing the observed data to play a larger role in the conclusion.</p>
<p>In our example, 18 of 20 participants successfully completed a task (a 90% completion rate). We want to understand how different prior beliefs affect our interpretation of this result when compared to a historical completion rate of 78%.</p>
<p>To do this, we compare two hypotheses: that the true completion rate is 78% (historical) or 90% (based on the observed data), under different prior assumptions. We could also test other values (e.g., 85% or 95%), but we use 90% as a convenient reference based on the sample, recognizing that this is a simplifying modeling choice.</p>
<p>So, which is more plausible: a 78% or 90% completion rate?</p>
<p>We examine five scenarios that vary the strength and direction of the prior belief:</p>
<ol>
<li>Neutral prior (no preference)</li>
<li>Weak prior favoring a 78% completion rate</li>
<li>Weak prior favoring a 90% completion rate</li>
<li>Strong prior favoring a 78% completion rate</li>
<li>Strong prior favoring a 90% completion rate</li>
</ol>
<p>So how do we quantify the strength of our prior beliefs? What values should we use to represent neutral, weak, and strong preferences for one hypothesis over another?</p>
<p>A neutral prior is straightforward, a 50/50 reflecting no preference. But once we move beyond that, the choice of “weak” or “strong” priors becomes less clear.</p>
<p>If we move slightly off a neutral stance, values like 60/40 seem reasonable. But whether we use 60/40, 70/30, or 80/20 is somewhat arbitrary. We use 0.6 and 0.8 to represent weak and strong prior preferences, respectively.</p>
<p>To avoid confusion between completion rates (e.g., 90%) and prior probabilities (e.g., 0.8), we use decimal values for the priors.</p>
<p>When we apply these values to the Bayesian formula (see the appendix), we obtain the results shown in Table 1.</p>
<p>Each row represents a different prior scenario. The second and third columns show the prior beliefs assigned to each hypothesis. The next two columns show how those beliefs are updated after observing 18 of 20 participants complete the task. The final column shows the relative likelihood of the two hypotheses.</p>
<p>For example, with neutral priors, the 90% completion rate is 2.7 times more likely than the 78% completion rate. In contrast, with a strong prior favoring 78%, the 78% completion rate becomes more likely than the 90% completion rate.</p>

<table id="tablepress-1025" class="tablepress tablepress-id-1025">
<thead>
<tr class="row-1">
	<th rowspan="2" class="column-1"><center>Prior Scenario</th><th colspan="2" class="column-2">Prior Belief in</th><th colspan="2" class="column-4">Updated Belief in</th><th rowspan="2" class="column-6"><center>Which Is<br>More Likely?</th><th rowspan="2" class="column-7"><center>Odds<br>(90% vs. 78%)</th>
</tr>
<tr class="row-2">
	<th class="column-2">90%</th><th class="column-3">78%</th><th class="column-4">90%</th><th class="column-5">78%</th>
</tr>
</thead>
<tbody class="row-striping row-hover">
<tr class="row-3">
	<td class="column-1">Neutral prior (no preference)</td><td class="column-2">0.5</td><td class="column-3">0.5</td><td class="column-4">0.732</td><td class="column-5">0.268</td><td class="column-6"><center>90%</td><td class="column-7"><center>2.7×</td>
</tr>
<tr class="row-4">
	<td class="column-1">Weak prior favoring 78%</td><td class="column-2">0.4</td><td class="column-3">0.6</td><td class="column-4">0.645</td><td class="column-5">0.355</td><td class="column-6"><center>90%</td><td class="column-7"><center>1.8×</td>
</tr>
<tr class="row-5">
	<td class="column-1">Weak prior favoring 90%</td><td class="column-2">0.6</td><td class="column-3">0.4</td><td class="column-4">0.804</td><td class="column-5">0.196</td><td class="column-6"><center>90%</td><td class="column-7"><center>4.1×</td>
</tr>
<tr class="row-6">
	<td class="column-1">Strong prior favoring 78%</td><td class="column-2">0.2</td><td class="column-3">0.8</td><td class="column-4">0.405</td><td class="column-5">0.595</td><td class="column-6"><center>78%</td><td class="column-7"><center><strong>0.68×<br>(≈1.5× for 78%)</strong></td>
</tr>
<tr class="row-7">
	<td class="column-1">Strong prior favoring 90%</td><td class="column-2">0.8</td><td class="column-3">0.2</td><td class="column-4">0.916</td><td class="column-5">0.084</td><td class="column-6"><center>90%</td><td class="column-7"><center><strong>10.9×</strong></td>
</tr>
</tbody>
</table>
<!-- #tablepress-1025 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Effect of different priors on updated beliefs.</p>
<h2><span lang="EN-US">How Our Conclusions Change Based on Priors</span></h2>
<p>Across all five scenarios, a 90% completion rate is more likely in four of them. In one case, it’s more than ten times as likely as the 78% completion rate. Only when we strongly favor the historical data does the conclusion shift, making the 78% completion rate more likely despite the observed results.</p>
<p>Changing only the prior belief can shift the conclusion from favoring 78% to strongly favoring 90%. No new data were added. In this example, changing the prior assumption had a larger effect on the conclusion than a modest increase in sample size would. This raises a natural question: how much additional data would be needed to overcome a strong prior?</p>
<p>This highlights an important property of Bayesian analysis. The conclusions are influenced not only by the observed data, but also by the strength and direction of the prior beliefs. When priors are strong, they can reinforce or counteract the data. When priors are weak or neutral, the data play a larger role.</p>
<p>Who decides what the historical data is and how relevant it is? And how strongly do you weight the priors? There isn’t a Bayesian rule book we can reference. Instead, it comes down to making informed and good judgments. But is that judgment always clear, and does it lead to better conclusions?</p>
<p>Understanding how priors affect the decision (under one scenario) is the easy part. Teasing out the pros and cons of this approach with more Bayesian methods and real-world scenarios is the harder one. And the subject of some upcoming articles.</p>
<p>This illustrates both the potential power and the potential risk of Bayesian analysis. It can incorporate prior knowledge in a principled way, but when priors are uncertain, subjective, or weakly supported, the results may reflect assumptions as much as evidence.</p>
<h2><span lang="EN-US">Summary and Discussion</span></h2>
<p>In a <a href="https://measuringu.com/bayes-law-in-ux-research-from-urns-to-users">previous article</a>, we extended a classical problem in Bayesian comparison of the likelihoods of two hypotheses to a UX research context using an approach that required only simple algebra.</p>
<p>In this article, we showed how variation in prior belief can affect the posterior likelihoods of competing UX hypotheses, potentially having a larger impact than small changes in the observed data. For this example, varying the priors had a large effect on the likelihoods of the hypotheses (from <strong>0.405 to 0.916</strong> for the 90% hypothesis). This may, in part, have been affected by the relatively small difference in the competing hypotheses (78% vs. 90%, just a 12-point difference).</p>
<h3><span lang="EN-US">What Should Researchers Do About Priors?</span></h3>
<p>In practice, researchers should:</p>
<ul>
<li>Be explicit about the priors they use and how they were chosen.</li>
<li>Test multiple plausible priors to understand how sensitive the conclusions are to variation in priors (e.g., <a href="https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2020.608045/full">prior sensitivity analysis</a>).</li>
<li>Be cautious when priors are uncertain or weakly supported.</li>
<li>Consider collecting more data when conclusions depend heavily on prior assumptions.</li>
</ul>
<p>Understanding how priors influence results is an important step in using Bayesian methods effectively. It does not mean avoiding Bayesian analysis, but it does mean using it thoughtfully and transparently.</p>
<h2><span lang="EN-US">Appendix</span></h2>
<p>For this example, we assumed 20 participants attempted an online checkout task with 18 successes and 2 failures (90% success). With that result, we want to understand whether it’s more likely that the true successful completion rate is 78% (historical) or our observed 90% (better than historical).</p>
<p>To get the odds ratios displayed in Table 1, we used the following Bayesian formula.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula.png"><img loading="lazy" decoding="async" class="size-full wp-image-47183 aligncenter" src="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula.png" alt="Bayesian formula comparing 78% and 90% completion rates" width="390" height="55" srcset="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula.png 390w, https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-300x42.png 300w" sizes="auto, (max-width: 390px) 100vw, 390px" /></a></p>
<p>where:</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|90%) is the probability of getting this sample (the data, D) if the true completion rate is 90%.</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|78%) is the probability of getting this sample if the true completion rate is 78%.</p>
<p style="padding-left: 25px;"><em>P</em>(90%) is our expected (prior) probability that the true completion rate is 90%.</p>
<p style="padding-left: 25px;"><em>P</em>(78%) is our expected (prior) probability that the true completion rate is 78%.</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) is the conditional probability of the completion rate being 90% given the sample.</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) is 1 – <em>P</em>(90%|<em>D</em>).</p>
<p>Using the binomial probability formula, we can compute the probabilities of getting this sample for each of the hypothesized true completion rates:</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|90%) is (0.9)<sup>18</sup> × (0.1)<sup>2</sup> = 0.0015.</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|78%) is (0.78)<sup>18</sup> × (0.22)<sup>2</sup> = 0.00055.</p>
<p>Next, we apply this formula to the five sets of priors (Neutral, Weak Favoring 78%, Weak Favoring 90%, Strong Favoring 78%, Strong Favoring 90%).</p>
<p><strong><em>Technical note</em></strong><em><strong>:</strong> We used binomial probabilities throughout this article because they allow us to illustrate the mechanics of the Bayesian analyses with simple algebra. The downside of this simplification is that we had to assign specific prior probabilities rather than using the current practice of using </em><a href="https://bookdown.org/pbaumgartner/bayesian-fun/05-beta-distribution.html"><em>beta distributions for priors</em></a><em>, but this does not affect the logic of the discussion. Also, we excluded the factorial component of the binomial probability formula because it was constant across the computations.</em></p>
<h3><span lang="EN-US">Neutral Prior</span></h3>
<p>If we decide there is no basis for weighting the priors unequally, the values for the formula are:</p>
<p style="padding-left: 25px;">P(90%) = 0.5</p>
<p style="padding-left: 25px;">P(78%) = 0.5</p>
<p style="padding-left: 25px;">P(D|90%) = 0.0015</p>
<p style="padding-left: 25px;">P(D|78%) = 0.00055</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-neutral-priors.png"><img loading="lazy" decoding="async" class="size-full wp-image-47184 aligncenter" src="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-neutral-priors.png" alt="Bayesian formula with neutral priors" width="324" height="211" srcset="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-neutral-priors.png 324w, https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-neutral-priors-300x195.png 300w" sizes="auto, (max-width: 324px) 100vw, 324px" /></a></p>
<p>So:</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) = 0.732</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) = 0.268</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) / <em>P</em>(78%|<em>D</em>) = 2.73</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) / <em>P</em>(90%|<em>D</em>) = 0.37</p>
<p>Conclusion: There is a substantial likelihood that the historical hypothesis (78%) might be true (0.268 isn’t anywhere near 0), but the alternative hypothesis (90%) is <strong>2.7 times more likely</strong>.</p>
<h3><span lang="EN-US">Weak Prior Favoring 78%</span></h3>
<p>If we decide to give a little more weight to the historical hypothesis (78%) and a little less to the alternative hypothesis (90%), we get:</p>
<p style="padding-left: 25px;"><em>P</em>(90%) = 0.4</p>
<p style="padding-left: 25px;"><em>P</em>(78%) = 0.6</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|90%) = 0.0015</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|78%) = 0.00055</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-weak-priors.png"><img loading="lazy" decoding="async" class="wp-image-47185 size-full aligncenter" src="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-weak-priors.png" alt="Bayesian formula with weak prior favoring 78%" width="307" height="213" srcset="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-weak-priors.png 307w, https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-weak-priors-300x208.png 300w" sizes="auto, (max-width: 307px) 100vw, 307px" /></a></p>
<p>So:</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) = 0.645</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) = 0.355</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) / <em>P</em>(78%|<em>D</em>) = 1.82</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) / <em>P</em>(90%|<em>D</em>) = 0.55</p>
<p>Conclusion: There is a substantial likelihood that the historical hypothesis (78%) might be true (0.355 isn’t anywhere near 0), but the alternative hypothesis (90%) is <strong>1.8 times more likely</strong>.</p>
<h3><span lang="EN-US">Weak Prior Favoring 90%</span></h3>
<p>If we decide to give a little more weight to the alternative hypothesis (90%) and a little less to the historical hypothesis (78%), we get:</p>
<p style="padding-left: 25px;"><em>P</em>(90%) = 0.6</p>
<p style="padding-left: 25px;"><em>P</em>(78%) = 0.4</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|90%) = 0.0015</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|78%) = 0.00055</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-favoring-90.png"><img loading="lazy" decoding="async" class="wp-image-47186 size-full aligncenter" src="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-favoring-90.png" alt="Bayesian formulas with weak prior favoring 90%" width="320" height="207" srcset="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-favoring-90.png 320w, https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-favoring-90-300x194.png 300w" sizes="auto, (max-width: 320px) 100vw, 320px" /></a></p>
<p>So:</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) = 0.804</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) = 0.196</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) / <em>P</em>(78%|<em>D</em>) = 4.09</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) / <em>P</em>(90%|<em>D</em>) = 0.24</p>
<p>Conclusion: There is a decent likelihood that the historical hypothesis (78%) might be true (0.196 isn’t that close to 0), but the alternative hypothesis (90%) is <strong>4.1 times more likely</strong>.</p>
<h3><span lang="EN-US">Strong Prior Favoring 78%</span></h3>
<p>If we decide to give a lot more weight to the historical hypothesis (78%) and a lot less to the alternative hypothesis (90%), we get:</p>
<p style="padding-left: 25px;"><em>P</em>(90%) = 0.2</p>
<p style="padding-left: 25px;"><em>P</em>(78%) = 0.8</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|90%) = 0.0015</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|78%) = 0.00055</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-strong-priors-favoring-78.png"><img loading="lazy" decoding="async" class="size-full wp-image-47188 aligncenter" src="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-strong-priors-favoring-78.png" alt="" width="307" height="213" srcset="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-strong-priors-favoring-78.png 307w, https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-strong-priors-favoring-78-300x208.png 300w" sizes="auto, (max-width: 307px) 100vw, 307px" /></a></p>
<p>So:</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) = 0.405</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) = 0.595</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) / <em>P</em>(78%|<em>D</em>) = 0.68</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) / <em>P</em>(90%|<em>D</em>) = 1.47</p>
<p>Conclusion: The historical hypothesis (78%) is <strong>about 1.5 times more likely</strong> than the alternative hypothesis (90%), but not by much (both likelihoods aren’t that far from 50%).</p>
<h3><span lang="EN-US">Strong Prior Favoring 90%</span></h3>
<p>If we decide to give a lot more weight to the alternative hypothesis (90%) and a lot less to the historical hypothesis (78%), we get:</p>
<p style="padding-left: 25px;"><em>P</em>(90%) = 0.8</p>
<p style="padding-left: 25px;"><em>P</em>(78%) = 0.2</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|90%) = 0.0015</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|78%) = 0.00055</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-strong-priors-favoring-90.png"><img loading="lazy" decoding="async" class="size-full wp-image-47189 aligncenter" src="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-strong-priors-favoring-90.png" alt="Bayesian formula with strong priors favoring 90%" width="318" height="222" srcset="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-strong-priors-favoring-90.png 318w, https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-strong-priors-favoring-90-300x209.png 300w" sizes="auto, (max-width: 318px) 100vw, 318px" /></a></p>
<p>So:</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) = 0.916</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) = 0.084</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) / <em>P</em>(78%|<em>D</em>) = 10.91</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) / <em>P</em>(90%|<em>D</em>) = 0.09</p>
<p>Conclusion: There is relatively little likelihood that the historical hypothesis (78%) might be true (0.084 is getting close to 0), and the alternative hypothesis (90%) is <strong>10.9 times more likely</strong>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How to Use Banner Tables to Present Survey Results</title>
		<link>https://measuringu.com/how-to-use-banner-tables-to-present-survey-results/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=how-to-use-banner-tables-to-present-survey-results</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD&nbsp;•&nbsp;Jeff Sauro, PhD]]></dc:creator>
		<pubDate>Wed, 25 Mar 2026 00:13:04 +0000</pubDate>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Survey]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[banner table]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[table]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47120</guid>

					<description><![CDATA[Surveys are a common way to measure attitudes, behaviors, and intentions related to products and services. But large surveys can include dozens of questions and multiple demographic segments, which can mean hundreds of potential comparisons. How do you present all those results in a way stakeholders can quickly scan? You can use a slide deck [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/03/032426-FeatureImage-1.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47161 size-medium" src="https://measuringu.com/wp-content/uploads/2026/03/032426-FeatureImage-1-300x169.png" alt="Feature image showing a researcher using banner table to present survey results" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/03/032426-FeatureImage-1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/03/032426-FeatureImage-1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/032426-FeatureImage-1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/03/032426-FeatureImage-1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/032426-FeatureImage-1-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/03/032426-FeatureImage-1.png 2000w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>Surveys are a common way to measure attitudes, behaviors, and intentions related to products and services.</p>
<p>But large surveys can include dozens of questions and multiple demographic segments, which can mean hundreds of potential comparisons. How do you present all those results in a way stakeholders can quickly scan?</p>
<p>You can use a slide deck with charts for every question and segment, but that can easily lead to dozens of slides.</p>
<p>Another option is a <em>banner table</em>. While it sounds like something you might see at a trade show, a banner table is a compact way to display many cross-tabulated survey results in a single view.</p>
<p>Banner tables are widely used in market research, but they are less commonly seen in UX research. In a <a href="https://measuringu.com/what-are-ux-deliverables/">previous article</a>, we listed 18 UX research deliverables classified as interim, final, and artifacts; the banner table was one of the least familiar.</p>
<p>When used appropriately, banner tables provide an efficient way to summarize survey results across multiple segments.</p>
<p>Below is an example showing a banner table displaying brand attitude and reluctance to share political content by two social media platforms and gender (Table 1).</p>

<table id="tablepress-1024" class="tablepress tablepress-id-1024">
<thead>
<tr class="row-1">
	<td class="column-1"></td><td class="column-2"></td><th colspan="2" class="column-3"><center><strong>Facebook</strong></th><th colspan="2" class="column-5"><center><strong>TikTok</strong></th>
</tr>
</thead>
<tbody class="row-striping row-hover">
<tr class="row-2">
	<td class="column-1"><strong>Metric</strong></td><td class="column-2"><strong>Total</strong></td><td class="column-3"><strong>Female</strong></td><td class="column-4"><strong>Male</strong></td><td class="column-5"><strong>Female</strong></td><td class="column-6"><strong>Male</strong></td>
</tr>
<tr class="row-3">
	<td class="column-1">Brand attitude (Top-2 Box %)</td><td class="column-2">30%</td><td class="column-3">28%</td><td class="column-4"> 7%</td><td class="column-5">50%</td><td class="column-6">35%</td>
</tr>
<tr class="row-4">
	<td class="column-1">Reluctance to share political content (Bottom-2 Box %)</td><td class="column-2">70%</td><td class="column-3">58%</td><td class="column-4">73%</td><td class="column-5">72%</td><td class="column-6">93%</td>
</tr>
<tr class="row-5">
	<td class="column-1">Sample size (<i>n</i>)</td><td class="column-2">123</td><td class="column-3">49</td><td class="column-4">25</td><td class="column-5">29</td><td class="column-6">20</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1024 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Example of a banner table.</p>
<p>In this article, we provide more detail about the why and how of banner tables, plus we display an example created with an R script.</p>
<p>Before diving into how banner tables work, it helps to understand where they came from and why they became a standard deliverable in market (and UX) research.</p>
<h2><span lang="EN-US">Banner Tables: Common in Market Research, Less Known in UX Research</span></h2>
<p>For large-scale surveys with multiple segments, it’s good to display results in a banner table when you need to provide cross-tabulated results by key segments (e.g., demographics, personas, behaviors) to reveal group differences in a form that is easy to scan (Figure 1).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-1.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47163" src="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-1-300x20.png" alt="High-level view of a banner table." width="1200" height="81" srcset="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-1-300x20.png 300w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-1-1024x69.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-1-768x52.png 768w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-1-1536x104.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-1-600x41.png 600w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-1.png 2005w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> High-level view of a banner table.</p>
<p>Later in this article, we’ll zoom in on the different parts of this table and dig into its details. What you can see from the high-level view in Figure 1 is a set of metrics in the first column followed by a Totals column. The empty green columns separate crosstabs of the metrics with, in order, six social media platforms, three gender designations, six age groups, and six income levels. When presented as a spreadsheet, it’s common to freeze the top row and the first one or two columns to support easily browsing the contents.</p>
<h3><span lang="EN-US">Banner Tables Can Be Traced Back to U.S. Census Practices in 1949</span></h3>
<p>There’s no clear historical record of when the first banner table was published, but it likely coincided with the emergence of large-scale surveys in the mid-20<sup>th</sup> century. In banner tables, the rows are sometimes called the <em>stub</em> and the columns the <em>banner</em>, and older names for banner tables include stub-and-banner tables or stub-and-boxhead (as in the 1949 U.S. Census Bureau publication, <a href="https://www2.census.gov/library/publications/1949/general/tabular-presentation.pdf"><em>Bureau of the Census Manual of Tabular Presentation</em></a>). Regardless of the nomenclature, the key to its success is compressing a large number of crosstabs into one wide table.</p>
<h3><span lang="EN-US">Banner Tables Are Widely Used in Market Research</span></h3>
<p>Often considered a core piece of survey reporting for market research projects, a “banner run” shows every key question broken out by key segments (e.g., demographics, usage, brand, region). This is a common practice because the sample sizes in market research are often large enough to support a large number of data splits, the format is standardized and repeatable, and it serves the needs of stakeholders who want the same results sliced in different ways.</p>
<h3><span lang="EN-US">Banner Tables Are Less Common in UX Research but Have Their Place</span></h3>
<p>It’s possible for a UX researcher to spend decades in the field and never be asked to produce a banner table (we know this from personal experience). Nonetheless, banner tables can play a role when the research methodology is a large-scale survey (especially when focused on segmentation analysis). Even then, in UX research, banner tables will usually be more of a <a href="https://measuringu.com/what-are-ux-deliverables/">supporting deliverable</a> than a key item, as in marketing research.</p>
<h3><span lang="EN-US">Banner Tables Provide a Quick Way to Compare Weighted and Unweighted Results</span></h3>
<p>In our previous article on <a href="https://measuringu.com/rake-weighting-how-to-weight-survey-data-with-multiple-variables/">rake weighting</a>, we demonstrated the use of the <a href="https://cran.r-project.org/web/packages/anesrake/index.html">anesrake R package</a> to weight data on multiple demographic variables. The practice of demographic weighting is more common in market research and political polling because they have clearer and more accessible reference populations than is typical in UX research.</p>
<p>If the decision has been made to weight data, a banner table provides a convenient way to check on the effect of weighting on research outcomes.</p>
<p>In practice, market research banner tables usually treat weighted results as the “official” estimates but commonly include unweighted bases and percentages for quality control and transparency. UX research tends to follow that practice when weights are known to be based on a strong reference population; otherwise, unweighted results may take precedence over weighted results when reviewing the banner tables.</p>
<h2><span lang="EN-US">Banner Table Example</span></h2>
<p>For this example, we return to the data we used in our article on rake weighting so we can produce banner tables with both unweighted and weighted outcomes (for R scripting details, see the appendix).</p>
<h3><span lang="EN-US">Social Media: Weighting Brand Attitude and Reluctance to Engage in Political Discourse</span></h3>
<p>The data for this example came from our <a href="https://measuringu.com/the-ux-of-social-media-in-2024/">2024 SUPR-Q survey</a> of social media platforms. We recruited 324 participants in August 2024 to reflect on their most recent experience with one of six social media platforms: Facebook, Instagram, LinkedIn, Snapchat, TikTok, and X. We were interested in a wide range of UX topics (e.g., overall quality of experience, levels of trust, impact on mood and self-esteem). For the rake weighting article, our examples focused on measuring brand attitude and reluctance to engage in political discourse on the platforms.</p>
<p>We used demographic distributions of the adult U.S. population (18 years of age and older) as the reference population for <a href="https://news.gallup.com/poll/656708/lgbtq-identification-rises.aspx">gender</a>, <a href="https://www2.census.gov/library/publications/decennial/2020/census-briefs/c2020br-06.pdf">age</a>, and <a href="https://www.census.gov/library/publications/2025/demo/p60-286.html">income</a> because it’s commonly used for that purpose in many research contexts. Note that <strong>we do not recommend this as good practice for UX research</strong>, as the entire U.S. population is rarely the target audience for a specific product or service, and demographic variables often have little effect on UX metrics. It did, however, work well in our example as a quick check of the value (or lack of value) of employing this kind of demographic weighting in future SUPR-Q retrospective benchmarks.</p>
<p>Figure 2 shows the first ten rows of the source data with the respondent number, the platform, gender, age group, income range, brand attitude (seven-point scale and top-two box), rating of likelihood to share political content (five-point scale and bottom-two-box score), and the case weight determined by the previous rake weighting exercise. The item stems were, respectively, “Overall, how would you rate your attitude toward &lt;platform&gt;?” and “How likely are you to share political content on &lt;platform&gt;?”</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-2.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47154" src="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-2.png" alt="Portion of source data with weights from previous rake weighting exercise." width="1200" height="287" srcset="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-2.png 1200w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-2-300x72.png 300w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-2-1024x245.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-2-768x184.png 768w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-2-600x144.png 600w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 2:</strong> Portion of source data with weights from previous rake weighting exercise.</p>
<h3><span lang="EN-US">Banner Table Results</span></h3>
<p>To produce the banner table, we used three R packages:</p>
<ul>
<li><a href="https://www.r-bloggers.com/2024/11/creating-professional-excel-reports-with-r-a-comprehensive-guide-to-openxlsx-package/">openxlsx</a>: Get data from an Excel file and produce formatted results</li>
<li><a href="https://cengel.github.io/R-data-wrangling/dplyr.html">dplyr</a>: Manipulation of data frames</li>
<li><a href="https://tidyr.tidyverse.org/reference/tidyr-package.html">tidyr</a>: Simplified creation of tidy data formats</li>
</ul>
<p>The complete R script for creating this banner table is in the appendix.</p>
<p>Figures 3 through 6 show each of the crosstabs in the table for the overall effects of Product, Gender, Age Group, and Income. Because the brand attitude metric in the table is a top-two box, larger percentages reflect a more favorable attitude. In contrast, because the item measuring likelihood to engage in political discourse is a bottom-two box (the top boxes were too sparse to provide a meaningful signal), larger percentages indicate greater reluctance to engage.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-3.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47155" src="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-3-300x49.png" alt="Effect of platform (TikTok had highest brand satisfaction; LinkedIn highest reluctance to engage in political discourse)." width="1170" height="191" srcset="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-3-300x49.png 300w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-3-1024x167.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-3-768x125.png 768w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-3-1536x251.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-3-600x98.png 600w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-3.png 1776w" sizes="auto, (max-width: 1170px) 100vw, 1170px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 3:</strong> Effect of platform (TikTok had the highest brand satisfaction; LinkedIn had the highest reluctance to engage in political discourse).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-4.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47156" src="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-4-300x70.png" alt="Effect of gender (female users had substantially higher brand attitudes than male users; nonbinary users least likely to engage in political discourse). " width="813" height="191" srcset="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-4-300x70.png 300w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-4-1024x241.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-4-768x180.png 768w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-4-600x141.png 600w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-4.png 1230w" sizes="auto, (max-width: 813px) 100vw, 813px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 4:</strong> Effect of gender (female users had substantially higher brand attitudes than male users; nonbinary users were the least likely to engage in political discourse).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-5.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47157" src="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-5-300x52.png" alt="Effect of age (50-59 had higher brand attitude; 18-24 least likely to engage in political discourse). " width="1092" height="191" srcset="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-5-300x52.png 300w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-5-1024x179.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-5-768x134.png 768w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-5-1536x269.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-5-600x105.png 600w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-5.png 1647w" sizes="auto, (max-width: 1092px) 100vw, 1092px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 5:</strong> Effect of age (50–59 had higher brand attitude; 18–24 least likely to engage in political discourse).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-6.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47158" src="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-6-300x48.png" alt="Effect of income ($25k-$49k had highest brand attitude; $200k+ were least likely to engage in political discourse). " width="1201" height="191" srcset="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-6-300x48.png 300w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-6-1024x163.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-6-768x122.png 768w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-6-1536x244.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-6-600x95.png 600w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-6.png 1811w" sizes="auto, (max-width: 1201px) 100vw, 1201px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 6: </strong>Effect of income ($25k–$49k had the highest brand attitude; $200k+ were least likely to engage in political discourse).</p>
<h3><span lang="EN-US">Try It!</span></h3>
<p>Table 2 is a <a href="https://tablepress.org/">TablePress</a> version of the banner table with the top row and the left two columns frozen.</p>
<p>To use the table, click on any row below the header, then use the slider or arrow keys to scroll horizontally.</p>
<p>To switch between the brand attitude and political discourse rows, toggle the 1 and 2 below the right corner of the table, then to resume horizontal scrolling, click on any row below the header.</p>

<table id="tablepress-1022" class="tablepress tablepress-id-1022">
<thead>
<tr class="row-1">
	<th class="column-1"><strong>Metric</strong></th><th class="column-2"><strong>Total</strong></th><th class="column-3"><strong> </strong></th><th class="column-4"><strong>Platform: Facebook</strong></th><th class="column-5"><strong>Platform: Instagram</strong></th><th class="column-6"><strong>Platform: LinkedIn</strong></th><th class="column-7"><strong>Platform: Snapchat</strong></th><th class="column-8"><strong>Platform: TikTok</strong></th><th class="column-9"><strong>Platform: X</strong></th><th class="column-10"><strong> </strong></th><th class="column-11"><strong>Gender: Female</strong></th><th class="column-12"><strong>Gender: Male</strong></th><th class="column-13"><strong>Gender: Nonbinary</strong></th><th class="column-14"><strong> </strong></th><th class="column-15"><strong>Age: 18-24</strong></th><th class="column-16"><strong>Age: 25-29</strong></th><th class="column-17"><strong>Age: 30-39</strong></th><th class="column-18"><strong>Age: 40-49</strong></th><th class="column-19"><strong>Age: 50-59</strong></th><th class="column-20"><strong>Age: 60-69</strong></th><th class="column-21"><strong> </strong></th><th class="column-22"><strong>Income: $0-$24k</strong></th><th class="column-23"><strong>Income: $25k-$49k</strong></th><th class="column-24"><strong>Income: $50k-$99k</strong></th><th class="column-25"><strong>Income: $100k-$149k</strong></th><th class="column-26"><strong>Income: $150k-$199k</strong></th><th class="column-27"><strong>Income: $200k+</strong></th>
</tr>
</thead>
<tbody class="row-striping row-hover">
<tr class="row-2">
	<td class="column-1">Brand attitude (Top-2 Box): Unweighted n</td><td class="column-2">324</td><td class="column-3"></td><td class="column-4">53</td><td class="column-5">56</td><td class="column-6">52</td><td class="column-7">50</td><td class="column-8">57</td><td class="column-9">56</td><td class="column-10"></td><td class="column-11">198</td><td class="column-12">114</td><td class="column-13">12</td><td class="column-14"></td><td class="column-15">68</td><td class="column-16">64</td><td class="column-17">102</td><td class="column-18">51</td><td class="column-19">31</td><td class="column-20">8</td><td class="column-21"></td><td class="column-22">50</td><td class="column-23">70</td><td class="column-24">107</td><td class="column-25">56</td><td class="column-26">25</td><td class="column-27">16</td>
</tr>
<tr class="row-3">
	<td class="column-1">Brand attitude (Top-2 Box): Weighted n</td><td class="column-2">324.0</td><td class="column-3"></td><td class="column-4">74.1</td><td class="column-5">52.6</td><td class="column-6">49.4</td><td class="column-7">39.4</td><td class="column-8">50.4</td><td class="column-9">58.0</td><td class="column-10"></td><td class="column-11">163.6</td><td class="column-12">157.1</td><td class="column-13">3.2</td><td class="column-14"></td><td class="column-15">51.7</td><td class="column-16">34.3</td><td class="column-17">69.1</td><td class="column-18">62.6</td><td class="column-19">66.3</td><td class="column-20">40.0</td><td class="column-21"></td><td class="column-22">56.8</td><td class="column-23">70.8</td><td class="column-24">91.2</td><td class="column-25">50.6</td><td class="column-26">22.3</td><td class="column-27">32.3</td>
</tr>
<tr class="row-4">
	<td class="column-1">Brand attitude (Top-2 Box): % (unweighted)</td><td class="column-2">27.2%</td><td class="column-3"></td><td class="column-4">24.5%</td><td class="column-5">25.0%</td><td class="column-6">23.1%</td><td class="column-7">26.0%</td><td class="column-8">50.9%</td><td class="column-9">12.5%</td><td class="column-10"></td><td class="column-11">35.4%</td><td class="column-12">15.8%</td><td class="column-13">0.0%</td><td class="column-14"></td><td class="column-15">22.1%</td><td class="column-16">31.2%</td><td class="column-17">22.5%</td><td class="column-18">33.3%</td><td class="column-19">35.5%</td><td class="column-20">25.0%</td><td class="column-21"></td><td class="column-22">20.0%</td><td class="column-23">30.0%</td><td class="column-24">29.9%</td><td class="column-25">28.6%</td><td class="column-26">24.0%</td><td class="column-27">18.8%</td>
</tr>
<tr class="row-5">
	<td class="column-1">Brand attitude (Top-2 Box): % (weighted)</td><td class="column-2">24.6%</td><td class="column-3"></td><td class="column-4">21.0%</td><td class="column-5">27.0%</td><td class="column-6">31.4%</td><td class="column-7">21.3%</td><td class="column-8">43.1%</td><td class="column-9">7.3%</td><td class="column-10"></td><td class="column-11">32.4%</td><td class="column-12">17.0%</td><td class="column-13">0.0%</td><td class="column-14"></td><td class="column-15">16.2%</td><td class="column-16">27.8%</td><td class="column-17">18.8%</td><td class="column-18">26.4%</td><td class="column-19">33.6%</td><td class="column-20">25.0%</td><td class="column-21"></td><td class="column-22">21.9%</td><td class="column-23">32.4%</td><td class="column-24">26.0%</td><td class="column-25">28.0%</td><td class="column-26">17.0%</td><td class="column-27">8.0%</td>
</tr>
<tr class="row-6">
	<td class="column-1"></td><td class="column-2"></td><td class="column-3"></td><td class="column-4"></td><td class="column-5"></td><td class="column-6"></td><td class="column-7"></td><td class="column-8"></td><td class="column-9"></td><td class="column-10"></td><td class="column-11"></td><td class="column-12"></td><td class="column-13"></td><td class="column-14"></td><td class="column-15"></td><td class="column-16"></td><td class="column-17"></td><td class="column-18"></td><td class="column-19"></td><td class="column-20"></td><td class="column-21"></td><td class="column-22"></td><td class="column-23"></td><td class="column-24"></td><td class="column-25"></td><td class="column-26"></td><td class="column-27"></td>
</tr>
<tr class="row-7">
	<td class="column-1">Political discourse reluctance (Bottom-2 Box): Unweighted n</td><td class="column-2">324</td><td class="column-3"></td><td class="column-4">53</td><td class="column-5">56</td><td class="column-6">52</td><td class="column-7">50</td><td class="column-8">57</td><td class="column-9">56</td><td class="column-10"></td><td class="column-11">198</td><td class="column-12">114</td><td class="column-13">12</td><td class="column-14"></td><td class="column-15">68</td><td class="column-16">64</td><td class="column-17">102</td><td class="column-18">51</td><td class="column-19">31</td><td class="column-20">8</td><td class="column-21"></td><td class="column-22">50</td><td class="column-23">70</td><td class="column-24">107</td><td class="column-25">56</td><td class="column-26">25</td><td class="column-27">16</td>
</tr>
<tr class="row-8">
	<td class="column-1">Political discourse reluctance (Bottom-2 Box): Weighted n</td><td class="column-2">324.0</td><td class="column-3"></td><td class="column-4">74.1</td><td class="column-5">52.6</td><td class="column-6">49.4</td><td class="column-7">39.4</td><td class="column-8">50.4</td><td class="column-9">58.0</td><td class="column-10"></td><td class="column-11">163.6</td><td class="column-12">157.1</td><td class="column-13">3.2</td><td class="column-14"></td><td class="column-15">51.7</td><td class="column-16">34.3</td><td class="column-17">69.1</td><td class="column-18">62.6</td><td class="column-19">66.3</td><td class="column-20">40.0</td><td class="column-21"></td><td class="column-22">56.8</td><td class="column-23">70.8</td><td class="column-24">91.2</td><td class="column-25">50.6</td><td class="column-26">22.3</td><td class="column-27">32.3</td>
</tr>
<tr class="row-9">
	<td class="column-1">Political discourse reluctance (Bottom-2 Box): % (unweighted)</td><td class="column-2">77.5%</td><td class="column-3"></td><td class="column-4">71.7%</td><td class="column-5">73.2%</td><td class="column-6">94.2%</td><td class="column-7">84.0%</td><td class="column-8">75.4%</td><td class="column-9">67.9%</td><td class="column-10"></td><td class="column-11">75.8%</td><td class="column-12">78.9%</td><td class="column-13">91.7%</td><td class="column-14"></td><td class="column-15">80.9%</td><td class="column-16">78.1%</td><td class="column-17">78.4%</td><td class="column-18">74.5%</td><td class="column-19">74.2%</td><td class="column-20">62.5%</td><td class="column-21"></td><td class="column-22">76.0%</td><td class="column-23">77.1%</td><td class="column-24">78.5%</td><td class="column-25">78.6%</td><td class="column-26">68.0%</td><td class="column-27">87.5%</td>
</tr>
<tr class="row-10">
	<td class="column-1">Political discourse reluctance (Bottom-2 Box): % (weighted)</td><td class="column-2">76.0%</td><td class="column-3"></td><td class="column-4">63.2%</td><td class="column-5">70.8%</td><td class="column-6">92.8%</td><td class="column-7">85.1%</td><td class="column-8">80.2%</td><td class="column-9">73.0%</td><td class="column-10"></td><td class="column-11">71.6%</td><td class="column-12">80.3%</td><td class="column-13">90.7%</td><td class="column-14"></td><td class="column-15">80.7%</td><td class="column-16">76.8%</td><td class="column-17">80.5%</td><td class="column-18">79.1%</td><td class="column-19">72.4%</td><td class="column-20">62.5%</td><td class="column-21"></td><td class="column-22">63.8%</td><td class="column-23">79.2%</td><td class="column-24">74.2%</td><td class="column-25">83.0%</td><td class="column-26">73.5%</td><td class="column-27">86.1%</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1022 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 2: </strong>Working version of the banner table.</p>
<h2><span lang="EN-US">Summary</span></h2>
<p>In this article, we went through the why and how of banner tables, ending with an example created with an R script from data collected in a retrospective benchmark study of attitudes toward social media platforms. We discussed that banner tables:</p>
<p><strong>Compress many crosstabs into one viewable table.</strong><br />
The compression of many crosstabs into a single banner table allows stakeholders to quickly scan results without having to flip between multiple slides.</p>
<p><strong>Are created with common analysis tools like R and AI.</strong><br />
Numerous software tools can create banner tables; in our example, we used R packages to generate the table. You can also easily have AI create these for you.</p>
<p><strong>Are ideal for large surveys with segmentation.</strong><br />
Use banner tables to summarize survey results (especially with large sample sizes) when comparing metrics across multiple segments.</p>
<p><strong>Are common in market research but useful in UX.</strong><br />
Banner tables are widely used in market research and, while less frequently requested in UX research, can be the right deliverable when you want to convey cross-tab metrics compactly.</p>
<h2><span lang="EN-US">Appendix</span></h2>
<p>Use the link below to download a PDF of the R script. It’s specific to this example but could certainly be edited for use with other data. The script is formatted so you can select all, copy, modify, and then paste the code into R or R Studio.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/rscriptForBannerTables.pdf">Click here for the R script</a></p>
<p><strong>AI note:</strong> We used ChatGPT 5.2 to create and iterate the R script until it worked as desired (which took about six hours, including debugging some weird roundoff errors). For the final table, we did a little additional formatting by hand (e.g., making the empty columns smaller with light green fill, freezing the top row and the left two columns for easier browsing of the crosstab sections).</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Assistant, Analyst, and User: How We’re Examining AI in UX</title>
		<link>https://measuringu.com/ai-as-uxr-assistant-user-and-analyst/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=ai-as-uxr-assistant-user-and-analyst</link>
		
		<dc:creator><![CDATA[Jeff Sauro, PhD&nbsp;•&nbsp;Jim Lewis, PhD]]></dc:creator>
		<pubDate>Wed, 18 Mar 2026 00:48:07 +0000</pubDate>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[artificial intelligence]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47094</guid>

					<description><![CDATA[It seems like AI is almost everywhere. For many people, it is. From the moment we wake up, AI increasingly shapes our daily experiences. Music playlists are generated automatically. Our computers prompt us to use AI assistants. Internet searches are now often preceded by AI-generated summaries. Call a doctor’s office after hours. and an AI [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/03/031726-FeatureImage-1.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47108 size-medium" src="https://measuringu.com/wp-content/uploads/2026/03/031726-FeatureImage-1-300x169.png" alt="Header image showing a person communicating with 3 AI robots" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/03/031726-FeatureImage-1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/03/031726-FeatureImage-1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/031726-FeatureImage-1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/03/031726-FeatureImage-1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/031726-FeatureImage-1-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/03/031726-FeatureImage-1.png 2000w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>It seems like AI is almost everywhere. For many people, it is.</p>
<p>From the moment we wake up, AI increasingly shapes our daily experiences. Music playlists are generated automatically. Our computers prompt us to use AI assistants. Internet searches are now often preceded by AI-generated summaries.</p>
<p>Call a doctor’s office after hours. and an AI voice assistant may help schedule your appointment. Chat with customer support, and you’ll likely interact with a chatbot before reaching a human. Write an email, and AI offers suggestions. Start a meeting and AI software generates notes and summaries. Need an image to make a point? Use AI to generate one from a textual description (e.g., Figure 1).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/03172026.jpg"><img loading="lazy" decoding="async" class="alignnone size-full wp-image-47095" src="https://measuringu.com/wp-content/uploads/2026/03/03172026.jpg" alt="Image showing the ubiquity of AI." width="881" height="588" srcset="https://measuringu.com/wp-content/uploads/2026/03/03172026.jpg 881w, https://measuringu.com/wp-content/uploads/2026/03/03172026-300x200.jpg 300w, https://measuringu.com/wp-content/uploads/2026/03/03172026-768x513.jpg 768w, https://measuringu.com/wp-content/uploads/2026/03/03172026-600x400.jpg 600w" sizes="auto, (max-width: 881px) 100vw, 881px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> The ubiquity of AI.</p>
<p>And of course, AI’s influence affects what we do in UX research.</p>
<p>But is AI helping? Is it making us more efficient, more accurate? Or is it actually just making us <a href="https://hbr.org/2026/02/ai-doesnt-reduce-work-it-intensifies-it">work more intensely</a>?</p>
<p>Of course, there are voices who overhype its efficacy in UX Research and Design. There are also voices who dismiss it as a fad. Increasingly, the latter is becoming a less tenable position.</p>
<p>We’re more pragmatic at MeasuringU and have an aversion to extreme attitudes. The <a href="https://en.wikipedia.org/wiki/Golden_mean_(philosophy)">Aristotelian golden mean</a> between extremes is part of our company DNA.</p>
<p>We lean into empiricism and judge the efficacy of claims using data. We also critically evaluate the quality of the evidence. An anecdote about improved productivity from a software company is not the same as a controlled study.</p>
<p>As is often the case with fast-changing technology, there’s a dearth of high-quality studies that allow us to separate the hype from the hypothesis testing. We’re actively conducting studies and literature reviews to quantify the extent to which different applications of AI to UX research are useful.</p>
<p>A good way to assess the evidence and group our research is to think about AI’s impact in UX research in three categories: AI as Research Assistant, AI as (Synthetic) User, and AI as Researcher.</p>
<h2><span lang="EN-US">AI as Research Assistant</span></h2>
<p>Let’s start with something less controversial and rather commonplace. That is, researchers using AI tools to assist (usually expedite) research.</p>
<p>Many UX research teams use AI for the following tasks, and the AI assistants appear to be well-received by researchers to either increase research speed or improve research quality. Questions remain about measurable quality criteria, failure modes, and the role of the human in the loop.</p>
<ul>
<li>Coding comments from categories</li>
<li>Cleaning data</li>
<li>Translation and localization</li>
<li>Analyzing interviews to find themes</li>
<li>Developing insights from transcripts</li>
<li>Building and modifying participant screeners</li>
<li>Writing and editing survey questions</li>
<li>Detecting bias and other quality issues in questions</li>
<li>Identifying categories from card sort results</li>
<li>Developing and editing task scenarios</li>
<li>Developing and editing test plans</li>
</ul>
<p>There’s more to do, but we’ve already made some progress investigating the role of AI as a research assistant in comment classification.</p>
<h3><span lang="EN-US">AI and Human Classification of Comments</span></h3>
<p>One of the first analyses we conducted on using AI to code comments was promising. We used three runs of  <a href="https://measuringu.com/classification-agreement-between-ux-researchers-and-chatgpt/">ChatGPT-4 to classify comments</a> in UX research and compared its results (in 2023!) to three human coders. We found only slightly lower interrater reliabilities between human coders and ChatGPT than between human coders alone, with three caveats:</p>
<ul>
<li>Human coders were more likely to assign single comments to their own themes.</li>
<li>Different prompts had different levels of effectiveness (prompt specificity matters).</li>
<li>AI outputs with the same prompt were similar, but there was substantial variation, making it necessary to run AI analyses multiple times.</li>
</ul>
<p>We plan to investigate how well newer AI products perform this task.</p>
<h2><span lang="EN-US">AI as Synthetic User: Synthetic Attitudes vs. Synthetic Actions</span></h2>
<p>Now we move into a category that gets <a href="https://quantuxblog.com/synthetic-survey-data-its-not-data">a lot of people fired up</a>, and for good reason. Any time you take the user out of UX, it becomes objectionable as a matter of principle. But again, we try to be open-minded. After all, inspection methods like <a href="https://measuringu.com/effective-he/">heuristic evaluation</a>, <a href="https://measuringu.com/ux-metrics-pure/">PURE</a>, and <a href="https://measuringu.com/inspection-methods/">guideline reviews</a> are part of the UX research toolbox even though users aren’t directly involved.</p>
<p>We see an important distinction between synthetic user attitudes and synthetic user behaviors, both of which have yet to be fully explored.</p>
<ul>
<li>Synthetic survey respondents (attitudes and reported behaviors): AI-generated responses to rating scales that measure things like satisfaction, intention, and usability, and responses to behaviors like product ownership and usage</li>
<li>Synthetic users of task-based studies (behaviors): AI-generated responses to task-based scenarios used in usability testing</li>
<li>Synthetic users of information architecture tasks (tree tests, card sorts)</li>
</ul>
<p>We have not conducted studies that use data from individually crafted synthetic users, but we have experimented with comparing AI predictions of user behaviors and attitudes for card sorting and tree testing, with mixed success.</p>
<h3><span lang="EN-US">AI and Human Analysis of Card Sorting Results</span></h3>
<p>AI’s ability <a href="https://measuringu.com/comparing-chatgpt-to-card-sorting-results/">to sort items into groups</a>, as in a card sort, was actually reasonably good. Our use of ChatGPT-4 to appropriately name groups of items, with the groups synthesized by human researchers from a standard open card sort, found a strong similarity in numbers and names of categories. Items matched most of the time, the interrater reliability between the two methods was moderate to substantial, and there weren’t any obviously bad ChatGPT placements.</p>
<h3><span lang="EN-US">AI and Human Tree Testing Results</span></h3>
<p>Our tree testing results were also promising. Based on data collected with <a href="https://measuringu.com/chatgpt4-tree-test/">multiple iterations of ChatGPT-4 and 33 participants</a> finding the location of target items in a tree structure based on the IRS website, and using the SEQ to assess perceived task difficulty, we found that ChatGPT performed too well and <strong>was not suitable</strong> for estimating how well humans will find items in a tree test. However, ChatGPT predicted people’s ease ratings of the search tasks <strong>with reasonable accuracy</strong>.</p>
<h2><span lang="EN-US">AI as Researcher</span></h2>
<p>These are more advanced tasks where AI might be able to take a more central role in analysis, but it isn’t clear how AI output compares to human output regarding the amount of time saved in the process (if any) and accuracy. Two ways in which AI might replace researchers are as analysts and moderators.</p>
<h3><span lang="EN-US">AI as Analyst</span></h3>
<p>A lot of human data analysis is repetitive, making it attractive for replacement with AI (Figure 2).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/03172026-F2.jpg"><img loading="lazy" decoding="async" class="alignnone size-full wp-image-47096" src="https://measuringu.com/wp-content/uploads/2026/03/03172026-F2.jpg" alt="Cartoon showing a robot applying for a mindless and repetitive job." width="1043" height="561" srcset="https://measuringu.com/wp-content/uploads/2026/03/03172026-F2.jpg 1043w, https://measuringu.com/wp-content/uploads/2026/03/03172026-F2-300x161.jpg 300w, https://measuringu.com/wp-content/uploads/2026/03/03172026-F2-1024x551.jpg 1024w, https://measuringu.com/wp-content/uploads/2026/03/03172026-F2-768x413.jpg 768w, https://measuringu.com/wp-content/uploads/2026/03/03172026-F2-600x323.jpg 600w" sizes="auto, (max-width: 1043px) 100vw, 1043px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 2:</strong> Robot applying for a job.</p>
<p>Other human data analysis is less repetitive and more dependent on contextual knowledge and human judgment (e.g., identification of usability problems). Some of the opportunities we envision for AI as an analyst (but which need development and validation) are:</p>
<ul>
<li>Validating screenshots to determine task success</li>
<li>Identifying usability problems from image analysis</li>
<li>Identifying usability problems from videos</li>
<li>Heuristic evaluation from analysis of videos, images, and websites</li>
<li>Advanced inspection analyses (<a href="https://measuringu.com/pure/">PURE</a>, <a href="https://measuringu.com/predicted-times/">KLM/GOMS</a>)</li>
<li>Analyzing datasets</li>
</ul>
<h3><span lang="EN-US">AI as Moderator</span></h3>
<p>Research moderation seems like a quintessentially human activity. However, advances in AI avatars, LLM dialog management, and synthetic speech production have led to the development of AI agents that could be applied to a <a href="https://www.nngroup.com/articles/ai-interviewers/">variety of moderation tasks</a>. Research in this area should focus on understanding when it works, when it fails, and how to validate quality.</p>
<ul>
<li>Simple interviews</li>
<li>Complex interviews</li>
<li>Moderated usability tests</li>
</ul>
<h2><span lang="EN-US">AI Adoption and Attitudes</span></h2>
<p>We have conducted and will conduct follow-up studies of attitudes toward AI usage by the general public and by UX researchers.</p>
<p>We’ve already published research on attitudes of UX researchers regarding the use of AI in UX (in association with UXPA) and attitudes of a general population of users toward three AI chat products.</p>
<p>Before examining how AI may function as an assistant, analyst, or synthetic user in UX research, it’s useful to understand how widely AI tools are already being used and how people perceive them. Some recent studies provide insight into both adoption and user experience with AI-based systems.</p>
<h3><span lang="EN-US">How Much Is AI Used in UX? </span></h3>
<p><em>More than you might think.</em> While our <a href="https://measuringu.com/how-much-is-ai-used-in-ux/">industry data</a> from 2024 is due for a refresh, we found that about half of UX professionals had used AI (but 20% were not impressed). More companies supported using AI than discouraged it (by about 6 to 1). Most respondents expected to use AI more in 2025, but expectations over the next five years were mixed.</p>
<h3><span lang="EN-US">Retrospective Benchmark of ChatGPT, Claude, and Gemini</span></h3>
<p>In January and February 2025, we conducted a retrospective study on <a href="https://measuringu.com/ai-based-chat-software-ux-2025/">three AI-based chat products</a> (ChatGPT, Claude, Gemini) with 153 U.S-based panel participants. This study included metrics from our standard UX &amp; NPS survey as part of our larger consumer software data collection effort. All products had high and similar Net Promoter Scores. Reported issues included accuracy, generic content, and limited free versions.</p>
<h2><span lang="EN-US">Summary</span></h2>
<p>It can be easy to be seduced into extreme views about emerging technologies. They can be cast as the best thing ever or a complete waste of time. Our recommendation is a more pragmatic, empirical approach. Rather than relying on anecdotes or hype, we encourage evaluating the role of AI in UX research with data.</p>
<p>One useful way to think about AI in UX research is to group its applications into three roles:</p>
<ul>
<li><strong>AI as Research Assistant.</strong> Tools that improve the quality and quantity of the work that UX researchers already do, such as coding comments, summarizing interviews, and generating study materials.</li>
</ul>
<ul>
<li><strong>AI as Synthetic User.</strong> Systems that simulate user attitudes or behaviors. An important distinction is between synthetic attitudes and synthetic actions. Our early work suggests some promise in modeling behavior, but much less evidence for synthetic attitudes.</li>
</ul>
<ul>
<li><strong>AI as Research Analyst.</strong> Applications where AI plays a more central role in analysis—identifying usability issues from images or videos, evaluating task success, or even assisting with research moderation.</li>
</ul>
<p>There is still much to learn. In the coming year, we plan to continue studying these areas and revisit both the usage of AI tools and attitudes toward them. Our goal is not to promote or dismiss AI, but to understand, through evidence, where it genuinely improves UX research.</p>
]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>

<!-- plugin=object-cache-pro client=phpredis metric#hits=7836 metric#misses=55 metric#hit-ratio=99.3 metric#bytes=4809287 metric#prefetches=213 metric#store-reads=134 metric#store-writes=168 metric#store-hits=326 metric#store-misses=37 metric#sql-queries=53 metric#ms-total=670.45 metric#ms-cache=38.24 metric#ms-cache-avg=0.1270 metric#ms-cache-ratio=5.7 sample#redis-hits=113128416 sample#redis-misses=16094466 sample#redis-hit-ratio=87.5 sample#redis-ops-per-sec=157 sample#redis-evicted-keys=0 sample#redis-used-memory=118212880 sample#redis-used-memory-rss=73969664 sample#redis-memory-fragmentation-ratio=0.6 sample#redis-connected-clients=1 sample#redis-tracking-clients=0 sample#redis-rejected-connections=0 sample#redis-keys=137446 -->
