<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>MeasuringU</title>
	<atom:link href="https://measuringu.com/feed/" rel="self" type="application/rss+xml" />
	<link>https://measuringu.com</link>
	<description>UX Research and Software</description>
	<lastBuildDate>Fri, 01 May 2026 13:51:56 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://measuringu.com/wp-content/uploads/2020/11/site-icon.png</url>
	<title>MeasuringU</title>
	<link>https://measuringu.com</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>How Reliable Is AI at Finding UI Problems?</title>
		<link>https://measuringu.com/ai-usability-problem-analysis-of-a-video/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=ai-usability-problem-analysis-of-a-video</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD •&nbsp;Jeff Sauro •&nbsp;Will Schiavone, PhD&nbsp;•&nbsp;Lucas Plabst, PhD]]></dc:creator>
		<pubDate>Tue, 28 Apr 2026 22:04:59 +0000</pubDate>
				<category><![CDATA[Usability]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[ChatGPT]]></category>
		<category><![CDATA[Gemini]]></category>
		<category><![CDATA[Problem Discovery]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47402</guid>

					<description><![CDATA[It looks like AI can “watch” videos. And if AI can watch videos, it can likely extract UI problems. That suggests it has the potential to support UX research. So maybe AI can “watch” a video and detect some problems. But if you run the same video through AI multiple times, do you get the [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-scaled.png"><img fetchpriority="high" decoding="async" class="alignleft wp-image-47475 size-medium" src="https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-300x169.png" alt="Feature image showing two AI robots, each holding a clipboard with a UI problems list" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-2048x1152.png 2048w, https://measuringu.com/wp-content/uploads/2026/04/042826-FeatureImage1-600x338.png 600w" sizes="(max-width: 300px) 100vw, 300px" /></a>It looks like AI can “<a href="https://measuringu.com/can-ai-detect-usability-problems">watch” videos</a>. And if AI can watch videos, it can likely extract UI problems. That suggests it has the <a href="https://measuringu.com/ai-as-uxr-assistant-user-and-analyst/">potential to support UX research</a>.</p>
<p>So maybe AI can “watch” a video and detect some problems. But if you run the same video through AI multiple times, do you get the same results?</p>
<p>Reliability matters. If AI produces different results each time, it becomes untrustworthy, no matter how convincing its reasoning sounds.</p>
<p>There are a lot of variables that can affect our assessment of reliability, including:</p>
<ul>
<li>AI models (Gemini, ChatGPT, Claude, Grok)</li>
<li>Versions (models are changing monthly)</li>
<li>LLM settings like <em>temperature</em>, which affect the randomness of the output</li>
<li>Prompts: What you ask (and even how many times you ask)</li>
</ul>
<p>There are a lot of variables to consider, but we have to start somewhere. So we did. In this article, we take a first step in assessing the reliability of AI problem detection. We examined how consistent two popular AI chatbots are at identifying usability problems from the same video.</p>
<h2><span lang="EN-US">Study Setup</span></h2>
<p>We had two LLMs, ChatGPT-5.4 Thinking and Gemini 3 Flash Thinking, review the video and list the usability issues they discovered (four runs per LLM to check for within-LLM consistency; default settings only). Both are general-purpose LLMs for which MeasuringU has paid “pro” subscriptions (i.e., not free versions). Video 1 shows a 15-second clip of the full six-minute video.</p>
<div class="ast-oembed-container " style="height: 100%;"><iframe title="MU_opentable_reservation first 15 sec" src="https://player.vimeo.com/video/1184232223?h=3252bb61a5&amp;dnt=1&amp;app_id=122963" width="1170" height="658" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write; encrypted-media; web-share" referrerpolicy="strict-origin-when-cross-origin"></iframe></div>
<p class="wp-caption-text" style="text-align: left;"><strong>Video 1: </strong>First 15 seconds of a participant searching for a sushi restaurant on the OpenTable website.</p>
<p>The task (visible at the bottom of the video) was to use OpenTable.com to:</p>
<blockquote><p>“Please think aloud. Make a reservation for four people at a sushi restaurant in Denver, CO tomorrow anytime after 5:00pm. Make sure the restaurant you select is not at the lowest or highest price point. Of the restaurants that fit these criteria, look at their overall rating, customer reviews, and photos to select the one that is the most appealing to you. Go as far as you can in the reservation process until you are asked for your personal information or account details. DO NOT fully confirm the reservation. Write down the restaurant name and the time of the reservation. You will be asked about this information after the task.”</p></blockquote>
<p>We used the following prompt:</p>
<blockquote><p>“During a usability test, the facilitator must keep track of participant behaviors as they navigate through tasks on a website, mobile app, software program, etc. We’d like you to watch a video of a usability test where participants were asked to book a table at a Sushi restaurant. As you&#8217;re watching, please look for problems the participant has while attempting to complete the task. For example, you can document the path users take, describe issues they encounter as well as what on the website might be causing problems. If you understand these instructions, let me know and I&#8217;ll drag the video in for you to review. Are you ready for the video?”</p></blockquote>
<p>The LLM response to this question was always some version of “Yes.”</p>
<p>In this study, we varied only the type of AI: ChatGPT and Gemini. The video, the prompt, and the LLM versions and settings were constant, but we plan to vary those variables in future studies.</p>
<h2><span lang="EN-US">Assessing Reliability</span></h2>
<p>If you ask, AI will deliver (something). For each run, we compiled a list of usability problems that the AI model “discovered.”</p>
<p>For example, a problem noticeable in the video (and on the current OpenTable website) is that when entering “Denver” in the search field, the previously selected cuisine (sushi) was removed, making for a clumsy filter and search experience.</p>
<p>To assess the reliability (consistency) of their problem discovery, we computed the <a href="https://measuringu.com/assessing-interrater-reliability-in-ux-research/">any-2 agreement</a> between ChatGPT and Gemini and within each model. We treated the models like evaluators.</p>
<p>Any-2 agreement is a UX context-specific version of the <a href="https://link.springer.com/article/10.1186/s12859-019-3118-5">Jaccard similarity coefficient (<em>J</em>)</a>, the ratio of the intersection of two binary measurements divided by their union. When there are more than two evaluators, the overall any-2 agreement is the average of the any-2 agreements for each pair of evaluators.</p>
<h3><span lang="EN-US">Computing Any-2 Agreement</span></h3>
<p>Imagine that (Y and C) have independently created lists of usability issues where Y’s list has 14 issues, C’s list has 17, and their two lists have ten issues in common (Figure 1). Their any-2 agreement is the intersection (the ten issues they both discovered) divided by the union of both lists (14 + 17 − 10 = 21), which is 48% (10/21).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/04/042826-Figure1.png"><img decoding="async" class="alignnone wp-image-47477" src="https://measuringu.com/wp-content/uploads/2026/04/042826-Figure1-300x172.png" alt="Venn diagram of problem discovery by two evaluators. " width="436" height="250" srcset="https://measuringu.com/wp-content/uploads/2026/04/042826-Figure1-300x172.png 300w, https://measuringu.com/wp-content/uploads/2026/04/042826-Figure1-768x440.png 768w, https://measuringu.com/wp-content/uploads/2026/04/042826-Figure1-600x344.png 600w, https://measuringu.com/wp-content/uploads/2026/04/042826-Figure1.png 909w" sizes="(max-width: 436px) 100vw, 436px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1: </strong>Venn diagram of problem discovery by two evaluators.</p>
<p>Due to the <a href="https://measuringu.com/evaluator-effect/">well-documented evaluator effect</a>, we do not expect perfect agreement among UX researchers. In a controlled study like this (evaluators watching the same participants do the same tasks), our best estimate of typical any-2 agreement across multiple human evaluators (<a href="https://measuringu.com/examining-the-evaluator-effect-in-unmoderated-usability-testing/">based on 12 evaluations</a>) is 47%. (When studies are not controlled, the expected any-2 agreement is about 27%.)</p>
<p>This gives us a <strong>rough</strong> benchmark for assessing if an any-2 agreement is typical (around 50%), relatively low (around 25%), or relatively high (around 75%).</p>
<h2><span lang="EN-US">Within-Group Results</span></h2>
<p>The first step in our analysis was to compute the mean any-2 agreement for each group of “evaluators” (ChatGPT, Gemini) to estimate the levels of within-group reliability.</p>
<h3><span lang="EN-US">ChatGPT Reliability Was Fair</span></h3>
<p>Table 1 shows the combined problem list for the four runs of ChatGPT. Table 2 shows the any-2 agreements for each pair of runs.</p>

<table id="tablepress-1029" class="tablepress tablepress-id-1029">
<thead>
<tr class="row-1">
	<th class="column-1">GPT #</th><th class="column-2">GPT Problem List</th><th class="column-3">Run 1</th><th class="column-4">Run 2</th><th class="column-5">Run 3</th><th class="column-6">Run 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1">1</td><td class="column-2">Complex search field with placeholder text and unexpected behaviors significantly delayed user who selected sushi from search bar dropdown but for Dallas (default) instead of Denver</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-3">
	<td class="column-1">2</td><td class="column-2">Entering "Denver" in search field lost previous selection of sushi as cuisine</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-4">
	<td class="column-1">3</td><td class="column-2">Filters not helpful</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-5">
	<td class="column-1">4</td><td class="column-2">Scanning through 86 cuisines is effortful, then top that off by sushi not being in the list</td><td class="column-3"></td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-6">
	<td class="column-1">5</td><td class="column-2">Surprised when typing sushi into search field did not lose current location</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-7">
	<td class="column-1">6</td><td class="column-2">Search results for sushi included many non-sushi restaurants</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"></td><td class="column-6"><center>1</td>
</tr>
<tr class="row-8">
	<td class="column-1">7</td><td class="column-2">Weak presentation of cuisine information in search results</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-9">
	<td class="column-1">8</td><td class="column-2">Participant seemed to miss price point filter—sorted on ratings and examined price points in descriptions</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-10">
	<td class="column-1">9</td><td class="column-2">Sorting by highest rated put many non-sushi restaurants at the top of the list</td><td class="column-3"></td><td class="column-4"></td><td class="column-5"></td><td class="column-6"><center>1</td>
</tr>
<tr class="row-11">
	<td class="column-1">10</td><td class="column-2">UI pushes browsing without good decision support</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-12">
	<td class="column-1">11</td><td class="column-2">Selected result labeled seafood instead of sushi</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-13">
	<td class="column-1">12</td><td class="column-2">Task not completed because participant did not reach reservation form</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
</tbody>
</table>
<!-- #tablepress-1029 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> ChatGPT evaluations problem list.</p>

<table id="tablepress-1030" class="tablepress tablepress-id-1030">
<thead>
<tr class="row-1">
	<th class="column-1">Any-2</th><th class="column-2">Run 1</th><th class="column-3">Run 2</th><th class="column-4">Run 3</th><th class="column-5">Run 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1"><strong>Run 1</td><td class="column-2"> x</td><td class="column-3">30%</td><td class="column-4">63%</td><td class="column-5">38%</td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong>Run 2</td><td class="column-2">30%</td><td class="column-3"> x</td><td class="column-4">33%</td><td class="column-5"> 0%</td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>Run 3</td><td class="column-2">63%</td><td class="column-3">33%</td><td class="column-4"> x</td><td class="column-5">25%</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>Run 4</td><td class="column-2">38%</td><td class="column-3"> 0%</td><td class="column-4">25%</td><td class="column-5"> x</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1030 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 2: </strong>Any-2 agreement for the ChatGPT evaluations.</p>
<p>With an <strong>overall any-2 agreement of 31%</strong>, the reliability of the ChatGPT evaluations was <strong>fair</strong>. None of the problems was identified on all four runs (5/12 were identified on three runs). Runs 2 and 4 had no problems in common.</p>
<h3><span lang="EN-US">Gemini Reliability Was Better</span></h3>
<p>Table 3 shows the combined problem list for the four runs of Gemini. Table 4 shows the any-2 agreements for each pair of runs.</p>

<table id="tablepress-1031" class="tablepress tablepress-id-1031">
<thead>
<tr class="row-1">
	<th class="column-1">Gem #</th><th class="column-2">Gemini Problem List</th><th class="column-3">Run 1</th><th class="column-4">Run 2</th><th class="column-5">Run 3</th><th class="column-6">Run 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1">1</td><td class="column-2">Entering "Denver" in search field lost previous selection of sushi as cuisine</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-3">
	<td class="column-1">2</td><td class="column-2">Scanning through 86 cuisines is effortful, then top that off by sushi not being in the list</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-4">
	<td class="column-1">3</td><td class="column-2">Participant used Ctrl-F to search page for "sushi"—not found</td><td class="column-3"><center>1</td><td class="column-4"><center>1</td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-5">
	<td class="column-1">4</td><td class="column-2">Participant seemed to miss price point filter—sorted on ratings and examined price points in descriptions</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"><center>1</td>
</tr>
<tr class="row-6">
	<td class="column-1">5</td><td class="column-2">Participant chose highest price tier</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-7">
	<td class="column-1">6</td><td class="column-2">Participant wanted to change sort to lowest rating first but not an option</td><td class="column-3"></td><td class="column-4"></td><td class="column-5"><center>1</td><td class="column-6"></td>
</tr>
<tr class="row-8">
	<td class="column-1">7</td><td class="column-2">Seating options only presented after selecting time</td><td class="column-3"><center>1</td><td class="column-4"></td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-9">
	<td class="column-1">8</td><td class="column-2">Set time to 5:10</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
<tr class="row-10">
	<td class="column-1">9</td><td class="column-2">Selected result labeled seafood instead of sushi</td><td class="column-3"></td><td class="column-4"><center>1</td><td class="column-5"></td><td class="column-6"></td>
</tr>
</tbody>
</table>
<!-- #tablepress-1031 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 3: </strong>Gemini evaluations problem list.</p>

<table id="tablepress-1032" class="tablepress tablepress-id-1032">
<thead>
<tr class="row-1">
	<th class="column-1">Any-2</th><th class="column-2">Run 1</th><th class="column-3">Run 2</th><th class="column-4">Run 3</th><th class="column-5">Run 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1"><strong>Run 1</td><td class="column-2"> x</td><td class="column-3">38%</td><td class="column-4">67%</td><td class="column-5">80%</td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong>Run 2</td><td class="column-2">38%</td><td class="column-3"> x</td><td class="column-4">38%</td><td class="column-5">43%</td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>Run 3</td><td class="column-2">67%</td><td class="column-3">38%</td><td class="column-4"> x</td><td class="column-5">80%</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>Run 4</td><td class="column-2">80%</td><td class="column-3">43%</td><td class="column-4">80%</td><td class="column-5"> x</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1032 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 4: </strong>Any-2 agreement for the Gemini evaluations.</p>
<p>With an <strong>overall any-2 agreement of 57%</strong>, the reliability of the Gemini evaluations was <strong>good</strong> (3/9 problems identified in all four runs, 4/9 identified by at least three runs).</p>
<h2><span lang="EN-US">Between-Group Results</span></h2>
<p>The second step in our analysis was to compute the mean any-2 agreement across LLMs to estimate the between-group reliability, shown in Table 5.</p>

<table id="tablepress-1033" class="tablepress tablepress-id-1033">
<thead>
<tr class="row-1">
	<th class="column-1">Any-2</th><th class="column-2">Gem 1</th><th class="column-3">Gem 2</th><th class="column-4">Gem 3</th><th class="column-5">Gem 4</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1"><strong>GPT 1</td><td class="column-2">40%</td><td class="column-3">40%</td><td class="column-4">33%</td><td class="column-5">40%</td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong>GPT 2</td><td class="column-2">20%</td><td class="column-3">20%</td><td class="column-4">17%</td><td class="column-5">20%</td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>GPT 3</td><td class="column-2">40%</td><td class="column-3">75%</td><td class="column-4">33%</td><td class="column-5">40%</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>GPT 4</td><td class="column-2"> 0%</td><td class="column-3">33%</td><td class="column-4"> 0%</td><td class="column-5"> 0%</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1033 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 5: </strong>Any-2 agreement between ChatGPT and Gemini evaluations.</p>
<p>With an <strong>overall any-2 agreement of 28%</strong>, the between-AI reliability was <strong>low</strong> (closer to 25% than to 50%).</p>
<h2><span lang="EN-US">Summary and Discussion</span></h2>
<p>Along with the rest of the UX researcher community, we have a strong interest in the roles that AI might play in facilitating our work. Watching participants attempt to complete tasks is a fundamental but labor-intensive UX research activity, so any relief AI assistance might offer would be welcome.</p>
<p>As a first step to investigate the capability of ChatGPT-5.4 Thinking and Gemini 3 Flash Thinking of finding usability problems in videos, we collected evaluations of a single video (summarized as lists of usability problems), performing four runs with each LLM.</p>
<p>In this article, we evaluated any-2 agreement within each group of evaluations (ChatGPT, Gemini) and between the AIs. Our key findings were:</p>
<p><strong>Gemini had good reliability, and ChatGPT’s was fair. </strong>The average any-2 agreement for ChatGPT was 31%. We expect this level of reliability when comparing different evaluators, <a href="https://www.dialogdesign.dk/cue-studies/">different methods</a>, or different users. It’s certainly lower than you’d want, but still at a level considered acceptable in our industry.</p>
<p>For Gemini, the average any-2 agreement was good at 57%. From the literature and our own research with human evaluators, 57% is above the mean of 47% and on the higher side of acceptability.</p>
<p><strong>Between-group reliability for Gemini and ChatGPT was low. </strong>The any-2 agreement between ChatGPT and Gemini was low at 28%. That’s about 20 points below the average when examining the same video by different people. This result is not great.</p>
<p><strong>Reliability isn’t accuracy. </strong>Are the problems identified by the LLMs as relevant as those discovered by a human evaluator? This question hasn’t been answered yet (a future analysis will). But to have accuracy (validity), we need to establish consistent (reliable) results, and at least for this video and prompt, the Gemini performance was sufficiently reliable.</p>
<p><strong>Humans vs. AI coming soon. </strong>We’re just getting started with our analyses. In an upcoming article, we’ll compare any-2 agreement between these LLMs and a problem list generated by professional human UX researchers. Stay tuned.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Can AI Detect Usability Problems?</title>
		<link>https://measuringu.com/can-ai-detect-usability-problems/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=can-ai-detect-usability-problems</link>
		
		<dc:creator><![CDATA[Jeff Sauro, PhD •&nbsp;Lucas Plabst, PhD •&nbsp;Jim Lewis, PhD&nbsp;•&nbsp;Will Schiavone, PhD]]></dc:creator>
		<pubDate>Wed, 22 Apr 2026 03:39:35 +0000</pubDate>
				<category><![CDATA[Usability]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[ChatGPT]]></category>
		<category><![CDATA[error]]></category>
		<category><![CDATA[Problem Discovery]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47333</guid>

					<description><![CDATA[You may have become numb to the overhyped headlines about AI. But it’d be wrong to dismiss the impact AI can have on our industry, not only because of job displacement, but also of helping us do our jobs more effectively (hopefully). To separate the hype and hysteria, we at MeasuringU think about AI’s impact [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-scaled.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47352 size-medium" src="https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-300x169.png" alt="Feature image showing an AI robot observing the user flow to detect usability issues" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-2048x1152.png 2048w, https://measuringu.com/wp-content/uploads/2026/04/042126-FeatureImage1-600x338.png 600w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>You may have become numb to the overhyped headlines about AI.</p>
<p>But it’d be wrong to dismiss the impact AI can have on our industry, not only because of job displacement, but also of helping us do our jobs more effectively (hopefully).</p>
<p>To separate the hype and hysteria, we at MeasuringU think about AI’s impact in UX research in <a href="https://measuringu.com/ai-as-uxr-assistant-user-and-analyst/">three ways</a>: AI as Research Assistant, AI as (Synthetic) User, and AI as Researcher.</p>
<p>One of the more valuable activities we do in UX research as a researcher and assistant is to find (and recommend fixes for) usability problems in an interface.</p>
<p>Finding problems typically comes from researchers observing people interacting with a product, either live in a lab (like ours at MeasuringU, Figure 1), remotely using tools like <a href="https://measuringu.com/muiq/">MUiQ<sup>®</sup></a>, or by reviewing recordings of moderated or unmoderated sessions.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/04/042126-F1.jpg"><img loading="lazy" decoding="async" class="alignnone size-full wp-image-47336" src="https://measuringu.com/wp-content/uploads/2026/04/042126-F1.jpg" alt="AI-generated image of a robot observing a usability test session." width="602" height="602" srcset="https://measuringu.com/wp-content/uploads/2026/04/042126-F1.jpg 602w, https://measuringu.com/wp-content/uploads/2026/04/042126-F1-300x300.jpg 300w, https://measuringu.com/wp-content/uploads/2026/04/042126-F1-150x150.jpg 150w, https://measuringu.com/wp-content/uploads/2026/04/042126-F1-70x70.jpg 70w, https://measuringu.com/wp-content/uploads/2026/04/042126-F1-600x600.jpg 600w, https://measuringu.com/wp-content/uploads/2026/04/042126-F1-100x100.jpg 100w" sizes="auto, (max-width: 602px) 100vw, 602px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> AI-generated image of a robot observing a usability test session.</p>
<p>Watching lots of videos can be time-consuming and tedious. Usability problems often aren’t obvious defects in an interface. They can come from perfectly compiled code and a beautiful interface. Due to the <a href="https://measuringu.com/assessing-interrater-reliability-in-ux-research/">well-known evaluator effect</a>, even trained professionals do not detect usability problems with perfect consistency.</p>
<p>Can AI really uncover usability problems like UX researchers?</p>
<p>A few years ago, this sort of question seemed like science fiction. Now it’s not <em>if</em> but <em>how well</em> AI can watch videos like a researcher or research assistant. Can human researchers use it and trust the results?</p>
<h2><span lang="EN-US">How Does AI “Watch” a Video?</span></h2>
<p>Here’s a 15-second clip of a participant starting the task of finding a sushi restaurant on the OpenTable website. It’s part of a six-minute video of an unmoderated usability test conducted on the MUiQ platform.</p>
<div class="ast-oembed-container " style="height: 100%;"><iframe loading="lazy" title="MU_opentable_reservation first 15 sec" src="https://player.vimeo.com/video/1184232223?h=3252bb61a5&amp;dnt=1&amp;app_id=122963" width="1170" height="658" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write; encrypted-media; web-share" referrerpolicy="strict-origin-when-cross-origin"></iframe></div>
<p class="wp-caption-text" style="text-align: left;"><strong>Video 1: </strong>First 15 seconds of the OpenTable.com video.</p>
<p>We asked Gemini 3.1 Pro to “Please tell me in extreme detail what you see in the video from start to finish going second by second.” Here&#8217;s how it responded:</p>
<ul>
<li><strong>00:00–00:04:</strong> The video begins on the OpenTable.com homepage. The top navigation bar shows options for &#8220;For Business,&#8221; &#8220;Mobile,&#8221; &#8220;Help,&#8221; language selection (&#8220;EN&#8221;), and buttons for &#8220;Sign up&#8221; and &#8220;Sign in.&#8221; The main search area displays default parameters: a date of &#8220;Jul 19, 2022,&#8221; a time of &#8220;7:00 PM,&#8221; and a party size of &#8220;2 people.&#8221; The search bar says &#8220;Location, Restaurant, or Cuisine&#8221; with a red &#8220;Let&#8217;s go&#8221; button. Below this is a message reading, &#8220;It looks like you&#8217;re in Dallas Suburbs. Not correct? Get current location.&#8221; A section titled &#8220;Book for dinner tonight&#8221; shows several restaurants, including Fogo de Chao, Lombardi Cucina Italiana, Seasons 52, and Perry&#8217;s Steakhouse. At the bottom of the screen, a task description box outlines the user&#8217;s objective: booking a sushi restaurant in Denver, CO for four people tomorrow after 5:00 PM. The user begins speaking, stating they are looking for a sushi restaurant.</li>
<li><strong>00:05–00:09:</strong> The user&#8217;s mouse cursor idles on the left side of the screen. The user continues speaking, noting they need a reservation for tomorrow anytime after 5:00 PM. They observe that today is the 19th, so they need to book for the 20th.</li>
<li><strong>00:10–00:11:</strong> The cursor moves to the date selector and clicks on &#8220;Jul 19, 2022.&#8221; A calendar dropdown appears. The user clicks on the number &#8220;20&#8221; to change the date to Jul 20, 2022.</li>
<li><strong>00:12–00:15:</strong> The user moves the cursor to the time selector, currently set at &#8220;7:00 PM,&#8221; and clicks it. A dropdown menu of times in 30-minute increments appears. The user scrolls up and clicks on &#8220;5:00 PM.&#8221;</li>
</ul>
<p>Even though the summary of the onscreen text describing the user task does not include all task requirements (which we did not specifically prompt for), this is a reasonably accurate description of what’s on the screen and the events.</p>
<p>Gemini didn’t actually watch the video the way we do. It doesn’t experience a continuous stream of motion or notice subtle hesitation in real time. Instead, it only samples a few frames (often only one to two frames out of the 30 frames) per second of the video.</p>
<p>Each frame is then broken into smaller regions and converted into numerical representations that the model can process. If there’s audio, spoken words are transcribed into text and aligned with those frames. By the time the model begins “analyzing,” the video has already been reduced to a combination of image fragments and text.</p>
<p>From there, the model treats those inputs similarly to how it processes language. The visual and textual information is converted into tokens (small chunks of data) and passed through a neural network that looks for patterns (things such as interface elements, changes in screens, or sequences of actions).</p>
<h2>Autocorrect for Video Watching</h2>
<p>Because AI is working from snapshots rather than continuous playback, it doesn’t directly see motion. Instead, it infers what likely happened between frames (for example, that a user scrolled, tapped, or navigated to a new page). This makes the process efficient, but it also means short or subtle behaviors can be missed.</p>
<p>Based on the sampled frames and any accompanying text, it generates the most likely description of what happened, much like how it predicts the next word in a sentence. Basically, it’s like autocorrect on steroids for videos.</p>
<p>That’s why the output can sound surprisingly natural and insightful, even when it’s not entirely accurate. It’s less like a researcher watching a session and more like a system generating a plausible narrative from partial information.</p>
<h3>Losing Frames</h3>
<p>As long as there’s been autocorrect, there’s been, well, mistakes (often <a href="https://www.huffpost.com/entry/funniest-autocorrect-faiils-2014_n_6391880">hilarious ones</a>). The sampling that makes AI fast also makes it “<a href="https://cs.stanford.edu/people/eroberts/courses/soco/projects/data-compression/lossy/index.htm">lossy</a>.” By looking at only a fraction of the frames, the model can miss brief moments of hesitation, confusion, or micro-interactions that are often critical in usability analysis. What’s efficient for processing might not always be sufficient for insight.</p>
<h3>Probabilistic Output</h3>
<p>But unlike autocorrect, which works the same each time it’s presented with a partial word, AI outputs aren’t always the same. They’re probabilistic rather than deterministic. Even with the same video and the same prompt, the model may generate slightly different descriptions each time. That’s because it’s not retrieving a fixed answer but generating the most likely sequence of words from a range of possibilities. The results can be consistent in general themes, but not identical in wording or even emphasis. And with current systems, there is always the possibility of <a href="https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence)">hallucination</a>. For researchers, these concerns mean that AI outputs should be treated less like definitive observations and more like plausible interpretations that still needs validation.</p>
<h3>Temperature</h3>
<p>Part of this variability comes from a setting called <em>temperature</em>, which controls how much randomness the model uses when generating responses. Temperature typically ranges from 0 (close to deterministic) to around 2 (much more variable). Most models use a middle setting by default, which balances consistency and variation. Higher temperatures can surface a wider range of interpretations (sometimes useful for exploratory analysis), while lower temperatures produce more consistent outputs—but even then, results aren’t perfectly repeatable.</p>
<p>Figure 2 illustrates this process.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/04/042126-F2.png"><img loading="lazy" decoding="async" class="alignnone size-full wp-image-47337" src="https://measuringu.com/wp-content/uploads/2026/04/042126-F2.png" alt="Visual overview of how an AI “watches” a video. " width="1182" height="788" srcset="https://measuringu.com/wp-content/uploads/2026/04/042126-F2.png 1182w, https://measuringu.com/wp-content/uploads/2026/04/042126-F2-300x200.png 300w, https://measuringu.com/wp-content/uploads/2026/04/042126-F2-1024x683.png 1024w, https://measuringu.com/wp-content/uploads/2026/04/042126-F2-768x512.png 768w, https://measuringu.com/wp-content/uploads/2026/04/042126-F2-600x400.png 600w" sizes="auto, (max-width: 1182px) 100vw, 1182px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 2: </strong>Visual overview of how an AI “watches” a video.</p>
<p>This gives you an idea about how AI reviews videos for usability problems. But what does it look like when you ask an AI to perform a usability evaluation of a video?</p>
<h2><span lang="EN-US">Problem List from ChatGPT</span></h2>
<p>We uploaded the full six-minute video of a person attempting to find a sushi restaurant in Denver on the OpenTable website into ChatGPT (model 5.4 Thinking). We selected the video because it has several known usability problems that humans consistently detect. We prompted ChatGPT with:</p>
<blockquote><p>“During a usability test, the facilitator must keep track of participant behaviors as they navigate through tasks on a website, mobile app, software program, etc. We’d like you to watch a video of a usability test where participants were asked to book a table at a Sushi restaurant. As you&#8217;re watching, please look for problems the participant has while attempting to complete the task. For example, you can document the path users take, describe issues they encounter as well as what on the website might be causing problems. If you understand these instructions, let me know and I&#8217;ll drag the video in for you to review. Are you ready for the video?”</p></blockquote>
<p>ChatGPT indicated “yes,” then took only three minutes to process the video (half the time of the six minutes because it sampled a fraction of the frames to piece together its visual autocorrect narrative).</p>
<p>From its output, we derived a list of seven usability problems (Table 1).</p>

<table id="tablepress-1028" class="tablepress tablepress-id-1028">
<thead>
<tr class="row-1">
	<th class="column-1">Problem List</th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1">Complex search field with placeholder text and unexpected behaviors significantly delayed user who selected sushi from search bar dropdown but for Dallas (default) instead of Denver</td>
</tr>
<tr class="row-3">
	<td class="column-1">Entering Denver in search field lost previous selection of sushi as cuisine</td>
</tr>
<tr class="row-4">
	<td class="column-1">Search results for sushi included many non-sushi restaurants</td>
</tr>
<tr class="row-5">
	<td class="column-1">Weak presentation of cuisine information in search results</td>
</tr>
<tr class="row-6">
	<td class="column-1">Participant seemed to miss price point filter—sorted on ratings and examined price points in descriptions</td>
</tr>
<tr class="row-7">
	<td class="column-1">UI pushes browsing without good decision support</td>
</tr>
<tr class="row-8">
	<td class="column-1">Selected restaurant was categorized as Seafood instead of Sushi, so participant failed the task</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1028 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> List of problems “discovered” by ChatGPT review of usability test video.</p>
<h2><span lang="EN-US">Looks Good, However …</span></h2>
<p>On the surface, that looks pretty good. It’s plausible, specific, and aligned with what a researcher might note. But it leaves us with a few questions:</p>
<ul>
<li>How many of these are <em>actual</em> usability problems versus plausible-sounding interpretations (autocorrect) or hallucinations?</li>
<li>How consistent are the results across multiple runs (reliability)?</li>
<li>How closely do these match what human UX researchers would identify (validity)?</li>
</ul>
<p>We’ll explore these important questions in future articles.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>A Review of Experiments with Synthetic Users</title>
		<link>https://measuringu.com/review-of-experiments-with-synthetic-users/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=review-of-experiments-with-synthetic-users</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD and Jeff Sauro, PhD]]></dc:creator>
		<pubDate>Wed, 15 Apr 2026 05:22:48 +0000</pubDate>
				<category><![CDATA[UX]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47300</guid>

					<description><![CDATA[One of the hardest parts of conducting user and market research is recruiting participants. It takes time, costs money, and on top of that, there are no-shows and fraudsters. Now imagine being able to conduct UX research without the hassle of recruiting the “U.” Enter the idea of AI-generated synthetic users that offer the promise [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/04/041426-FeatureImage.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47329 size-medium" src="https://measuringu.com/wp-content/uploads/2026/04/041426-FeatureImage-300x169.png" alt="Feature image showing two researchers examining an AI synthetic user robot" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/04/041426-FeatureImage-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/04/041426-FeatureImage-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/04/041426-FeatureImage-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/04/041426-FeatureImage-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/04/041426-FeatureImage-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/04/041426-FeatureImage.png 2000w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>One of the hardest parts of conducting user and market research is recruiting participants. It takes time, costs money, and on top of that, there are <a href="https://measuringu.com/typical-no-show-rate-for-moderated-studies/">no-shows</a> and fraudsters.</p>
<p>Now imagine being able to conduct UX research without the hassle of recruiting the “U.” Enter the idea of AI-generated synthetic users that offer the promise of participant input being:</p>
<ul>
<li>Simpler (no need to deal with humans)</li>
<li>Less costly (no panel/respondent fees)</li>
<li>Faster (finish in hours or days instead of weeks or months)</li>
<li>Scalable (get data from thousands of synthetic users instead of a relative handful of participants)</li>
<li>Broader in reach (access to user groups that are hard to find or very expensive to recruit)</li>
<li>More secure (no need for nondisclosure agreements and no risk of human disclosure)</li>
</ul>
<p>At least, that’s the dream of research with synthetic users.</p>
<p>Others view synthetic users as more of a nightmare. They are concerned that research with synthetic users can lead to:</p>
<ul>
<li>Plausible-looking data that’s just wrong</li>
<li>Shallow qualitative responses because synthetic users have no real lived experience</li>
<li>Reinforced biases driven by large language models (LLM)</li>
<li>Artificially low variability (quantitative or qualitative)</li>
</ul>
<p>We’ve seen these conflicting attitudes about synthetic users play out in online posts and conversations over the past few years, most recently with the promotion of proprietary models of synthetic users by companies like Qualtrics and Aaru, followed by criticism of that promotion by influential UX researchers.</p>
<h2>Pro-Synthetic Voices</h2>
<p>Qualtrics is the dominant (and <a href="https://www.pymnts.com/acquisitions/2026/wall-street-snubs-qualtrics-debt-over-ai-disruption-risks/">debt-loaded</a>) survey platform that’s made a big bet on synthetic users. Their synthetic dataset was trained on millions of survey responses, and they reported that it can <a href="https://www.greenbook.org/insights/data-science/testing-synthetic-data-against-academic-benchmarks-a-replication-study">realistically mimic human survey patterns</a>. The variability and correlations mirror human response patterns better than general LLMs data, at least for the attitudinal survey questions they used.</p>
<p>Aaru is another synthetic user simulation platform that has gotten attention. The global consulting firm EY used Aaru’s proprietary multi-agent simulation to replicate a 3,600-person global wealth survey. They <a href="https://www.ey.com/en_us/insights/wealth-asset-management/how-ai-simulation-accelerates-growth-in-wealth-and-asset-management">reported strong agreement across multiple statistical metrics</a> (high correlation, modest error), suggesting that synthetic data approximated real survey results at scale (done in one day versus six months!).</p>
<h2>Anti-Synthetic Voices</h2>
<p>First from the anti-synthetic camp is Chris Chapman, a longtime quantitative UX researcher (Amazon, Google, Microsoft) and co-chair of the Quant UX Conference. His most recent presentation clearly elaborates that <a href="https://quantuxblog.com/synthetic-survey-data-its-not-data">synthetic users are not users</a>. His blunt conclusion is that <strong>synthetic data has no place in survey research</strong>.</p>
<p>Another voice is John Mecke, a SaaS and product strategy writer who argues that <a href="https://developmentcorporate.com/saas/synthetic-responses-market-research-2025/">synthetic users face five core limitations</a>: no lived experience, misleading “too-accurate” results, cultural bias, weak statistical reliability, and narrower real-world usefulness than claimed.</p>
<p>And there’s also Constantine Papas, a UX research strategist and writer of <em>The Voice of User. </em>Papas argues that <a href="https://www.thevoiceofuser.com/aaa-billion-dollar-ai-startup-is-selling-you-a-survey-the-wall-street-journal-wrote-a-love-letter-about-it/"><em>s</em>ynthetic research is being oversold</a> largely by cherry-picking favorable results from financially interested parties. When describing the EY study from Aaru’s algorithms, he argues that the correlations are largely because the LLMs were already trained on this data. That’s hardly predicting.</p>
<p>Finally, a recent preprint of a comprehensive literature review of 182 papers also casts strong doubt on the ability of synthetic users to do more than mimic already collected data. We recommend reading the <a href="https://www.researchgate.net/publication/401777396_Synthetic_Participants_Generated_by_Large_Language_Models_A_Systematic_Literature_Review">preprint</a> (not yet peer reviewed) and a <a href="https://www.thevoiceofuser.com/the-largest-review-of-synthetic-participants-ever-conducted-found-exactly-what-youd-expect-synthetic-users-dont-work/">discussion of the research</a> by Papas.</p>
<p>As interesting as these online conversations are, they have not been formally reviewed for scientific quality. In this article, we briefly review 12 <strong>peer-reviewed research papers</strong> on the use of synthetic users in UX and UX-adjacent research. For full details on experimental designs and results (e.g., experimental comparisons, models, prompting, settings, metrics), see the links to the papers and articles in the appendix.</p>
<h2>Our Inclusion Criteria for Papers on Synthetic Users</h2>
<p>We searched the literature for peer-reviewed research that had been published no earlier than 2023 and used LLM models no earlier than GPT-3.5. This turned up 12 papers that can be broadly categorized as attempts to replicate:</p>
<ul>
<li>Psychological experiments (five papers)</li>
<li>Survey results (three papers)</li>
<li>Social research (three papers)</li>
<li>UX research (one paper)</li>
</ul>
<p>We’ll now review the evidence in each of these four categories.</p>
<h3>Psychological Experiments: Sometimes Human-Like, Often Inconsistent</h3>
<p>The idea that digital data can replicate people predates LLMs. From the mid-1990s through the 2000s, a popular research program in social psychology was the &#8220;<a href="https://en.wikipedia.org/wiki/Computers_are_social_actors">computers are social actors</a>&#8221; paradigm, recreating classical psychological experiments in which one of the human participants was replaced by a computer to investigate how this affected human behavior.</p>
<p>Several researchers have adapted this approach to one in which there are <em>no</em> human participants, exploring the extent to which LLMs mimic humans in psychological experiments.</p>
<p>If synthetic users act like humans in experiments, maybe we can use them instead of humans in some studies. But why would anyone think this would be possible? Well, because modern LLMs are trained on huge amounts of human-generated content, <a href="https://www.sciencedirect.com/science/article/pii/S2590198226000825">the models may include latent social information</a>. Depending on the quality of this latent information, with appropriate prompts, they might produce human-like outputs.</p>
<p>The results from these experiments were <strong>mixed</strong>, with the following key findings from the five papers:</p>
<ul>
<li>Using GPT-3.5, Dillion et al. (<a href="https://static1.squarespace.com/static/671011231a30d401349ce94c/t/67c7999a4348db3e090ea128/1741134235231/can-AI-language-models-replace-human-participants.pdf">2023</a>) found significant correlation (<em>r</em> = .95) between synthetic and human moral judgments (encouraging), but there were many points with large differences between human and synthetic mean ratings (discouraging).</li>
<li>Goli and Singh (<a href="https://pubsonline.informs.org/doi/10.1287/mksc.2023.0306">2024</a>) used GPT-3.5 and GPT-4 in a replication of experiments in which synthetic users were presented with a choice between a certain number of tokens in a month versus waiting for a larger number of tokens later. GPT-3.5 ignored differences in reward amounts (discouraging), while GPT-4 had some sensitivity to the differences (encouraging), but its discount rates were larger than those observed with humans (discouraging).</li>
<li>Attempting to replicate 14 classic social science studies using GPT-3.5, Park et al. (<a href="https://link.springer.com/article/10.3758/s13428-023-02307-x">2024</a>) reported that six had unanalyzable data (too little variability), five failed replication, and three were successfully replicated. So, 21% of the attempts were successful (encouraging), but 79% were unsuccessful (discouraging).</li>
<li>Using GPT-4, de Winter et al. (<a href="https://www.sciencedirect.com/science/article/pii/S0191886924001892">2024</a>) created 2000 text-based personas that completed a short form of the <a href="https://www.sciencedirect.com/topics/social-sciences/big-five">Big Five Inventory</a>. The synthetic data matched the expected factor structure and had high correlation with human data (encouraging) but significant deviation from the humans’ item means (discouraging).</li>
<li>Almeida et al. (<a href="https://arxiv.org/pdf/2308.01264">2024</a>) replicated eight psychology studies of legal and moral reasoning using Gemini Pro (1.0), Claude 2.1, GPT-4, and Llama 2 Chat 70b. They found differing levels of alignment with human responses, with the closest match for GPT-4. “Nonetheless, even when LLM-generated responses are highly correlated to human responses [encouraging], there are still systematic differences, with a tendency for models to exaggerate effects that are present among humans, in part by reducing variance” (discouraging).</li>
</ul>
<h3>Surveys: Match on Averages, Fail on Details</h3>
<p>Even if synthetic users are inconsistent in how they react to classical psychology experiments, they might be able to match human response patterns in surveys. However:</p>
<ul>
<li>Bisbee et al. (<a href="https://www.cambridge.org/core/journals/political-analysis/article/synthetic-replacements-for-human-survey-data-the-perils-of-large-language-models/B92267DC26195C7F36E63EA04A47D2FE">2024</a>) used GPT-3.5 Turbo (with some replication by GPT 4.0 and Falcon-40B-Instruct) to reproduce the 2016–2020 American National Election Survey (ANES). They encountered numerous statistical issues with synthetic respondents somewhat matching high-level means (encouraging) but having inaccurate subgroup means, small standard deviations, inaccurate regression coefficients, and failure to meet even basic requirements for replication (discouraging).</li>
<li>Using GPT-4, Shrestha et al. (<a href="https://journals.sagepub.com/doi/10.1177/23794607241311793">2024</a>) compared synthetic and human responses to 43 policy questions on topics like climate, spending, and labor in the U.S., Saudi Arabia, and the UAE. The means for the 43 questions indicated that the responses of human and synthetic participants were reasonably aligned (encouraging) but not precisely the same, with about 70% significantly different (discouraging).</li>
<li>Tjuatja et al. (<a href="https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00685/124261/Do-LLMs-Exhibit-Human-like-Response-Biases-A-Case">2024</a>) used variants of Llama, ChatGPT-3.5 Turbo, and Turbo Instruct to investigate whether synthetic responses to different item formats matched expected human response behavior biases. “Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior” (discouraging).</li>
</ul>
<h3>Social Research: Trends Match Humans, but the Details Don’t</h3>
<p>The goal of studies in social research is similar to psychological experimentation, though with more focus on interpersonal behaviors and attitudes.</p>
<ul>
<li>In experiments with GPT-4 and Llama3, Yu et al. (<a href="https://www.sciencedirect.com/science/article/pii/S2949882125001173">2025</a>) compared synthetic user and human responses to standardized psychological questionnaires measuring empathy. The expected factor structure of the questionnaires was produced by GPT-4 (encouraging), but the magnitudes of the synthetic scores did not match those of humans (discouraging). Responses from Llama3 synthetic users did not match the expected factor structure (discouraging).</li>
<li>Wang et al. (<a href="https://arxiv.org/pdf/2402.01908">2025</a>) showed that the LLMs they investigated (Llama-2-Chat 7B, Wizard Vicuna Uncensored 7B, GPT-3.5 Turbo, GPT-4) may not be able to distinguish between text written about different groups of people by others and those written by different groups of people, making them unsuitable for creating synthetic users that can replace actual users for social research due to inherent bias (discouraging).</li>
<li>Rafikova and Voronin (<a href="https://ideas.repec.org/a/spr/jcsosc/v9y2026i1d10.1007_s42001-025-00452-1.html">2026</a>) used GPT-4 to investigate synthetic responses to complex social issues (e.g., immigration, gender stereotypes). Synthetic users matched the direction and magnitude of human attitudinal trends (encouraging) but had weak correspondence with deeper models of attitudinal variance (discouraging).</li>
</ul>
<h3>UX Interviews: Convincing at First, Limited on Follow-Up</h3>
<p>We didn’t turn up studies directly related to quantitative UX research (although that is informed by psychological, survey, and social research). We did, however, find one related to researcher experiences interviewing humans and synthetic users.</p>
<ul>
<li>Kapania et al. (<a href="https://dl.acm.org/doi/full/10.1145/3706598.3713220">2025</a>) had 19 UX researchers recreate one of their recent projects conducted with human participants with GPT-4-Turbo. &#8220;Initially skeptical, researchers were surprised to see similar narratives emerge in the LLM-generated data when using the interview probe. However, over several conversational turns, they went on to identify fundamental limitations, such as how LLMs foreclose participants’ consent and agency, produce responses lacking in palpability and contextual depth, and risk delegitimizing qualitative research methods&#8221; (discouraging).</li>
</ul>
<h2>Summary and Discussion</h2>
<p>We reviewed 12 papers describing recent research comparing synthetic users and humans in four contexts of interest to UX researchers. In our summaries, we tagged 9 findings as encouraging and 14 as discouraging. So, the results aren’t universally bad, but they definitely aren’t great. We summarized those in Table 1.</p>

<table id="tablepress-1027" class="tablepress tablepress-id-1027">
<thead>
<tr class="row-1">
	<th class="column-1">Theme</th><th class="column-2">Encouraging Findings</th><th class="column-3">Discouraging Findings</th>
</tr>
</thead>
<tbody>
<tr class="row-2">
	<td class="column-1"><strong>Matched means/percents<strong></td><td class="column-2"><strong>4</strong> (B, P, R, S)</td><td class="column-3"><strong>7</strong>(A, D, G, P, S, W, Y)</td>
</tr>
<tr class="row-3">
	<td class="column-1"><strong>Correlated<strong></td><td class="column-2"><strong>4</strong> (A, D, G, W)</td><td class="column-3"><strong>1 </strong>(R)</td>
</tr>
<tr class="row-4">
	<td class="column-1"><strong>Matched expected variance</td><td class="column-2"><strong>0</td><td class="column-3"><strong>3</strong> (A, B, P)</td>
</tr>
<tr class="row-5">
	<td class="column-1"><strong>Matched factor structure</td><td class="column-2"><strong>2</strong> (W, Y)</td><td class="column-3"><strong>1</strong> (Y)</td>
</tr>
<tr class="row-6">
	<td class="column-1"><strong>Matched expected replication</td><td class="column-2"><strong>1 </strong>(P)</td><td class="column-3"><strong>2</strong> (B, P)</td>
</tr>
<tr class="row-7">
	<td class="column-1"><strong>Good qualitative depth</td><td class="column-2"><strong>0</td><td class="column-3"><strong>3</strong> (K, T, Wa)</td>
</tr>
<tr class="row-8">
	<td class="column-1"><strong>Unbiased/representative</td><td class="column-2"><strong>0</td><td class="column-3"><strong>1 </strong>(Wa)</td>
</tr>
<tr class="row-9">
	<td class="column-1"><strong>Matched regression weights</td><td class="column-2"><strong>0</td><td class="column-3"><strong>1</strong> (B)</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1027 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Summary of the number of encouraging and discouraging findings. The letters in parentheses indicate the sources for the findings. Letters are the first letter of the last name of the lead author; W = de Winter, Wa = Wang. Some studies produced both encouraging and discouraging findings in the same themes (e.g., Yu found both matching and mismatching factor structures), and some findings matched multiple themes.</p>
<h3>Correlation Does Not Mean Equivalence</h3>
<p>Some results were promising, but most found discrepancies between synthetic and human results.</p>
<p>For example, Dillion et al. (2023) found significant alignment between synthetic and human moral judgments, but Almeida et al. (2024) reported that even when synthetic and human moral judgments correlated, there were systematic differences with synthetic data exaggerating effects seen with humans.</p>
<h3>Superficial Agreement, Deeper Errors</h3>
<p>Issues with synthetic data included reduced variance, misalignment of means/percentages, distorted correlations, inaccurate regression coefficients, and shallow experiential narratives.</p>
<p>Different studies reported different issues. Sometimes high-level means matched but deeper correlational metrics were distorted (e.g., Bisbee et al., 2024); at other times, correlations were high, but there were significant differences among means (e.g., de Winter et al., 2024).</p>
<h3>Rapid Model Changes Make Findings Quickly Outdated</h3>
<p>Research on synthetic users is complicated by variation in contexts, models, prompting, and settings.</p>
<p>Controlled experimentation relies on being able to control the experimental environment. Different researchers use different models with different prompting and settings. Even the papers published in 2026 used older models than are currently available because research necessarily precedes peer-reviewed publication. Next year’s models will be different from this year’s.</p>
<h3>Proprietary Models May Work but Lack Validation</h3>
<p>Further complicating the research landscape is the emergence of proprietary models incorporating extensive amounts of survey data. Proprietary models from companies like Qualtrics and Aaru might perform better than general LLM chatbots in the production of synthetic samples that match human attitudes and performance. It’s just too early to tell. To date, we have not seen any peer-reviewed publications using these platforms.</p>
<h3>Directional When Answers Are Unknown</h3>
<p>The encouraging results regarding occasionally high correlation of human and synthetic data suggest that the results from synthetic users can provide directional signals, but synthetic estimates are often imprecise and inconsistent from study to study. The promise of synthetic users is alluring, but until there is strong evidence of consistently good matching with human data, it seems premature to rely on research with synthetic users for critical decision-making.</p>
<h3>Potentially Useful When Answers Are Known and Stable</h3>
<p>We’re not done yet with this topic and are planning our own analysis. But right now, it seems the most promising use of synthetic users is deriving insights from already collected data. Why ask a survey question if the answer is already known and stable? Most attitudes aren’t stable and are highly dependent on the audience. But if you have surveyed the same type of population repeatedly and have relatively stable results (possibly like the EY study), then you may know the answer. In that case, an LLM is just an easier way to query your database. Just don’t think it’s predicting something that’s not already known.</p>
<h2>Appendix: Links to Papers and Articles</h2>
<p><a href="https://arxiv.org/pdf/2308.01264">Almeida G. F. C. F., Nunes J. L., Engelmann N., Wiegmann A., &amp; de Araújo M. (2024). Exploring the psychology of LLMs’ moral and legal reasoning. <em>Artificial Intelligence</em>, 333.</a></p>
<p><a href="https://www.ey.com/en_us/insights/wealth-asset-management/how-ai-simulation-accelerates-growth-in-wealth-and-asset-management">Babcic, S., Hamaloglu, U., &amp; Munshi, S. (2025, Oct 25). <em>How AI simulation accelerates growth in wealth and asset management</em>. EY.</a></p>
<p><a href="https://www.cambridge.org/core/journals/political-analysis/article/synthetic-replacements-for-human-survey-data-the-perils-of-large-language-models/B92267DC26195C7F36E63EA04A47D2FE">Bisbee, J., Clinton, J. D., Dorff, C., Kenkel, B., &amp; Larson, J. M. (2024). Synthetic replacements for human survey data? The perils of large language models. <em>Political Analysis</em>, <em>32</em>(4), 401–416.</a></p>
<p><a href="https://quantuxblog.com/synthetic-survey-data-its-not-data">Chapman, C. (2025, Jun 18). <em>Synthetic survey data? It’s not data</em>. Quantitative UX Research Blog. </a></p>
<p><a href="https://www.sciencedirect.com/science/article/pii/S0191886924001892">de Winter J. C. F., Driessen T., &amp; Dodou D. (2024). The use of ChatGPT for personality research: Administering questionnaires using generated personas. <em>Personality and Individual Differences</em>, <em>228</em>, #112729.</a></p>
<p><a href="https://static1.squarespace.com/static/671011231a30d401349ce94c/t/67c7999a4348db3e090ea128/1741134235231/can-AI-language-models-replace-human-participants.pdf">Dillion, D., Tandon, N., Gu, Y., &amp; Gray, K. (2023). Can AI language models replace human participants? <em>Trends in Cognitive Sciences</em>, <em>27</em>(7), 597–600. </a></p>
<p><a href="https://pubsonline.informs.org/doi/10.1287/mksc.2023.0306">Goli, A., &amp; Singh, A. (2024). Can large language models capture human preferences? <em>Marketing Science</em>, <em>43</em>(4), 709–722. [Abstract only]. </a></p>
<p><a href="https://dl.acm.org/doi/10.1145/3706598.3713220">Kapania, S., Agnew, W., Eslami, M., Heidari, H., &amp; Fox, S. E. (2025). Simulacrum of stories: Examining large language models as qualitative research participants. <em>Proceedings of CHI ‘25</em>, #489,  1–17. </a></p>
<p><a href="https://www.researchgate.net/publication/401777396_Synthetic_Participants_Generated_by_Large_Language_Models_A_Systematic_Literature_Review">Kuric, E., Demcak, P., &amp; Krajcovic, M. (2026). Synthetic participants generated by large language models: A systematic literature review. [Preprint—Not yet peer reviewed]. </a></p>
<p><a href="https://www.greenbook.org/insights/data-science/testing-synthetic-data-against-academic-benchmarks-a-replication-study">McLean, D. (2026, Feb 2). <em>Testing synthetic data against academic benchmarks: A replication study</em>. Greenbook. </a></p>
<p><a href="https://developmentcorporate.com/saas/synthetic-responses-market-research-2025/">Mecke, J. (2025, Oct 31). Synthetic responses in market research: Promise vs. reality in 2025. Development Corporate.</a></p>
<p><a href="https://www.quirks.com/articles/exploring-the-challenges-and-potential-of-synthetic-data-and-survey-participants">Millman, S. (2025, Feb 25). <em>Exploring the challenges and potential of synthetic data and survey participants</em>. Quirk&#8217;s Media.</a></p>
<p><a href="https://www.thevoiceofuser.com/aaa-billion-dollar-ai-startup-is-selling-you-a-survey-the-wall-street-journal-wrote-a-love-letter-about-it/">Papas, C. (2026, Mar 15). <em>Question: Is Aaru actually proving that synthetic research can predict human behavior and replace real user research?</em> The Voice of User.</a></p>
<p><a href="https://link.springer.com/article/10.3758/s13428-023-02307-x">Park, P. S., Schoenegger, P., &amp; Zhu, C. (2024). Diminished diversity-of-thought in a standard large language model. <em>Behavior Research Methods</em>, <em>56</em>(6), 5754–5770.</a></p>
<p><a href="https://ideas.repec.org/a/spr/jcsosc/v9y2026i1d10.1007_s42001-025-00452-1.html">Rafikova, A., &amp; Voronin, A. (2026). ChatGPT as a research proxy: simulating human attitudes in social science research. <em>Journal of Computational Social Science</em>, <em>9</em>(17). [Abstract Only]</a></p>
<p><a href="https://journals.sagepub.com/doi/10.1177/23794607241311793">Shrestha, P., Krpan, D., Koaik, F., Schnider, R., Sayess, D., &amp; Binbaz, M. S. (2024). Beyond WEIRD: Can synthetic survey participants substitute for humans in global policy research? <em>Behavioral Science &amp; Policy</em>, <em>10</em>(2), 26–45.</a></p>
<p><a href="https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00685/124261/Do-LLMs-Exhibit-Human-like-Response-Biases-A-Case">Tjuatja, L. Chen, V., Wu, T., Talwalkwar, A., &amp; Neubig, G. (2024). Do LLMs exhibit human-like response biases? A case study in survey design. <em>Transactions of the Association for Computational Linguistics</em>, <em>12</em>, 1011–1026.</a></p>
<p><a href="https://www.sciencedirect.com/science/article/pii/S2590198226000825">Wallius, E., &amp; Lehtonen, E. (2026). Beyond human proxies: The roles and usefulness of large language models in user research for mobility service development. <em>Transportation Research Interdisciplinary Perspectives</em>, <em>36</em>, #101917.</a></p>
<p><a href="https://arxiv.org/pdf/2402.01908">Wang, A., Morgenstern, J. &amp; Dickerson, J. P. (2025) Large language models that replace human participants can harmfully misportray and flatten identity groups. <em>Nature Machine Intelligence</em>, <em>7</em>, 400–411.</a></p>
<p><a href="https://www.sciencedirect.com/science/article/pii/S2949882125001173">Yu, T., Pan, S., Fan, C., Luo, S., Jin, Y., &amp; Zhao, B. (2025). Can large language models exhibit cognitive and affective empathy as humans? <em>Computers in Human Behavior: Artificial Humans</em>, <em>6</em>, #100233.</a></p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Credible vs. Confidence Intervals: Different Meanings but Similar Decisions</title>
		<link>https://measuringu.com/credible-vs-confidence-intervals/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=credible-vs-confidence-intervals</link>
		
		<dc:creator><![CDATA[Jeff Sauro, PhD&nbsp;•&nbsp;Jim Lewis, PhD]]></dc:creator>
		<pubDate>Wed, 08 Apr 2026 06:49:35 +0000</pubDate>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[Bayes theorem]]></category>
		<category><![CDATA[Bayesian theorem]]></category>
		<category><![CDATA[confidence interval]]></category>
		<category><![CDATA[Confidence Intervals]]></category>
		<category><![CDATA[credible interval]]></category>
		<category><![CDATA[credible intervals]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47234</guid>

					<description><![CDATA[We’ve written a lot about confidence intervals for the last two decades. We especially encourage them for small sample studies. Some of you even bought into our recommendation and use them yourselves (a decision we continue to support). But maybe you’ve heard about Bayesian credible intervals and wonder if you should be using them instead. [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47288 size-medium" src="https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1-300x169.png" alt="Feature image shows two researchers, each examining a measuring tool, with a specific interval highlighted." width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/04/040726-FeatureImage-1.png 2000w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>We’ve <a href="https://measuringu.com/article/estimating-completion-rates-from-small-samples-using-binomial-confidence-intervals-comparisons-and-recommendations/">written a lot about confidence intervals</a> for the last two decades.</p>
<p>We especially encourage them for small sample studies.</p>
<p>Some of you even bought into our recommendation and use them yourselves (a decision we continue to support).</p>
<p>But maybe you’ve heard about <a href="https://en.wikipedia.org/wiki/Credible_interval">Bayesian credible intervals</a> and wonder if you should be using them instead.</p>
<p>In this article, we return to an <a href="https://measuringu.com/bayes-law-in-ux-research-the-power-and-perils-of-priors">example used in our previous articles</a> on Bayesian methods applied to UX research and compare analyses of that example with confidence and credible intervals.</p>
<h2><span lang="EN-US">Confidence Interval Analysis</span></h2>
<p>In our recurring example, 18 of 20 participants successfully completed a checkout task (a 90% completion rate). But if we were to test hundreds, thousands, or (somehow) all potential users, would the completion rate be exactly 90%? Almost surely not. But instead of trying to nail down an exact single number, a likely range is usually sufficient for decision making and surprisingly easy to compute and accurate even for small sample sizes.</p>
<p>For this type of data (binary), the likely range can be computed using an adjusted-Wald confidence interval with 95% confidence. That interval is 68.7% to 98.4%.</p>
<p>We’ve made it easy to compute binomial confidence intervals with <a href="https://measuringu.com/calculators/wald/">our online calculator</a>. But how do you interpret or explain what it means? How about:</p>
<ul>
<li>There’s a 95% probability the true completion rate is between 68.7% and 98.4%.</li>
<li>There’s a 95% chance the true completion rate falls within 68.7% and 98.4%.</li>
<li>95% of future tests with completion rates will be between 68.7% and 98.4%.</li>
</ul>
<p>Strictly speaking, all three of those statements are wrong. A stats professor or Bayesian enthusiast will be happy to point out that error.</p>
<p>The more technically correct way to describe the interval is:</p>
<ul>
<li>If we ran many tests, each with 20 users from the same population and computed confidence intervals each time, on average, 95 out of 100 confidence intervals will contain the unknown population completion rate.</li>
</ul>
<p>Strictly speaking, we are 95% confident <em>in the method </em>of generating confidence intervals and not in any given interval. The confidence interval we generated from the sample data either does or does not contain the population completion rate.</p>
<p>We don’t know if our sample of 20 is one of those five whose confidence interval doesn&#8217;t contain the completion rate. So, it’s best to avoid using “probability” or “chance” when describing a confidence interval and remember that we’re 95% confident in the process of generating confidence intervals rather than a given interval.</p>
<p>So, we have just one study, and we computed only one interval. What does that mean? What are we “allowed” to say other than that cumbersome statement? We have a couple of recommendations suitable for practical decision making:</p>
<ul>
<li><strong>Likely range</strong>: “68.7% to 98.4% is the most likely range for the unknown completion rate from all users.”</li>
<li><strong>Plausible range</strong> (from <a href="https://www.amazon.com/Confidence-Intervals-Quantitative-Applications-Sciences/dp/076192499X">Smithson, 2002</a>): “Given this data, values inside the confidence interval are plausible while those outside are implausible. The observed completion rate of 90% is plausible but rates lower than 68.7% or higher than 98.4% are implausible.”</li>
</ul>
<p>This is where the precision of numbers meets the imprecision of language. Although confidence, probability, likely, and plausible all sound about the same, they have more precise usage when it comes to statistics and probability.</p>
<p>This rigidity in language makes them less ideal when communicating the results to stakeholders who will not likely have a sophisticated understanding of confidence intervals (although <a href="https://link.springer.com/article/10.3758/s13423-013-0572-3">even professors sometimes struggle with the concept</a>).</p>
<h2><span lang="EN-US">Credible Interval Analysis</span></h2>
<p>One proposed alternative is the Bayesian credible interval.</p>
<p>Credible intervals are designed to allow for the interpretation people naturally want to use. A 95% credible interval can be interpreted as having a 95% probability of containing the true value.</p>
<p>Like with confidence intervals, there are different computations used to generate credible intervals on binary data. And like with confidence intervals, there are debates about which method is optimal. We won’t get into that debate here. Instead, we’ve provided in Table 1 three Bayesian credible intervals for our example that differ in <a href="https://measuringu.com/bayes-law-in-ux-research-the-power-and-perils-of-priors">their priors</a> (all of which are <a href="https://nvlpubs.nist.gov/nistpubs/TechnicalNotes/NIST.TN.2119.pdf">commonly used in practice</a>).</p>

<table id="tablepress-1026" class="tablepress tablepress-id-1026">
<thead>
<tr class="row-1">
	<th class="column-1"><strong>Method</strong></th><th class="column-2"><strong>Prior/Setup</strong></th><th class="column-3"><strong>95% Interval</strong></th>
</tr>
</thead>
<tbody class="row-striping">
<tr class="row-2">
	<td class="column-1">Adjusted-Wald</td><td class="column-2">Add ~2 successes &amp; ~2 failures</td><td class="column-3"><strong>68.7% to 98.4%</strong></td>
</tr>
<tr class="row-3">
	<td class="column-1">Bayesian credible interval</td><td class="column-2">Beta(1,1)—Uniform prior</td><td class="column-3"><strong>69.6% to 97.0%</strong></td>
</tr>
<tr class="row-4">
	<td class="column-1">Bayesian credible interval</td><td class="column-2">Beta(0.5, 0.5)—Jeffreys prior</td><td class="column-3"><strong>71.6% to 97.9%</strong></td>
</tr>
<tr class="row-5">
	<td class="column-1">Bayesian credible interval</td><td class="column-2">Beta(2, 2)—Agresti prior</td><td class="column-3"><strong>66.4% to 95.0%</strong></td>
</tr>
</tbody>
</table>
<!-- #tablepress-1026 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Four 95% interval estimates, one confidence and three credible.</p>
<p>For example, a 95% Bayesian credible interval using a uniform prior for 18 successes and 2 failures generates a credible interval of 69.6% to 97.0%.</p>
<p>We can say there’s a 95% probability that the true and unknown completion rate is between 69.6% and 97%.</p>
<p>Stats professors are happy with that statement. Bayesian purists are happy with that statement. And your stakeholders probably understand that statement too!</p>
<p>So, should we all start using credible intervals and abandon confidence intervals? Not necessarily.</p>
<p>Credible intervals require more complex calculations and usually don’t have the simple closed-form solution of the adjusted-Wald interval. In practice, however, this difference is negligible because modern software handles the computation (e.g., we used the binom.bayes function in the R package binom).</p>
<p>But did you notice anything about the values in Table 1? The intervals are all very similar, as shown in the graph in Figure 1.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-scaled.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47293" src="https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-300x175.png" alt="Graph of the four intervals " width="1200" height="698" srcset="https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-300x175.png 300w, https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-1024x596.png 1024w, https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-768x447.png 768w, https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-1536x894.png 1536w, https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-2048x1192.png 2048w, https://measuringu.com/wp-content/uploads/2026/04/040726-Figure1-1-600x349.png 600w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> Graph of the four intervals (Green: adjusted-Wald, Blue: Bayesian Uniform, Orange: Bayesian Jeffreys, Black: Bayesian Agresti); dashed green line shows limits of adjusted-Wald interval across the three Bayesian intervals.</p>
<p>There are very few differences between the intervals. The width of the adjusted-Wald interval is 29.7%. The Uniform and Jeffreys intervals lie within the adjusted-Wald (with respective widths of 27.4% and 26.3%) while the Agresti interval has about the same width as the adjusted-Wald (28.6%), with its upper and lower endpoints shifted down relative to the adjusted-Wald interval by, respectively, 3.4% and 2.3%.</p>
<p>If the output is roughly the same, does it really matter? The numbers don’t know where they come from.</p>
<p>This is similar to the debate about ordinal versus interval data. As Lord (1951) noted, even <a href="https://is.muni.cz/el/fss/jaro2010/PSY454/um/Frederick_Lord_On_the_statistical_treatment_of_football_numbers.pdf">nominal values like football jersey numbers can be averaged</a>. The math works, but proper interpretation is critical.</p>
<p>Confidence intervals and credible intervals can yield nearly identical results, especially for this type of data. In many cases, <strong>they will lead to the same practical decision</strong>, even though the interpretation differs.</p>
<p>So, what should you do?</p>
<p>The results here suggest that, at least for this type of data, traditional confidence intervals and Bayesian credible intervals can produce very similar ranges. The main difference is not in the numbers, but in how we interpret and communicate them.</p>
<p>That’s one reason we continue to recommend confidence intervals. They are well understood, widely taught, and, when used appropriately, provide accurate estimates of the range of plausible values.</p>
<p>At the same time, we understand the appeal of credible intervals. The interpretation is more natural and often aligns better with how stakeholders think about uncertainty.</p>
<p>In practice, either approach can be effective. What matters most is understanding what the interval represents and communicating it clearly. Decisions are made by inspecting the endpoints of the intervals. If you’d make the same decision for both endpoints, then you have enough information to make the decision. Otherwise, you need more data. In this example, it seems unlikely that the slight variation in endpoint values would affect real-world decision making.</p>
<p>Notably, in this example, the confidence interval encompassed two of the Bayesian intervals, so not only did it have 95% confidence from a frequentist perspective, but it also had at least 95% credibility from a Bayesian perspective.</p>
<p>We’ll continue to explore where these approaches differ more meaningfully in future articles, including whether these similarities extend beyond this example to different proportions and to other statistics such as means.</p>
<h2>Key Takeaways</h2>
<p>In this latest article on Bayesian methods, we covered:</p>
<ul>
<li>Confidence intervals are harder to explain than most people think.</li>
<li>Credible intervals match how people want to interpret uncertainty.</li>
<li>In this example, both methods produce very similar ranges.</li>
<li>The difference is less about the numbers and more about what we can say about them.</li>
<li>Use either approach thoughtfully, but focus on clear communication.</li>
</ul>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Bayes’ Law in UX Research: The Power and Perils of Priors</title>
		<link>https://measuringu.com/bayes-law-in-ux-research-the-power-and-perils-of-priors/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=bayes-law-in-ux-research-the-power-and-perils-of-priors</link>
		
		<dc:creator><![CDATA[Jeff Sauro, PhD&nbsp;•&nbsp;Jim Lewis, PhD]]></dc:creator>
		<pubDate>Wed, 01 Apr 2026 03:35:38 +0000</pubDate>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Bayes theorem]]></category>
		<category><![CDATA[Bayesian theorem]]></category>
		<category><![CDATA[Statistics]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47171</guid>

					<description><![CDATA[&#8220;That confirms what I expected.&#8221; The same data, two different conclusions. A 90% completion rate from 20 participants on a usability test of a checkout flow. Is that completion rate better than the historical average of 78%? One researcher says yes, definitely. Another says no, it’s in line with the historical average. Both are using [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/03/033126-FeatureImage.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47227 size-medium" src="https://measuringu.com/wp-content/uploads/2026/03/033126-FeatureImage-300x169.png" alt="Feature image showing two balance scales with urns on each side" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/03/033126-FeatureImage-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/03/033126-FeatureImage-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/033126-FeatureImage-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/03/033126-FeatureImage-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/033126-FeatureImage-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/03/033126-FeatureImage.png 2000w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>&#8220;That confirms what I expected.&#8221;</p>
<p>The same data, two different conclusions.</p>
<p>A 90% completion rate from 20 participants on a usability test of a checkout flow. Is that completion rate better than the <a href="https://measuringu.com/task-completion/">historical average of 78%</a>?</p>
<p>One researcher says yes, definitely. Another says no, it’s in line with the historical average.</p>
<p>Both are using the same <a href="https://measuringu.com/intro-to-bayesian-thinking-in-ux-research/">Bayesian method</a>. How can the same data produce opposite conclusions?</p>
<p>The answer lies in <em>priors</em>, the assumptions you bring to the analysis before the data impact the decision.</p>
<p>In our previous article, we assumed equal priors when <a href="https://measuringu.com/bayes-law-in-ux-research-from-urns-to-users">analyzing completion rate data</a> to simplify the analysis. But what happens when those priors change?</p>
<p>In this article, we explore the consequences of manipulating those prior probabilities in different ways.</p>
<h2><span lang="EN-US">The Effect of Priors on the Outcome</span></h2>
<p>In Bayesian analysis, we assign numerical probabilities to prior beliefs about competing hypotheses. Priors reflect how plausible we think each explanation is before seeing the current data.</p>
<p>If a prior belief is well supported, we give it more weight. If it’s less credible, we give it less weight. When we don’t have strong prior information, we can assign roughly equal weights, allowing the observed data to play a larger role in the conclusion.</p>
<p>In our example, 18 of 20 participants successfully completed a task (a 90% completion rate). We want to understand how different prior beliefs affect our interpretation of this result when compared to a historical completion rate of 78%.</p>
<p>To do this, we compare two hypotheses: that the true completion rate is 78% (historical) or 90% (based on the observed data), under different prior assumptions. We could also test other values (e.g., 85% or 95%), but we use 90% as a convenient reference based on the sample, recognizing that this is a simplifying modeling choice.</p>
<p>So, which is more plausible: a 78% or 90% completion rate?</p>
<p>We examine five scenarios that vary the strength and direction of the prior belief:</p>
<ol>
<li>Neutral prior (no preference)</li>
<li>Weak prior favoring a 78% completion rate</li>
<li>Weak prior favoring a 90% completion rate</li>
<li>Strong prior favoring a 78% completion rate</li>
<li>Strong prior favoring a 90% completion rate</li>
</ol>
<p>So how do we quantify the strength of our prior beliefs? What values should we use to represent neutral, weak, and strong preferences for one hypothesis over another?</p>
<p>A neutral prior is straightforward, a 50/50 reflecting no preference. But once we move beyond that, the choice of “weak” or “strong” priors becomes less clear.</p>
<p>If we move slightly off a neutral stance, values like 60/40 seem reasonable. But whether we use 60/40, 70/30, or 80/20 is somewhat arbitrary. We use 0.6 and 0.8 to represent weak and strong prior preferences, respectively.</p>
<p>To avoid confusion between completion rates (e.g., 90%) and prior probabilities (e.g., 0.8), we use decimal values for the priors.</p>
<p>When we apply these values to the Bayesian formula (see the appendix), we obtain the results shown in Table 1.</p>
<p>Each row represents a different prior scenario. The second and third columns show the prior beliefs assigned to each hypothesis. The next two columns show how those beliefs are updated after observing 18 of 20 participants complete the task. The final column shows the relative likelihood of the two hypotheses.</p>
<p>For example, with neutral priors, the 90% completion rate is 2.7 times more likely than the 78% completion rate. In contrast, with a strong prior favoring 78%, the 78% completion rate becomes more likely than the 90% completion rate.</p>

<table id="tablepress-1025" class="tablepress tablepress-id-1025">
<thead>
<tr class="row-1">
	<th rowspan="2" class="column-1"><center>Prior Scenario</th><th colspan="2" class="column-2">Prior Belief in</th><th colspan="2" class="column-4">Updated Belief in</th><th rowspan="2" class="column-6"><center>Which Is<br>More Likely?</th><th rowspan="2" class="column-7"><center>Odds<br>(90% vs. 78%)</th>
</tr>
<tr class="row-2">
	<th class="column-2">90%</th><th class="column-3">78%</th><th class="column-4">90%</th><th class="column-5">78%</th>
</tr>
</thead>
<tbody class="row-striping row-hover">
<tr class="row-3">
	<td class="column-1">Neutral prior (no preference)</td><td class="column-2">0.5</td><td class="column-3">0.5</td><td class="column-4">0.732</td><td class="column-5">0.268</td><td class="column-6"><center>90%</td><td class="column-7"><center>2.7×</td>
</tr>
<tr class="row-4">
	<td class="column-1">Weak prior favoring 78%</td><td class="column-2">0.4</td><td class="column-3">0.6</td><td class="column-4">0.645</td><td class="column-5">0.355</td><td class="column-6"><center>90%</td><td class="column-7"><center>1.8×</td>
</tr>
<tr class="row-5">
	<td class="column-1">Weak prior favoring 90%</td><td class="column-2">0.6</td><td class="column-3">0.4</td><td class="column-4">0.804</td><td class="column-5">0.196</td><td class="column-6"><center>90%</td><td class="column-7"><center>4.1×</td>
</tr>
<tr class="row-6">
	<td class="column-1">Strong prior favoring 78%</td><td class="column-2">0.2</td><td class="column-3">0.8</td><td class="column-4">0.405</td><td class="column-5">0.595</td><td class="column-6"><center>78%</td><td class="column-7"><center><strong>0.68×<br>(≈1.5× for 78%)</strong></td>
</tr>
<tr class="row-7">
	<td class="column-1">Strong prior favoring 90%</td><td class="column-2">0.8</td><td class="column-3">0.2</td><td class="column-4">0.916</td><td class="column-5">0.084</td><td class="column-6"><center>90%</td><td class="column-7"><center><strong>10.9×</strong></td>
</tr>
</tbody>
</table>
<!-- #tablepress-1025 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Effect of different priors on updated beliefs.</p>
<h2><span lang="EN-US">How Our Conclusions Change Based on Priors</span></h2>
<p>Across all five scenarios, a 90% completion rate is more likely in four of them. In one case, it’s more than ten times as likely as the 78% completion rate. Only when we strongly favor the historical data does the conclusion shift, making the 78% completion rate more likely despite the observed results.</p>
<p>Changing only the prior belief can shift the conclusion from favoring 78% to strongly favoring 90%. No new data were added. In this example, changing the prior assumption had a larger effect on the conclusion than a modest increase in sample size would. This raises a natural question: how much additional data would be needed to overcome a strong prior?</p>
<p>This highlights an important property of Bayesian analysis. The conclusions are influenced not only by the observed data, but also by the strength and direction of the prior beliefs. When priors are strong, they can reinforce or counteract the data. When priors are weak or neutral, the data play a larger role.</p>
<p>Who decides what the historical data is and how relevant it is? And how strongly do you weight the priors? There isn’t a Bayesian rule book we can reference. Instead, it comes down to making informed and good judgments. But is that judgment always clear, and does it lead to better conclusions?</p>
<p>Understanding how priors affect the decision (under one scenario) is the easy part. Teasing out the pros and cons of this approach with more Bayesian methods and real-world scenarios is the harder one. And the subject of some upcoming articles.</p>
<p>This illustrates both the potential power and the potential risk of Bayesian analysis. It can incorporate prior knowledge in a principled way, but when priors are uncertain, subjective, or weakly supported, the results may reflect assumptions as much as evidence.</p>
<h2><span lang="EN-US">Summary and Discussion</span></h2>
<p>In a <a href="https://measuringu.com/bayes-law-in-ux-research-from-urns-to-users">previous article</a>, we extended a classical problem in Bayesian comparison of the likelihoods of two hypotheses to a UX research context using an approach that required only simple algebra.</p>
<p>In this article, we showed how variation in prior belief can affect the posterior likelihoods of competing UX hypotheses, potentially having a larger impact than small changes in the observed data. For this example, varying the priors had a large effect on the likelihoods of the hypotheses (from <strong>0.405 to 0.916</strong> for the 90% hypothesis). This may, in part, have been affected by the relatively small difference in the competing hypotheses (78% vs. 90%, just a 12-point difference).</p>
<h3><span lang="EN-US">What Should Researchers Do About Priors?</span></h3>
<p>In practice, researchers should:</p>
<ul>
<li>Be explicit about the priors they use and how they were chosen.</li>
<li>Test multiple plausible priors to understand how sensitive the conclusions are to variation in priors (e.g., <a href="https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2020.608045/full">prior sensitivity analysis</a>).</li>
<li>Be cautious when priors are uncertain or weakly supported.</li>
<li>Consider collecting more data when conclusions depend heavily on prior assumptions.</li>
</ul>
<p>Understanding how priors influence results is an important step in using Bayesian methods effectively. It does not mean avoiding Bayesian analysis, but it does mean using it thoughtfully and transparently.</p>
<h2><span lang="EN-US">Appendix</span></h2>
<p>For this example, we assumed 20 participants attempted an online checkout task with 18 successes and 2 failures (90% success). With that result, we want to understand whether it’s more likely that the true successful completion rate is 78% (historical) or our observed 90% (better than historical).</p>
<p>To get the odds ratios displayed in Table 1, we used the following Bayesian formula.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula.png"><img loading="lazy" decoding="async" class="size-full wp-image-47183 aligncenter" src="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula.png" alt="Bayesian formula comparing 78% and 90% completion rates" width="390" height="55" srcset="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula.png 390w, https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-300x42.png 300w" sizes="auto, (max-width: 390px) 100vw, 390px" /></a></p>
<p>where:</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|90%) is the probability of getting this sample (the data, D) if the true completion rate is 90%.</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|78%) is the probability of getting this sample if the true completion rate is 78%.</p>
<p style="padding-left: 25px;"><em>P</em>(90%) is our expected (prior) probability that the true completion rate is 90%.</p>
<p style="padding-left: 25px;"><em>P</em>(78%) is our expected (prior) probability that the true completion rate is 78%.</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) is the conditional probability of the completion rate being 90% given the sample.</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) is 1 – <em>P</em>(90%|<em>D</em>).</p>
<p>Using the binomial probability formula, we can compute the probabilities of getting this sample for each of the hypothesized true completion rates:</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|90%) is (0.9)<sup>18</sup> × (0.1)<sup>2</sup> = 0.0015.</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|78%) is (0.78)<sup>18</sup> × (0.22)<sup>2</sup> = 0.00055.</p>
<p>Next, we apply this formula to the five sets of priors (Neutral, Weak Favoring 78%, Weak Favoring 90%, Strong Favoring 78%, Strong Favoring 90%).</p>
<p><strong><em>Technical note</em></strong><em><strong>:</strong> We used binomial probabilities throughout this article because they allow us to illustrate the mechanics of the Bayesian analyses with simple algebra. The downside of this simplification is that we had to assign specific prior probabilities rather than using the current practice of using </em><a href="https://bookdown.org/pbaumgartner/bayesian-fun/05-beta-distribution.html"><em>beta distributions for priors</em></a><em>, but this does not affect the logic of the discussion. Also, we excluded the factorial component of the binomial probability formula because it was constant across the computations.</em></p>
<h3><span lang="EN-US">Neutral Prior</span></h3>
<p>If we decide there is no basis for weighting the priors unequally, the values for the formula are:</p>
<p style="padding-left: 25px;">P(90%) = 0.5</p>
<p style="padding-left: 25px;">P(78%) = 0.5</p>
<p style="padding-left: 25px;">P(D|90%) = 0.0015</p>
<p style="padding-left: 25px;">P(D|78%) = 0.00055</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-neutral-priors.png"><img loading="lazy" decoding="async" class="size-full wp-image-47184 aligncenter" src="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-neutral-priors.png" alt="Bayesian formula with neutral priors" width="324" height="211" srcset="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-neutral-priors.png 324w, https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-neutral-priors-300x195.png 300w" sizes="auto, (max-width: 324px) 100vw, 324px" /></a></p>
<p>So:</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) = 0.732</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) = 0.268</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) / <em>P</em>(78%|<em>D</em>) = 2.73</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) / <em>P</em>(90%|<em>D</em>) = 0.37</p>
<p>Conclusion: There is a substantial likelihood that the historical hypothesis (78%) might be true (0.268 isn’t anywhere near 0), but the alternative hypothesis (90%) is <strong>2.7 times more likely</strong>.</p>
<h3><span lang="EN-US">Weak Prior Favoring 78%</span></h3>
<p>If we decide to give a little more weight to the historical hypothesis (78%) and a little less to the alternative hypothesis (90%), we get:</p>
<p style="padding-left: 25px;"><em>P</em>(90%) = 0.4</p>
<p style="padding-left: 25px;"><em>P</em>(78%) = 0.6</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|90%) = 0.0015</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|78%) = 0.00055</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-weak-priors.png"><img loading="lazy" decoding="async" class="wp-image-47185 size-full aligncenter" src="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-weak-priors.png" alt="Bayesian formula with weak prior favoring 78%" width="307" height="213" srcset="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-weak-priors.png 307w, https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-weak-priors-300x208.png 300w" sizes="auto, (max-width: 307px) 100vw, 307px" /></a></p>
<p>So:</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) = 0.645</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) = 0.355</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) / <em>P</em>(78%|<em>D</em>) = 1.82</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) / <em>P</em>(90%|<em>D</em>) = 0.55</p>
<p>Conclusion: There is a substantial likelihood that the historical hypothesis (78%) might be true (0.355 isn’t anywhere near 0), but the alternative hypothesis (90%) is <strong>1.8 times more likely</strong>.</p>
<h3><span lang="EN-US">Weak Prior Favoring 90%</span></h3>
<p>If we decide to give a little more weight to the alternative hypothesis (90%) and a little less to the historical hypothesis (78%), we get:</p>
<p style="padding-left: 25px;"><em>P</em>(90%) = 0.6</p>
<p style="padding-left: 25px;"><em>P</em>(78%) = 0.4</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|90%) = 0.0015</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|78%) = 0.00055</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-favoring-90.png"><img loading="lazy" decoding="async" class="wp-image-47186 size-full aligncenter" src="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-favoring-90.png" alt="Bayesian formulas with weak prior favoring 90%" width="320" height="207" srcset="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-favoring-90.png 320w, https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-favoring-90-300x194.png 300w" sizes="auto, (max-width: 320px) 100vw, 320px" /></a></p>
<p>So:</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) = 0.804</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) = 0.196</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) / <em>P</em>(78%|<em>D</em>) = 4.09</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) / <em>P</em>(90%|<em>D</em>) = 0.24</p>
<p>Conclusion: There is a decent likelihood that the historical hypothesis (78%) might be true (0.196 isn’t that close to 0), but the alternative hypothesis (90%) is <strong>4.1 times more likely</strong>.</p>
<h3><span lang="EN-US">Strong Prior Favoring 78%</span></h3>
<p>If we decide to give a lot more weight to the historical hypothesis (78%) and a lot less to the alternative hypothesis (90%), we get:</p>
<p style="padding-left: 25px;"><em>P</em>(90%) = 0.2</p>
<p style="padding-left: 25px;"><em>P</em>(78%) = 0.8</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|90%) = 0.0015</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|78%) = 0.00055</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-strong-priors-favoring-78.png"><img loading="lazy" decoding="async" class="size-full wp-image-47188 aligncenter" src="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-strong-priors-favoring-78.png" alt="" width="307" height="213" srcset="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-strong-priors-favoring-78.png 307w, https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-strong-priors-favoring-78-300x208.png 300w" sizes="auto, (max-width: 307px) 100vw, 307px" /></a></p>
<p>So:</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) = 0.405</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) = 0.595</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) / <em>P</em>(78%|<em>D</em>) = 0.68</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) / <em>P</em>(90%|<em>D</em>) = 1.47</p>
<p>Conclusion: The historical hypothesis (78%) is <strong>about 1.5 times more likely</strong> than the alternative hypothesis (90%), but not by much (both likelihoods aren’t that far from 50%).</p>
<h3><span lang="EN-US">Strong Prior Favoring 90%</span></h3>
<p>If we decide to give a lot more weight to the alternative hypothesis (90%) and a lot less to the historical hypothesis (78%), we get:</p>
<p style="padding-left: 25px;"><em>P</em>(90%) = 0.8</p>
<p style="padding-left: 25px;"><em>P</em>(78%) = 0.2</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|90%) = 0.0015</p>
<p style="padding-left: 25px;"><em>P</em>(<em>D</em>|78%) = 0.00055</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-strong-priors-favoring-90.png"><img loading="lazy" decoding="async" class="size-full wp-image-47189 aligncenter" src="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-strong-priors-favoring-90.png" alt="Bayesian formula with strong priors favoring 90%" width="318" height="222" srcset="https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-strong-priors-favoring-90.png 318w, https://measuringu.com/wp-content/uploads/2026/03/03312026-Bayesian-formula-with-strong-priors-favoring-90-300x209.png 300w" sizes="auto, (max-width: 318px) 100vw, 318px" /></a></p>
<p>So:</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) = 0.916</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) = 0.084</p>
<p style="padding-left: 25px;"><em>P</em>(90%|<em>D</em>) / <em>P</em>(78%|<em>D</em>) = 10.91</p>
<p style="padding-left: 25px;"><em>P</em>(78%|<em>D</em>) / <em>P</em>(90%|<em>D</em>) = 0.09</p>
<p>Conclusion: There is relatively little likelihood that the historical hypothesis (78%) might be true (0.084 is getting close to 0), and the alternative hypothesis (90%) is <strong>10.9 times more likely</strong>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>How to Use Banner Tables to Present Survey Results</title>
		<link>https://measuringu.com/how-to-use-banner-tables-to-present-survey-results/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=how-to-use-banner-tables-to-present-survey-results</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD&nbsp;•&nbsp;Jeff Sauro, PhD]]></dc:creator>
		<pubDate>Wed, 25 Mar 2026 00:13:04 +0000</pubDate>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Survey]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[banner table]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[table]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47120</guid>

					<description><![CDATA[Surveys are a common way to measure attitudes, behaviors, and intentions related to products and services. But large surveys can include dozens of questions and multiple demographic segments, which can mean hundreds of potential comparisons. How do you present all those results in a way stakeholders can quickly scan? You can use a slide deck [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/03/032426-FeatureImage-1.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47161 size-medium" src="https://measuringu.com/wp-content/uploads/2026/03/032426-FeatureImage-1-300x169.png" alt="Feature image showing a researcher using banner table to present survey results" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/03/032426-FeatureImage-1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/03/032426-FeatureImage-1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/032426-FeatureImage-1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/03/032426-FeatureImage-1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/032426-FeatureImage-1-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/03/032426-FeatureImage-1.png 2000w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>Surveys are a common way to measure attitudes, behaviors, and intentions related to products and services.</p>
<p>But large surveys can include dozens of questions and multiple demographic segments, which can mean hundreds of potential comparisons. How do you present all those results in a way stakeholders can quickly scan?</p>
<p>You can use a slide deck with charts for every question and segment, but that can easily lead to dozens of slides.</p>
<p>Another option is a <em>banner table</em>. While it sounds like something you might see at a trade show, a banner table is a compact way to display many cross-tabulated survey results in a single view.</p>
<p>Banner tables are widely used in market research, but they are less commonly seen in UX research. In a <a href="https://measuringu.com/what-are-ux-deliverables/">previous article</a>, we listed 18 UX research deliverables classified as interim, final, and artifacts; the banner table was one of the least familiar.</p>
<p>When used appropriately, banner tables provide an efficient way to summarize survey results across multiple segments.</p>
<p>Below is an example showing a banner table displaying brand attitude and reluctance to share political content by two social media platforms and gender (Table 1).</p>

<table id="tablepress-1024" class="tablepress tablepress-id-1024">
<thead>
<tr class="row-1">
	<td class="column-1"></td><td class="column-2"></td><th colspan="2" class="column-3"><center><strong>Facebook</strong></th><th colspan="2" class="column-5"><center><strong>TikTok</strong></th>
</tr>
</thead>
<tbody class="row-striping row-hover">
<tr class="row-2">
	<td class="column-1"><strong>Metric</strong></td><td class="column-2"><strong>Total</strong></td><td class="column-3"><strong>Female</strong></td><td class="column-4"><strong>Male</strong></td><td class="column-5"><strong>Female</strong></td><td class="column-6"><strong>Male</strong></td>
</tr>
<tr class="row-3">
	<td class="column-1">Brand attitude (Top-2 Box %)</td><td class="column-2">30%</td><td class="column-3">28%</td><td class="column-4"> 7%</td><td class="column-5">50%</td><td class="column-6">35%</td>
</tr>
<tr class="row-4">
	<td class="column-1">Reluctance to share political content (Bottom-2 Box %)</td><td class="column-2">70%</td><td class="column-3">58%</td><td class="column-4">73%</td><td class="column-5">72%</td><td class="column-6">93%</td>
</tr>
<tr class="row-5">
	<td class="column-1">Sample size (<i>n</i>)</td><td class="column-2">123</td><td class="column-3">49</td><td class="column-4">25</td><td class="column-5">29</td><td class="column-6">20</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1024 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Example of a banner table.</p>
<p>In this article, we provide more detail about the why and how of banner tables, plus we display an example created with an R script.</p>
<p>Before diving into how banner tables work, it helps to understand where they came from and why they became a standard deliverable in market (and UX) research.</p>
<h2><span lang="EN-US">Banner Tables: Common in Market Research, Less Known in UX Research</span></h2>
<p>For large-scale surveys with multiple segments, it’s good to display results in a banner table when you need to provide cross-tabulated results by key segments (e.g., demographics, personas, behaviors) to reveal group differences in a form that is easy to scan (Figure 1).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-1.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47163" src="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-1-300x20.png" alt="High-level view of a banner table." width="1200" height="81" srcset="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-1-300x20.png 300w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-1-1024x69.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-1-768x52.png 768w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-1-1536x104.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-1-600x41.png 600w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-1.png 2005w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> High-level view of a banner table.</p>
<p>Later in this article, we’ll zoom in on the different parts of this table and dig into its details. What you can see from the high-level view in Figure 1 is a set of metrics in the first column followed by a Totals column. The empty green columns separate crosstabs of the metrics with, in order, six social media platforms, three gender designations, six age groups, and six income levels. When presented as a spreadsheet, it’s common to freeze the top row and the first one or two columns to support easily browsing the contents.</p>
<h3><span lang="EN-US">Banner Tables Can Be Traced Back to U.S. Census Practices in 1949</span></h3>
<p>There’s no clear historical record of when the first banner table was published, but it likely coincided with the emergence of large-scale surveys in the mid-20<sup>th</sup> century. In banner tables, the rows are sometimes called the <em>stub</em> and the columns the <em>banner</em>, and older names for banner tables include stub-and-banner tables or stub-and-boxhead (as in the 1949 U.S. Census Bureau publication, <a href="https://www2.census.gov/library/publications/1949/general/tabular-presentation.pdf"><em>Bureau of the Census Manual of Tabular Presentation</em></a>). Regardless of the nomenclature, the key to its success is compressing a large number of crosstabs into one wide table.</p>
<h3><span lang="EN-US">Banner Tables Are Widely Used in Market Research</span></h3>
<p>Often considered a core piece of survey reporting for market research projects, a “banner run” shows every key question broken out by key segments (e.g., demographics, usage, brand, region). This is a common practice because the sample sizes in market research are often large enough to support a large number of data splits, the format is standardized and repeatable, and it serves the needs of stakeholders who want the same results sliced in different ways.</p>
<h3><span lang="EN-US">Banner Tables Are Less Common in UX Research but Have Their Place</span></h3>
<p>It’s possible for a UX researcher to spend decades in the field and never be asked to produce a banner table (we know this from personal experience). Nonetheless, banner tables can play a role when the research methodology is a large-scale survey (especially when focused on segmentation analysis). Even then, in UX research, banner tables will usually be more of a <a href="https://measuringu.com/what-are-ux-deliverables/">supporting deliverable</a> than a key item, as in marketing research.</p>
<h3><span lang="EN-US">Banner Tables Provide a Quick Way to Compare Weighted and Unweighted Results</span></h3>
<p>In our previous article on <a href="https://measuringu.com/rake-weighting-how-to-weight-survey-data-with-multiple-variables/">rake weighting</a>, we demonstrated the use of the <a href="https://cran.r-project.org/web/packages/anesrake/index.html">anesrake R package</a> to weight data on multiple demographic variables. The practice of demographic weighting is more common in market research and political polling because they have clearer and more accessible reference populations than is typical in UX research.</p>
<p>If the decision has been made to weight data, a banner table provides a convenient way to check on the effect of weighting on research outcomes.</p>
<p>In practice, market research banner tables usually treat weighted results as the “official” estimates but commonly include unweighted bases and percentages for quality control and transparency. UX research tends to follow that practice when weights are known to be based on a strong reference population; otherwise, unweighted results may take precedence over weighted results when reviewing the banner tables.</p>
<h2><span lang="EN-US">Banner Table Example</span></h2>
<p>For this example, we return to the data we used in our article on rake weighting so we can produce banner tables with both unweighted and weighted outcomes (for R scripting details, see the appendix).</p>
<h3><span lang="EN-US">Social Media: Weighting Brand Attitude and Reluctance to Engage in Political Discourse</span></h3>
<p>The data for this example came from our <a href="https://measuringu.com/the-ux-of-social-media-in-2024/">2024 SUPR-Q survey</a> of social media platforms. We recruited 324 participants in August 2024 to reflect on their most recent experience with one of six social media platforms: Facebook, Instagram, LinkedIn, Snapchat, TikTok, and X. We were interested in a wide range of UX topics (e.g., overall quality of experience, levels of trust, impact on mood and self-esteem). For the rake weighting article, our examples focused on measuring brand attitude and reluctance to engage in political discourse on the platforms.</p>
<p>We used demographic distributions of the adult U.S. population (18 years of age and older) as the reference population for <a href="https://news.gallup.com/poll/656708/lgbtq-identification-rises.aspx">gender</a>, <a href="https://www2.census.gov/library/publications/decennial/2020/census-briefs/c2020br-06.pdf">age</a>, and <a href="https://www.census.gov/library/publications/2025/demo/p60-286.html">income</a> because it’s commonly used for that purpose in many research contexts. Note that <strong>we do not recommend this as good practice for UX research</strong>, as the entire U.S. population is rarely the target audience for a specific product or service, and demographic variables often have little effect on UX metrics. It did, however, work well in our example as a quick check of the value (or lack of value) of employing this kind of demographic weighting in future SUPR-Q retrospective benchmarks.</p>
<p>Figure 2 shows the first ten rows of the source data with the respondent number, the platform, gender, age group, income range, brand attitude (seven-point scale and top-two box), rating of likelihood to share political content (five-point scale and bottom-two-box score), and the case weight determined by the previous rake weighting exercise. The item stems were, respectively, “Overall, how would you rate your attitude toward &lt;platform&gt;?” and “How likely are you to share political content on &lt;platform&gt;?”</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-2.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47154" src="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-2.png" alt="Portion of source data with weights from previous rake weighting exercise." width="1200" height="287" srcset="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-2.png 1200w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-2-300x72.png 300w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-2-1024x245.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-2-768x184.png 768w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-2-600x144.png 600w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 2:</strong> Portion of source data with weights from previous rake weighting exercise.</p>
<h3><span lang="EN-US">Banner Table Results</span></h3>
<p>To produce the banner table, we used three R packages:</p>
<ul>
<li><a href="https://www.r-bloggers.com/2024/11/creating-professional-excel-reports-with-r-a-comprehensive-guide-to-openxlsx-package/">openxlsx</a>: Get data from an Excel file and produce formatted results</li>
<li><a href="https://cengel.github.io/R-data-wrangling/dplyr.html">dplyr</a>: Manipulation of data frames</li>
<li><a href="https://tidyr.tidyverse.org/reference/tidyr-package.html">tidyr</a>: Simplified creation of tidy data formats</li>
</ul>
<p>The complete R script for creating this banner table is in the appendix.</p>
<p>Figures 3 through 6 show each of the crosstabs in the table for the overall effects of Product, Gender, Age Group, and Income. Because the brand attitude metric in the table is a top-two box, larger percentages reflect a more favorable attitude. In contrast, because the item measuring likelihood to engage in political discourse is a bottom-two box (the top boxes were too sparse to provide a meaningful signal), larger percentages indicate greater reluctance to engage.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-3.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47155" src="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-3-300x49.png" alt="Effect of platform (TikTok had highest brand satisfaction; LinkedIn highest reluctance to engage in political discourse)." width="1170" height="191" srcset="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-3-300x49.png 300w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-3-1024x167.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-3-768x125.png 768w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-3-1536x251.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-3-600x98.png 600w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-3.png 1776w" sizes="auto, (max-width: 1170px) 100vw, 1170px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 3:</strong> Effect of platform (TikTok had the highest brand satisfaction; LinkedIn had the highest reluctance to engage in political discourse).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-4.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47156" src="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-4-300x70.png" alt="Effect of gender (female users had substantially higher brand attitudes than male users; nonbinary users least likely to engage in political discourse). " width="813" height="191" srcset="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-4-300x70.png 300w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-4-1024x241.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-4-768x180.png 768w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-4-600x141.png 600w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-4.png 1230w" sizes="auto, (max-width: 813px) 100vw, 813px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 4:</strong> Effect of gender (female users had substantially higher brand attitudes than male users; nonbinary users were the least likely to engage in political discourse).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-5.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47157" src="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-5-300x52.png" alt="Effect of age (50-59 had higher brand attitude; 18-24 least likely to engage in political discourse). " width="1092" height="191" srcset="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-5-300x52.png 300w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-5-1024x179.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-5-768x134.png 768w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-5-1536x269.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-5-600x105.png 600w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-5.png 1647w" sizes="auto, (max-width: 1092px) 100vw, 1092px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 5:</strong> Effect of age (50–59 had higher brand attitude; 18–24 least likely to engage in political discourse).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-6.png"><img loading="lazy" decoding="async" class="alignnone wp-image-47158" src="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-6-300x48.png" alt="Effect of income ($25k-$49k had highest brand attitude; $200k+ were least likely to engage in political discourse). " width="1201" height="191" srcset="https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-6-300x48.png 300w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-6-1024x163.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-6-768x122.png 768w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-6-1536x244.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-6-600x95.png 600w, https://measuringu.com/wp-content/uploads/2026/03/032426-Figure-6.png 1811w" sizes="auto, (max-width: 1201px) 100vw, 1201px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 6: </strong>Effect of income ($25k–$49k had the highest brand attitude; $200k+ were least likely to engage in political discourse).</p>
<h3><span lang="EN-US">Try It!</span></h3>
<p>Table 2 is a <a href="https://tablepress.org/">TablePress</a> version of the banner table with the top row and the left two columns frozen.</p>
<p>To use the table, click on any row below the header, then use the slider or arrow keys to scroll horizontally.</p>
<p>To switch between the brand attitude and political discourse rows, toggle the 1 and 2 below the right corner of the table, then to resume horizontal scrolling, click on any row below the header.</p>

<table id="tablepress-1022" class="tablepress tablepress-id-1022">
<thead>
<tr class="row-1">
	<th class="column-1"><strong>Metric</strong></th><th class="column-2"><strong>Total</strong></th><th class="column-3"><strong> </strong></th><th class="column-4"><strong>Platform: Facebook</strong></th><th class="column-5"><strong>Platform: Instagram</strong></th><th class="column-6"><strong>Platform: LinkedIn</strong></th><th class="column-7"><strong>Platform: Snapchat</strong></th><th class="column-8"><strong>Platform: TikTok</strong></th><th class="column-9"><strong>Platform: X</strong></th><th class="column-10"><strong> </strong></th><th class="column-11"><strong>Gender: Female</strong></th><th class="column-12"><strong>Gender: Male</strong></th><th class="column-13"><strong>Gender: Nonbinary</strong></th><th class="column-14"><strong> </strong></th><th class="column-15"><strong>Age: 18-24</strong></th><th class="column-16"><strong>Age: 25-29</strong></th><th class="column-17"><strong>Age: 30-39</strong></th><th class="column-18"><strong>Age: 40-49</strong></th><th class="column-19"><strong>Age: 50-59</strong></th><th class="column-20"><strong>Age: 60-69</strong></th><th class="column-21"><strong> </strong></th><th class="column-22"><strong>Income: $0-$24k</strong></th><th class="column-23"><strong>Income: $25k-$49k</strong></th><th class="column-24"><strong>Income: $50k-$99k</strong></th><th class="column-25"><strong>Income: $100k-$149k</strong></th><th class="column-26"><strong>Income: $150k-$199k</strong></th><th class="column-27"><strong>Income: $200k+</strong></th>
</tr>
</thead>
<tbody class="row-striping row-hover">
<tr class="row-2">
	<td class="column-1">Brand attitude (Top-2 Box): Unweighted n</td><td class="column-2">324</td><td class="column-3"></td><td class="column-4">53</td><td class="column-5">56</td><td class="column-6">52</td><td class="column-7">50</td><td class="column-8">57</td><td class="column-9">56</td><td class="column-10"></td><td class="column-11">198</td><td class="column-12">114</td><td class="column-13">12</td><td class="column-14"></td><td class="column-15">68</td><td class="column-16">64</td><td class="column-17">102</td><td class="column-18">51</td><td class="column-19">31</td><td class="column-20">8</td><td class="column-21"></td><td class="column-22">50</td><td class="column-23">70</td><td class="column-24">107</td><td class="column-25">56</td><td class="column-26">25</td><td class="column-27">16</td>
</tr>
<tr class="row-3">
	<td class="column-1">Brand attitude (Top-2 Box): Weighted n</td><td class="column-2">324.0</td><td class="column-3"></td><td class="column-4">74.1</td><td class="column-5">52.6</td><td class="column-6">49.4</td><td class="column-7">39.4</td><td class="column-8">50.4</td><td class="column-9">58.0</td><td class="column-10"></td><td class="column-11">163.6</td><td class="column-12">157.1</td><td class="column-13">3.2</td><td class="column-14"></td><td class="column-15">51.7</td><td class="column-16">34.3</td><td class="column-17">69.1</td><td class="column-18">62.6</td><td class="column-19">66.3</td><td class="column-20">40.0</td><td class="column-21"></td><td class="column-22">56.8</td><td class="column-23">70.8</td><td class="column-24">91.2</td><td class="column-25">50.6</td><td class="column-26">22.3</td><td class="column-27">32.3</td>
</tr>
<tr class="row-4">
	<td class="column-1">Brand attitude (Top-2 Box): % (unweighted)</td><td class="column-2">27.2%</td><td class="column-3"></td><td class="column-4">24.5%</td><td class="column-5">25.0%</td><td class="column-6">23.1%</td><td class="column-7">26.0%</td><td class="column-8">50.9%</td><td class="column-9">12.5%</td><td class="column-10"></td><td class="column-11">35.4%</td><td class="column-12">15.8%</td><td class="column-13">0.0%</td><td class="column-14"></td><td class="column-15">22.1%</td><td class="column-16">31.2%</td><td class="column-17">22.5%</td><td class="column-18">33.3%</td><td class="column-19">35.5%</td><td class="column-20">25.0%</td><td class="column-21"></td><td class="column-22">20.0%</td><td class="column-23">30.0%</td><td class="column-24">29.9%</td><td class="column-25">28.6%</td><td class="column-26">24.0%</td><td class="column-27">18.8%</td>
</tr>
<tr class="row-5">
	<td class="column-1">Brand attitude (Top-2 Box): % (weighted)</td><td class="column-2">24.6%</td><td class="column-3"></td><td class="column-4">21.0%</td><td class="column-5">27.0%</td><td class="column-6">31.4%</td><td class="column-7">21.3%</td><td class="column-8">43.1%</td><td class="column-9">7.3%</td><td class="column-10"></td><td class="column-11">32.4%</td><td class="column-12">17.0%</td><td class="column-13">0.0%</td><td class="column-14"></td><td class="column-15">16.2%</td><td class="column-16">27.8%</td><td class="column-17">18.8%</td><td class="column-18">26.4%</td><td class="column-19">33.6%</td><td class="column-20">25.0%</td><td class="column-21"></td><td class="column-22">21.9%</td><td class="column-23">32.4%</td><td class="column-24">26.0%</td><td class="column-25">28.0%</td><td class="column-26">17.0%</td><td class="column-27">8.0%</td>
</tr>
<tr class="row-6">
	<td class="column-1"></td><td class="column-2"></td><td class="column-3"></td><td class="column-4"></td><td class="column-5"></td><td class="column-6"></td><td class="column-7"></td><td class="column-8"></td><td class="column-9"></td><td class="column-10"></td><td class="column-11"></td><td class="column-12"></td><td class="column-13"></td><td class="column-14"></td><td class="column-15"></td><td class="column-16"></td><td class="column-17"></td><td class="column-18"></td><td class="column-19"></td><td class="column-20"></td><td class="column-21"></td><td class="column-22"></td><td class="column-23"></td><td class="column-24"></td><td class="column-25"></td><td class="column-26"></td><td class="column-27"></td>
</tr>
<tr class="row-7">
	<td class="column-1">Political discourse reluctance (Bottom-2 Box): Unweighted n</td><td class="column-2">324</td><td class="column-3"></td><td class="column-4">53</td><td class="column-5">56</td><td class="column-6">52</td><td class="column-7">50</td><td class="column-8">57</td><td class="column-9">56</td><td class="column-10"></td><td class="column-11">198</td><td class="column-12">114</td><td class="column-13">12</td><td class="column-14"></td><td class="column-15">68</td><td class="column-16">64</td><td class="column-17">102</td><td class="column-18">51</td><td class="column-19">31</td><td class="column-20">8</td><td class="column-21"></td><td class="column-22">50</td><td class="column-23">70</td><td class="column-24">107</td><td class="column-25">56</td><td class="column-26">25</td><td class="column-27">16</td>
</tr>
<tr class="row-8">
	<td class="column-1">Political discourse reluctance (Bottom-2 Box): Weighted n</td><td class="column-2">324.0</td><td class="column-3"></td><td class="column-4">74.1</td><td class="column-5">52.6</td><td class="column-6">49.4</td><td class="column-7">39.4</td><td class="column-8">50.4</td><td class="column-9">58.0</td><td class="column-10"></td><td class="column-11">163.6</td><td class="column-12">157.1</td><td class="column-13">3.2</td><td class="column-14"></td><td class="column-15">51.7</td><td class="column-16">34.3</td><td class="column-17">69.1</td><td class="column-18">62.6</td><td class="column-19">66.3</td><td class="column-20">40.0</td><td class="column-21"></td><td class="column-22">56.8</td><td class="column-23">70.8</td><td class="column-24">91.2</td><td class="column-25">50.6</td><td class="column-26">22.3</td><td class="column-27">32.3</td>
</tr>
<tr class="row-9">
	<td class="column-1">Political discourse reluctance (Bottom-2 Box): % (unweighted)</td><td class="column-2">77.5%</td><td class="column-3"></td><td class="column-4">71.7%</td><td class="column-5">73.2%</td><td class="column-6">94.2%</td><td class="column-7">84.0%</td><td class="column-8">75.4%</td><td class="column-9">67.9%</td><td class="column-10"></td><td class="column-11">75.8%</td><td class="column-12">78.9%</td><td class="column-13">91.7%</td><td class="column-14"></td><td class="column-15">80.9%</td><td class="column-16">78.1%</td><td class="column-17">78.4%</td><td class="column-18">74.5%</td><td class="column-19">74.2%</td><td class="column-20">62.5%</td><td class="column-21"></td><td class="column-22">76.0%</td><td class="column-23">77.1%</td><td class="column-24">78.5%</td><td class="column-25">78.6%</td><td class="column-26">68.0%</td><td class="column-27">87.5%</td>
</tr>
<tr class="row-10">
	<td class="column-1">Political discourse reluctance (Bottom-2 Box): % (weighted)</td><td class="column-2">76.0%</td><td class="column-3"></td><td class="column-4">63.2%</td><td class="column-5">70.8%</td><td class="column-6">92.8%</td><td class="column-7">85.1%</td><td class="column-8">80.2%</td><td class="column-9">73.0%</td><td class="column-10"></td><td class="column-11">71.6%</td><td class="column-12">80.3%</td><td class="column-13">90.7%</td><td class="column-14"></td><td class="column-15">80.7%</td><td class="column-16">76.8%</td><td class="column-17">80.5%</td><td class="column-18">79.1%</td><td class="column-19">72.4%</td><td class="column-20">62.5%</td><td class="column-21"></td><td class="column-22">63.8%</td><td class="column-23">79.2%</td><td class="column-24">74.2%</td><td class="column-25">83.0%</td><td class="column-26">73.5%</td><td class="column-27">86.1%</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1022 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 2: </strong>Working version of the banner table.</p>
<h2><span lang="EN-US">Summary</span></h2>
<p>In this article, we went through the why and how of banner tables, ending with an example created with an R script from data collected in a retrospective benchmark study of attitudes toward social media platforms. We discussed that banner tables:</p>
<p><strong>Compress many crosstabs into one viewable table.</strong><br />
The compression of many crosstabs into a single banner table allows stakeholders to quickly scan results without having to flip between multiple slides.</p>
<p><strong>Are created with common analysis tools like R and AI.</strong><br />
Numerous software tools can create banner tables; in our example, we used R packages to generate the table. You can also easily have AI create these for you.</p>
<p><strong>Are ideal for large surveys with segmentation.</strong><br />
Use banner tables to summarize survey results (especially with large sample sizes) when comparing metrics across multiple segments.</p>
<p><strong>Are common in market research but useful in UX.</strong><br />
Banner tables are widely used in market research and, while less frequently requested in UX research, can be the right deliverable when you want to convey cross-tab metrics compactly.</p>
<h2><span lang="EN-US">Appendix</span></h2>
<p>Use the link below to download a PDF of the R script. It’s specific to this example but could certainly be edited for use with other data. The script is formatted so you can select all, copy, modify, and then paste the code into R or R Studio.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/rscriptForBannerTables.pdf">Click here for the R script</a></p>
<p><strong>AI note:</strong> We used ChatGPT 5.2 to create and iterate the R script until it worked as desired (which took about six hours, including debugging some weird roundoff errors). For the final table, we did a little additional formatting by hand (e.g., making the empty columns smaller with light green fill, freezing the top row and the left two columns for easier browsing of the crosstab sections).</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Assistant, Analyst, and User: How We’re Examining AI in UX</title>
		<link>https://measuringu.com/ai-as-uxr-assistant-user-and-analyst/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=ai-as-uxr-assistant-user-and-analyst</link>
		
		<dc:creator><![CDATA[Jeff Sauro, PhD&nbsp;•&nbsp;Jim Lewis, PhD]]></dc:creator>
		<pubDate>Wed, 18 Mar 2026 00:48:07 +0000</pubDate>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[AI]]></category>
		<category><![CDATA[artificial intelligence]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=47094</guid>

					<description><![CDATA[It seems like AI is almost everywhere. For many people, it is. From the moment we wake up, AI increasingly shapes our daily experiences. Music playlists are generated automatically. Our computers prompt us to use AI assistants. Internet searches are now often preceded by AI-generated summaries. Call a doctor’s office after hours. and an AI [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/03/031726-FeatureImage-1.png"><img loading="lazy" decoding="async" class="alignleft wp-image-47108 size-medium" src="https://measuringu.com/wp-content/uploads/2026/03/031726-FeatureImage-1-300x169.png" alt="Header image showing a person communicating with 3 AI robots" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/03/031726-FeatureImage-1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/03/031726-FeatureImage-1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/031726-FeatureImage-1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/03/031726-FeatureImage-1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/031726-FeatureImage-1-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/03/031726-FeatureImage-1.png 2000w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>It seems like AI is almost everywhere. For many people, it is.</p>
<p>From the moment we wake up, AI increasingly shapes our daily experiences. Music playlists are generated automatically. Our computers prompt us to use AI assistants. Internet searches are now often preceded by AI-generated summaries.</p>
<p>Call a doctor’s office after hours. and an AI voice assistant may help schedule your appointment. Chat with customer support, and you’ll likely interact with a chatbot before reaching a human. Write an email, and AI offers suggestions. Start a meeting and AI software generates notes and summaries. Need an image to make a point? Use AI to generate one from a textual description (e.g., Figure 1).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/03172026.jpg"><img loading="lazy" decoding="async" class="alignnone size-full wp-image-47095" src="https://measuringu.com/wp-content/uploads/2026/03/03172026.jpg" alt="Image showing the ubiquity of AI." width="881" height="588" srcset="https://measuringu.com/wp-content/uploads/2026/03/03172026.jpg 881w, https://measuringu.com/wp-content/uploads/2026/03/03172026-300x200.jpg 300w, https://measuringu.com/wp-content/uploads/2026/03/03172026-768x513.jpg 768w, https://measuringu.com/wp-content/uploads/2026/03/03172026-600x400.jpg 600w" sizes="auto, (max-width: 881px) 100vw, 881px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> The ubiquity of AI.</p>
<p>And of course, AI’s influence affects what we do in UX research.</p>
<p>But is AI helping? Is it making us more efficient, more accurate? Or is it actually just making us <a href="https://hbr.org/2026/02/ai-doesnt-reduce-work-it-intensifies-it">work more intensely</a>?</p>
<p>Of course, there are voices who overhype its efficacy in UX Research and Design. There are also voices who dismiss it as a fad. Increasingly, the latter is becoming a less tenable position.</p>
<p>We’re more pragmatic at MeasuringU and have an aversion to extreme attitudes. The <a href="https://en.wikipedia.org/wiki/Golden_mean_(philosophy)">Aristotelian golden mean</a> between extremes is part of our company DNA.</p>
<p>We lean into empiricism and judge the efficacy of claims using data. We also critically evaluate the quality of the evidence. An anecdote about improved productivity from a software company is not the same as a controlled study.</p>
<p>As is often the case with fast-changing technology, there’s a dearth of high-quality studies that allow us to separate the hype from the hypothesis testing. We’re actively conducting studies and literature reviews to quantify the extent to which different applications of AI to UX research are useful.</p>
<p>A good way to assess the evidence and group our research is to think about AI’s impact in UX research in three categories: AI as Research Assistant, AI as (Synthetic) User, and AI as Researcher.</p>
<h2><span lang="EN-US">AI as Research Assistant</span></h2>
<p>Let’s start with something less controversial and rather commonplace. That is, researchers using AI tools to assist (usually expedite) research.</p>
<p>Many UX research teams use AI for the following tasks, and the AI assistants appear to be well-received by researchers to either increase research speed or improve research quality. Questions remain about measurable quality criteria, failure modes, and the role of the human in the loop.</p>
<ul>
<li>Coding comments from categories</li>
<li>Cleaning data</li>
<li>Translation and localization</li>
<li>Analyzing interviews to find themes</li>
<li>Developing insights from transcripts</li>
<li>Building and modifying participant screeners</li>
<li>Writing and editing survey questions</li>
<li>Detecting bias and other quality issues in questions</li>
<li>Identifying categories from card sort results</li>
<li>Developing and editing task scenarios</li>
<li>Developing and editing test plans</li>
</ul>
<p>There’s more to do, but we’ve already made some progress investigating the role of AI as a research assistant in comment classification.</p>
<h3><span lang="EN-US">AI and Human Classification of Comments</span></h3>
<p>One of the first analyses we conducted on using AI to code comments was promising. We used three runs of  <a href="https://measuringu.com/classification-agreement-between-ux-researchers-and-chatgpt/">ChatGPT-4 to classify comments</a> in UX research and compared its results (in 2023!) to three human coders. We found only slightly lower interrater reliabilities between human coders and ChatGPT than between human coders alone, with three caveats:</p>
<ul>
<li>Human coders were more likely to assign single comments to their own themes.</li>
<li>Different prompts had different levels of effectiveness (prompt specificity matters).</li>
<li>AI outputs with the same prompt were similar, but there was substantial variation, making it necessary to run AI analyses multiple times.</li>
</ul>
<p>We plan to investigate how well newer AI products perform this task.</p>
<h2><span lang="EN-US">AI as Synthetic User: Synthetic Attitudes vs. Synthetic Actions</span></h2>
<p>Now we move into a category that gets <a href="https://quantuxblog.com/synthetic-survey-data-its-not-data">a lot of people fired up</a>, and for good reason. Any time you take the user out of UX, it becomes objectionable as a matter of principle. But again, we try to be open-minded. After all, inspection methods like <a href="https://measuringu.com/effective-he/">heuristic evaluation</a>, <a href="https://measuringu.com/ux-metrics-pure/">PURE</a>, and <a href="https://measuringu.com/inspection-methods/">guideline reviews</a> are part of the UX research toolbox even though users aren’t directly involved.</p>
<p>We see an important distinction between synthetic user attitudes and synthetic user behaviors, both of which have yet to be fully explored.</p>
<ul>
<li>Synthetic survey respondents (attitudes and reported behaviors): AI-generated responses to rating scales that measure things like satisfaction, intention, and usability, and responses to behaviors like product ownership and usage</li>
<li>Synthetic users of task-based studies (behaviors): AI-generated responses to task-based scenarios used in usability testing</li>
<li>Synthetic users of information architecture tasks (tree tests, card sorts)</li>
</ul>
<p>We have not conducted studies that use data from individually crafted synthetic users, but we have experimented with comparing AI predictions of user behaviors and attitudes for card sorting and tree testing, with mixed success.</p>
<h3><span lang="EN-US">AI and Human Analysis of Card Sorting Results</span></h3>
<p>AI’s ability <a href="https://measuringu.com/comparing-chatgpt-to-card-sorting-results/">to sort items into groups</a>, as in a card sort, was actually reasonably good. Our use of ChatGPT-4 to appropriately name groups of items, with the groups synthesized by human researchers from a standard open card sort, found a strong similarity in numbers and names of categories. Items matched most of the time, the interrater reliability between the two methods was moderate to substantial, and there weren’t any obviously bad ChatGPT placements.</p>
<h3><span lang="EN-US">AI and Human Tree Testing Results</span></h3>
<p>Our tree testing results were also promising. Based on data collected with <a href="https://measuringu.com/chatgpt4-tree-test/">multiple iterations of ChatGPT-4 and 33 participants</a> finding the location of target items in a tree structure based on the IRS website, and using the SEQ to assess perceived task difficulty, we found that ChatGPT performed too well and <strong>was not suitable</strong> for estimating how well humans will find items in a tree test. However, ChatGPT predicted people’s ease ratings of the search tasks <strong>with reasonable accuracy</strong>.</p>
<h2><span lang="EN-US">AI as Researcher</span></h2>
<p>These are more advanced tasks where AI might be able to take a more central role in analysis, but it isn’t clear how AI output compares to human output regarding the amount of time saved in the process (if any) and accuracy. Two ways in which AI might replace researchers are as analysts and moderators.</p>
<h3><span lang="EN-US">AI as Analyst</span></h3>
<p>A lot of human data analysis is repetitive, making it attractive for replacement with AI (Figure 2).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/03172026-F2.jpg"><img loading="lazy" decoding="async" class="alignnone size-full wp-image-47096" src="https://measuringu.com/wp-content/uploads/2026/03/03172026-F2.jpg" alt="Cartoon showing a robot applying for a mindless and repetitive job." width="1043" height="561" srcset="https://measuringu.com/wp-content/uploads/2026/03/03172026-F2.jpg 1043w, https://measuringu.com/wp-content/uploads/2026/03/03172026-F2-300x161.jpg 300w, https://measuringu.com/wp-content/uploads/2026/03/03172026-F2-1024x551.jpg 1024w, https://measuringu.com/wp-content/uploads/2026/03/03172026-F2-768x413.jpg 768w, https://measuringu.com/wp-content/uploads/2026/03/03172026-F2-600x323.jpg 600w" sizes="auto, (max-width: 1043px) 100vw, 1043px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 2:</strong> Robot applying for a job.</p>
<p>Other human data analysis is less repetitive and more dependent on contextual knowledge and human judgment (e.g., identification of usability problems). Some of the opportunities we envision for AI as an analyst (but which need development and validation) are:</p>
<ul>
<li>Validating screenshots to determine task success</li>
<li>Identifying usability problems from image analysis</li>
<li>Identifying usability problems from videos</li>
<li>Heuristic evaluation from analysis of videos, images, and websites</li>
<li>Advanced inspection analyses (<a href="https://measuringu.com/pure/">PURE</a>, <a href="https://measuringu.com/predicted-times/">KLM/GOMS</a>)</li>
<li>Analyzing datasets</li>
</ul>
<h3><span lang="EN-US">AI as Moderator</span></h3>
<p>Research moderation seems like a quintessentially human activity. However, advances in AI avatars, LLM dialog management, and synthetic speech production have led to the development of AI agents that could be applied to a <a href="https://www.nngroup.com/articles/ai-interviewers/">variety of moderation tasks</a>. Research in this area should focus on understanding when it works, when it fails, and how to validate quality.</p>
<ul>
<li>Simple interviews</li>
<li>Complex interviews</li>
<li>Moderated usability tests</li>
</ul>
<h2><span lang="EN-US">AI Adoption and Attitudes</span></h2>
<p>We have conducted and will conduct follow-up studies of attitudes toward AI usage by the general public and by UX researchers.</p>
<p>We’ve already published research on attitudes of UX researchers regarding the use of AI in UX (in association with UXPA) and attitudes of a general population of users toward three AI chat products.</p>
<p>Before examining how AI may function as an assistant, analyst, or synthetic user in UX research, it’s useful to understand how widely AI tools are already being used and how people perceive them. Some recent studies provide insight into both adoption and user experience with AI-based systems.</p>
<h3><span lang="EN-US">How Much Is AI Used in UX? </span></h3>
<p><em>More than you might think.</em> While our <a href="https://measuringu.com/how-much-is-ai-used-in-ux/">industry data</a> from 2024 is due for a refresh, we found that about half of UX professionals had used AI (but 20% were not impressed). More companies supported using AI than discouraged it (by about 6 to 1). Most respondents expected to use AI more in 2025, but expectations over the next five years were mixed.</p>
<h3><span lang="EN-US">Retrospective Benchmark of ChatGPT, Claude, and Gemini</span></h3>
<p>In January and February 2025, we conducted a retrospective study on <a href="https://measuringu.com/ai-based-chat-software-ux-2025/">three AI-based chat products</a> (ChatGPT, Claude, Gemini) with 153 U.S-based panel participants. This study included metrics from our standard UX &amp; NPS survey as part of our larger consumer software data collection effort. All products had high and similar Net Promoter Scores. Reported issues included accuracy, generic content, and limited free versions.</p>
<h2><span lang="EN-US">Summary</span></h2>
<p>It can be easy to be seduced into extreme views about emerging technologies. They can be cast as the best thing ever or a complete waste of time. Our recommendation is a more pragmatic, empirical approach. Rather than relying on anecdotes or hype, we encourage evaluating the role of AI in UX research with data.</p>
<p>One useful way to think about AI in UX research is to group its applications into three roles:</p>
<ul>
<li><strong>AI as Research Assistant.</strong> Tools that improve the quality and quantity of the work that UX researchers already do, such as coding comments, summarizing interviews, and generating study materials.</li>
</ul>
<ul>
<li><strong>AI as Synthetic User.</strong> Systems that simulate user attitudes or behaviors. An important distinction is between synthetic attitudes and synthetic actions. Our early work suggests some promise in modeling behavior, but much less evidence for synthetic attitudes.</li>
</ul>
<ul>
<li><strong>AI as Research Analyst.</strong> Applications where AI plays a more central role in analysis—identifying usability issues from images or videos, evaluating task success, or even assisting with research moderation.</li>
</ul>
<p>There is still much to learn. In the coming year, we plan to continue studying these areas and revisit both the usage of AI tools and attitudes toward them. Our goal is not to promote or dismiss AI, but to understand, through evidence, where it genuinely improves UX research.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Bayes&#8217; Law in UX Research: From Urns to Users</title>
		<link>https://measuringu.com/bayes-law-in-ux-research-from-urns-to-users/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=bayes-law-in-ux-research-from-urns-to-users</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD&nbsp;•&nbsp;Jeff Sauro, PhD]]></dc:creator>
		<pubDate>Tue, 10 Mar 2026 23:16:59 +0000</pubDate>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Bayes theorem]]></category>
		<category><![CDATA[Bayesian theorem]]></category>
		<category><![CDATA[Statistics]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=46809</guid>

					<description><![CDATA[&#8220;Follow the data. Update your beliefs.&#8221; We like the idea of applying iterative Bayesian thinking to how we test hypotheses and conduct UX research. The idea is simple, but modern Bayesian math can be opaque and hard to understand. We have questions about how well Bayesian analysis works relative to frequentist analysis. We are also [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/03/031026-FeatureImage.png"><img loading="lazy" decoding="async" class="alignleft wp-image-46937 size-medium" src="https://measuringu.com/wp-content/uploads/2026/03/031026-FeatureImage-300x169.png" alt="Feature image showing an urn, a math equation and a bust memorial" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/03/031026-FeatureImage-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/03/031026-FeatureImage-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/031026-FeatureImage-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/03/031026-FeatureImage-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/031026-FeatureImage-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/03/031026-FeatureImage.png 2000w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>&#8220;Follow the data. Update your beliefs.&#8221;</p>
<p>We like the idea of applying iterative Bayesian thinking to how we test hypotheses and conduct UX research.</p>
<p>The idea is simple, but modern Bayesian math can be opaque and hard to understand.</p>
<p>We have questions about how well Bayesian analysis works relative to <a href="https://www.statsig.com/perspectives/bayesian-or-frequentist-choosing-your-statistical-approach">frequentist analysis</a>. We are also intrigued by the possibility of <a href="https://measuringu.com/intro-to-bayesian-thinking-in-ux-research">Bayesian thinking in UX research</a>.</p>
<p>The best way to understand how to apply Bayesian thinking and math to UX is to match the original Bayesian examples from hundreds of years ago to modern problems in UX. And it starts with urns.</p>
<h2><span lang="EN-US">Statistics, Probability, and Urns</span></h2>
<p>Statistics is abstract. Probability is hard to understand. And conditional probability is harder still.</p>
<p>One way to make probability more concrete is through cards, dice, and coins. We’re not trying to make people compulsive gamblers. Historically, however, many of our modern formulas come from games of chance. If you can understand how to win a card game or avoid a costly mistake at the roulette table, you tend to pay more attention.</p>
<p>And this is where Thomas Bayes’ <a href="https://bayes.wustl.edu/Manual/an.essay.pdf">famous essay</a> on the logic of updating beliefs from observed outcomes (e.g., successes and failures) comes into play. This work was published after he died in 1763, a reminder that sometimes the private scraps of an idea can publicly change the world.</p>
<p>While cards and dice work for teaching basic probability, conditional probability is often clearer with a different classic tool: the urn problem. In its simplest form, two urns contain different proportions of colored balls. After drawing a sample, the task is to determine which urn most likely produced it.</p>
<p>The advantage of urn problems is flexibility. Unlike coins (two sides), dice (six sides), or cards (fixed suits and ranks), urns allow probabilities to vary in ways that make Bayesian comparisons easier to illustrate.</p>
<p>We modified an example from Cowles’ excellent book, <a href="https://www.amazon.com/Statistics-Psychology-Perspective-Michael-Cowles/dp/080580031X"><em>Statistics in Psychology: An Historical Perspective</em></a>, to demonstratethe use of Bayesian analysis to assess the relative likelihood of competing urn hypotheses (and who noted on p. 75, “Traditionally, writers on Bayes make heavy use of ‘urn problems’”).</p>
<p>For the example depicted in Figure 1, suppose you have a sample of 20 balls where 18 are green and 2 are red. Is this sample more likely to have come from an urn with 90% green balls (Urn A) or one with 78% green balls (Urn B)?</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/03102026-F1.png"><img loading="lazy" decoding="async" class="alignnone size-full wp-image-46810" src="https://measuringu.com/wp-content/uploads/2026/03/03102026-F1.png" alt="Depiction of a classical urn problem." width="399" height="267" srcset="https://measuringu.com/wp-content/uploads/2026/03/03102026-F1.png 399w, https://measuringu.com/wp-content/uploads/2026/03/03102026-F1-300x201.png 300w" sizes="auto, (max-width: 399px) 100vw, 399px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> Depiction of a classical urn problem.</p>
<p>Before we go through the math to compare the likelihoods, let’s update the narrative. After all, UX researchers don’t work with urns. We work with users.</p>
<h2><span lang="EN-US">From Balls in Urns to Users Checking Out</span></h2>
<p>In our <a href="https://measuringu.com/intro-to-bayesian-thinking-in-ux-research">earlier article</a>, we introduced Bayesian thinking using a checkout completion example with 20 participants, 18 who successfully completed the task and two who failed.</p>
<p>So instead of asking which urn produced a sample of green and red balls, let’s ask a question more relevant to UX research: <strong>Is this checkout experience performing at a historical level, or is it consistent with a more aspirational goal?</strong></p>
<p>That’s the same math, just with a better context.</p>
<p>Instead of trying to figure out which urn a sample of balls came from, we want to know which of two possible completion rates is more likely.</p>
<ol>
<li>Historical Completion Rate of 78% (Hypothesis H): This comes from <a href="https://measuringu.com/task-completion/">a historical average with an overall 78% completion</a>.</li>
<li>Aspirational Completion Rate of 90% (Hypothesis A): It’s aspirational because in this hypothetical example, we have a reason to believe (or at least hope) the checkout flow is better than average.</li>
</ol>
<p>Because the observed percentage of success from the sample is exactly 90%, this seems more consistent with the aspirational hypothesis. But let’s work through the math using Bayes’ theorem.</p>
<p>Comparing these two hypotheses, the exact probability of the <em>aspirational 90% hypothesis</em> is:</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/031026-Formula1.png"><img loading="lazy" decoding="async" class="aligncenter wp-image-46940" src="https://measuringu.com/wp-content/uploads/2026/03/031026-Formula1-300x51.png" alt="Bayesian formula to calculate probability" width="411" height="70" srcset="https://measuringu.com/wp-content/uploads/2026/03/031026-Formula1-300x51.png 300w, https://measuringu.com/wp-content/uploads/2026/03/031026-Formula1-1024x174.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/031026-Formula1-768x131.png 768w, https://measuringu.com/wp-content/uploads/2026/03/031026-Formula1-1536x262.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/031026-Formula1-2048x349.png 2048w, https://measuringu.com/wp-content/uploads/2026/03/031026-Formula1-600x102.png 600w" sizes="auto, (max-width: 411px) 100vw, 411px" /></a></p>
<p>where:</p>
<p><em>P(D|A) </em>is the probability of getting this sample (the data, <em>D</em>) if the aspirational hypothesis is true.</p>
<p><em>P(D|H) </em>is the probability of getting this sample if the historical hypothesis is true.</p>
<p><em>P(A) </em>is our expected (prior) probability that the aspirational hypothesis is true.</p>
<p><em>P(H) </em>is our expected (prior) probability that the historical hypothesis is true.</p>
<p><em>P(A|D)</em> is the conditional probability of the aspirational hypothesis given the sample.</p>
<p>Using binomial probabilities, <em>P</em>(<em>D</em>|<em>A</em>) is (0.9)<sup>18</sup> × (0.1)<sup>2</sup> = 0.0015 and <em>P</em>(<em>D</em>|<em>H</em>) is (0.78)<sup>18</sup> × (0.22)<sup>2</sup> = 0.00055. Assuming we have no prior belief favoring either hypothesis (so <em>P</em>(<em>A</em>) = <em>P</em>(<em>H</em>) = 0.5), we get:</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/031026-Formula2.png"><img loading="lazy" decoding="async" class="aligncenter wp-image-46941" src="https://measuringu.com/wp-content/uploads/2026/03/031026-Formula2-300x49.png" alt="Filled in formula for calculating Bayesian probability." width="432" height="70" srcset="https://measuringu.com/wp-content/uploads/2026/03/031026-Formula2-300x49.png 300w, https://measuringu.com/wp-content/uploads/2026/03/031026-Formula2-1024x166.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/031026-Formula2-768x124.png 768w, https://measuringu.com/wp-content/uploads/2026/03/031026-Formula2-1536x249.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/031026-Formula2-2048x332.png 2048w, https://measuringu.com/wp-content/uploads/2026/03/031026-Formula2-600x97.png 600w" sizes="auto, (max-width: 432px) 100vw, 432px" /></a></p>
<p>which equals 0.00075/(0.00075 + 0.000275) = 0.00075/0.001025 = 0.732 (73.2%). Because <em>P</em>(<em>A</em>|<em>D</em>) + <em>P</em>(<em>H</em>|<em>D</em>) = 1, <em>P</em>(<em>H</em>|<em>D</em>) = 0.268 (26.8%).</p>
<p>We conclude there is a substantial probability that the historical hypothesis might be true (26.8% isn’t anywhere near 0%), but the aspirational hypothesis is <strong>2.7 times more likely</strong>.</p>
<h2><span lang="EN-US">Summary and Discussion</span></h2>
<p>In this article, we extended a classic Bayesian urn exercise to illustrate one way to apply Bayesian analysis to a UX research context.</p>
<p>We showed how you can use a relatively simple version of Bayes’ theorem to compare the likelihoods of two hypotheses from observed completion rate data.</p>
<p>Even though we couldn’t say with certainty that either hypothesis was implausible, the data were clearly a better fit for the aspirational hypothesis.</p>
<p>But this leaves more questions:</p>
<ol>
<li>Is comparing two hypotheses in this way a better approach than just using a confidence (or credibility) interval, or using binomial tests to compare the sample against the two benchmarks (78%, 90%)?</li>
<li>How do we come up with a solid prior? We avoided that question in this example by using the <a href="https://en.wikipedia.org/wiki/Principle_of_indifference">principle of indifference</a> (setting both priors to 0.5). But what if we had a good reason to believe that the historical value of 78% should receive more (or less) weight in the calculation? How much could that change the <a href="https://en.wikipedia.org/wiki/Posterior_probability">posterior belief</a> (<em>P</em>(<em>A</em>|<em>D</em>)) and consequent decisions?</li>
</ol>
<p>We’ll address these questions in future articles.</p>
<p>Note that we aren’t recommending this specific method in UX analysis. One of our primary goals in this article is to illustrate the mechanics of Bayesian analysis with simple algebra and binomial probabilities. The downside of this is that we had to assign specific prior probabilities rather than following the current practice of <a href="https://bookdown.org/pbaumgartner/bayesian-fun/05-beta-distribution.html">using beta distributions for priors</a>. This does not, however, affect the logic of the discussion.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Why You Should Not Compute Medians for Individual Rating Scales</title>
		<link>https://measuringu.com/means-vs-medians-for-rating-scales/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=means-vs-medians-for-rating-scales</link>
		
		<dc:creator><![CDATA[Jim Lewis, PhD&nbsp;•&nbsp;Jeff Sauro, PhD]]></dc:creator>
		<pubDate>Wed, 04 Mar 2026 00:25:57 +0000</pubDate>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Rating Scale]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[mean]]></category>
		<category><![CDATA[median]]></category>
		<category><![CDATA[Rating Scales]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=46780</guid>

					<description><![CDATA[Say you collect rating scale data from dozens of users across ten apps. To analyze the data, you compute medians because you learned that rating scale data isn’t interval or ratio data. The medians of all ten apps end up the same. They’re all 4! If you rely on the medians, you’d conclude the apps [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/02/030326-FeatureImage-1.png"><img loading="lazy" decoding="async" class="alignleft wp-image-46790 size-medium" src="https://measuringu.com/wp-content/uploads/2026/02/030326-FeatureImage-1-300x169.png" alt="Feature image showing two persons inspecting a rating scale and an infographic" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/02/030326-FeatureImage-1-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/02/030326-FeatureImage-1-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/02/030326-FeatureImage-1-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/02/030326-FeatureImage-1-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/02/030326-FeatureImage-1-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/02/030326-FeatureImage-1.png 2000w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>Say you collect rating scale data from dozens of users across ten apps. To analyze the data, you compute medians because you learned that rating scale data isn’t interval or ratio data.</p>
<p>The medians of all ten apps end up the same. They’re all 4!</p>
<p>If you rely on the medians, you’d conclude the apps are essentially equivalent.</p>
<p>But if you compute the means, the ratings range from 3.6 to 4.6, providing a much clearer differentiation.</p>
<p>How can the same dataset produce such different stories? What’s the “right” way?<a href="https://measuringu.com/wp-content/uploads/2026/02/03032026-Cartoon.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46781 size-full" src="https://measuringu.com/wp-content/uploads/2026/02/03032026-Cartoon.png" alt="Cartoon showing researcher objecting to computing means" width="779" height="346" srcset="https://measuringu.com/wp-content/uploads/2026/02/03032026-Cartoon.png 779w, https://measuringu.com/wp-content/uploads/2026/02/03032026-Cartoon-300x133.png 300w, https://measuringu.com/wp-content/uploads/2026/02/03032026-Cartoon-768x341.png 768w, https://measuringu.com/wp-content/uploads/2026/02/03032026-Cartoon-600x266.png 600w" sizes="auto, (max-width: 779px) 100vw, 779px" /></a></p>
<p>Why are some researchers so adamant about NOT computing the means of rating scales like the Single Ease Question (Figure 1)? In this article, we explain why taking the median of rating scale data is a poor practice.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/02/03032026-F1.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46782" src="https://measuringu.com/wp-content/uploads/2026/02/03032026-F1.png" alt="The Single Ease Question" width="1200" height="222" srcset="https://measuringu.com/wp-content/uploads/2026/02/03032026-F1.png 1430w, https://measuringu.com/wp-content/uploads/2026/02/03032026-F1-300x56.png 300w, https://measuringu.com/wp-content/uploads/2026/02/03032026-F1-1024x190.png 1024w, https://measuringu.com/wp-content/uploads/2026/02/03032026-F1-768x142.png 768w, https://measuringu.com/wp-content/uploads/2026/02/03032026-F1-600x111.png 600w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> The Single Ease Question (SEQ<sup>®</sup>).</p>
<h2><span lang="EN-US">Stevens in 1946 Said Ordinal Data Can’t Be Averaged</span></h2>
<p>Ever since <a href="https://en.wikipedia.org/wiki/Stanley_Smith_Stevens">S. S. Stevens</a> declared in 1946 that <a href="https://pages.gseis.ucla.edu/faculty/richardson/Courses/stevens1946.pdf">numbers are not all created equal</a> by categorizing them as <a href="https://measuringu.com/data-types/">ratio, interval, ordinal, and nominal</a>, analysts have debated whether it’s legitimate to compute the mean of multipoint rating scales such as the SEQ. Based on his “<a href="https://en.wikipedia.org/wiki/Level_of_measurement">principle of invariance</a>,” he argued against doing anything more than counting nominal and ordinal data, which restricts addition, subtraction, multiplication, and division to interval and ratio data. These are exactly the operations needed to compute the mean of a set of data: “Thus, the mean is appropriate to an interval scale and also to a ratio scale (but not, of course, to an ordinal or a nominal scale” (Stevens, 1959, p. 28).</p>
<h3><span lang="EN-US">But Lord in 1953 Says Numbers Don’t Know They Are Ordinal</span></h3>
<p>It didn’t take long for other statisticians and measurement theorists to craft arguments against the proposed policy of restricting analysis of ordinal and nominal data to counts and medians. Probably the most famous counterargument was by Lord (1953). And we’re not referring to the <a href="https://en.wikipedia.org/wiki/Lorde">“Royals” singer</a> nor a deity, but a late psychologist with a divine name and lasting contributions (including the <a href="https://en.wikipedia.org/wiki/Frederic_M._Lord#:~:text=Frederic%20Mather%20Lord%20(November%2012%2C%201912%20%E2%80%93,TOEFL%20are%20all%20based%20on%20Lord's%20research.">SAT and GRE</a> tests).</p>
<p>In his <a href="https://is.muni.cz/el/fss/jaro2010/PSY454/um/Frederick_Lord_On_the_statistical_treatment_of_football_numbers.pdf">parable of a retired professor</a>, Lord described a machine used to randomly assign football numbers to the jerseys of freshmen and sophomore football players at his university … a clear use of numbers as labels (<strong>nominal data</strong>). After receiving their numbers, the freshmen complained that the assignment wasn’t random. They claimed to have received generally smaller numbers than the sophomores and that the sophomores must have tampered with the machine.</p>
<p>The professor consulted with a statistician to investigate how likely it was that the freshmen got their low numbers by chance. Over the professor’s objections, the statistician determined the population mean and standard deviation of the football numbers as 54.3 and 16.0, respectively. He found that the mean of the freshmen’s numbers was too low to have happened by chance, strongly indicating that the sophomores had tampered with the football number machine to get larger numbers. The professor objected to the analysis because the numbers weren’t even ordinal, but the statistician replied, “The numbers don’t know that; since the numbers don’t remember where they came from, they always behave just the same way, regardless.”</p>
<h3><span lang="EN-US">Even Nonparametric Tests Quietly Compute Means</span></h3>
<p>For analyzing ordinal data, some researchers have recommended using statistical methods that are similar to the well-known <em>t</em>&#8211; and <em>F</em>-tests, but which replace the original data with ranks before analysis. These are the so-called nonparametric methods (e.g., the Mann–Whitney <em>U</em> test, the Friedman test, or the Kruskal–Wallis test). But here’s the dirty secret: These methods actually compute the means of the ranks (or an equivalent process), which are ordinal (not interval or ratio) data! Despite these violations of permissible data manipulations from Stevens’ point of view, those methods work perfectly well.</p>
<h2><span lang="EN-US">Why Medians Are Poor Estimates of Central Tendency for Rating Scales</span></h2>
<p>When is computing a median a good practice, and why doesn’t it work well with rating scales?</p>
<h3><span lang="EN-US">When to Compute a Median</span></h3>
<p>The mean and median are both common ways to measure the central tendency of a set of data. To calculate the mean, add up the data points and divide by the total number in the group (the sample size, <em>n</em>). With the mean, every data point contributes to the estimate. The median is simply the center point of a distribution (or, if there is an even number of points, the average of the two center points).</p>
<p>The mean usually works well as a measure of central tendency, especially when the distribution is roughly symmetrical. When the data aren’t symmetrical, however, the mean can be sufficiently influenced by a few extreme data points (e.g., time data, currency values), so it’s no longer a good estimate of central tendency. When that happens, the median can be a better estimate of central tendency than the mean.</p>
<h3><span lang="EN-US">But Rating Scales Are Bounded and Discrete</span></h3>
<p>The examples of data types that can be better summarized with the median than the mean have two things in common:</p>
<ul>
<li>An unlimited range with a small number of extreme scores</li>
<li>Continuous measurement</li>
</ul>
<p>Rating scales, however, have a limited range and fundamentally discrete measurements. Because the ratings are discrete, the median can take only one of a restricted number of values regardless of the sample size. For a five-point scale, the median can take only the following values, no matter how large the sample: 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, and 5.0. (And it can take the intermediate values only when <em>n </em>is even.)</p>
<p>The mean, on the other hand, can take any value between 1 and 5, and as the sample size increases, the mean becomes more and more continuous. Because the mean can be a larger number of values, it can reflect differences between two samples more reliably than the median difference.</p>
<p>When scales are open-ended (have at least one endpoint at infinity, like time data), extreme values can affect the mean but will not affect medians. Rating scales, however, are not open-ended, so <a href="https://www.researchgate.net/publication/220302331_Multipoint_Scales_Mean_and_Median_Differences_and_Observed_Significance_Levels">the median does not have a compelling advantage over the mean</a> when analyzing individual rating scales. Instead, it is at a distinct disadvantage.</p>
<h2><span lang="EN-US">Eleven Mobile Apps That Look the Same Using Medians (A Real Example)</span></h2>
<p>So, we weren’t making up the story about a bunch of apps having the same median (we just changed the number from eleven to ten). The story comes from our data.</p>
<p>In our 2026 UX benchmark of clothing websites, we asked respondents who used the mobile apps of various companies to rate their usefulness with a five-point scale. Table 1 shows the means, medians, and sample sizes for the companies included in the benchmark.</p>

<table id="tablepress-1021" class="tablepress tablepress-id-1021">
<thead>
<tr class="row-1">
	<th class="column-1"><strong>Mobile App</strong></th><th class="column-2"><strong>Mean</strong></th><th class="column-3"><strong>Median</strong></th><th class="column-4"><strong><i>n</i></strong></th>
</tr>
</thead>
<tbody class="row-striping row-hover">
<tr class="row-2">
	<td class="column-1">Anthropologie</td><td class="column-2">3.94</td><td class="column-3">4.00</td><td class="column-4">18</td>
</tr>
<tr class="row-3">
	<td class="column-1">Athleta</td><td class="column-2">4.26</td><td class="column-3">4.00</td><td class="column-4">23</td>
</tr>
<tr class="row-4">
	<td class="column-1">Banana Republic</td><td class="column-2">4.42</td><td class="column-3">4.00</td><td class="column-4">19</td>
</tr>
<tr class="row-5">
	<td class="column-1">Gap</td><td class="column-2">4.44</td><td class="column-3">4.00</td><td class="column-4">18</td>
</tr>
<tr class="row-6">
	<td class="column-1">H&amp;M</td><td class="column-2">4.64</td><td class="column-3">5.00</td><td class="column-4">11</td>
</tr>
<tr class="row-7">
	<td class="column-1">Lululemon</td><td class="column-2">3.57</td><td class="column-3">4.00</td><td class="column-4">23</td>
</tr>
<tr class="row-8">
	<td class="column-1">Neiman Marcus</td><td class="column-2">4.21</td><td class="column-3">4.00</td><td class="column-4">24</td>
</tr>
<tr class="row-9">
	<td class="column-1">Nordstrom</td><td class="column-2">4.12</td><td class="column-3">4.00</td><td class="column-4">17</td>
</tr>
<tr class="row-10">
	<td class="column-1">Old Navy</td><td class="column-2">4.00</td><td class="column-3">4.00</td><td class="column-4">13</td>
</tr>
<tr class="row-11">
	<td class="column-1">Urban Outfitters</td><td class="column-2">3.91</td><td class="column-3">4.00</td><td class="column-4">22</td>
</tr>
<tr class="row-12">
	<td class="column-1">Zara</td><td class="column-2">4.30</td><td class="column-3">4.00</td><td class="column-4">30</td>
</tr>
</tbody>
</table>
<!-- #tablepress-1021 from cache -->
<p class="wp-caption-text" style="text-align: left;"><strong>Table 1:</strong> Means and medians for usefulness ratings of eleven mobile apps for online clothes shopping.</p>
<p>In this example, all the medians were either 4 or 5. The means, on the other hand, ranged from 3.57 to 4.64 with no duplication, providing a much more nuanced picture of the differences in the ratings (Figure 2).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-1-scaled.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46794" src="https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-1-300x114.png" alt="Graph of means for usefulness ratings of eleve online clothes shopping apps." width="1200" height="456" srcset="https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-1-300x114.png 300w, https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-1-1024x389.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-1-768x292.png 768w, https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-1-1536x583.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-1-2048x778.png 2048w, https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-1-600x228.png 600w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /></a><a href="https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-2-scaled.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46795" src="https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-2-300x114.png" alt="Graph of medians for usefulness ratings of eleven online clothes shopping apps. The profile of the means is more informative than the profile of the medians." width="1200" height="456" srcset="https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-2-300x114.png 300w, https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-2-1024x389.png 1024w, https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-2-768x292.png 768w, https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-2-1536x584.png 1536w, https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-2-2048x778.png 2048w, https://measuringu.com/wp-content/uploads/2026/03/030326-Figure2-2-600x228.png 600w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 2:</strong> Comparison of graphs of means and medians for usefulness ratings of eleven online clothes shopping apps. The profile of the means is more informative than the profile of the medians.</p>
<p>And when sample sizes are very small, there usually won’t be much difference between rating scale means and medians. The most extreme example is when <em>n</em> = 2, in which case the mean and median will be the same, but that doesn’t happen in the real world.</p>
<h2>Use Means, But Don’t Overinterpret Them</h2>
<p>So, which is it—not all numbers are equal (Stevens, 1946), or the numbers don’t remember where they came from (Lord, 1953)? Given our backgrounds in applied statistics (and personal experiences attempting to act in accordance with Stevens’ reasoning that didn’t work out very well—that’s a story for another day), we fall firmly in the camp that supports the use of statistical techniques (such as the <em>t</em>-test, analysis of variance, and factor analysis) on ordinal data such as multipoint rating scales. However, you can’t just ignore the level of measurement of your data.</p>
<p>When you make claims about the meaning of the outcomes of your statistical tests, you must be careful not to act as if rating scale data are interval rather than ordinal. An average rating of 4 might be better than an average rating of 2, and a <em>t</em>-test might indicate that, across a group of participants, the difference is consistent enough to be statistically significant. Even so, you can’t claim that it’s twice as good (a ratio claim), nor can you claim that the difference between 4 and 2 is equal to the difference between 4 and 6 (an interval claim). You can only claim that there is a reliably consistent difference.</p>
<p>Although it might surprise some researchers who treat the implications of the levels of measurement as if they were laws, Stevens (1946, p. 679) took a more moderate stance on this topic than most people realize:</p>
<p style="margin-left: 40px;"><em>On the other hand, for this &#8220;illegal&#8221; statisticizing there can be invoked a kind of pragmatic sanction: In numerous instances it leads to fruitful results. While the outlawing of this procedure would probably serve no good purpose, it is proper to point out that means and standard deviations computed on an ordinal scale are in error to the extent that the successive intervals on the scale are unequal in size. When only the rank-order of data is known, we should proceed cautiously with our statistics, and especially with the conclusions we draw from them.</em></p>
<p>Fortunately, even if you make the mistake of thinking one product is twice as good as another when the scale doesn’t justify it, it would be a mistake that often would not affect the practical decision of which product is better.</p>
<h2><span lang="EN-US">Summary and Discussion</span></h2>
<p>Some analysts strongly advise against computing the means of rating scales, often recommending the computation of medians instead. In this article, we explain why reporting the median of rating scale data doesn’t work as well as reporting the mean.</p>
<p><strong>Medians are better than means when outliers skew continuous, unbounded data. </strong>This pattern is common in measures such as time or money, where a few extreme values can substantially shift the mean.</p>
<p><strong>Rating scales are discrete and bounded, making means more informative than medians. </strong>Even though we spend a lot of money and time collecting data, rating scales aren’t like time and money. For data like this, medians are too coarse to capture the meaningful differences that means are sensitive enough to detect.</p>
<p><strong>Compute means of rating scales, but don’t make interval claims from ordinal data. </strong>Differences between rating scale means indicate consistent ordering, not equal intervals or proportional differences. Even so, they are often very useful in practice.</p>
<p><strong>Bottom line</strong>: When analyzing rating scale data, don’t be afraid to compute and compare means as long as your interpretation of results doesn’t exceed what the data says.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>An Intro to Bayesian Thinking for UX Research: Updating Beliefs with Data</title>
		<link>https://measuringu.com/intro-to-bayesian-thinking-in-ux-research/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=intro-to-bayesian-thinking-in-ux-research</link>
		
		<dc:creator><![CDATA[Jeff Sauro, PhD&nbsp;•&nbsp;Jim Lewis, PhD]]></dc:creator>
		<pubDate>Wed, 25 Feb 2026 00:50:31 +0000</pubDate>
				<category><![CDATA[User Research]]></category>
		<category><![CDATA[UX]]></category>
		<category><![CDATA[Bayes theorem]]></category>
		<category><![CDATA[Bayesian theorem]]></category>
		<category><![CDATA[completion rates]]></category>
		<guid isPermaLink="false">https://measuringu.com/?p=46598</guid>

					<description><![CDATA[&#8220;That design will never work.&#8221; You may have had that thought before you even ran your first participant in a usability test. If you’ve seen enough users struggle and conducted enough usability tests, then you probably have some idea about how well or poorly a task attempt may go for prototypes or even commercially available [&#8230;]]]></description>
										<content:encoded><![CDATA[<p><a href="https://measuringu.com/wp-content/uploads/2026/02/022426-FeatureImage-4.png"><img loading="lazy" decoding="async" class="alignleft wp-image-46615 size-medium" src="https://measuringu.com/wp-content/uploads/2026/02/022426-FeatureImage-4-300x169.png" alt="Feature image showing a researcher pointing on a math equation using a pointer stick" width="300" height="169" srcset="https://measuringu.com/wp-content/uploads/2026/02/022426-FeatureImage-4-300x169.png 300w, https://measuringu.com/wp-content/uploads/2026/02/022426-FeatureImage-4-1024x576.png 1024w, https://measuringu.com/wp-content/uploads/2026/02/022426-FeatureImage-4-768x432.png 768w, https://measuringu.com/wp-content/uploads/2026/02/022426-FeatureImage-4-1536x864.png 1536w, https://measuringu.com/wp-content/uploads/2026/02/022426-FeatureImage-4-600x338.png 600w, https://measuringu.com/wp-content/uploads/2026/02/022426-FeatureImage-4.png 2000w" sizes="auto, (max-width: 300px) 100vw, 300px" /></a>&#8220;That design will never work.&#8221;</p>
<p>You may have had that thought before you even ran your first participant in a usability test.</p>
<p>If you’ve seen enough users struggle and conducted enough usability tests, then you probably have some idea about how well or poorly a task attempt may go for prototypes or even commercially available software or products.</p>
<p>It’s rare to have <em>no</em> idea about how well things will go before the testing even starts. In fact, an experienced researcher is expected to know of some problems and anticipate the friction. This is one of the foundations behind inspection methods like heuristic evaluation and the PURE method (which puts some numbers to friction).</p>
<p>Expert reviewers, of course, are not a substitute for observing users. But is there a way to build in our a priori knowledge of what’s likely to go wrong and then inform and update our beliefs once we see data? Can we do that systematically or even mathematically?</p>
<h2><span lang="EN-US">Thomas Bayes and Updating Our Beliefs from Data</span></h2>
<p>It turns out that hundreds of years ago, a famous Presbyterian minister named <a href="https://en.wikipedia.org/wiki/Thomas_Bayes">Thomas Bayes</a> was also interested in updating his beliefs with what he observed.</p>
<p>His name has been associated with a formula for updating our beliefs with data (<a href="https://en.wikipedia.org/wiki/Bayes%27_theorem">Bayes&#8217; Theorem</a>). It follows a simple iterative process:</p>
<ol>
<li>Start with a belief or hypothesis.</li>
<li>Collect data.</li>
<li>Update the belief.</li>
<li>Repeat.</li>
</ol>
<p>The formula for this process looks like this:</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-1.png"><img loading="lazy" decoding="async" class="aligncenter wp-image-46629" src="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-1.png" alt="Formula for updating beliefs with Bayes Theorem" width="674" height="70" srcset="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-1.png 1078w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-1-300x31.png 300w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-1-1024x106.png 1024w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-1-768x80.png 768w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-1-600x62.png 600w" sizes="auto, (max-width: 674px) 100vw, 674px" /></a><br />
In other words, start with what you expect, check how well the data matches that expectation, and then adjust your belief accordingly.</p>
<p>The Bayes’ formula means that beliefs that better predict the data become more credible; beliefs that predict the data poorly lose credibility.</p>
<p>Our original belief is called the prior hypothesis (before). The belief we have after observing data and calculating an update is our posterior belief (after).</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-2.png"><img loading="lazy" decoding="async" class="aligncenter wp-image-46630" src="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-2-300x39.png" alt="Formula for determining a posterior hypothesis" width="536" height="70" srcset="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-2-300x39.png 300w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-2-768x100.png 768w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-2-600x78.png 600w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-2.png 857w" sizes="auto, (max-width: 536px) 100vw, 536px" /></a><br />
If we replace words with symbols, we get the more recognizable Bayesian formula. We have only two symbols that extend Bayesian thinking: θ (theta) and <em>D</em> (data).</p>
<p>Our prior belief is represented with the Greek symbol theta (θ) and shown in the formula as the probability of theta. <em>D</em> represents the data we observed/collected and is shown in the formula in the denominator (probability of all data). Both θ and <em>D</em> appear in the numerator as a conditional probability of the data given theta (<em>D</em>|θ).</p>
<p>Our posterior (updated belief) is represented with the probability of theta given the data (θ|<em>D</em>). The resulting formula is:</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-3.png"><img loading="lazy" decoding="async" class="aligncenter wp-image-46631" src="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-3-300x77.png" alt="Posterior with theta" width="273" height="70" srcset="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-3-300x77.png 300w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-3.png 437w" sizes="auto, (max-width: 273px) 100vw, 273px" /></a><br />
Interestingly, Bayes himself never published his famous theorem. It was published after his death by his friend <a href="https://www.google.com/url?sa=t&amp;source=web&amp;rct=j&amp;opi=89978449&amp;url=https://www.york.ac.uk/depts/maths/histstat/price.pdf&amp;ved=2ahUKEwjyqb--_dySAxXYODQIHRhnMzgQFnoECCAQAQ&amp;usg=AOvVaw2cW4VK9LyEC78RfzR_HhPE">Richard Price</a> [PDF], who used it to attempt to prove the existence of God by showing that the order in the universe wasn’t accidental. Because Price likely made a substantial contribution to completing Bayes’ work on the theorem, this may be another example of <a href="https://en.wikipedia.org/wiki/Stigler%27s_law_of_eponymy">Stigler’s Law</a> (scientific discoveries are not named after the discoverer or, in this case, do not include the co-discoverer).</p>
<p>Formulas, ministers, and theology are interesting and all, but how does this apply to UX research?</p>
<h2><span lang="EN-US">A Simple UX Research Example with Completion Rates</span></h2>
<p>We can use an example of testing a new checkout experience. We want to gauge the completion rate (a fundamental usability metric). How successfully are people able to get through the new flow?</p>
<p>We’ve never tested this checkout flow before, though. But do we really have <em>no idea</em> about what will happen? Is a 0% completion rate really as likely as a 50%, 90%, or 100% completion rate?</p>
<p>Using a <strong>rough</strong> guide from historical data, we know an “average” completion rate is <a href="https://measuringu.com/task-completion/">around 78%</a>. It doesn’t mean we expect this new checkout completion rate to be <strong>exactly</strong> 78% (there is a lot of variability around this average). But values between 50% and 95% seem more plausible than a 5%, 10%, or even 99% completion rate. The lower end would be cause for concern, and the upper end would be desired for such an important flow.</p>
<h3><span lang="EN-US">What Exactly Is Our Prior?</span></h3>
<p>So, following Bayesian thinking, we establish a prior. Our <strong>prior belief</strong> is not a single number (78%), but a <em>range of plausible completion rates</em>, centered near 78% (the most plausible rate). Rates far lower (e.g., 40%) or far higher (e.g., 99%) are possible but less likely. In Bayesian terms, this represents a prior belief with a probability distribution centered at 78% but wide enough to allow for substantial uncertainty (see the appendix for details).</p>
<h3><span lang="EN-US">Collecting Data</span></h3>
<p>As an example of using data to update our initial thinking, assume we’ve collected data from a hypothetical moderated usability test with twenty participants in which eighteen completed the checkout and two failed. That’s a 90% observed completion rate. What does that do to our prior belief?</p>
<p>Using Bayesian thinking, we’d ask which completion rates best explain 18 successes out of 20.</p>
<ul>
<li>Rates near 90% explain it well.</li>
<li>Rates near 78% still explain it reasonably well.</li>
<li>Rates near 50% explain it poorly.</li>
</ul>
<p>Bayes’ theorem formalizes that comparison. It increases the credibility of rates that better predict the data and decreases the credibility of those that don’t.</p>
<h3><span lang="EN-US">Updating Our Prior</span></h3>
<p>Before seeing the data, our belief was centered near 78%. After observing 18/20 completions, we conclude (see appendix for the mechanics):</p>
<ul>
<li>Our updated best estimate of the true completion rate is about <strong>86%</strong>.</li>
<li>A 95% credible interval runs from roughly <strong>72% to 96%</strong>.</li>
<li>There’s about an <strong>89% probability </strong>that the true completion rate exceeds 78%.</li>
</ul>
<p>A few things to notice:</p>
<ul>
<li>The data pulled our estimate up from 78% toward 90%.</li>
<li>It didn’t go all the way to 90%.</li>
<li>The prior kept us from overreacting to just twenty observations.</li>
</ul>
<p>That’s <strong>Bayesian updating</strong>. We started with an informed expectation, saw new evidence, and adjusted accordingly. Figure 1 illustrates this Bayesian thinking.</p>
<p><a href="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-4.png"><img loading="lazy" decoding="async" class="alignnone wp-image-46632" src="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-4-300x163.png" alt="The posterior distribution (after observing 18/20 completions) shifts upward from the prior centered at 78%, reflecting the influence of new data while retaining uncertainty." width="1200" height="651" srcset="https://measuringu.com/wp-content/uploads/2026/02/022426-formula-4-300x163.png 300w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-4-1024x556.png 1024w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-4-768x417.png 768w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-4-600x326.png 600w, https://measuringu.com/wp-content/uploads/2026/02/022426-formula-4.png 1284w" sizes="auto, (max-width: 1200px) 100vw, 1200px" /></a></p>
<p class="wp-caption-text" style="text-align: left;"><strong>Figure 1:</strong> The posterior distribution (after observing 18/20 completions) shifts upward from the prior centered at 78%, reflecting the influence of new data while retaining uncertainty.</p>
<p>So, can we just plug our numbers into the simple formula we showed above? The answer we’ve found working through this Bayesian example is, unfortunately, not that simple. We describe the approach we used for those numbers below in the appendix.</p>
<p>We’ll cover how to conduct these analyses in upcoming articles, but this provides some idea about using Bayesian thinking in practice without getting swallowed up in the conditional probabilities.</p>
<h2><span lang="EN-US">Updating Our Beliefs with More Questions</span></h2>
<p>Who can argue with updating your beliefs with new data? We like this idea of applying iterative Bayesian thinking and incorporating historical data. Who wants to be stuck in their ways? But while using Bayesian thinking seems both appealing and like sound science, it generates a few questions:</p>
<ul>
<li>How is this different from using the statistics taught in an intro statistics class and <a href="https://measuringu.com/product/practical-statistics-for-ux-and-customer-research-course/">our courses</a>?</li>
<li>What’s the difference between a credibility interval and a confidence interval?</li>
<li>Do Bayesian statistics require smaller sample sizes?</li>
<li>What if you don’t have any prior information?</li>
<li>How reliable are priors if they are just our intuition or “conventional wisdom?”</li>
<li>Can a prior steer us in the wrong direction?</li>
<li>How can this concept be extended to assessing the likelihoods of different hypotheses?</li>
</ul>
<p>We’ll dig into these questions in upcoming articles.</p>
<h2><span lang="EN-US">Appendix: How the Posterior Was Computed</span></h2>
<p>Here’s a quick summary of how we computed the values. We used some common modern Bayesian methods that are computationally intense (we’ll cover that in a future article).</p>
<p>We modeled the true completion rate using a Beta distribution and the observed data using a binomial model. We set a weak prior centered at the historical average of 78%, equivalent to about 10 prior observations. This corresponds to a Beta(7.8, 2.2) prior distribution.</p>
<p>With 18 completions out of 20 participants, the Bayesian update is: for a Beta prior and binomial data, the posterior is Beta(α + successes, β + failures). Substituting the values gives a posterior of Beta(25.8, 4.2).</p>
<p>From this updated distribution:</p>
<ul>
<li>The posterior mean is 25.8/30 ≈ 86%.</li>
<li>The 95% credible interval is approximately 72% to 96% (2.5<sup>th</sup> and 97.5<sup>th</sup> percentiles of the Beta(25.8, 4.2) distribution.</li>
<li>The probability that the true completion rate exceeds 78% is about 89% (using the upper tail of the Beta(25.8, 4.2) distribution.</li>
</ul>
<p>This update reflects a compromise between prior expectations and observed data, with the new evidence pulling the estimate upward while retaining uncertainty.</p>
]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>

<!-- plugin=object-cache-pro client=phpredis metric#hits=7780 metric#misses=55 metric#hit-ratio=99.3 metric#bytes=4054985 metric#prefetches=208 metric#store-reads=113 metric#store-writes=170 metric#store-hits=308 metric#store-misses=37 metric#sql-queries=57 metric#ms-total=638.84 metric#ms-cache=36.94 metric#ms-cache-avg=0.1310 metric#ms-cache-ratio=5.8 -->
