Etsy Engineering | Code as Craft

Shaping Product Understanding with Contrastive Reinforcement Learning

Pat Geitner — Tue, 26 May 2026 14:13:10 -0400

Etsy’s marketplace is defined by the creativity and craftsmanship of our sellers and the hundreds of millions of highly diverse products they offer. You can find silversmiths who cold-forge recycled sterling silver, weavers who dye raw fleece with indigo and black walnut, and ceramicists who throw stoneware on a kick wheel.

These details define each product and often determine whether it matches a buyer’s taste, style, and interests. Sometimes buyers know exactly what they want, searching for “hand-thrown ceramic mug” or “vegetable-tanned leather wallet.” Other times, especially in recommendations, that intent is implicit. In those cases, we need to surface products that align with a buyer’s preferences even when those details are not explicitly stated.

To do that well, our search and recommendation systems need to understand what sets each listing apart beyond just the product category or listing titles, capturing the details that signal what an object is, how it was made, and whom it might appeal to. The quality of these representations directly limits how precisely our models can learn relationships between buyer taste and product details, and in turn surface relevant recommendations and search results. If the representation misses what makes a product special, the model will too.

The Gap Between Raw Data and Rich Product Understanding

Sellers provide rich product information through listing titles, images, descriptions, tags, variations, and attributes, but the most distinguishing details are often buried in long, noisy, and inconsistently expressed data. For example, details about how a product is made might be buried in a long description alongside shipping information, return policies, and sizing charts. A product’s visual style might be clear in its images, but not all of our systems can process images. At scale, ingesting and reasoning over all this raw data for production models is difficult due to tight latency constraints.

Parallel work at Etsy addresses some of these challenges by extracting parts of this raw signal into a structured format where we define attributes and infer them across all of the products in our inventory. While structured data works well for analysis and hand-crafting ML features, the diversity of Etsy’s one-of-a-kind inventory makes defining a fixed schema to capture all product intricacies difficult at scale.

Unstructured data can give us greater flexibility to capture more of this diverse information in a way that downstream ML models can learn from. One way to represent listings with less structural constraint is with free-form natural language summaries. Language models can leverage their world knowledge to distill unique product details into concise, expressive summaries that are easy to plug into downstream models.

However, a key challenge with unstructured output is defining what a “good” solution looks like, as what's important varies widely across products and contexts. Even within the same category, the details that matter most can differ substantially between listings. For one piece of artwork, buyers might respond most strongly to the composition or color palette, while for another, it may be the brushwork or framing style that determines which buyers it is most relevant to. Which details matter most is often reflected in how buyers engage with the listing given their search intent or prior interactions.

Because these distinctions depend heavily on the specific product, expressing what information is important through a prompt or a schema would require knowing upfront what details our complex search and recommendation systems actually need. Instead, we can build richer product representations by learning which information matters directly from contextual buyer engagement signal.

Unlocking Deeper Product Understanding with Contrastive Signal

Reinforcement learning (RL) provides a flexible way to shape product representations around the listing details most aligned with buyer engagement. Prior work, including a recent paper from Walmart, has shown that downstream models can be used directly as reward signal to teach an LLM how to represent and summarize the information that is important for downstream machine learning tasks.

Inspired by this work, we aimed to identify and surface the details that distinguish a listing from other similar options a buyer might consider. For example, we want our models to understand why a buyer who searched for “Sculptural Stoneware Pottery” chose to interact with one listing (engaged), but not another (non-engaged).

This perspective aligns naturally with ideas from contrastive learning. In traditional representation learning, contrastive methods train models to produce embeddings by pulling similar examples closer together and pushing dissimilar ones apart, implicitly learning which features matter most for distinguishing between them. We apply the same logic here, but instead of training a model to learn a latent embedding space, we use contrastive signal as a reward to fine-tune a language model to generate natural language summaries that emphasize distinguishing product details. These internal summaries are used to help our search and recommendation models surface products that better match buyer taste and intent.

To do this, we fine-tune Qwen3-VL-8B, an open-source LLM, using search interaction data structured as triplets: a query, an engaged listing, and a non-engaged listing. Engaged listings are those a buyer clicked, purchased, or favorited for a given query, while non-engaged listings are a mix of listings that appeared in the same search results but were not interacted with, along with randomly sampled listings. Because this data is grounded in real buyer behavior, it provides a natural signal for which details influenced a buyer’s decision in a given context.

For each triplet, the model generates several candidate summaries for both listings. These summaries are passed through a frozen neural search retrieval model, a two-tower dense retriever used for candidate generation in search, which scores their similarity to the query. The reward signal is defined as the margin between the query–engaged similarity and the query–non-engaged similarity. Summaries that increase this separation receive higher reward.

We then use Group Relative Policy Optimization (GRPO) to update the summary generation model based on the relative quality of its candidate generations, as illustrated in the figure below. Because the reward depends on the margin between paired engaged and non-engaged summaries, rather than on summaries independently, we generate and evaluate these candidates jointly within the same batch during RL training.

Throughout training, this contrastive objective nudges the model to produce summaries that push engaged listings closer to the query in the search model’s embedding space while pushing non-engaged listings further away. The margin structure also discourages reward hacking by construction: a generic, keyword-stuffed summary cannot simultaneously raise the engaged listing’s similarity and lower the non-engaged’s, so it is not rewarded for exploiting the retriever’s embedding space without grounding in the listing. The figure below illustrates how these embedding space dynamics unfold for candidate summaries generated from a given training triplet over the course of RL fine-tuning.

Over enough steps, the model learns to surface the nuanced product details that distinguish between listings for the same broad type of product, but differ in the specific details that make one more relevant to a buyer than another.

Critically, since the model is trained to distinguish between alternatives rather than distill raw listing data, it can draw on sources our search systems currently don’t directly ingest, such as images. To understand how this reward shapes the information the model surfaces in practice, we can trace how the generated summary for a single listing evolves over the course of training.

This listing's distinctive product information is spread across its image, nested attributes, and a long description that also contains care instructions and shop information. Without reinforcement learning, the base model produces a summary that largely repeats the listing title and vaguely describes some additional features. Over the course of training, the model learns to extract specific product details from the raw data while ignoring irrelevant noise. By the end, the fine-tuned model produces a summary that goes well beyond the title and surfaces the details that actually set the listing apart. More example product summaries generated by the RL-tuned model are highlighted in the table below.

Measuring Downstream Impact

Looking at examples of generated summaries and how they progress through training can help us assess if the model is surfacing the right details, but we need concrete measures of quality to guide model tuning and understand if the generated summaries are moving the needle in terms of downstream system performance.

Human Evaluation

To get signal into the quality and accuracy of the generated summaries, we partnered with Etsy’s inventory specialist to review the generated summaries alongside the raw listing data. Beyond general high quality human review, she brings specialized knowledge of what makes products in our marketplace distinctive – a key component for aligning LLM outputs to what actually matters. For each summary, she assessed whether it was grounded in the listing information, whether it surfaced concrete, distinguishing details not present in the title, and whether these product details were the most important additional information to surface.

Across all evaluated summaries, the model was consistently pulling out specific, accurate product details beyond the title. The reward signal was also working as intended, keeping generations grounded in seller-provided information and pushing the model to prioritize the details most relevant for search and recommendation models.

Quantitative Offline Evaluation

Beyond qualitative evaluation, we also want to measure whether these generated summaries are surfacing the right information to help downstream ML better understand the intricacies of the products on Etsy. To test this, we measured the impact of generated summaries on our Semantic Relevance model, which classifies query-listing pairs as relevant, partially relevant, or irrelevant and is used in search filtering and evaluating our search systems. This provides a good evaluation task because the omission of seemingly small product details can completely shift whether a product is semantically relevant to a specific buyer query. We generated summaries for the listings in a human-labeled evaluation set and compared model performance using description n-grams versus the RL fine-tuned product summaries as input. The table below provides a breakdown of the results.

We found that including the generated summaries as input in the semantic relevance model improved macro F1 score by 8.7% relative to using description n-grams, a simpler method that captures keywords in listing descriptions. Beyond the improvement for this specific evaluation task, this gives us evidence that despite being optimized explicitly for our neural search retrieval model, the summaries surface information that leads to gains for an entirely different model with a different objective. This generalization suggests the summaries are meaningfully capturing important product details overall and not just overfitting to our reward signal.

Why This Matters

This work was built as a way to highlight the creativity and craft of our sellers and the extremely diverse products they create. Rather than trying to rigidly define what makes each product distinct, we learn it directly from how our systems observe buyer behavior. The result is a way to surface what makes each individual product special in a form that is expressive, consistent, and easy for downstream systems to learn from.

In the near term, we are planning online experiments integrating these summaries into production ML systems. Beyond that, we are excited about the potential feedback loop this framework opens up: better summaries give downstream models richer signal to learn from, and better models produce stronger reward signal to further fine-tune expressive summaries.

Acknowledgements

This work is part of an ongoing, collaborative effort to bring more of the creativity and craft behind every listing into how our systems represent and surface products. We would like to thank our collaborators across the Inventory ML and Search ML teams.

Engineering and Product Leadership: Brian Schmidt, Eve Ahearn, Argie Angeleas
Inventory Expertise: Taylor Hunt
Search ML Collaborators: David Blincoe, Oriane Cavrois, Yuqing Zhang, Maria Castanos

Making Ads Count: Using MMoE and Auxiliary Tasks to Better Connect Buyers & Sellers

Amanda Steigman — Fri, 20 Mar 2026 14:31:56 -0400

When buyers search on Etsy, they need to quickly and easily find the perfect item. At the same time, sellers need to be confident their unique products are being seen by the right customers. Our Ads Search ranking model, which is built on a multitask learning foundation, is the critical link in this connection. Recently, we identified an opportunity to drive more meaningful buyer engagement by enhancing our model’s ability to predict purchase intent. We achieved this via a dual-pronged improvement: introducing Multigate Mixture of Experts (MMoE) to our model architecture and leveraging add-to-cart as an auxiliary signal. By providing our downstream systems with more accurate predictions, we improved matching in our marketplace, surfacing more relevant listings for buyers while helping sellers reach customers who are genuinely interested in their products.

Background

When a buyer searches for an item on Etsy, we want them to find exactly what they’re looking for from our inventory containing tens of millions of listings. In order to help them do this, we surface high-quality listings that are relevant to a user’s search query by ranking a small subset of items from a much larger group. This includes advertisements purchased by sellers that enables them to promote their listings across Etsy placements, including search. While these results are sponsored, the items go through their own ranking process to surface the listings most likely to meet a buyer’s needs. The final result on the search page utilizes our auto-bidding system, which helps decide which listings get shown and the cost-per-click.

After a user views an ad (known as an “impression”), clicking on the ad is often the first engagement in their purchase journey. However, each subsequent step – from click to cart addition to purchase – represents a progressively smaller subset of users. The increasing data sparsity that exists further along in the purchase journey can make it difficult for our model to pick up on a strong signal to learn from.

When ranking ads, our machine learning models optimize for click-through rate (CTR) and post-click conversion rate (PCCVR). Clicks and purchases are the primary behaviors we use to predict and drive user engagement, but other actions in the buyer’s purchase journey, such as adding an item to a cart, are important and often predictive of a purchase.

Figure 1. The Ads Search user journey.

Some post-impression actions, such as favoriting an item, are not directly related to a buyer's purchase journey but can provide valuable signals to enhance our model's predictive capability. A click can be a strong indicator of a future purchase, but it can also be noisy – meaning it doesn’t always reliably predict purchase intent. For example, a user may click on an ad purely out of curiosity with no intention to buy. These are just a few reasons why user behavior is complex, and we are constantly trying to improve our prediction models to better capture these patterns and recommend the most relevant ads.

Multitask Model Architecture

The Ads Search ranking model is a multitask learning framework containing four major components: feature representation, explicit feature interaction, implicit feature interaction, and task prediction. Figure 2 is a depiction of our model architecture prior to the enhancements this post will describe. We start with raw numerical, categorical, and high-cardinality ID features for query, user, and listing entities, which are converted through the feature representation layer - including text embeddings and sequence encodings - to generate dense feature representations. These are concatenated and fed to a Deep and Cross Network (DCN) that learns explicit feature interactions. The explicitly crossed features then pass through a shallow feed-forward network for the model to learn additional implicit feature interactions. Finally, the latent feature representations are fed into task-specific towers to output CTR and PCCVR predictions.

Figure 2. The initial multitask architecture used for the Ads Search ranking model which has since been upgraded with Multigate Mixture of Experts (MMoE).

Since the CTR and PCCVR predictions are used in downstream ads ranking and auto-bidding systems, we need the predictions to be well-calibrated. After the underlying model is trained, we individually calibrate the CTR and PCCVR towers to probability distributions using Platt scaling layers. As user behaviors vary significantly across ad placements, the model learns distinct parameters for different placements.

Optimizing for Purchase Intent

The multitasked architecture we use has several advantages, as it helps the model learn shared patterns across tasks, reduces overfitting by allowing the model to learn more generalizable features, and decreases training and serving infrastructure costs by consolidating the two separate models into a single model. We originally deployed this multitask ranking model online in July 2023 and had not made major changes to its architecture since then. In the second half of 2025, the team identified an opportunity to better optimize for meaningful buyer engagement beyond click and purchase signals alone.

Our goal was to not only surface listings that resonate with buyers and drive conversions but also encourage them to return to Etsy – creating a positive feedback loop that benefits both buyers and sellers. We hypothesized that optimizing our models for engagement actions that signaled both purchase intent and buyer satisfaction would surface more relevant ads to buyers. Engagement actions include behavior that goes beyond a simple click, such as an add-to-cart or a favorited listing. To more effectively prioritize listings that led to this meaningful engagement for our buyers, we experimented with:

A model architecture that would better predict purchase intent and
Adding additional signals to boost high-quality listings that buyers are more likely to purchase

These two enhancements, in the form of Multigate Mixture of Experts (MMoE) and add-to-cart as an auxiliary task, worked well together in our model to drive a sizable product improvement in Q4 2025.

Enabling Task-Specific Learning with Multigate Mixture of Experts (MMoE)

While the introduction of our initial multitask model was a large success overall, it also had a limitation: since the model learns and shares the same feature representations across tasks, it is not always able to learn task-specific nuances, and this is more pronounced the less related the tasks are. When one task sees improved performance, other tasks can see performance degradation. This behavior is known as the “seesaw phenomenon,” and we encountered this when we first brought the multitask model online. One solution to this limitation is to add a Multigate Mixture of Experts (MMoE) layer. In this architecture, the model still employs a shared bottom architecture where the feature representation and interaction layers remain unchanged. However, the MMoE layer introduces two key additional components in place of the shared feed-forward network: experts and gates.

Experts are parallel subnetworks that, unlike the shared representations before them in the network, are able to specialize and learn different patterns of the data. Experts are not specifically assigned tasks but rather this learning happens organically during training - some experts learn more about click-specific behavior, others learn more about purchase-specific behavior, and others learn patterns that are important for both of these actions. Each task has one softmax gating network which controls how that task combines expert outputs. This allows the CTR and PCCVR tasks to use and activate different subsets of experts differently, and this weighted expert information is then sent to our task-specific towers.

Figure 3. A comparison of our shared bottom multitask architecture for the Ads Search ranking model (left) and our MMoE architecture (right).

Tuning the Experts

The main hyperparameters to tune in the MMoE layer are the number of experts, the size of the experts, and the expert type. The number of experts used depends on several factors. One factor is a tradeoff between having too few experts – which can underfit and fail to capture distinct patterns needed for each task – and too many – which can overfit to training data and fail to generalize well. Another factor to consider is that adding experts increases model capacity which in turn increases latency and infrastructure costs. Our initial configuration included only multilayer perceptrons (MLP)-based experts (i.e., feed-forward neural networks), but we experimented offline with heterogenous experts and saw an offline lift in purchase and click metrics through introducing a mixture of DCN- and MLP-based experts.

There are other pitfalls when employing an MMoE architecture which oftentimes require additional hyperparameter tuning to resolve. Specifically, common issues in an MMoE structure are expert utilization (the ability of each task to use multiple experts) and expert specialization (the ability of each expert to learn differently). If experts are not well-utilized, the model has wasted capacity and fails to leverage the full representational power of the architecture. If experts are not specialized, the model effectively reduces to a shared bottom architecture with redundant experts. On the other hand, if experts specialize too strictly and are not shared between tasks, the model loses some of the benefits of multitask transfer learning.

We ran into these issues when training our new model. To build a successful model with MMoE, we needed each task to utilize more experts and utilize some of the same experts as the other tasks so that they could benefit from both specialized and shared learning. We experimented offline with two regularization techniques to try and solve this issue: expert dropout and temperature scaling.

In expert dropout, we randomly disable some experts during training to force the model to learn more diverse representations. Expert dropout differs from typical “dropout” in neural networks (where we randomly remove a percent of connections in a given layer during training), as we fully remove the utilization of a number of experts during the forward pass. Using expert dropout, utilization did improve a bit: each gate was selecting a primary and a secondary expert for each task. Still, we did not see any sharing of experts between the tasks.

We then tried temperature scaling, which modifies the raw logits of the expert gates by dividing them by a temperature (T) to control the smoothness of the resulting probability distribution. By applying this before the softmax function (which converts logits to probabilities) in the gates with a T > 1, we softened the distribution, making it more likely to select multiple experts. Expert dropout is random and only applied at training while temperature scaling is deterministic and applied at both training and inference. Temperature scaling achieved better utilization and specialization than expert dropout, leading us to deploy this approach.

Auxiliary Tasks

Our multitask model already leveraged user click and purchase engagements to train CTR and PCCVR towers. However, we also have access to other rich user interactions, namely add-to-cart and favorites, that reflect the meaningful buyer engagement described above. Purchases are quite sparse compared to clicks, and one of the major benefits of our original multitasking model was for this sparse purchase action to benefit with additional signal from the more common click action. Our goal with adding auxiliary tasks was to help the model learn more generalizable representations of user engagement, again leaning on actions that are more plentiful than purchase. We hypothesized that add-to-cart and favorite actions were indications of high purchase intent that would help the model better learn the purchase task without hurting the click task.

Since we do not use add-to-cart and favorite predictions for downstream use cases like ranking or bidding, we did not need to calibrate these predictions or serve them online. This makes them relatively straightforward to add to our existing uncalibrated model architecture. In the shared bottom version of our model, we simply add one tower for each additional task. In the MMoE version of our model, we add one gate and one tower for each additional task. We experimented with both versions offline and found that MMoE in combination with auxiliary tasks performed better than the shared bottom model with auxiliary tasks. It makes sense that MMoE would outperform the shared bottom when we added more tasks due to the “seesaw phenomenon” described earlier.

Through experimentation, we learned that while add-to-cart as an auxiliary task boosted purchase metrics by bridging the gap between purchases and clicks in terms of relatedness, favorites actually had a negative impact (known as negative transfer) on the model. With further analysis, we found that favorites can actually be quite noisy and not indicative of high purchase intent. As a result, the version of the model was ultimately ramped up in production only included add-to-cart as an auxiliary task.

Figure 4. A simplified version of the MMoE piece of our model architecture with add-to-cart (ATC) as an auxiliary task.

Results and Impact

Offline, our new model showed promising improvements in Purchase and Click Area Under the Precision-Recall Curves (PR AUCs) and Purchase Area Under the Receiver Operating Characteristic Curve (ROC AUC) metrics. Together, these metrics measure how well our model predicts buyer behavior – PR AUC evaluates its ability to rank relevant listings at the top of the search results, and Purchase ROC AUC evaluates its ability to distinguish between listings buyers will and will not purchase. We saw average increases of 3.5% and 1% to Purchase and Click PR AUCs, respectively, and a 0.5% increase to Purchase ROC AUC, meaningful lifts for an industry-level ranking system.

When we deployed the model online, we saw three meaningful improvements across the marketplace. First, the model drove purchases, improving buyer experience by more accurately predicting which listings from our inventory would resonate with them. Second, the ads marketplace became more efficient due to an improvement in purchase calibration metrics. More accurate PCCVR predictions served as better inputs to our auto-bidding system, which helped sellers reach buyers who are genuinely interested in their listings. Finally, the MMoE architecture is more flexible than the shared bottom architecture, so we were able to keep the overall model size flat by pruning other parts of the model when adding in MMoE. At serving time, inference became less costly, likely due to differences in the distribution of compute across model components.

What’s Next

The MMoE architecture provides the flexibility to add a variety of tasks to our ranking model by reducing the risk for negative transfer by encouraging some experts to learn task-specific patterns and others to learn shared representations. After seeing success with the add-to-cart task in our new modeling framework, we plan on experimenting with several additional auxiliary tasks, such as dwell time, to further improve our model’s ability to connect buyers with listings they’ll love.

Ads give sellers an additional opportunity to stand out to buyers seeking their unique creations on Etsy. With each improvement to our ranking model, we continue to strengthen the marketplace connection between buyers and sellers – facilitating matches that help our sellers’ businesses grow and buyers discover products that feel made for them.

Migrating Etsy’s database sharding to Vitess

Ella Yarmo-Gray — Thu, 19 Mar 2026 17:03:54 -0400

Etsy has maintained a sharded MySQL architecture since around 2010. This database cluster contains most of Etsy’s online data and is made up of ~1,000 tables distributed across ~1,000 shards. Over the last 16 years, it has grown significantly: combined, these tables have over 425 TB of data and receive roughly 1.7 million requests per second.

Etsy engineers access our MySQL data through a proprietary object-relational mapping (ORM). The ORM has a corresponding model for each MySQL table. When a table is sharded, its rows are partitioned among different databases known as shards. Each shard has identical table schemas and contains a distinct subset of all rows. Sharded tables require a unique id field to be designated on the model as the “shardifier id” which determines the shard for each record.

Shardifier ids are designed to co-locate related data in order to minimize the number of databases we need to rely on for a given query. For example, all records related to a single shop or user are stored on the same shard. The majority of models use shop_id or user_id as the shardifier id, but there are over 30 different options in use. Prior to Vitess, the ORM stored mappings of records to shards in a single, unsharded “index” database. When a record was created, the shard mapping for its id was determined randomly by the ORM and then stored in a table on the index database so it could be retrieved again later. When a query was sent to a sharded table using the ORM, the ORM first queried the index database to retrieve the shard mapping, and then targeted the desired query to the right shard.

This shard architecture brought two major benefits: scalability and resilience against host outages. The impact of a database outage was greatly reduced compared to an unsharded database – we only lost 1/1000 of traffic to the distributed database if a shard went down. However, it also had some drawbacks:

The process for horizontally scaling was manual, complex, and took months to complete.
We were dependent on a single point of failure – the index database. If the index database went down, it could cause a full site outage since the ORM would no longer be able to route queries to any shard.
Product developers were required to understand and manage sharding when accessing data, which could create friction since it was often confusing and toilsome.

As Etsy and the index database grew, we saw an increase in incidents when it was unavailable and addressing these vulnerabilities became a high priority. Vitess, an open source software that provides an abstraction layer for scaling, deploying, and managing large MySQL clusters, offered a solution to these problems.

Introducing Vitess

In 2018, we added Vitess to our database architecture. Prior to this change, the ORM issued queries directly to the MySQL database for a specific shard. Now, each query was sent to Vitess, which would then query MySQL. At first, the ORM told Vitess which shard to query so we could validate that the new components worked in our environment before adding shard logic into Vitess.

With this new infrastructure in place, we were ready to start exploring vindexes, which define sharding strategies within Vitess. Vindexes provide a way to calculate a shard given a column value akin to how we mapped shardifier ids to shard numbers in our index database. To familiarize ourselves with them, we scaled several unsharded payments tables by creating a new sharded database cluster that could use Vitess vindexes out-of-the-box. The success of that project paved the way for us to migrate our existing, in-house shard infrastructure to Vitess vindexes.

Choosing a vindex

Vitess provides a number of vindexes. We were most interested in those that calculated the shard using an algorithm like the hash vindex, which would remove the reliance on an additional data store for shard lookups. Since the ORM’s shard mappings are random and not algorithmic, using one of these out-of-the-box would require re-sharding all of our data – a process that would be manual and likely take years. Instead, we chose to write custom vindexes that ported our existing shard logic into Vitess so we could test how vindexes worked in our environment without the complexity and risk of moving data.

The first step of creating our custom vindexes was to modify the shard assignment algorithm in the ORM to match the Vitess hash vindex algorithm. After that change, no new shard mappings required an index database lookup, so we could store all of the existing mappings in a read-only database. We chose to use SQLite, which provided low latency reads and a small enough footprint to copy the database directly onto each Vitess server – avoiding the latency and dependency added by a call to an external database. Then we built a custom SQLite lookup vindex, which looks up shard information from a SQLite database.

Next, we built a custom hybrid vindex, which allows a table to apply one of two vindexes based on a threshold value – in this case, the SQLite vindex or the hash vindex. We set the threshold to the first id that was created after we changed the shard assignment algorithm. Any id that is above this value would use the hash vindex to look up its shard mapping, and any id that is below it would use the SQLite vindex.

Introducing vindexes into our environment

We were now able to introduce Vitess vindexes into our existing architecture without moving any data. Before doing so, we needed to ensure that all queries currently in use with legacy ORM sharding would still succeed and return the same results with Vitess managed sharding. Some incompatibility was expected: Vitess requires the shardifier id to be present in the WHERE clause of a SQL query in order to route it to the correct shard, but this was not previously required by the ORM. In fact, since Etsy developers were required to provide this information separate from the SQL query in our legacy architecture, many queries did not include it in the WHERE clause. More detail around how we found and solved for query incompatibilities can be found in Scaling Etsy Payments with Vitess, Part 3.

The ORM contained over a decade of queries with many different ways of accessing MySQL, which made auditing each and every query time consuming. Building enough context to test changes to a model was also challenging since each table varied greatly in the design and purpose of its data. Given this, we decided to take an incremental approach and introduce vindexes into our codebase one table at a time. This would allow us to test on a smaller scale, ensuring each type of data access pattern would succeed with Vitess by monitoring how vindexes performed with a specific set of queries. We could slowly and safely surface incompatibilities while we made progress on the migration, allowing room for error without risking large-scale outages.

Since we would need to repeat the migration for hundreds of tables, we prioritized creating a clear and repeatable process. Etsy’s experimentation framework enabled us to gradually ramp up the change by incrementally increasing the percentage of traffic that was using Vitess vindexes for shard routing for each table. This enabled us to assess impact by comparing the performance of a query when it was using vindexes vs ORM shard routing. If they encountered any issues along the way, it was easy to ramp down to 0. Minimizing the risk of each model ramp by starting small and having a quick way to roll back helped mitigate the risk of not testing all queries upfront.

Working through challenges: database transactions

Early on we ran into a challenge with this approach when handling database transactions. With our in-house shard routing, the ORM treated the shards as many separate databases and queried specific shards directly. With Vitess-managed sharding, the ORM queried the shards as if it were a single database, and they were only regarded as separate databases behind-the-scenes by Vitess. This meant that the two approaches created different connections to the database, which caused some issues for database transactions, since atomicity is only guaranteed per connection. To avoid data integrity issues that could come from breaking transaction guarantees, we required tables written to in a single database transaction to be ramped onto vindexes at the same time so they would use the same connection.

This was simple enough in theory, but ended up being pretty consequential. A small number of tables that represent the most complex and critical data models like receipts, listings, and transactions, make up a large portion of database traffic to the sharded cluster. As we were preparing one of them to use vindexes, we realized that just 27 models (3% of tables) accounted for ⅓ of our database traffic – and they were all connected via database transactions. Despite our best efforts to minimize risk by ramping table-by-table, some of the riskiest changes were tightly coupled anyway. We collaborated across the company to ramp these 27 models in unison.

Reaping the benefits: cross-shard queries

One major advantage of using Vitess vindexes is the ability to query across shards. By default, any query that does not contain the shardifier id will "scatter" – Vitess will send it to all shards in parallel, sort the results, and return a single result set. This is a powerful tool – one model’s query time was reduced from ~2 seconds to ~20 ms by using cross-shard queries! However, at Etsy’s scale, unintentionally sending an expensive query to all 1,000 shards at once could quickly become an issue. To protect against these scenarios, we disabled scatter queries by default in our environment and provided a way to specify that a query is allowed to scatter via the ORM.

Utilizing scatter queries brought notable improvements to bulk primary key lookups in the 27 table ramp mentioned above. The ORM’s previous implementation took an array of primary keys, batched them by shard, issued a query to each shard, and returned the combined results. Since Vitess enables us to query multiple shards in a single query, we can skip the batching by shard and include all primary keys in a single query. This greatly reduced the number of queries issued for bulk lookups for some models. Bulk lookup queries for one model during its ramp onto vindexes. Purple lines denote 1%, 10%, 50%, and 100% of traffic to the model using vindexes.

Conclusion

Five years, approximately 2,500 pull requests and 6,000 queries later, we have successfully migrated Etsy’s shard management to Vitess vindexes! Despite the work we put in to streamline the migration process, it was still a challenge to replace the database infrastructure for a codebase of Etsy’s scale and age. As infrastructure engineers, my team usually had minimal context around the code we were changing and what might break if things went wrong. However, with coordination and careful testing across Etsy engineering, we were able to meet our goals:

Scaling operations are no longer manual and can be performed in a matter of days rather than months.
We removed the index database as a single point of failure for our sharded cluster.
Shard infrastructure has been obscured from the developer’s view which makes data modeling and query writing much simpler.
Database performance stayed largely the same as it was prior to the migration, rendering the change invisible to our end users.
We were able to progress incrementally, roll back quickly, and integrate Vitess into our shard architecture without performing a massive data migration.

We are excited to use all the new Vitess features that migrating to vindexes has unlocked – for example, re-sharding our data, rebalancing data across the shards, and using Vitess’s MoveTable operations to shard unsharded tables. Swapping out such critical and complex infrastructure with almost no downtime or impact on our users was an extremely challenging task, but it was also extremely gratifying.

Acknowledgements

This project was a shared accomplishment of the Data Access Platform team: Jessica Chen, Samantha Drago-Kramer, Hermes Garcia, Sam Kenny, David Leibovic, Kyle Leiby, Benjamin Mariscal, Juan Ortega, Adam Saponara, Wendy Sung, and Stephanie Wu. Thank you to everyone who contributed to the vindex project across engineering!

How Etsy Uses LLMs to Improve Search Relevance

Yuqing Zhang — Fri, 16 Jan 2026 15:02:28 -0500

Ever searched for something specific, only to be met with results that are close, but not quite? On Etsy’s Search Relevance team, that frustration is exactly what we are tackling. Our goal is simple yet ambitious: to help buyers find exactly what they’re looking for, and to help sellers reach the people seeking their special products.

Search plays a central role in that mission. Historically, Etsy’s search models have relied heavily on engagement signals – such as clicks, add-to-carts, and purchases – as proxies for relevance. These signals are objective, but they can also be biased: popular listings get more clicks, even when they’re not the best match for a specific query.

To address this, we introduce semantic relevance as a complementary perspective to engagement, capturing how well a listing aligns with a buyer’s intent as expressed in their query. We developed a Semantic Relevance Evaluation and Enhancement Framework, powered by large language models (LLMs). It provides a comprehensive approach to measure and improve relevance through three key components:

High quality data: we first establish human-curated “golden” labels of relevance categories (we’ll come back to this) for precise evaluation of the relevance prediction models, complemented by data from a human-aligned LLM that scales training across millions of query-listing pairs
Semantic relevance models: we use a family of ML models with different trade-offs in accuracy, latency, and cost; tuned for both offline evaluation and real-time search
Model-driven applications: we integrate relevance signals directly into Etsy’s search systems enabling both large-scale offline evaluation and real-time enhancement in production

Together, this framework brings a more intent-aware search experience that better serves both buyers and sellers across our marketplace.

Figure 1. Overview of the Semantic Relevance Evaluation and Enhancement Framework

Capturing Shades of Relevance

Let’s return to the idea of relevance categories. Based on user research, we define three categories for semantic relevance of query-listing pairs:

Relevant: listing matches all parts of the query, accounting for meaning and proper nouns
Partially relevant: listing matches part of the query or is thematically related but not a full match
Irrelevant: listing has no meaningful connection to the query; its presence in top results would make the search feel broken

Figure 2. Examples for the three relevance categories. Text highlighted in green shows how the product aligns with the search query, whereas red highlights indicate mismatches.*

In an ideal world, we’d rely on human judgments for all query-listing pairs. But large-scale human annotation is time-consuming and expensive, rendering it infeasible. Instead, language models unlock the ability to generate these judgments at scale, transforming our ability to make every search on Etsy produce more relevant results.

Data: Anchored by Humans, Scaled by LLMs

With recent advances in LLMs, a promising approach to evaluate search relevance is to use LLM-as-a-judge: directly using LLMs to judge the relevance of our search system without looping in humans. However, this approach faces two main challenges:

Domain shift: off-the-shelf LLMs may not capture the unique preferences and vocabulary of Etsy users
Performance-cost tradeoff: larger LLMs offer stronger reasoning but are expensive for large-scale inference, while smaller LLMs are faster and cheaper, but less accurate

To address these challenges, we start with human-curated golden labels to evaluate and align a powerful LLM with these human-labels, then use a full dataset scaled up by the LLM for training our relevance judge. In other words, humans define what good looks like, and LLMs help us scale it. LLMs do not replace human judgment, instead they align with and amplify it.

We maintain a detailed, evolving relevance labeling guideline, continuously refined through user research and annotation feedback. What relevance means in our marketplace shifts over time and social context. For example, people searching for “face masks” pre-2020 were primarily looking for masks for costumes or fashion, which is a completely different intent from protective masks post-2020. These guidelines ensure our definitions of relevance accurately reflect Etsy users’ intent and capture cultural trends over time.

Query-listing pairs are sampled from search logs using a mix of approaches, including both random, stratified sampling for broad coverage, and targeted sampling for challenging cases. Each query-listing pair is labeled by two Etsy admins, with an ongoing review process to both break ties and adjust labeling guidelines accordingly. For quality control, we continuously track metrics such as row-level disagreement rates, which measures how often multiple annotators disagree with each other for the same query-listing pair.

To scale beyond manual annotation, we introduced a few-shot, chain-of-thought (CoT) prompting strategy using the o3 model, implemented in LangGraph. The prompt instruction is inspired by the annotation guidelines described above, and includes comprehensive query and listing features, like title, images, text description, attributes, variations, and extracted entities (read more about listing extracted entities in another one of our posts). We also applied self-consistency sampling to improve reliability.

This model, known as the LLM annotator (as seen in Figure 1), is first validated against the human-labeled golden data to ensure its judgement aligns with humans. Once validated, we use it to generate large-scale training data to develop the production models. The LLM annotator thus serves as the foundation for our teacher-student modeling pipeline, bridging the gap between expensive manual labeling and scalable automated annotation.

Models: Balancing Accuracy, Latency and Cost

Our modeling pipeline uses a three-tier cascaded distillation design, where each model balances accuracy and efficiency for a specific purpose. The stack includes:

The LLM annotator: our most accurate and cost-intensive model, aligned closely with human-labeled golden data
The teacher model: a fine-tuned smaller LLM (Qwen 3 VL 4B) that delivers high-throughput annotation at scale
The student model: a lightweight, BERT-based two-tower model optimized for real-time inference

The LLM annotator aligns best with the golden labels, but is too costly for recurrent, large-scale inference. To reduce cost while maintaining quality, we performed supervised fine-tuning (SFT) with a smaller LLM, Qwen 3 VL 4B, using the training data generated by the LLM annotator. This teacher model preserves human alignment while enabling us to label millions of query-listing pairs daily, which is ideal for recurring evaluation and monitoring.

The teacher, however, is too slow to surface relevant search results quickly, which is critical for helping our sellers reach potential buyers. As such, we further distilled the teacher into a student model with a two-tower architecture. The distillation process aligns the student’s output with that of the teacher, so that the student judges relevance labels nearly as accurately as the teacher, while being lightweight and fast. The resulting model ensures we deliver search results almost as fast as before, with only <10ms additional latency.

All three models – the LLM annotator, teacher, and student – are evaluated against the same golden dataset to ensure traceable performance and consistent alignment with human judgment. Figure 3 shows their accuracy measured using multi-class Macro F1, and individual class F1 scores.

Figure 3. Performance of semantic relevance models against human golden labels

Applications: From Evaluation to Action

With these models in place, we can both measure and enhance search relevance across Etsy.

Search relevance evaluation

We use the teacher model to measure how well our search system surfaces relevant listings. Each day, we sample search requests and perform offline inference using the teacher model, then aggregate the predicted relevance labels into summary metrics. These metrics are reviewed regularly by our team, and if we observe unexpected trends like a sudden decline of relevance, we work to quickly diagnose and address the problem.

Similarly, we monitor relevance metrics in A/B tests. The computed relevance metrics are discussed when we decide whether to roll out a new change to our search system, to ensure those changes affect semantic relevance of search results in a neutral to positive way. We sample sufficient amounts of requests separately from control and treatment variants, to ensure statistical power. Using vLLM for high-throughput inference, we process millions of query-listing pairs daily at a very low cost, maintaining both statistical power and operational efficiency.

Improving search in production

The lightweight student model is embedded directly into Etsy’s real-time search stack. It improves relevance through several integration points:

Filtering: removes retrieved listings predicted as irrelevant before downstream ranking
Feature enrichment: contributes model-predicted relevance scores as features for the downstream ranking model
Loss weighting: adjusts training weights of the ranking model based on predicted relevance
Relevance boosting: promotes listings deemed highly relevant using heuristic rules among the final returned search results

How Semantic Relevance is Changing Etsy Search

The Semantic Relevance Evaluation and Enhancement Framework is fully deployed in Etsy’s search stack, and continues to evolve. We’ve observed a measurable uplift in semantic relevance: the percentage of fully relevant listings (as defined by the relevance categories described earlier) has increased from 58% to 62% between August and October 2025.

Figure 4. Improvement of semantic relevance metrics over time

This improvement reflects Etsy’s growing ability to align search results with buyer intent. For instance, in searches like “fall decor,” the enhanced search engine now focuses on surfacing seasonal decor items, while deprioritizing loosely related listings like clothing, which appeared before the enhancement on relevance.

Figure 5. Before and after comparison when searching for “fall decor” *

Beyond these immediate gains, semantic relevance has shifted how we evaluate and improve search at Etsy, by adopting a user-centered approach. By grounding our evaluation in semantic intent in addition to behavioral signals, we move closer to our goal of connecting buyers with the relevant products, not just the most popular ones. While search results are influenced by multiple factors, and outcomes may vary, on the seller side, improving semantic relevance can also help surface items from small or new sellers who may not yet have the visibility of more established shops.

What’s Next

In ongoing and future efforts, we hope to explore the following directions:

Better understanding of relevance-engagement dynamics. In online experiments, we often observe engagement metrics decline even as semantic relevance improves (a pattern also noted by other e-commerce platforms). We suspect this results from applying uniform relevance treatments despite contextual variation. Next, we plan to explore adaptive strategies that tailor adjustments by query type.
Refining partial relevance. Inspired by Amazon’s ESCI framework, we’re exploring finer-grained labels, for example, introducing new subcategories of complements and substitutes. This could potentially improve evaluation precision and power new user search experiences.
Reducing annotation effort through LLM facilitation. When LLM judgments are self-consistent, they align better with human labels. This may indicate easier query-listing pairs. We are exploring using LLMs for these easy cases, focusing human effort on more complex cases.
Simplifying the multi-stage model stack. Our current three-tier distillation pipeline provides flexibility but adds operational complexity. We plan to simplify this setup by exploring better performance-efficiency tradeoffs and potentially merging model tiers.
Improving relevance in retrieval. So far, post-retrieval filtering is the first stage where our semantic relevance model applies. We see strong potential to enhance both inference and measurement further upstream in the retrieval layer.

Conclusion

Key takeaways:

LLMs can meaningfully evaluate search relevance when grounded in human judgment. Aligning LLM assessments with human-labeled data ensures we measure, and continually improve, the search experience that is so essential to connecting buyers and sellers on Etsy.
Semantic relevance redefines how Etsy optimizes search. By complementing engagement metrics with semantic relevance, we address real customer pain points and deliver more satisfying search experiences.
Teacher-student distillation offers a flexible and efficient way to apply relevance modeling across diverse performance, latency and cost requirements.

Ultimately, improving semantic relevance strengthens the human connections that define Etsy. By understanding what shoppers truly mean, we can help them find the right items. And by emphasizing relevant listings over popular ones, we can help create fairer opportunities on the search relevance factor of search visibility for our sellers – 89% of whom are businesses of one.

Acknowledgments

This work is brought to you in a collaborative effort by the Search Relevance Team, enabled by ML Enablement, and the Merchandising teams. Thanks to the following contributors

Data: Susan Liu, Jugal Gala, David Blincoe, Yuqing Zhang, Taylor Hunt, Liz Mikolaj
Models: David Blincoe, Oriane Cavrois, Orson Adams, Yuqing Zhang
Application: Grant Sherrick, Kaushik Bekal, Haoming Chen, Patrick Callier, Davis Kim, Marcus Daly
Product leadership: Julia Zhou, Willy Huang, Argie Angeleas
Engineering leadership: Yinlin Fu, Congzhe Su, Xiaoting Zhao
ML Enablement partners: Ari Carter, Stan Schwertly, Shreya Agarwal, K Ogilvie, Marvin Wang, etc.
Other cross-team partners: Will Beckman, Karl Yokono, Audrey Chen, Heather Campbell, David Le, Khadeeja Din, etc.
Early contributors: Ethan Benjamin, Cung Tran, Maggie Matsui, Jack Gammack, Yogeeta Chatoredussy, Austin Clapp, Benjamin Russell, Khaled Jabr

Special thanks to Oriane Cavrois & David Blincoe for helping this piece come to life.

* Images are provided for illustrative purposes. Item availability on Etsy may vary.

Reducing experiment duration with predicted control variates

Kelly McManus — Tue, 25 Nov 2025 11:28:26 -0500

In 2021, we published a blog post titled “Increasing experimentation accuracy and speed by using control variates,” describing how we reduce the variance of metrics using CUPED in our experimentation platform. This is a follow-up on how CUPED has evolved at Etsy since then. Spoiler – It’s changed a lot, decreasing our average experiment duration by 3 days!

Etsy’s mission is to Keep Commerce Human. To achieve this, we need to understand the impact each change to our platform has on our buyers' and sellers' experience. Whether that involves changing the color of the “Buy Now” button on the Etsy app or updating elements of how our algorithms rank search results, we leverage large-scale online experimentation to iterate on and improve the things we build.

However, running an experiment can be a long process. From design and setup to running the experiment and analyzing results, the entire experimentation process can take weeks to months. Experiments must run long enough to collect sufficient data for the results to be statistically significant – ensuring we can confidently attribute observed changes to the treatment, rather than random chance. On the other hand, being able to learn from an experiment quickly is a crucial step in the product development lifecycle, enabling faster improvements to Etsy. Fortunately, there are tools to reduce experiment runtime. CUPED is one of them! Variance reduction techniques like CUPED can help reduce the time to run an experiment, shortening the overall experimentation lifecycle and time to learning, as visualized below.

A recap of CUPED

CUPED is a variance reduction technique that estimates experiment outcomes with greater speed and accuracy compared to a direct comparison between control and treatment groups. In 2021, Etsy implemented CUPED (Controlled-Experiment Using Pre-Experiment Data) for key metrics like Conversion Rate (the percentage of visitors that make a purchase).

CUPED leverages historical visitor data collected before the experiment begins — for example, the number of purchases in the week prior to the experiment – to explain some natural variation in the outcome metric. The pre-experiment factors are used as covariates in a linear regression model to remove some of the “noise” that is not attributable to the treatment. By accounting for this variation, CUPED reduces the variance of the treatment effect estimator, increasing statistical power and improving sensitivity without introducing bias.

The CUPED correction can be conceptualized as:

The CUPED-adjusted metric will have a smaller variance than the original metric, as visualized below, providing more precise estimates of a mean or treatment effect.

Sample size, power, and variance are all related. Holding everything else unchanged, the smaller the variance of a metric, the smaller the sample size required to reach a desired power. Since we can reduce the variance of our metric by applying CUPED, we can achieve the same amount of power with a smaller sample size. In practice, a smaller sample size corresponds to a shorter experiment duration.

Etsy’s initial implementation of CUPED yielded an average variance reduction of 7% across all experiments, with some experiments achieving up to 30% variance reduction. Experiments that used CUPED-adjusted metrics in decision-making yielded a decision about 1 day earlier, on average. However, we’re always iterating to improve our buyers’ and sellers’ experience on Etsy, and we knew we could do even better. Enter: CUPAC.

Leveling up further with CUPAC

During our research and implementation of CUPED in 2020, scientists at DoorDash published a blog post describing a novel statistical method, building on CUPED, called Control Using Predictions as Covariate, or “CUPAC.”

When performing CUPAC, the pre-experiment data is first input into a non-linear machine learning model that captures more complex relationships than a linear model. The non-linear model is trained to predict the outcome metric of interest – for example, if an experiment is measuring the observed Conversion Rate, the model would predict Conversion Rate. The prediction more effectively captures the impact of pre-experiment behaviors on our experimental outcomes than the raw pre-experiment data because it captures complex relationships in the data that linear regression alone cannot. The prediction is then used as an “ML-based covariate” in a linear regression to perform the CUPED correction:

The CUPAC-adjusted outcome has an even smaller variance than the CUPED-adjusted outcome, as visualized below.

Empirically, our CUPAC-adjusted metrics showed even lower variance than CUPED. Our initial prototype demonstrated that CUPAC produced an adjusted metric with an additional 10% smaller variance when compared to our original CUPED estimator. Despite the added complexity, these results justified incorporating CUPAC into our experimentation pipeline. We hypothesized it would cut average experiment duration by an additional day, enabling teams to run more experiments and ship changes to Etsy faster.

Training and implementation

The first step was to train the CUPAC models to predict the ML-based covariate. We identified over 100 pre-experiment features, increasing from 3 features in CUPED, to capture more behavior prior to the experiment. Using these features, we iteratively trained and tuned the models in Vertex AI. Hyperparameters were optimized on a validation dataset to maximize the median correlation between the model’s predictions and the observed in-experiment metrics across experiments.

Initially we trained XGBoost, a popular gradient boosted tree model, but then found LightGBM, a similar non-linear, tree-based model, was better suited to predict the covariate. When testing the models at scale with billions of predictions, LightGBM demonstrated both rapid training and prediction times, along with strong validation results.

Once the models were trained, our next challenge was to implement them at scale. Our experimentation pipeline runs batch jobs for hundreds of experiments each day. From our original implementation, we had an Airflow DAG (directed acyclic graph) to orchestrate the CUPED variance reduction pipeline, as visualized below:

We evolved this pipeline to support CUPAC by adding a batch prediction step to produce the ML-based covariate.

In the above CUPAC pipeline, we perform the following steps:

Calculate pre-experiment features and in-experiment data using BigQuery SQL jobs.
Predict ML-based covariates with our trained LightGBM models via parallel Dataflow jobs using the pre-experiment features.
Perform variance reduction with a Spark job that fits a linear regression model between the ML-based covariates and in-experiment data, creating the CUPAC-adjusted metrics.
Apply statistical t-tests using the CUPAC-adjusted metric to calculate the treatment effect, p-value, and power of the experiment.

Impact: Shortening average experiment duration by 3 days

We measured success through variance reduction. Variance reduction is the percent change between the:

Variance of the metric without CUPAC, and
Variance of the CUPAC-adjusted metric.

The original CUPED implementation showed 7% variance reduction, reducing overall experiment duration by almost 1 day, on average. After implementing CUPAC, we observed an average of 27% variance reduction, nearly 4x as much variance reduction, when compared to CUPED, exceeding our early research estimates.

The additional variance reduction shortens our average experiment duration by almost 3 days. This means a 10-day experiment could conclude in only 7 days due to the ability to reach power on a smaller sample size with CUPAC. These marginal time savings allow many teams to run 10 or more additional experiments each year. That translates to more opportunities to test and faster insights into how we can deliver the best experience for our community of millions of sellers and buyers.

Notably, there was a substantial spread in variance reduction among different metrics and experiments, ranging from 2% to 77%. In the chart below, each blue bar displays the percent variance reduction for a sampled metric on an experiment.

The large range is expected because variance reduction can be influenced by several factors, such as metric definition, data accessibility, experimental design, and market characteristics. These factors impact how predictive the pre-experiment data is of the outcome metric, resulting in the degree of variance reduction. For example, two common experimentation metrics are Mean Visits and Purchase Rate. In the e-commerce setting, an individual's visit behavior will almost always be more stable over time than their purchasing behavior. This implies that pre-experiment data is more correlated with in-experiment data for a visit-related metric than for a purchase-related metric. Therefore, CUPAC is more effective at reducing variance in a metric like Mean Visits than in a metric like Purchase Rate.

What’s next?

Aligned with Etsy’s culture of experimentation, we’ll continue to evolve our pipeline to be nimble and flexible based on the needs of the teams that use them.

One challenge we face is that teams use metrics curated to specific parts of the Etsy experience – like search, recommendations, seller features, etc. – to make decisions on their experiment results. However, our CUPAC models take significant time to train and maintain for each metric, consequently limiting the number of CUPAC-adjusted metrics we can develop. While we continue to grow CUPAC use, we also encourage teams to continue to use CUPED, which is more scalable and has lower maintenance costs. To account for this, we plan to increase the flexibility of CUPED to more metrics by automatically collecting pre-experiment data based on the metric definition to reduce noise. In tandem with our work on CUPAC, this CUPED expansion will enable teams across Etsy to benefit from variance reduction across all their team-specific metrics, not just a select few.

Despite the success of CUPED and CUPAC thus far, there remains a need to explore additional variance reduction techniques for the current metrics that leverage CUPAC. In 2024, we released research findings exploring a novel approach: Variance reduction combining pre-experiment and in-experiment data. As we look to generalize our variance reduction architecture, we expect that incorporating such techniques will continue to strengthen our experimentation platform and enable product teams to iterate more quickly.

Lastly, it is important to recognize that applying variance reduction in practice can be a never-ending race to squeeze the most noise out of these estimators. In our experience, the craft lies in finding the sweet spot between variance reduction, implementation cost, and the impact on experimentation velocity. That intersection is context-dependent and what makes experimentation code as craft.

We hope our experience inspires you to try out variance reduction techniques and determine which one is best suited to your needs!

Acknowledgements

Thank you to Alexander Tank and Stephane Shao for their work on initial research and implementation of CUPAC. Thanks to Pablo Crespo for his research into extending our CUPAC models with more predictive features. And, thanks to Julie Beckley, Kevin Gaan, and Mary Hu for supporting and prioritizing this project.

References

A. Deng, Y. Xu, R. Kohavi, T. Walker (2013). Improving the sensitivity of online controlled experiments by utilizing pre-experiment data.

J. Li (2020). Improving Experimental Power through Control Using Predictions as Covariate (CUPAC).

Improving performance by prefetching product pages from Etsy Search

David Weinzimmer — Wed, 29 Oct 2025 13:14:09 -0400

Rarely are there opportunities for big, bold, game-changing improvements in web performance. The Speculation Rules API (SRA) is a recent browser development that offers just such an opportunity. This post details a joint effort between the search and the web performance teams at Etsy to implement SRA on Etsy search pages and drastically improve the performance of product listing pages with some metrics seeing 20-24% improvements and some dropping all the way to 0ms values.

Prefetching Options

There are two main ways to predictively fetch resources for the next page:

allows us to instruct the browser to download a resource we believe we'll need soon. The resource can be anything, like a static asset or an HTML page.
Speculation Rules API (SRA) is a recently updated browser API which allows for a JSON definition to dictate what page A wants to do with page B. The "do"-ing can be either prefetching (just download the HTML) or prerendering (load the page, including its static assets, and render it completely). The prerendering happens in a new browser process and page B is ready to be swapped with the current page A instantaneously when the user navigates to B.

While implementing full prerendering is likely to yield more impressive performance improvements, it is a bigger and riskier investment, mostly related to the side effects of executing the JavaScript on the target page B. Starting with prefetching is a good first step into exploring SRA.

The benefits of using SRA over link prefetch will become evident further in the article but the topline highlights include: a simpler API to define what is prefetched (via a CSS selector rather than do-it-yourself bespoke JavaScript), convenient utilities to define when, as well as where (memory and HTTP cache) prefetching happens, and a built-in upgrade path to full on prerendering.

Implementing Speculation Rules

The Search team at Etsy recently ran an A/B experiment to use the Speculation Rules API to prefetch the listing page when hovering over organic listing cards on the desktop search page on Chromium browsers. To do this, we added a new

This instructs the browser to download the HTML for a listing page when:

the user hovers over a link to a listing page for 200 milliseconds (defined by the “moderate” eagerness property), and
the link has a data-sr-prefetch attribute

The data attribute allows us to more precisely opt in pages that are eligible for prefetching.

Lessons learned

In terms of changes to the page’s code, the implementation of SRA was straightforward. As such, we spent most of our time testing that everything was working as expected and that our systems and analytics were not inadvertently affected. And we found some surprises along the way, related to all the little details modern web pages use (such as cookies, redirects, new tabs).

Allow us to share a few lessons in prefetching...

Two ways to prefetch

As mentioned earlier, prefetching can happen one of two ways: or speculative prefetch (the one using SRA). So what sets them apart?

They do work mostly the same, except that the speculative prefetch caches the page in both memory and the HTTP cache. The only uses the HTTP cache and merely downloads the specified resource.

This makes the SRA way of prefetching more advantageous than the prefetch because of the memory cache.

Two speculative prefetches only

We also discovered the number of prefetched pages that are kept in memory is restricted to two. When you prefetch a third page, the first one is evicted from the memory cache. The HTTP cache still works as usual. So again, the SRA prefetching is preferable to the prefetch due to the difference in caching we just described.

It’s helpful while debugging to be aware of the eviction of the prefetched page from memory. But rest assured, the downloaded page is still cached locally.

Eagerness

While advises the browser to load a resource as soon as it sees the in the DOM, the speculative prefetch is more nuanced, offering eager, immediate, conservative and moderate loading. We selected moderate eagerness, which prefetches after the user has hovered over a link for 200ms.

Exploring our options we found that the immediate eagerness would trigger a significantly larger number of prefetches (since it executes immediately and prefetches all eligible pages), and we wanted to avoid creating new server requests for listing cards with a low likelihood of being clicked. However, the immediate eagerness setting could be worth considering if the cost of additional requests is very low. The conservative eagerness executes on pointer or touch down, providing a very small head start over normal browser behavior and therefore greatly reducing the potential benefits of prefetching. Conservative eagerness may only be suitable for a use case in which it is necessary to avoid unused prefetches altogether.

Note that eager and immediate were synonyms in the initial SRA implementation, but that is changing. Keep an eye on the official docs for updates.

Speculations and new browser tabs

Initially, SRA launched without the ability to prerender pages that open in new tabs, as Etsy listings do. This option was added later, but only for values _blank of the target attribute of the link elements, not named target attributes such as the ones that Etsy uses, for example .

Fortunately, the target restriction doesn't apply to prefetching, so for SRA prefetching (unlike SRA prerendering) there's no problem, regardless of whether or how you specify a target at all. For developers who may be considering moving from prefetching to prerendering, this distinction is something to bear in mind.

5-minute rule

Because of the complex nature of listing pages, Etsy's HTML pages are non-cacheable. However, the speculative prefetch keeps the prefetched pages cached in memory for five minutes. This was a helpful learning, as there would be no point of using speculative prefetches at all if they expire immediately. After five minutes, the normal caching rules apply, set via HTTP headers such as Max-age or Expires.

Given that only two pages are currently kept in memory cache and all others expire because they are non-cacheable, the benefits are greatly reduced when, for example, a person hovers over 3 links and eventually clicks the first one which leads to a page that's already expired from the prefetch memory (and HTTP!) cache.

To aid with the two-page restriction, one strategy we devised is to make our pages cacheable for five minutes when we detect a prefetch request. Such requests are identifiable because the browser sends Sec-Purpose: prefetch HTTP header when prefetching. This helps preserve downloaded pages that would’ve otherwise expired from both memory and HTTP cache.

Video links and shadow DOM

Often, listings on Etsy include product videos, which start to play on the search results page when a user hovers their mouse over them. In these cases, prefetching doesn't work: the mouse hover is effectively "swallowed," disappearing into the shadow DOM of the browser's video player. One workaround is to overlay a div on top of the video for 200ms to let the hover register in the DOM. Then, after the 200ms has elapsed, remove the extra div to let the browser video controls (e.g., on right click) work as usual. You can find a demonstration of this technique here.

Cookies

If a page sets cookies, prefetching it will set those cookies as well (as demonstrated here). This is something to be aware of, as the prefetch may end up being unused. This may confuse your application (and/or analytics) to thinking a page has been visited where in reality it was not.

Again, you can use Sec-Purpose: prefetch HTTP header to detect prefetch requests and avoid setting the cookie as part of the prefetching process.

Redirects

If the link to the page being prefetched goes through a redirect, the actual page after the redirect is still being prefetched. Let’s say you have a sequence that looks like this:

Link on Page A -> redirect -> Page B

Here the browser follows the redirect during prefetching and still caches Page B. When the user then clicks the link on page A leading to Page B, the browser follows the usual process of going through the redirect. Normal HTTP cache rules still apply, meaning that if the redirect is cached, it won’t need to be requested again.

So, even though redirects are a bad performance practice, if you need to do them, they do not affect prefetching as long as you set appropriate caching headers.

Mutating hrefs

Sometimes the href attributes of elements get modified by JavaScript on mouse hover. This does not play well with prefetching. Imagine you have:

Follow me

… which changes on hover to:

Follow me

When the user hovers over the link, the browser starts working on prefetching link.html but realizes that the link to that page is no longer in the DOM and abandons the process. So the page is not prefetched even if ?source=footer doesn’t change the target page in any way other than reporting analytics. The browser has no way of knowing this and considers the two as separate pages.

Additionally, the failed attempt at prefetching link.html counts in the “two speculations only” rule and evicts the older speculative load from the memory cache. For best results, avoid modifying links on hover.

Analytics and Event Logging

This is the elephant in the room. Many sites on the web today were built in a world where prefetching did not exist. So there is one big assumption: that a page load is always initiated by the user and the load can be counted as such – either server-side during page construction or client-side by JavaScript after the page is loaded (or, as it often happens, a combination of the two). With prefetching, this assumption is no longer true. A page constructed on the server-side and downloaded by the browser does not necessarily mean the page has been seen (and therefore its JavaScript has been executed). This can result in a number of miscalculations when it comes to analytics.

Luckily, browser APIs such as the Sec-Purpose HTTP header and JavaScript APIs (document.prerendering and prerenderingchange event) allow us to tell prerender requests from user-generated ones, as well as when a prerendered page is "activated" (when the user actually sees a prefetched page). For prefetches, Performance Resource Timing’s deliveryType method of navigational-prefetch can be used for the purposes of analytics.

We (and our analytics partners) found this to be the hardest part: ironing out the required analytics updates so that numbers remain true after implementing speculation rules. In our particular use case, we intentionally pursued a strategy of prefetching the destination page instead of prerendering it, meaning that no assets would be loaded and JavaScript would not execute on our prefetches. This gave us a relatively simple way to handle the accuracy of our analytics.

A foundational piece of our analytics is event logging. For example, in the controller of the listing page we log a view_listing event that contains key information such as the listing ID, user ID, etc. This informs not only our site analytics, but also our search training pipeline, recently viewed listing data for users, and more. We ended up creating a system to cache the payload of all events within a request to avoid firing those events during prefetches. We were then able to move that event logging to the destination page’s JavaScript bundle, deferring them until after page “activation” and mitigating the impact of prefetching on our analytics.

Results

We were thrilled with the performance results of the prefetching experiment. We saw a 20-24% improvement in many performance metrics we care about: TTFB, DOMContentLoaded, FCP, LCP.

The 75th percentile time to first byte (TTFB) on the listing page improved by 23.6%
We saw similar improvements throughout the request: First Contentful Paint -20.7%, Largest Contentful Paint -21.1%, DOMContentLoaded -20.4%, and Page Load -10.6%

In the cumulative distribution function below, we see the control of our experiment (no speculation rules prefetches) in blue, and the treatment (speculation rules prefetches) in orange, with the treatment dramatically faster than the control at every percentile. Remarkably, about 40% of eligible browsers saw their TTFB drop nearly to zero:

We saw small but detectable improvements in some business metrics, which is promising given that listing page views come from many sources, only some of which are search results. As we implement more prefetching in more places, we hypothesize that the numbers will further improve.

When people approach SRA implementation they may be worrying about unused prefetches and resource waste. In our experiment we saw a ratio of about 14:1 for the number of prefetches requested to subsequently activated pages (i.e., about 1 in 14 prefetch requests was navigated to by the user).

We’re encouraged by these results, and are looking forward to new opportunities to improve performance across additional surfaces.

Opportunities to iterate and expand

One clear opportunity is to try implementing prefetching on other pages beyond Search. Shoppers end up on product listing pages from various other referral surfaces: shop pages, our SEO-optimized landing pages, home page, etc. Prefetching could improve performance on these surfaces, leading to a better experience for Etsy buyers.

Another opportunity is to consider upgrading our prefetching to prerendering in the future. This would be a significant change to client-side JavaScript code operating during prefetches. However (and it's hard to contain the excitement about this!) Chrome is working on prerender-until-script update, which means prerendering stops at the first