<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" 
    xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:atom="http://www.w3.org/2005/Atom"
    >
    <channel>
        <title>Etsy Engineering | Code as Craft</title>
        <link>http://www.etsy.com/codeascraft/rss</link>
        <description>The engineers who make Etsy make our living with a craft we love: software. This is where we'll write about our craft and our collective experience building and running the world's most vibrant handmade marketplace.</description>
        <pubDate>Sat, 11 Apr 2026 05:18:10 -0400</pubDate>
        <language>en-US</language>
        <atom:link href="http://www.etsy.com/codeascraft/rss" rel="self" type="application/rss+xml" />
        <image title="Etsy Engineering | Code as Craft" link="http://www.etsy.com/codeascraft/rss" url="https://etsy.com/images/blogs/code_as_craft/cac_logo_lavender.png" />
            <item>
                <title><![CDATA[Making Ads Count: Using MMoE and Auxiliary Tasks to Better Connect Buyers & Sellers]]></title>
                <link>https://www.etsy.com/codeascraft/making-ads-count-using-mmoe-and-auxiliary-tasks-to-better-connect-buyers--sellers?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</link>
                <dc:creator><![CDATA[Amanda Steigman]]></dc:creator>
                <pubDate>Fri, 20 Mar 2026 14:31:56 -0400</pubDate>
                <guid isPermaLink="true">https://www.etsy.com/codeascraft/making-ads-count-using-mmoe-and-auxiliary-tasks-to-better-connect-buyers--sellers?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</guid>
                <description><![CDATA[<p>When buyers search on Etsy, they need to quickly and easily find the perfect item. At the same time, sellers need to be confident their unique products are being seen by the right customers. Our Ads Search ranking model, which is built on a multitask learning foundation, is the critical link in this connection. Recently, we identified an opportunity to drive more meaningful buyer engagement by enhancing our model’s ability to predict purchase intent. We achieved this via a dual-pronged improvement: introducing <strong>Multigate Mixture of Experts (MMoE) to our model architecture</strong> and leveraging <strong>add-to-cart as an auxiliary signal</strong>. By providing our downstream systems with more accurate predictions, we improved matching in our marketplace, surfacing more relevant listings for buyers while helping sellers reach customers who are genuinely interested in their products.</p>
<h1>Background</h1>
<p>When a buyer searches for an item on Etsy, we want them to find exactly what they’re looking for from our inventory containing tens of millions of listings. In order to help them do this, we surface high-quality listings that are relevant to a user’s search query by ranking a small subset of items from a much larger group. This includes advertisements purchased by sellers that enables them to promote their listings across Etsy placements, including search. While these results are sponsored, the items go through their own ranking process to surface the listings most likely to meet a buyer’s needs. The final result on the search page utilizes our auto-bidding system, which helps decide which listings get shown and the cost-per-click.</p>
<p>After a user views an ad (known as an “impression”), clicking on the ad is often the first engagement in their purchase journey. However, each subsequent step – from click to cart addition to purchase – represents a progressively smaller subset of users. The increasing data sparsity that exists further along in the purchase journey can make it difficult for our model to pick up on a strong signal to learn from. </p>
<p>When ranking ads, our machine learning models optimize for click-through rate (CTR) and post-click conversion rate (PCCVR). Clicks and purchases are the primary behaviors we use to predict and drive user engagement, but other actions in the buyer’s purchase journey, such as adding an item to a cart, are important and often predictive of a purchase. </p>
<div align="center"><img src="https://i.etsystatic.com/inv/bc3170/7823894384/inv_fullxfull.7823894384_k23glddx.jpg?version=0" width="200" alt="https://i.etsystatic.com/inv/bc3170/7823894384/inv_fullxfull.7823894384_k23glddx.jpg?version=0" /><p><em>Figure 1. The Ads Search user journey.</em></p></div>
<p><br /></p>
<p>Some post-impression actions, such as favoriting an item, are not directly related to a buyer's purchase journey but can provide valuable signals to enhance our model's predictive capability. A click can be a strong indicator of a future purchase, but it can also be noisy – meaning it doesn’t always reliably predict purchase intent. For example, a user may click on an ad purely out of curiosity with no intention to buy. These are just a few reasons why user behavior is complex, and we are constantly trying to improve our prediction models to better capture these patterns and recommend the most relevant ads.</p>
<h1>Multitask Model Architecture</h1>
<p>The Ads Search ranking model is a multitask learning framework containing four major components: feature representation, explicit feature interaction, implicit feature interaction, and task prediction. Figure 2 is a depiction of our model architecture prior to the enhancements this post will describe. We start with raw numerical, categorical, and high-cardinality ID features for query, user, and listing entities, which are converted through the feature representation layer - including text embeddings and sequence encodings - to generate dense feature representations. These are concatenated and fed to a Deep and Cross Network (DCN) that learns explicit feature interactions. The explicitly crossed features then pass through a shallow feed-forward network for the model to learn additional implicit feature interactions. Finally, the latent feature representations are fed into task-specific towers to output CTR and PCCVR predictions. </p>
<div align="center"><img src="https://i.etsystatic.com/inv/f3c814/7823892552/inv_fullxfull.7823892552_7gh8fhz0.jpg?version=0" width="200" alt="https://i.etsystatic.com/inv/f3c814/7823892552/inv_fullxfull.7823892552_7gh8fhz0.jpg?version=0" /><p><em>Figure 2. The  initial multitask architecture used for the Ads Search ranking model which has since been upgraded with Multigate Mixture of Experts (MMoE).</em></p></div>
<p><br /></p>
<p>Since the CTR and PCCVR predictions are used in downstream ads ranking and auto-bidding systems, we need the predictions to be well-calibrated. After the underlying model is trained, we individually calibrate the CTR and PCCVR towers to probability distributions using <a href="https://changyaochen.github.io/platt-scaling/">Platt scaling</a> layers. As user behaviors vary significantly across ad placements, the model learns distinct parameters for different placements.</p>
<h2>Optimizing for Purchase Intent</h2>
<p>The multitasked architecture we use has several advantages, as it helps the model learn shared patterns across tasks, reduces overfitting by allowing the model to learn more generalizable features, and decreases training and serving infrastructure costs by consolidating the two separate models into a single model. We originally deployed this multitask ranking model online in July 2023 and had not made major changes to its architecture since then. In the second half of 2025, the team identified an opportunity to better optimize for meaningful buyer engagement beyond click and purchase signals alone.</p>
<p>Our goal was to not only surface listings that resonate with buyers and drive conversions but also encourage them to return to Etsy – creating a positive feedback loop that benefits both buyers and sellers. We hypothesized that optimizing our models for engagement actions that signaled both purchase intent and buyer satisfaction would surface more relevant ads to buyers. Engagement actions include behavior that goes beyond a simple click, such as an add-to-cart or a favorited listing. To more effectively prioritize listings that led to this meaningful engagement for our buyers, we experimented with:</p>
<ol>
<li>A model architecture that would better predict purchase intent and </li>
<li>Adding additional signals to boost high-quality listings that buyers are more likely to purchase</li>
</ol>
<p>These two enhancements, in the form of Multigate Mixture of Experts (MMoE) and add-to-cart as an auxiliary task, worked well together in our model to drive a sizable product improvement in Q4 2025.</p>
<h1>Enabling Task-Specific Learning with Multigate Mixture of Experts (MMoE)</h1>
<p>While the introduction of our initial multitask model was a large success overall, it also had a limitation: since the model learns and shares the same feature representations across tasks, it is not always able to learn task-specific nuances, and this is more pronounced the less related the tasks are. When one task sees improved performance, other tasks can see performance degradation. This behavior is known as the “seesaw phenomenon,” and we encountered this when we first brought the multitask model online. One solution to this limitation is to add a <a href="https://dl.acm.org/doi/pdf/10.1145/3219819.3220007">Multigate Mixture of Experts</a> (MMoE) layer. In this architecture, the model still employs a shared bottom architecture where the feature representation and interaction layers remain unchanged. However, the MMoE layer introduces two key additional components in place of the shared feed-forward network: experts and gates. </p>
<p>Experts are parallel subnetworks that, unlike the shared representations before them in the network, are able to specialize and learn different patterns of the data. Experts are not specifically assigned tasks but rather this learning happens organically during training - some experts learn more about click-specific behavior, others learn more about purchase-specific behavior, and others learn patterns that are important for both of these actions. Each task has one softmax gating network which controls how that task combines expert outputs. This allows the CTR and PCCVR tasks to use and activate different subsets of experts differently, and this weighted expert information is then sent to our task-specific towers.</p>
<div align="center"><img src="https://i.etsystatic.com/inv/43e271/7823954316/inv_fullxfull.7823954316_gh3aar4q.jpg?version=0" width="600" alt="https://i.etsystatic.com/inv/43e271/7823954316/inv_fullxfull.7823954316_gh3aar4q.jpg?version=0" /><p><em>Figure 3. A comparison of our shared bottom multitask architecture for the Ads Search ranking model (left) and our MMoE architecture (right).</em></p></div>
<h2>Tuning the Experts</h2>
<p>The main hyperparameters to tune in the MMoE layer are the number of experts, the size of the experts, and the expert type. The number of experts used depends on several factors. One factor is a tradeoff between having too few experts – which can underfit and fail to capture distinct patterns needed for each task – and too many – which can overfit to training data and fail to generalize well. Another factor to consider is that adding experts increases model capacity which in turn increases latency and infrastructure costs. Our initial configuration included only multilayer perceptrons (MLP)-based experts (i.e., feed-forward neural networks), but we experimented offline with <a href="https://medium.com/pinterest-engineering/multi-gate-mixture-of-experts-mmoe-model-architecture-and-knowledge-distillation-in-ads-08ec7f4aa857">heterogenous experts</a> and saw an offline lift in purchase and click metrics through introducing a mixture of DCN- and MLP-based experts. </p>
<p>There are other pitfalls when employing an MMoE architecture which oftentimes require additional hyperparameter tuning to resolve. Specifically, common issues in an MMoE structure are expert utilization (the ability of each task to use multiple experts) and expert specialization (the ability of each expert to learn differently). If experts are not well-utilized, the model has wasted capacity and fails to leverage the full representational power of the architecture. If experts are not specialized, the model effectively reduces to a shared bottom architecture with redundant experts. On the other hand, if experts specialize too strictly and are not shared between tasks, the model loses some of the benefits of multitask transfer learning.</p>
<p>We ran into these issues when training our new model. To build a successful model with MMoE, we needed each task to utilize more experts and utilize some of the same experts as the other tasks so that they could benefit from both specialized and shared learning. We experimented offline with two regularization techniques to try and solve this issue: expert dropout and temperature scaling.</p>
<p>In expert dropout, we randomly disable some experts during training to force the model to learn more diverse representations. Expert dropout differs from typical “dropout” in neural networks (where we randomly remove a percent of connections in a given layer during training), as we fully remove the utilization of a number of experts during the forward pass. Using expert dropout, utilization did improve a bit: each gate was selecting a primary and a secondary expert for each task. Still, we did not see any sharing of experts between the tasks. </p>
<p>We then tried temperature scaling, which modifies the raw logits of the expert gates by dividing them by a temperature (T) to control the smoothness of the resulting probability distribution. By applying this before the softmax function (which converts logits to probabilities) in the gates with a T &gt; 1, we softened the distribution, making it more likely to select multiple experts. Expert dropout is random and only applied at training while temperature scaling is deterministic and applied at both training and inference. Temperature scaling achieved better utilization and specialization than expert dropout, leading us to deploy this approach.</p>
<h1>Auxiliary Tasks</h1>
<p>Our multitask model already leveraged user click and purchase engagements to train CTR and PCCVR towers. However, we also have access to other rich user interactions, namely add-to-cart and favorites, that reflect the meaningful buyer engagement described above. Purchases are quite sparse compared to clicks, and one of the major benefits of our original multitasking model was for this sparse purchase action to benefit with additional signal from the more common click action. Our goal with adding auxiliary tasks was to help the model learn more generalizable representations of user engagement, again leaning on actions that are more plentiful than purchase. We hypothesized that add-to-cart and favorite actions were indications of high purchase intent that would help the model better learn the purchase task without hurting the click task.</p>
<p>Since we do not use add-to-cart and favorite predictions for downstream use cases like ranking or bidding, we did not need to calibrate these predictions or serve them online. This makes them relatively straightforward to add to our existing uncalibrated model architecture. In the shared bottom version of our model, we simply add one tower for each additional task. In the MMoE version of our model, we add one gate and one tower for each additional task. We experimented with both versions offline and found that MMoE in combination with auxiliary tasks performed better than the shared bottom model with auxiliary tasks.  It makes sense that MMoE would outperform the shared bottom when we added more tasks due to the “seesaw phenomenon” described earlier.</p>
<p>Through experimentation, we learned that while add-to-cart as an auxiliary task boosted purchase metrics by bridging the gap between purchases and clicks in terms of relatedness, favorites actually had a negative impact (known as negative transfer) on the model. With further analysis, we found that favorites can actually be quite noisy and not indicative of high purchase intent. As a result, the version of the model was ultimately ramped up in production only included add-to-cart as an auxiliary task.</p>
<div align="center"><img src="https://i.etsystatic.com/inv/5f0b3d/7823681610/inv_fullxfull.7823681610_qbngbrck.jpg?version=0" width="300" alt="https://i.etsystatic.com/inv/5f0b3d/7823681610/inv_fullxfull.7823681610_qbngbrck.jpg?version=0" /><p><em>Figure 4. A simplified version of the MMoE piece of our model architecture with add-to-cart (ATC) as an auxiliary task.</em></p></div>
<h1>Results and Impact</h1>
<p>Offline, our new model showed promising improvements in Purchase and Click Area Under the Precision-Recall Curves (PR AUCs) and Purchase Area Under the Receiver Operating Characteristic Curve (ROC AUC) metrics.  Together, these metrics measure how well our model predicts buyer behavior  – PR AUC evaluates its ability to rank relevant listings at the top of the search results, and Purchase ROC AUC evaluates its ability to distinguish between listings buyers will and will not purchase. We saw average <strong>increases of 3.5% and 1% to Purchase and Click PR AUCs</strong>, respectively, and a <strong>0.5% increase to Purchase ROC AUC</strong>, meaningful lifts for an industry-level ranking system.</p>
<p>When we deployed the model online, we saw three meaningful improvements across the marketplace. First, the model drove purchases, improving buyer experience by more accurately predicting which listings from our inventory would resonate with them. Second, the ads marketplace became more efficient due to an improvement in purchase calibration metrics. More accurate PCCVR predictions served as better inputs to our auto-bidding system, which helped sellers reach buyers who are genuinely interested in their listings. Finally, the MMoE architecture is more flexible than the shared bottom architecture, so we were able to keep the overall model size flat by pruning other parts of the model when adding in MMoE. At serving time, inference became less costly, likely due to differences in the distribution of compute across model components.</p>
<h2>What’s Next</h2>
<p>The MMoE architecture provides the flexibility to add a variety of tasks to our ranking model by reducing the risk for negative transfer by encouraging some experts to learn task-specific patterns and others to learn shared representations. After seeing success with the add-to-cart task in our new modeling framework, we plan on experimenting with several additional auxiliary tasks, such as dwell time, to further improve our model’s ability to connect buyers with listings they’ll love. </p>
<p>Ads give sellers an additional opportunity to stand out to buyers seeking their unique creations on Etsy. With each improvement to our ranking model, we continue to strengthen the marketplace connection between buyers and sellers  – facilitating matches that help our sellers’ businesses grow and buyers discover products that feel made for them.</p>]]></description>
                <content:encoded><![CDATA[<p>When buyers search on Etsy, they need to quickly and easily find the perfect item. At the same time, sellers need to be confident their unique products are being seen by the right customers. Our Ads Search ranking model, which is built on a multitask learning foundation, is the critical link in this connection. Recently, we identified an opportunity to drive more meaningful buyer engagement by enhancing our model’s ability to predict purchase intent. We achieved this via a dual-pronged improvement: introducing <strong>Multigate Mixture of Experts (MMoE) to our model architecture</strong> and leveraging <strong>add-to-cart as an auxiliary signal</strong>. By providing our downstream systems with more accurate predictions, we improved matching in our marketplace, surfacing more relevant listings for buyers while helping sellers reach customers who are genuinely interested in their products.</p>
<h1>Background</h1>
<p>When a buyer searches for an item on Etsy, we want them to find exactly what they’re looking for from our inventory containing tens of millions of listings. In order to help them do this, we surface high-quality listings that are relevant to a user’s search query by ranking a small subset of items from a much larger group. This includes advertisements purchased by sellers that enables them to promote their listings across Etsy placements, including search. While these results are sponsored, the items go through their own ranking process to surface the listings most likely to meet a buyer’s needs. The final result on the search page utilizes our auto-bidding system, which helps decide which listings get shown and the cost-per-click.</p>
<p>After a user views an ad (known as an “impression”), clicking on the ad is often the first engagement in their purchase journey. However, each subsequent step – from click to cart addition to purchase – represents a progressively smaller subset of users. The increasing data sparsity that exists further along in the purchase journey can make it difficult for our model to pick up on a strong signal to learn from. </p>
<p>When ranking ads, our machine learning models optimize for click-through rate (CTR) and post-click conversion rate (PCCVR). Clicks and purchases are the primary behaviors we use to predict and drive user engagement, but other actions in the buyer’s purchase journey, such as adding an item to a cart, are important and often predictive of a purchase. </p>
<div align="center"><img src="https://i.etsystatic.com/inv/bc3170/7823894384/inv_fullxfull.7823894384_k23glddx.jpg?version=0" width="200" alt="https://i.etsystatic.com/inv/bc3170/7823894384/inv_fullxfull.7823894384_k23glddx.jpg?version=0" /><p><em>Figure 1. The Ads Search user journey.</em></p></div>
<p><br /></p>
<p>Some post-impression actions, such as favoriting an item, are not directly related to a buyer's purchase journey but can provide valuable signals to enhance our model's predictive capability. A click can be a strong indicator of a future purchase, but it can also be noisy – meaning it doesn’t always reliably predict purchase intent. For example, a user may click on an ad purely out of curiosity with no intention to buy. These are just a few reasons why user behavior is complex, and we are constantly trying to improve our prediction models to better capture these patterns and recommend the most relevant ads.</p>
<h1>Multitask Model Architecture</h1>
<p>The Ads Search ranking model is a multitask learning framework containing four major components: feature representation, explicit feature interaction, implicit feature interaction, and task prediction. Figure 2 is a depiction of our model architecture prior to the enhancements this post will describe. We start with raw numerical, categorical, and high-cardinality ID features for query, user, and listing entities, which are converted through the feature representation layer - including text embeddings and sequence encodings - to generate dense feature representations. These are concatenated and fed to a Deep and Cross Network (DCN) that learns explicit feature interactions. The explicitly crossed features then pass through a shallow feed-forward network for the model to learn additional implicit feature interactions. Finally, the latent feature representations are fed into task-specific towers to output CTR and PCCVR predictions. </p>
<div align="center"><img src="https://i.etsystatic.com/inv/f3c814/7823892552/inv_fullxfull.7823892552_7gh8fhz0.jpg?version=0" width="200" alt="https://i.etsystatic.com/inv/f3c814/7823892552/inv_fullxfull.7823892552_7gh8fhz0.jpg?version=0" /><p><em>Figure 2. The  initial multitask architecture used for the Ads Search ranking model which has since been upgraded with Multigate Mixture of Experts (MMoE).</em></p></div>
<p><br /></p>
<p>Since the CTR and PCCVR predictions are used in downstream ads ranking and auto-bidding systems, we need the predictions to be well-calibrated. After the underlying model is trained, we individually calibrate the CTR and PCCVR towers to probability distributions using <a href="https://changyaochen.github.io/platt-scaling/">Platt scaling</a> layers. As user behaviors vary significantly across ad placements, the model learns distinct parameters for different placements.</p>
<h2>Optimizing for Purchase Intent</h2>
<p>The multitasked architecture we use has several advantages, as it helps the model learn shared patterns across tasks, reduces overfitting by allowing the model to learn more generalizable features, and decreases training and serving infrastructure costs by consolidating the two separate models into a single model. We originally deployed this multitask ranking model online in July 2023 and had not made major changes to its architecture since then. In the second half of 2025, the team identified an opportunity to better optimize for meaningful buyer engagement beyond click and purchase signals alone.</p>
<p>Our goal was to not only surface listings that resonate with buyers and drive conversions but also encourage them to return to Etsy – creating a positive feedback loop that benefits both buyers and sellers. We hypothesized that optimizing our models for engagement actions that signaled both purchase intent and buyer satisfaction would surface more relevant ads to buyers. Engagement actions include behavior that goes beyond a simple click, such as an add-to-cart or a favorited listing. To more effectively prioritize listings that led to this meaningful engagement for our buyers, we experimented with:</p>
<ol>
<li>A model architecture that would better predict purchase intent and </li>
<li>Adding additional signals to boost high-quality listings that buyers are more likely to purchase</li>
</ol>
<p>These two enhancements, in the form of Multigate Mixture of Experts (MMoE) and add-to-cart as an auxiliary task, worked well together in our model to drive a sizable product improvement in Q4 2025.</p>
<h1>Enabling Task-Specific Learning with Multigate Mixture of Experts (MMoE)</h1>
<p>While the introduction of our initial multitask model was a large success overall, it also had a limitation: since the model learns and shares the same feature representations across tasks, it is not always able to learn task-specific nuances, and this is more pronounced the less related the tasks are. When one task sees improved performance, other tasks can see performance degradation. This behavior is known as the “seesaw phenomenon,” and we encountered this when we first brought the multitask model online. One solution to this limitation is to add a <a href="https://dl.acm.org/doi/pdf/10.1145/3219819.3220007">Multigate Mixture of Experts</a> (MMoE) layer. In this architecture, the model still employs a shared bottom architecture where the feature representation and interaction layers remain unchanged. However, the MMoE layer introduces two key additional components in place of the shared feed-forward network: experts and gates. </p>
<p>Experts are parallel subnetworks that, unlike the shared representations before them in the network, are able to specialize and learn different patterns of the data. Experts are not specifically assigned tasks but rather this learning happens organically during training - some experts learn more about click-specific behavior, others learn more about purchase-specific behavior, and others learn patterns that are important for both of these actions. Each task has one softmax gating network which controls how that task combines expert outputs. This allows the CTR and PCCVR tasks to use and activate different subsets of experts differently, and this weighted expert information is then sent to our task-specific towers.</p>
<div align="center"><img src="https://i.etsystatic.com/inv/43e271/7823954316/inv_fullxfull.7823954316_gh3aar4q.jpg?version=0" width="600" alt="https://i.etsystatic.com/inv/43e271/7823954316/inv_fullxfull.7823954316_gh3aar4q.jpg?version=0" /><p><em>Figure 3. A comparison of our shared bottom multitask architecture for the Ads Search ranking model (left) and our MMoE architecture (right).</em></p></div>
<h2>Tuning the Experts</h2>
<p>The main hyperparameters to tune in the MMoE layer are the number of experts, the size of the experts, and the expert type. The number of experts used depends on several factors. One factor is a tradeoff between having too few experts – which can underfit and fail to capture distinct patterns needed for each task – and too many – which can overfit to training data and fail to generalize well. Another factor to consider is that adding experts increases model capacity which in turn increases latency and infrastructure costs. Our initial configuration included only multilayer perceptrons (MLP)-based experts (i.e., feed-forward neural networks), but we experimented offline with <a href="https://medium.com/pinterest-engineering/multi-gate-mixture-of-experts-mmoe-model-architecture-and-knowledge-distillation-in-ads-08ec7f4aa857">heterogenous experts</a> and saw an offline lift in purchase and click metrics through introducing a mixture of DCN- and MLP-based experts. </p>
<p>There are other pitfalls when employing an MMoE architecture which oftentimes require additional hyperparameter tuning to resolve. Specifically, common issues in an MMoE structure are expert utilization (the ability of each task to use multiple experts) and expert specialization (the ability of each expert to learn differently). If experts are not well-utilized, the model has wasted capacity and fails to leverage the full representational power of the architecture. If experts are not specialized, the model effectively reduces to a shared bottom architecture with redundant experts. On the other hand, if experts specialize too strictly and are not shared between tasks, the model loses some of the benefits of multitask transfer learning.</p>
<p>We ran into these issues when training our new model. To build a successful model with MMoE, we needed each task to utilize more experts and utilize some of the same experts as the other tasks so that they could benefit from both specialized and shared learning. We experimented offline with two regularization techniques to try and solve this issue: expert dropout and temperature scaling.</p>
<p>In expert dropout, we randomly disable some experts during training to force the model to learn more diverse representations. Expert dropout differs from typical “dropout” in neural networks (where we randomly remove a percent of connections in a given layer during training), as we fully remove the utilization of a number of experts during the forward pass. Using expert dropout, utilization did improve a bit: each gate was selecting a primary and a secondary expert for each task. Still, we did not see any sharing of experts between the tasks. </p>
<p>We then tried temperature scaling, which modifies the raw logits of the expert gates by dividing them by a temperature (T) to control the smoothness of the resulting probability distribution. By applying this before the softmax function (which converts logits to probabilities) in the gates with a T &gt; 1, we softened the distribution, making it more likely to select multiple experts. Expert dropout is random and only applied at training while temperature scaling is deterministic and applied at both training and inference. Temperature scaling achieved better utilization and specialization than expert dropout, leading us to deploy this approach.</p>
<h1>Auxiliary Tasks</h1>
<p>Our multitask model already leveraged user click and purchase engagements to train CTR and PCCVR towers. However, we also have access to other rich user interactions, namely add-to-cart and favorites, that reflect the meaningful buyer engagement described above. Purchases are quite sparse compared to clicks, and one of the major benefits of our original multitasking model was for this sparse purchase action to benefit with additional signal from the more common click action. Our goal with adding auxiliary tasks was to help the model learn more generalizable representations of user engagement, again leaning on actions that are more plentiful than purchase. We hypothesized that add-to-cart and favorite actions were indications of high purchase intent that would help the model better learn the purchase task without hurting the click task.</p>
<p>Since we do not use add-to-cart and favorite predictions for downstream use cases like ranking or bidding, we did not need to calibrate these predictions or serve them online. This makes them relatively straightforward to add to our existing uncalibrated model architecture. In the shared bottom version of our model, we simply add one tower for each additional task. In the MMoE version of our model, we add one gate and one tower for each additional task. We experimented with both versions offline and found that MMoE in combination with auxiliary tasks performed better than the shared bottom model with auxiliary tasks.  It makes sense that MMoE would outperform the shared bottom when we added more tasks due to the “seesaw phenomenon” described earlier.</p>
<p>Through experimentation, we learned that while add-to-cart as an auxiliary task boosted purchase metrics by bridging the gap between purchases and clicks in terms of relatedness, favorites actually had a negative impact (known as negative transfer) on the model. With further analysis, we found that favorites can actually be quite noisy and not indicative of high purchase intent. As a result, the version of the model was ultimately ramped up in production only included add-to-cart as an auxiliary task.</p>
<div align="center"><img src="https://i.etsystatic.com/inv/5f0b3d/7823681610/inv_fullxfull.7823681610_qbngbrck.jpg?version=0" width="300" alt="https://i.etsystatic.com/inv/5f0b3d/7823681610/inv_fullxfull.7823681610_qbngbrck.jpg?version=0" /><p><em>Figure 4. A simplified version of the MMoE piece of our model architecture with add-to-cart (ATC) as an auxiliary task.</em></p></div>
<h1>Results and Impact</h1>
<p>Offline, our new model showed promising improvements in Purchase and Click Area Under the Precision-Recall Curves (PR AUCs) and Purchase Area Under the Receiver Operating Characteristic Curve (ROC AUC) metrics.  Together, these metrics measure how well our model predicts buyer behavior  – PR AUC evaluates its ability to rank relevant listings at the top of the search results, and Purchase ROC AUC evaluates its ability to distinguish between listings buyers will and will not purchase. We saw average <strong>increases of 3.5% and 1% to Purchase and Click PR AUCs</strong>, respectively, and a <strong>0.5% increase to Purchase ROC AUC</strong>, meaningful lifts for an industry-level ranking system.</p>
<p>When we deployed the model online, we saw three meaningful improvements across the marketplace. First, the model drove purchases, improving buyer experience by more accurately predicting which listings from our inventory would resonate with them. Second, the ads marketplace became more efficient due to an improvement in purchase calibration metrics. More accurate PCCVR predictions served as better inputs to our auto-bidding system, which helped sellers reach buyers who are genuinely interested in their listings. Finally, the MMoE architecture is more flexible than the shared bottom architecture, so we were able to keep the overall model size flat by pruning other parts of the model when adding in MMoE. At serving time, inference became less costly, likely due to differences in the distribution of compute across model components.</p>
<h2>What’s Next</h2>
<p>The MMoE architecture provides the flexibility to add a variety of tasks to our ranking model by reducing the risk for negative transfer by encouraging some experts to learn task-specific patterns and others to learn shared representations. After seeing success with the add-to-cart task in our new modeling framework, we plan on experimenting with several additional auxiliary tasks, such as dwell time, to further improve our model’s ability to connect buyers with listings they’ll love. </p>
<p>Ads give sellers an additional opportunity to stand out to buyers seeking their unique creations on Etsy. With each improvement to our ranking model, we continue to strengthen the marketplace connection between buyers and sellers  – facilitating matches that help our sellers’ businesses grow and buyers discover products that feel made for them.</p>]]></content:encoded>
            </item>
        
            <item>
                <title><![CDATA[Migrating Etsy’s database sharding to Vitess]]></title>
                <link>https://www.etsy.com/codeascraft/migrating-etsyas-database-sharding-to-vitess?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</link>
                <dc:creator><![CDATA[Ella Yarmo-Gray]]></dc:creator>
                <pubDate>Thu, 19 Mar 2026 17:03:54 -0400</pubDate>
                <guid isPermaLink="true">https://www.etsy.com/codeascraft/migrating-etsyas-database-sharding-to-vitess?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</guid>
                <description><![CDATA[<p>Etsy has maintained a sharded MySQL architecture since around 2010. This database cluster contains most of Etsy’s online data and is made up of ~1,000 tables distributed across ~1,000 shards. Over the last 16 years, it has grown significantly: combined, these tables have over 425 TB of data and receive roughly 1.7 million requests per second. </p>
<p>Etsy engineers access our MySQL data through a proprietary object-relational mapping (ORM). The ORM has a corresponding model for each MySQL table. When a table is sharded, its rows are partitioned among different databases known as shards. Each shard has identical table schemas and contains a distinct subset of all rows. Sharded tables require a unique id field to be designated on the model as the “shardifier id” which determines the shard for each record. </p>
<p>Shardifier ids are designed to co-locate related data in order to minimize the number of databases we need to rely on for a given query. For example, all records related to a single shop or user are stored on the same shard. The majority of models use shop_id or user_id as the shardifier id, but there are over 30 different options in use. Prior to Vitess, the ORM stored mappings of records to shards in a single, unsharded “index” database. When a record was created, the shard mapping for its id was determined randomly by the ORM and then stored in a table on the index database so it could be retrieved again later. When a query was sent to a sharded table using the ORM, the ORM first queried the index database to retrieve the shard mapping, and then targeted the desired query to the right shard.</p>
<p>This shard architecture brought two major benefits: scalability and resilience against host outages. The impact of a database outage was greatly reduced compared to an unsharded database – we only lost 1/1000 of traffic to the distributed database if a shard went down. However, it also had some drawbacks:</p>
<ul>
<li>The process for horizontally scaling was manual, complex, and took months to complete. </li>
<li>We were dependent on a single point of failure – the index database. If the index database went down, it could cause a full site outage since the ORM would no longer be able to route queries to any shard.</li>
<li>Product developers were required to understand and manage sharding when accessing data, which could create friction since it was often confusing and toilsome.</li>
</ul>
<p>As Etsy and the index database grew, we saw an increase in incidents when it was unavailable and addressing these vulnerabilities became a high priority. <a href="https://vitess.io/docs/24.0/overview/whatisvitess/">Vitess</a>, an open source software that provides an abstraction layer for scaling, deploying, and managing large MySQL clusters, offered a solution to these problems.</p>
<h2>Introducing Vitess</h2>
<p>In 2018, we added Vitess to our database architecture. Prior to this change, the ORM issued queries directly to the MySQL database for a specific shard. Now, each query was sent to Vitess, which would then query MySQL. At first, the ORM told Vitess which shard to query so we could validate that the new components worked in our environment before adding shard logic into Vitess. </p>
<p>With this new infrastructure in place, we were ready to start exploring <a href="https://vitess.io/docs/24.0/reference/features/vindexes/">vindexes</a>, which define sharding strategies within Vitess. Vindexes provide a way to calculate a shard given a column value akin to how we mapped shardifier ids to shard numbers in our index database. To familiarize ourselves with them, we <a href="https://www.etsy.com/codeascraft/search/Scaling%2520Etsy%2520Payments?ref=codeascraft">scaled several unsharded payments tables</a> by creating a new sharded database cluster that could use Vitess vindexes out-of-the-box. The success of that project paved the way for us to migrate our existing, in-house shard infrastructure to Vitess vindexes.</p>
<h2>Choosing a vindex</h2>
<p>Vitess provides a number of vindexes. We were most interested in those that calculated the shard using an algorithm like the <a href="https://vitess.io/docs/23.0/reference/features/vindexes/#predefined-vindexes">hash vindex</a>, which would remove the reliance on an additional data store for shard lookups. Since the ORM’s shard mappings are random and not algorithmic, using one of these out-of-the-box would require re-sharding all of our data – a process that would be manual and likely take years. Instead, we chose to write custom vindexes that ported our existing shard logic into Vitess so we could test how vindexes worked in our environment without the complexity and risk of moving data.</p>
<p>The first step of creating our custom vindexes was to modify the shard assignment algorithm in the ORM to match the Vitess hash vindex algorithm. After that change, no new shard mappings required an index database lookup, so we could store all of the existing mappings in a read-only database. We chose to use SQLite, which provided low latency reads and a small enough footprint to copy the database directly onto each Vitess server – avoiding the latency and dependency added by a call to an external database. Then we built a custom SQLite lookup vindex, which looks up shard information from a SQLite database.</p>
<p>Next, we built a custom hybrid vindex, which allows a table to apply one of two vindexes based on a threshold value – in this case, the SQLite vindex or the hash vindex. We set the threshold to the first id that was created after we changed the shard assignment algorithm. Any id that is above this value would use the hash vindex to look up its shard mapping, and any id that is below it would use the SQLite vindex.</p>
<h2>Introducing vindexes into our environment</h2>
<p>We were now able to introduce Vitess vindexes into our existing architecture without moving any data. Before doing so, we needed to ensure that all queries currently in use with legacy ORM sharding would still succeed and return the same results with Vitess managed sharding. Some incompatibility was expected: Vitess requires the shardifier id to be present in the WHERE clause of a SQL query in order to route it to the correct shard, but this was not previously required by the ORM. In fact, since Etsy developers were required to provide this information separate from the SQL query in our legacy architecture, many queries did not include it in the WHERE clause. More detail around how we found and solved for query incompatibilities can be found in <a href="https://www.etsy.com/codeascraft/scaling-etsy-payments-with-vitess-part-3--reducing-cutover-risk">Scaling Etsy Payments with Vitess, Part 3</a>.</p>
<p>The ORM contained over a decade of queries with many different ways of accessing MySQL, which made auditing each and every query time consuming. Building enough context to test changes to a model was also challenging since each table varied greatly in the design and purpose of its data. Given this, we decided to take an incremental approach and introduce vindexes into our codebase one table at a time. This would allow us to test on a smaller scale, ensuring each type of data access pattern would succeed with Vitess by monitoring how vindexes performed with a specific set of queries. We could slowly and safely surface incompatibilities while we made progress on the migration, allowing room for error without risking large-scale outages.</p>
<p>Since we would need to repeat the migration for hundreds of tables, we prioritized creating a clear and repeatable process. <a href="https://www.etsy.com/codeascraft/understanding-the-collective-impact-of-experiments">Etsy’s experimentation framework</a> enabled us to gradually ramp up the change by incrementally increasing the percentage of traffic that was using Vitess vindexes for shard routing for each table. This enabled us to assess impact by comparing the performance of a query when it was using vindexes vs ORM shard routing. If they encountered any issues along the way, it was easy to ramp down to 0. Minimizing the risk of each model ramp by starting small and having a quick way to roll back helped mitigate the risk of not testing all queries upfront.</p>
<h2>Working through challenges: database transactions</h2>
<p>Early on we ran into a challenge with this approach when handling database transactions. With our in-house shard routing, the ORM treated the shards as many separate databases and queried specific shards directly. With Vitess-managed sharding, the ORM queried the shards as if it were a single database, and they were only regarded as separate databases behind-the-scenes by Vitess. This meant that the two approaches created different connections to the database, which caused some issues for database transactions, since atomicity is only guaranteed per connection. To avoid data integrity issues that could come from breaking transaction guarantees, we required tables written to in a single database transaction to be ramped onto vindexes at the same time so they would use the same connection.</p>
<p>This was simple enough in theory, but ended up being pretty consequential. A small number of tables that represent the most complex and critical data models like receipts, listings, and transactions, make up a large portion of database traffic to the sharded cluster. As we were preparing one of them to use vindexes, we realized that just 27 models (3% of tables) accounted for ⅓ of our database traffic – and they were all connected via database transactions. Despite our best efforts to minimize risk by ramping table-by-table, some of the riskiest changes were tightly coupled anyway. We collaborated across the company to ramp these 27 models in unison.
<img alt="Percentage of traffic using vindexes" src="https://i.etsystatic.com/inv/3f0fee/7870795065/inv_fullxfull.7870795065_hr9alodj.jpg?version=0" title="Percentage of traffic using vindexes" /></p>
<h2>Reaping the benefits: cross-shard queries</h2>
<p>One major advantage of using Vitess vindexes is the ability to query across shards. By default, any query that does not contain the shardifier id will "scatter" – Vitess will send it to all shards in parallel, sort the results, and return a single result set. This is a powerful tool – one model’s query time was reduced from ~2 seconds to ~20 ms by using cross-shard queries! However, at Etsy’s scale, unintentionally sending an expensive query to all 1,000 shards at once could quickly become an issue. To protect against these scenarios, we disabled scatter queries by default in our environment and provided a way to specify that a query is allowed to scatter via the ORM. </p>
<p>Utilizing scatter queries brought notable improvements to bulk primary key lookups in the 27 table ramp mentioned above. The ORM’s previous implementation took an array of primary keys, batched them by shard, issued a query to each shard, and returned the combined results. Since Vitess enables us to query multiple shards in a single query, we can skip the batching by shard and include all primary keys in a single query. This greatly reduced the number of queries issued for bulk lookups for some models.
<img alt="Number of bulk lookup queries for one model" src="https://i.etsystatic.com/inv/4da339/7822693132/inv_fullxfull.7822693132_7yjbrc8p.jpg?version=0" title="Number of bulk lookup queries for one model" /><em>Bulk lookup queries for one model during its ramp onto vindexes. Purple lines denote 1%, 10%, 50%, and 100% of traffic to the model using vindexes.</em></p>
<h2>Conclusion</h2>
<p>Five years, approximately 2,500 pull requests and 6,000 queries later, we have successfully migrated Etsy’s shard management to Vitess vindexes! Despite the work we put in to streamline the migration process, it was still a challenge to replace the database infrastructure for a codebase of Etsy’s scale and age. As infrastructure engineers, my team usually had minimal context around the code we were changing and what might break if things went wrong. However, with coordination and careful testing across Etsy engineering, we were able to meet our goals:</p>
<ul>
<li>Scaling operations are no longer manual and can be performed in a matter of days rather than months.</li>
<li>We removed the index database as a single point of failure for our sharded cluster.</li>
<li>Shard infrastructure has been obscured from the developer’s view which makes data modeling and query writing much simpler.</li>
<li>Database performance stayed largely the same as it was prior to the migration, rendering the change invisible to our end users.</li>
<li>We were able to progress incrementally, roll back quickly, and integrate Vitess into our shard architecture without performing a massive data migration.</li>
</ul>
<p>We are excited to use all the new Vitess features that migrating to vindexes has unlocked – for example, re-sharding our data, rebalancing data across the shards, and using Vitess’s MoveTable operations to shard unsharded tables. Swapping out such critical and complex infrastructure with almost no downtime or impact on our users was an extremely challenging task, but it was also extremely gratifying.</p>
<h2>Acknowledgements</h2>
<p>This project was a shared accomplishment of the Data Access Platform team: Jessica Chen, Samantha Drago-Kramer, Hermes Garcia, Sam Kenny, David Leibovic, Kyle Leiby, Benjamin Mariscal, Juan Ortega, Adam Saponara, Wendy Sung, and Stephanie Wu. Thank you to everyone who contributed to the vindex project across engineering!</p>]]></description>
                <content:encoded><![CDATA[<p>Etsy has maintained a sharded MySQL architecture since around 2010. This database cluster contains most of Etsy’s online data and is made up of ~1,000 tables distributed across ~1,000 shards. Over the last 16 years, it has grown significantly: combined, these tables have over 425 TB of data and receive roughly 1.7 million requests per second. </p>
<p>Etsy engineers access our MySQL data through a proprietary object-relational mapping (ORM). The ORM has a corresponding model for each MySQL table. When a table is sharded, its rows are partitioned among different databases known as shards. Each shard has identical table schemas and contains a distinct subset of all rows. Sharded tables require a unique id field to be designated on the model as the “shardifier id” which determines the shard for each record. </p>
<p>Shardifier ids are designed to co-locate related data in order to minimize the number of databases we need to rely on for a given query. For example, all records related to a single shop or user are stored on the same shard. The majority of models use shop_id or user_id as the shardifier id, but there are over 30 different options in use. Prior to Vitess, the ORM stored mappings of records to shards in a single, unsharded “index” database. When a record was created, the shard mapping for its id was determined randomly by the ORM and then stored in a table on the index database so it could be retrieved again later. When a query was sent to a sharded table using the ORM, the ORM first queried the index database to retrieve the shard mapping, and then targeted the desired query to the right shard.</p>
<p>This shard architecture brought two major benefits: scalability and resilience against host outages. The impact of a database outage was greatly reduced compared to an unsharded database – we only lost 1/1000 of traffic to the distributed database if a shard went down. However, it also had some drawbacks:</p>
<ul>
<li>The process for horizontally scaling was manual, complex, and took months to complete. </li>
<li>We were dependent on a single point of failure – the index database. If the index database went down, it could cause a full site outage since the ORM would no longer be able to route queries to any shard.</li>
<li>Product developers were required to understand and manage sharding when accessing data, which could create friction since it was often confusing and toilsome.</li>
</ul>
<p>As Etsy and the index database grew, we saw an increase in incidents when it was unavailable and addressing these vulnerabilities became a high priority. <a href="https://vitess.io/docs/24.0/overview/whatisvitess/">Vitess</a>, an open source software that provides an abstraction layer for scaling, deploying, and managing large MySQL clusters, offered a solution to these problems.</p>
<h2>Introducing Vitess</h2>
<p>In 2018, we added Vitess to our database architecture. Prior to this change, the ORM issued queries directly to the MySQL database for a specific shard. Now, each query was sent to Vitess, which would then query MySQL. At first, the ORM told Vitess which shard to query so we could validate that the new components worked in our environment before adding shard logic into Vitess. </p>
<p>With this new infrastructure in place, we were ready to start exploring <a href="https://vitess.io/docs/24.0/reference/features/vindexes/">vindexes</a>, which define sharding strategies within Vitess. Vindexes provide a way to calculate a shard given a column value akin to how we mapped shardifier ids to shard numbers in our index database. To familiarize ourselves with them, we <a href="https://www.etsy.com/codeascraft/search/Scaling%2520Etsy%2520Payments?ref=codeascraft">scaled several unsharded payments tables</a> by creating a new sharded database cluster that could use Vitess vindexes out-of-the-box. The success of that project paved the way for us to migrate our existing, in-house shard infrastructure to Vitess vindexes.</p>
<h2>Choosing a vindex</h2>
<p>Vitess provides a number of vindexes. We were most interested in those that calculated the shard using an algorithm like the <a href="https://vitess.io/docs/23.0/reference/features/vindexes/#predefined-vindexes">hash vindex</a>, which would remove the reliance on an additional data store for shard lookups. Since the ORM’s shard mappings are random and not algorithmic, using one of these out-of-the-box would require re-sharding all of our data – a process that would be manual and likely take years. Instead, we chose to write custom vindexes that ported our existing shard logic into Vitess so we could test how vindexes worked in our environment without the complexity and risk of moving data.</p>
<p>The first step of creating our custom vindexes was to modify the shard assignment algorithm in the ORM to match the Vitess hash vindex algorithm. After that change, no new shard mappings required an index database lookup, so we could store all of the existing mappings in a read-only database. We chose to use SQLite, which provided low latency reads and a small enough footprint to copy the database directly onto each Vitess server – avoiding the latency and dependency added by a call to an external database. Then we built a custom SQLite lookup vindex, which looks up shard information from a SQLite database.</p>
<p>Next, we built a custom hybrid vindex, which allows a table to apply one of two vindexes based on a threshold value – in this case, the SQLite vindex or the hash vindex. We set the threshold to the first id that was created after we changed the shard assignment algorithm. Any id that is above this value would use the hash vindex to look up its shard mapping, and any id that is below it would use the SQLite vindex.</p>
<h2>Introducing vindexes into our environment</h2>
<p>We were now able to introduce Vitess vindexes into our existing architecture without moving any data. Before doing so, we needed to ensure that all queries currently in use with legacy ORM sharding would still succeed and return the same results with Vitess managed sharding. Some incompatibility was expected: Vitess requires the shardifier id to be present in the WHERE clause of a SQL query in order to route it to the correct shard, but this was not previously required by the ORM. In fact, since Etsy developers were required to provide this information separate from the SQL query in our legacy architecture, many queries did not include it in the WHERE clause. More detail around how we found and solved for query incompatibilities can be found in <a href="https://www.etsy.com/codeascraft/scaling-etsy-payments-with-vitess-part-3--reducing-cutover-risk">Scaling Etsy Payments with Vitess, Part 3</a>.</p>
<p>The ORM contained over a decade of queries with many different ways of accessing MySQL, which made auditing each and every query time consuming. Building enough context to test changes to a model was also challenging since each table varied greatly in the design and purpose of its data. Given this, we decided to take an incremental approach and introduce vindexes into our codebase one table at a time. This would allow us to test on a smaller scale, ensuring each type of data access pattern would succeed with Vitess by monitoring how vindexes performed with a specific set of queries. We could slowly and safely surface incompatibilities while we made progress on the migration, allowing room for error without risking large-scale outages.</p>
<p>Since we would need to repeat the migration for hundreds of tables, we prioritized creating a clear and repeatable process. <a href="https://www.etsy.com/codeascraft/understanding-the-collective-impact-of-experiments">Etsy’s experimentation framework</a> enabled us to gradually ramp up the change by incrementally increasing the percentage of traffic that was using Vitess vindexes for shard routing for each table. This enabled us to assess impact by comparing the performance of a query when it was using vindexes vs ORM shard routing. If they encountered any issues along the way, it was easy to ramp down to 0. Minimizing the risk of each model ramp by starting small and having a quick way to roll back helped mitigate the risk of not testing all queries upfront.</p>
<h2>Working through challenges: database transactions</h2>
<p>Early on we ran into a challenge with this approach when handling database transactions. With our in-house shard routing, the ORM treated the shards as many separate databases and queried specific shards directly. With Vitess-managed sharding, the ORM queried the shards as if it were a single database, and they were only regarded as separate databases behind-the-scenes by Vitess. This meant that the two approaches created different connections to the database, which caused some issues for database transactions, since atomicity is only guaranteed per connection. To avoid data integrity issues that could come from breaking transaction guarantees, we required tables written to in a single database transaction to be ramped onto vindexes at the same time so they would use the same connection.</p>
<p>This was simple enough in theory, but ended up being pretty consequential. A small number of tables that represent the most complex and critical data models like receipts, listings, and transactions, make up a large portion of database traffic to the sharded cluster. As we were preparing one of them to use vindexes, we realized that just 27 models (3% of tables) accounted for ⅓ of our database traffic – and they were all connected via database transactions. Despite our best efforts to minimize risk by ramping table-by-table, some of the riskiest changes were tightly coupled anyway. We collaborated across the company to ramp these 27 models in unison.
<img alt="Percentage of traffic using vindexes" src="https://i.etsystatic.com/inv/3f0fee/7870795065/inv_fullxfull.7870795065_hr9alodj.jpg?version=0" title="Percentage of traffic using vindexes" /></p>
<h2>Reaping the benefits: cross-shard queries</h2>
<p>One major advantage of using Vitess vindexes is the ability to query across shards. By default, any query that does not contain the shardifier id will "scatter" – Vitess will send it to all shards in parallel, sort the results, and return a single result set. This is a powerful tool – one model’s query time was reduced from ~2 seconds to ~20 ms by using cross-shard queries! However, at Etsy’s scale, unintentionally sending an expensive query to all 1,000 shards at once could quickly become an issue. To protect against these scenarios, we disabled scatter queries by default in our environment and provided a way to specify that a query is allowed to scatter via the ORM. </p>
<p>Utilizing scatter queries brought notable improvements to bulk primary key lookups in the 27 table ramp mentioned above. The ORM’s previous implementation took an array of primary keys, batched them by shard, issued a query to each shard, and returned the combined results. Since Vitess enables us to query multiple shards in a single query, we can skip the batching by shard and include all primary keys in a single query. This greatly reduced the number of queries issued for bulk lookups for some models.
<img alt="Number of bulk lookup queries for one model" src="https://i.etsystatic.com/inv/4da339/7822693132/inv_fullxfull.7822693132_7yjbrc8p.jpg?version=0" title="Number of bulk lookup queries for one model" /><em>Bulk lookup queries for one model during its ramp onto vindexes. Purple lines denote 1%, 10%, 50%, and 100% of traffic to the model using vindexes.</em></p>
<h2>Conclusion</h2>
<p>Five years, approximately 2,500 pull requests and 6,000 queries later, we have successfully migrated Etsy’s shard management to Vitess vindexes! Despite the work we put in to streamline the migration process, it was still a challenge to replace the database infrastructure for a codebase of Etsy’s scale and age. As infrastructure engineers, my team usually had minimal context around the code we were changing and what might break if things went wrong. However, with coordination and careful testing across Etsy engineering, we were able to meet our goals:</p>
<ul>
<li>Scaling operations are no longer manual and can be performed in a matter of days rather than months.</li>
<li>We removed the index database as a single point of failure for our sharded cluster.</li>
<li>Shard infrastructure has been obscured from the developer’s view which makes data modeling and query writing much simpler.</li>
<li>Database performance stayed largely the same as it was prior to the migration, rendering the change invisible to our end users.</li>
<li>We were able to progress incrementally, roll back quickly, and integrate Vitess into our shard architecture without performing a massive data migration.</li>
</ul>
<p>We are excited to use all the new Vitess features that migrating to vindexes has unlocked – for example, re-sharding our data, rebalancing data across the shards, and using Vitess’s MoveTable operations to shard unsharded tables. Swapping out such critical and complex infrastructure with almost no downtime or impact on our users was an extremely challenging task, but it was also extremely gratifying.</p>
<h2>Acknowledgements</h2>
<p>This project was a shared accomplishment of the Data Access Platform team: Jessica Chen, Samantha Drago-Kramer, Hermes Garcia, Sam Kenny, David Leibovic, Kyle Leiby, Benjamin Mariscal, Juan Ortega, Adam Saponara, Wendy Sung, and Stephanie Wu. Thank you to everyone who contributed to the vindex project across engineering!</p>]]></content:encoded>
            </item>
        
            <item>
                <title><![CDATA[How Etsy Uses LLMs to Improve Search Relevance]]></title>
                <link>https://www.etsy.com/codeascraft/how-etsy-uses-llms-to-improve-search-relevance?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</link>
                <dc:creator><![CDATA[Yuqing Zhang]]></dc:creator>
                <pubDate>Fri, 16 Jan 2026 15:02:28 -0500</pubDate>
                <guid isPermaLink="true">https://www.etsy.com/codeascraft/how-etsy-uses-llms-to-improve-search-relevance?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</guid>
                <description><![CDATA[<p>Ever searched for something specific, only to be met with results that are <em>close, but not quite</em>? On Etsy’s Search Relevance team, that frustration is exactly what we are tackling. Our goal is simple yet ambitious: to help buyers find exactly what they’re looking for, and to help sellers reach the people seeking their special products.</p>
<p>Search plays a central role in that mission. Historically, Etsy’s search models have relied heavily on engagement signals – such as clicks, add-to-carts, and purchases – as proxies for relevance. These signals are objective, but they can also be biased: popular listings get more clicks, even when they’re not the best match for a specific query.</p>
<p>To address this, we introduce semantic relevance as a complementary perspective to engagement, capturing how well a listing aligns with a buyer’s intent as expressed in their query. We developed a <strong>Semantic Relevance Evaluation and Enhancement Framework</strong>, powered by large language models (LLMs). It provides a comprehensive approach to measure and improve relevance through three key components:</p>
<ol>
<li><strong>High quality data</strong>: we first establish human-curated “golden” labels of relevance categories (we’ll come back to this) for precise evaluation of the relevance prediction models, complemented by data from a human-aligned LLM that scales training across millions of query-listing pairs</li>
<li><strong>Semantic relevance models</strong>: we use a family of ML models with different trade-offs in accuracy, latency, and cost; tuned for both offline evaluation and real-time search</li>
<li><strong>Model-driven applications</strong>: we integrate relevance signals directly into Etsy’s search systems enabling both large-scale offline evaluation and real-time enhancement in production</li>
</ol>
<p>Together, this framework brings a more intent-aware search experience that better serves both buyers and sellers across our marketplace.</p>
<figure>
<img alt="Overview of the Semantic Relevance Evaluation and Enhancement Framework" src="https://i.etsystatic.com/inv/469ec5/7606034144/inv_fullxfull.7606034144_n8s6uql6.jpg?version=0" title="Overview of the Semantic Relevance Evaluation and Enhancement Framework" />
<figcaption>Figure 1. Overview of the Semantic Relevance Evaluation and Enhancement Framework</figcaption>
</figure>
<h1>Capturing Shades of Relevance</h1>
<p>Let’s return to the idea of <em>relevance categories</em>. Based on user research, we define three categories for semantic relevance of query-listing pairs:</p>
<ul>
<li><em>Relevant</em>: listing matches all parts of the query, accounting for meaning and proper nouns</li>
<li><em>Partially relevant</em>: listing matches part of the query or is thematically related but not a full match</li>
<li><em>Irrelevant</em>: listing has no meaningful connection to the query; its presence in top results would make the search feel broken</li>
</ul>
<figure>
<img alt="Examples for the three relevance categories" src="https://i.etsystatic.com/inv/90c66d/7606038370/inv_fullxfull.7606038370_abrfgw7f.jpg?version=0" title="Examples for the three relevance categories" />
<figcaption>Figure 2. Examples for the three relevance categories. Text highlighted in green shows how the product aligns with the search query, whereas red highlights indicate mismatches.*</figcaption>
</figure>
<p>In an ideal world, we’d rely on human judgments for all query-listing pairs. But large-scale human annotation is time-consuming and expensive, rendering it infeasible. Instead, language models unlock the ability to generate these judgments at scale, transforming our ability to make every search on Etsy produce more relevant results.</p>
<h1>Data: Anchored by Humans, Scaled by LLMs</h1>
<p>With recent advances in LLMs, a promising approach to evaluate search relevance is to use <a href="https://en.wikipedia.org/wiki/LLM-as-a-Judge">LLM-as-a-judge</a>: directly using LLMs to judge the relevance of our search system without looping in humans. However, this approach faces two main challenges: </p>
<ol>
<li><em>Domain shift</em>: off-the-shelf LLMs may not capture the unique preferences and vocabulary of Etsy users</li>
<li><em>Performance-cost tradeoff</em>: larger LLMs offer stronger reasoning but are expensive for large-scale inference, while smaller LLMs are faster and cheaper, but less accurate</li>
</ol>
<p>To address these challenges, we start with human-curated golden labels to evaluate and align a powerful LLM with these human-labels, then use a full dataset scaled up by the LLM for training our relevance judge. In other words, <strong>humans define what good looks like, and LLMs help us scale it</strong>. LLMs do not replace human judgment, instead they align with and amplify it.</p>
<p>We maintain a detailed, evolving relevance labeling guideline, continuously refined through user research and annotation feedback. What relevance means in our marketplace shifts over time and social context. For example, people searching for “face masks” pre-2020 were primarily looking for masks for costumes or fashion, which is a completely different intent from protective masks post-2020. These guidelines ensure our definitions of relevance accurately reflect Etsy users’ intent and capture cultural trends over time. </p>
<p>Query-listing pairs are sampled from search logs using a mix of approaches, including both random, stratified sampling for broad coverage, and targeted sampling for challenging cases. Each query-listing pair is labeled by two Etsy admins, with an ongoing review process to both break ties and adjust labeling guidelines accordingly. For quality control, we continuously track metrics such as row-level disagreement rates, which measures how often multiple annotators disagree with each other for the same query-listing pair.</p>
<p>To scale beyond manual annotation, we introduced a few-shot, chain-of-thought (CoT) prompting strategy using the <a href="https://openai.com/index/introducing-o3-and-o4-mini/">o3 model</a>, implemented in LangGraph. The prompt instruction is inspired by the annotation guidelines described above, and includes comprehensive query and listing features, like title, images, text description, attributes, variations, and extracted entities (read more about listing extracted entities in <a href="https://www.etsy.com/codeascraft/understanding-etsyas-vast-inventory-with-llms">another one of our posts</a>). We also applied <a href="https://www.promptingguide.ai/techniques/consistency">self-consistency sampling</a> to improve reliability.</p>
<p>This model, known as the LLM annotator (as seen in Figure 1), is first validated against the human-labeled golden data to ensure its judgement aligns with humans. Once validated, we use it to generate large-scale training data to develop the production models. The LLM annotator thus serves as the foundation for our teacher-student modeling pipeline, bridging the gap between expensive manual labeling and scalable automated annotation.</p>
<h1>Models: Balancing Accuracy, Latency and Cost</h1>
<p>Our modeling pipeline uses a <strong>three-tier cascaded distillation design</strong>, where each model balances accuracy and efficiency for a specific purpose. The stack includes:</p>
<ul>
<li><strong>The LLM annotator</strong>: our most accurate and cost-intensive model, aligned closely with human-labeled golden data</li>
<li><strong>The teacher model</strong>: a fine-tuned smaller LLM (Qwen 3 VL 4B) that delivers high-throughput annotation at scale</li>
<li><strong>The student model</strong>: a lightweight, BERT-based two-tower model optimized for real-time inference</li>
</ul>
<p>The <strong>LLM annotator</strong> aligns best with the golden labels, but is too costly for recurrent, large-scale inference. To reduce cost while maintaining quality, we performed supervised fine-tuning (SFT) with a smaller LLM, Qwen 3 VL 4B, using the training data generated by the LLM annotator. This <strong>teacher model</strong> preserves human alignment while enabling us to label millions of query-listing pairs daily, which is ideal for recurring evaluation and monitoring.</p>
<p>The teacher, however, is too slow to surface relevant search results quickly, which is critical for helping our sellers reach potential buyers. As such, we further distilled the teacher into a <strong>student model</strong> with a two-tower architecture. The distillation process aligns the student’s output with that of the teacher, so that the student judges relevance labels nearly as accurately as the teacher, while being lightweight and fast. The resulting model ensures we deliver search results almost as fast as before, with only &lt;10ms additional latency.</p>
<p>All three models – the LLM annotator, teacher, and student – are evaluated against the same golden dataset to ensure traceable performance and consistent alignment with human judgment. Figure 3 shows their accuracy measured using <a href="https://towardsdatascience.com/micro-macro-weighted-averages-of-f1-score-clearly-explained-b603420b292f/">multi-class Macro F1</a>, and individual class F1 scores.</p>
<figure>
<img alt="Performance of semantic relevance models against human golden labels" src="https://i.etsystatic.com/inv/7494b9/7653990743/inv_fullxfull.7653990743_q8oe1r83.jpg?version=0" title="Performance of semantic relevance models against human golden labels" />
<figcaption>Figure 3. Performance of semantic relevance models against human golden labels</figcaption>
</figure>
<h1>Applications: From Evaluation to Action</h1>
<p>With these models in place, we can both measure and enhance search relevance across Etsy.</p>
<h2>Search relevance evaluation</h2>
<p>We use the <strong>teacher model</strong> to measure how well our search system surfaces relevant listings. Each day, we sample search requests and perform offline inference using the teacher model, then aggregate the predicted relevance labels into summary metrics. These metrics are reviewed regularly by our team, and if we observe unexpected trends like a sudden decline of relevance, we work to quickly diagnose and address the problem. </p>
<p>Similarly, we monitor relevance metrics in A/B tests. The computed relevance metrics are discussed when we decide whether to roll out a new change to our search system, to ensure those changes affect semantic relevance of search results in a neutral to positive way. We sample sufficient amounts of requests separately from control and treatment variants, to ensure statistical power. Using <a href="https://docs.vllm.ai/en/stable/">vLLM</a> for high-throughput inference, we process millions of query-listing pairs daily at a very low cost, maintaining both statistical power and operational efficiency.</p>
<h2>Improving search in production</h2>
<p>The lightweight <strong>student model</strong> is embedded directly into Etsy’s real-time search stack. It improves relevance through several integration points:</p>
<ol>
<li><em>Filtering</em>: removes retrieved listings predicted as irrelevant before downstream ranking</li>
<li><em>Feature enrichment</em>: contributes model-predicted relevance scores as features for the downstream ranking model</li>
<li><em>Loss weighting</em>: adjusts training weights of the ranking model based on predicted relevance</li>
<li><em>Relevance boosting</em>: promotes listings deemed highly relevant using heuristic rules among the final returned search results </li>
</ol>
<h1>How Semantic Relevance is Changing Etsy Search</h1>
<p>The Semantic Relevance Evaluation and Enhancement Framework is fully deployed in Etsy’s search stack, and continues to evolve. We’ve observed a measurable uplift in semantic relevance: the percentage of fully relevant listings (as defined by the relevance categories described earlier) has increased from 58% to 62% between August and October 2025.</p>
<figure>
<img alt="Improvement of semantic relevance metrics over time" src="https://i.etsystatic.com/inv/cebae8/7653994975/inv_fullxfull.7653994975_9yv33rfk.jpg?version=0" title="Improvement of semantic relevance metrics over time" />
<figcaption>Figure 4. Improvement of semantic relevance metrics over time</figcaption>
</figure>
<p>This improvement reflects Etsy’s growing ability to align search results with buyer intent. For instance, in searches like “fall decor,” the enhanced search engine now focuses on surfacing seasonal decor items, while deprioritizing loosely related listings like clothing, which appeared before the enhancement on relevance. </p>
<figure>
<img alt="Before and after comparison when searching for “fall decor”" src="https://i.etsystatic.com/inv/259d22/7654001251/inv_fullxfull.7654001251_t76ovmxr.jpg?version=0" title="Before and after comparison when searching for “fall decor”" />
<figcaption>Figure 5. Before and after comparison when searching for “fall decor” *</figcaption>
</figure>
<p>Beyond these immediate gains, semantic relevance has shifted how we evaluate and improve search at Etsy, by adopting a user-centered approach. By grounding our evaluation in semantic intent in addition to behavioral signals, we move closer to our goal of connecting buyers with the relevant products, not just the most popular ones. While search results are influenced by multiple factors, and outcomes may vary, on the seller side, improving semantic relevance can also help surface items from small or new sellers who may not yet have the visibility of more established shops.</p>
<h1>What’s Next</h1>
<p>In ongoing and future efforts, we hope to explore the following directions:</p>
<ul>
<li><em>Better understanding of relevance-engagement dynamics.</em> In online experiments, we often observe engagement metrics decline even as semantic relevance improves (a pattern also <a href="https://arxiv.org/pdf/2409.17460v1">noted by other e-commerce platforms</a>). We suspect this results from applying uniform relevance treatments despite contextual variation. Next, we plan to explore adaptive strategies that tailor adjustments by query type.</li>
<li><em>Refining partial relevance.</em> Inspired by Amazon’s <a href="https://github.com/amazon-science/esci-data">ESCI</a> framework, we’re exploring finer-grained labels, for example, introducing new subcategories of complements and substitutes. This could potentially improve evaluation precision and power new user search experiences.</li>
<li><em>Reducing annotation effort through LLM facilitation.</em> When LLM judgments are self-consistent, they align better with human labels. This may indicate easier query-listing pairs. We are exploring using LLMs for these easy cases, focusing human effort on more complex cases.</li>
<li><em>Simplifying the multi-stage model stack</em>. Our current three-tier distillation pipeline provides flexibility but adds operational complexity. We plan to simplify this setup by exploring better performance-efficiency tradeoffs and potentially merging model tiers.</li>
<li><em>Improving relevance in retrieval.</em> So far, post-retrieval filtering is the first stage where our semantic relevance model applies. We see strong potential to enhance both inference and measurement further upstream in the retrieval layer.</li>
</ul>
<h1>Conclusion</h1>
<p>Key takeaways:</p>
<ul>
<li><strong>LLMs can meaningfully evaluate search relevance when grounded in human judgment.</strong> Aligning LLM assessments with human-labeled data ensures we measure, and continually improve, the search experience that is so essential to connecting buyers and sellers on Etsy.</li>
<li><strong>Semantic relevance redefines how Etsy optimizes search.</strong> By complementing engagement metrics with semantic relevance, we address real customer pain points and deliver more satisfying search experiences.</li>
<li><strong>Teacher-student distillation</strong> offers a flexible and efficient way to apply relevance modeling across diverse performance, latency and cost requirements.</li>
</ul>
<p>Ultimately, improving semantic relevance strengthens the human connections that define Etsy. By understanding what shoppers truly mean, we can help them find the right items. And by emphasizing relevant listings over popular ones, we can help create fairer opportunities on the search relevance factor of search visibility for our sellers – <a href="https://storage.googleapis.com/etsy-extfiles-prod/2025_Q1_GLOBAL_Seller%20Census%202024.pdf?ref=news">89% of whom</a> are businesses of one.</p>
<h1>Acknowledgments</h1>
<p>This work is brought to you in a collaborative effort by the Search Relevance Team, enabled by ML Enablement, and the Merchandising teams. Thanks to the following contributors</p>
<ul>
<li>Data: Susan Liu, Jugal Gala, David Blincoe, Yuqing Zhang, Taylor Hunt, Liz Mikolaj</li>
<li>Models: David Blincoe, Oriane Cavrois, Orson Adams, Yuqing Zhang</li>
<li>Application: Grant Sherrick, Kaushik Bekal, Haoming Chen, Patrick Callier, Davis Kim, Marcus Daly</li>
<li>Product leadership: Julia Zhou, Willy Huang, Argie Angeleas</li>
<li>Engineering leadership: Yinlin Fu, Congzhe Su, Xiaoting Zhao</li>
<li>ML Enablement partners: Ari Carter, Stan Schwertly, Shreya Agarwal, K Ogilvie, Marvin Wang, etc.</li>
<li>Other cross-team partners: Will Beckman, Karl Yokono, Audrey Chen, Heather Campbell, David Le, Khadeeja Din, etc.</li>
<li>Early contributors: Ethan Benjamin, Cung Tran, Maggie Matsui, Jack Gammack, Yogeeta Chatoredussy, Austin Clapp, Benjamin Russell, Khaled Jabr</li>
</ul>
<p>Special thanks to Oriane Cavrois &amp; David Blincoe for helping this piece come to life.</p>
<p><em>* Images are provided for illustrative purposes. Item availability on Etsy may vary.</em></p>]]></description>
                <content:encoded><![CDATA[<p>Ever searched for something specific, only to be met with results that are <em>close, but not quite</em>? On Etsy’s Search Relevance team, that frustration is exactly what we are tackling. Our goal is simple yet ambitious: to help buyers find exactly what they’re looking for, and to help sellers reach the people seeking their special products.</p>
<p>Search plays a central role in that mission. Historically, Etsy’s search models have relied heavily on engagement signals – such as clicks, add-to-carts, and purchases – as proxies for relevance. These signals are objective, but they can also be biased: popular listings get more clicks, even when they’re not the best match for a specific query.</p>
<p>To address this, we introduce semantic relevance as a complementary perspective to engagement, capturing how well a listing aligns with a buyer’s intent as expressed in their query. We developed a <strong>Semantic Relevance Evaluation and Enhancement Framework</strong>, powered by large language models (LLMs). It provides a comprehensive approach to measure and improve relevance through three key components:</p>
<ol>
<li><strong>High quality data</strong>: we first establish human-curated “golden” labels of relevance categories (we’ll come back to this) for precise evaluation of the relevance prediction models, complemented by data from a human-aligned LLM that scales training across millions of query-listing pairs</li>
<li><strong>Semantic relevance models</strong>: we use a family of ML models with different trade-offs in accuracy, latency, and cost; tuned for both offline evaluation and real-time search</li>
<li><strong>Model-driven applications</strong>: we integrate relevance signals directly into Etsy’s search systems enabling both large-scale offline evaluation and real-time enhancement in production</li>
</ol>
<p>Together, this framework brings a more intent-aware search experience that better serves both buyers and sellers across our marketplace.</p>
<figure>
<img alt="Overview of the Semantic Relevance Evaluation and Enhancement Framework" src="https://i.etsystatic.com/inv/469ec5/7606034144/inv_fullxfull.7606034144_n8s6uql6.jpg?version=0" title="Overview of the Semantic Relevance Evaluation and Enhancement Framework" />
<figcaption>Figure 1. Overview of the Semantic Relevance Evaluation and Enhancement Framework</figcaption>
</figure>
<h1>Capturing Shades of Relevance</h1>
<p>Let’s return to the idea of <em>relevance categories</em>. Based on user research, we define three categories for semantic relevance of query-listing pairs:</p>
<ul>
<li><em>Relevant</em>: listing matches all parts of the query, accounting for meaning and proper nouns</li>
<li><em>Partially relevant</em>: listing matches part of the query or is thematically related but not a full match</li>
<li><em>Irrelevant</em>: listing has no meaningful connection to the query; its presence in top results would make the search feel broken</li>
</ul>
<figure>
<img alt="Examples for the three relevance categories" src="https://i.etsystatic.com/inv/90c66d/7606038370/inv_fullxfull.7606038370_abrfgw7f.jpg?version=0" title="Examples for the three relevance categories" />
<figcaption>Figure 2. Examples for the three relevance categories. Text highlighted in green shows how the product aligns with the search query, whereas red highlights indicate mismatches.*</figcaption>
</figure>
<p>In an ideal world, we’d rely on human judgments for all query-listing pairs. But large-scale human annotation is time-consuming and expensive, rendering it infeasible. Instead, language models unlock the ability to generate these judgments at scale, transforming our ability to make every search on Etsy produce more relevant results.</p>
<h1>Data: Anchored by Humans, Scaled by LLMs</h1>
<p>With recent advances in LLMs, a promising approach to evaluate search relevance is to use <a href="https://en.wikipedia.org/wiki/LLM-as-a-Judge">LLM-as-a-judge</a>: directly using LLMs to judge the relevance of our search system without looping in humans. However, this approach faces two main challenges: </p>
<ol>
<li><em>Domain shift</em>: off-the-shelf LLMs may not capture the unique preferences and vocabulary of Etsy users</li>
<li><em>Performance-cost tradeoff</em>: larger LLMs offer stronger reasoning but are expensive for large-scale inference, while smaller LLMs are faster and cheaper, but less accurate</li>
</ol>
<p>To address these challenges, we start with human-curated golden labels to evaluate and align a powerful LLM with these human-labels, then use a full dataset scaled up by the LLM for training our relevance judge. In other words, <strong>humans define what good looks like, and LLMs help us scale it</strong>. LLMs do not replace human judgment, instead they align with and amplify it.</p>
<p>We maintain a detailed, evolving relevance labeling guideline, continuously refined through user research and annotation feedback. What relevance means in our marketplace shifts over time and social context. For example, people searching for “face masks” pre-2020 were primarily looking for masks for costumes or fashion, which is a completely different intent from protective masks post-2020. These guidelines ensure our definitions of relevance accurately reflect Etsy users’ intent and capture cultural trends over time. </p>
<p>Query-listing pairs are sampled from search logs using a mix of approaches, including both random, stratified sampling for broad coverage, and targeted sampling for challenging cases. Each query-listing pair is labeled by two Etsy admins, with an ongoing review process to both break ties and adjust labeling guidelines accordingly. For quality control, we continuously track metrics such as row-level disagreement rates, which measures how often multiple annotators disagree with each other for the same query-listing pair.</p>
<p>To scale beyond manual annotation, we introduced a few-shot, chain-of-thought (CoT) prompting strategy using the <a href="https://openai.com/index/introducing-o3-and-o4-mini/">o3 model</a>, implemented in LangGraph. The prompt instruction is inspired by the annotation guidelines described above, and includes comprehensive query and listing features, like title, images, text description, attributes, variations, and extracted entities (read more about listing extracted entities in <a href="https://www.etsy.com/codeascraft/understanding-etsyas-vast-inventory-with-llms">another one of our posts</a>). We also applied <a href="https://www.promptingguide.ai/techniques/consistency">self-consistency sampling</a> to improve reliability.</p>
<p>This model, known as the LLM annotator (as seen in Figure 1), is first validated against the human-labeled golden data to ensure its judgement aligns with humans. Once validated, we use it to generate large-scale training data to develop the production models. The LLM annotator thus serves as the foundation for our teacher-student modeling pipeline, bridging the gap between expensive manual labeling and scalable automated annotation.</p>
<h1>Models: Balancing Accuracy, Latency and Cost</h1>
<p>Our modeling pipeline uses a <strong>three-tier cascaded distillation design</strong>, where each model balances accuracy and efficiency for a specific purpose. The stack includes:</p>
<ul>
<li><strong>The LLM annotator</strong>: our most accurate and cost-intensive model, aligned closely with human-labeled golden data</li>
<li><strong>The teacher model</strong>: a fine-tuned smaller LLM (Qwen 3 VL 4B) that delivers high-throughput annotation at scale</li>
<li><strong>The student model</strong>: a lightweight, BERT-based two-tower model optimized for real-time inference</li>
</ul>
<p>The <strong>LLM annotator</strong> aligns best with the golden labels, but is too costly for recurrent, large-scale inference. To reduce cost while maintaining quality, we performed supervised fine-tuning (SFT) with a smaller LLM, Qwen 3 VL 4B, using the training data generated by the LLM annotator. This <strong>teacher model</strong> preserves human alignment while enabling us to label millions of query-listing pairs daily, which is ideal for recurring evaluation and monitoring.</p>
<p>The teacher, however, is too slow to surface relevant search results quickly, which is critical for helping our sellers reach potential buyers. As such, we further distilled the teacher into a <strong>student model</strong> with a two-tower architecture. The distillation process aligns the student’s output with that of the teacher, so that the student judges relevance labels nearly as accurately as the teacher, while being lightweight and fast. The resulting model ensures we deliver search results almost as fast as before, with only &lt;10ms additional latency.</p>
<p>All three models – the LLM annotator, teacher, and student – are evaluated against the same golden dataset to ensure traceable performance and consistent alignment with human judgment. Figure 3 shows their accuracy measured using <a href="https://towardsdatascience.com/micro-macro-weighted-averages-of-f1-score-clearly-explained-b603420b292f/">multi-class Macro F1</a>, and individual class F1 scores.</p>
<figure>
<img alt="Performance of semantic relevance models against human golden labels" src="https://i.etsystatic.com/inv/7494b9/7653990743/inv_fullxfull.7653990743_q8oe1r83.jpg?version=0" title="Performance of semantic relevance models against human golden labels" />
<figcaption>Figure 3. Performance of semantic relevance models against human golden labels</figcaption>
</figure>
<h1>Applications: From Evaluation to Action</h1>
<p>With these models in place, we can both measure and enhance search relevance across Etsy.</p>
<h2>Search relevance evaluation</h2>
<p>We use the <strong>teacher model</strong> to measure how well our search system surfaces relevant listings. Each day, we sample search requests and perform offline inference using the teacher model, then aggregate the predicted relevance labels into summary metrics. These metrics are reviewed regularly by our team, and if we observe unexpected trends like a sudden decline of relevance, we work to quickly diagnose and address the problem. </p>
<p>Similarly, we monitor relevance metrics in A/B tests. The computed relevance metrics are discussed when we decide whether to roll out a new change to our search system, to ensure those changes affect semantic relevance of search results in a neutral to positive way. We sample sufficient amounts of requests separately from control and treatment variants, to ensure statistical power. Using <a href="https://docs.vllm.ai/en/stable/">vLLM</a> for high-throughput inference, we process millions of query-listing pairs daily at a very low cost, maintaining both statistical power and operational efficiency.</p>
<h2>Improving search in production</h2>
<p>The lightweight <strong>student model</strong> is embedded directly into Etsy’s real-time search stack. It improves relevance through several integration points:</p>
<ol>
<li><em>Filtering</em>: removes retrieved listings predicted as irrelevant before downstream ranking</li>
<li><em>Feature enrichment</em>: contributes model-predicted relevance scores as features for the downstream ranking model</li>
<li><em>Loss weighting</em>: adjusts training weights of the ranking model based on predicted relevance</li>
<li><em>Relevance boosting</em>: promotes listings deemed highly relevant using heuristic rules among the final returned search results </li>
</ol>
<h1>How Semantic Relevance is Changing Etsy Search</h1>
<p>The Semantic Relevance Evaluation and Enhancement Framework is fully deployed in Etsy’s search stack, and continues to evolve. We’ve observed a measurable uplift in semantic relevance: the percentage of fully relevant listings (as defined by the relevance categories described earlier) has increased from 58% to 62% between August and October 2025.</p>
<figure>
<img alt="Improvement of semantic relevance metrics over time" src="https://i.etsystatic.com/inv/cebae8/7653994975/inv_fullxfull.7653994975_9yv33rfk.jpg?version=0" title="Improvement of semantic relevance metrics over time" />
<figcaption>Figure 4. Improvement of semantic relevance metrics over time</figcaption>
</figure>
<p>This improvement reflects Etsy’s growing ability to align search results with buyer intent. For instance, in searches like “fall decor,” the enhanced search engine now focuses on surfacing seasonal decor items, while deprioritizing loosely related listings like clothing, which appeared before the enhancement on relevance. </p>
<figure>
<img alt="Before and after comparison when searching for “fall decor”" src="https://i.etsystatic.com/inv/259d22/7654001251/inv_fullxfull.7654001251_t76ovmxr.jpg?version=0" title="Before and after comparison when searching for “fall decor”" />
<figcaption>Figure 5. Before and after comparison when searching for “fall decor” *</figcaption>
</figure>
<p>Beyond these immediate gains, semantic relevance has shifted how we evaluate and improve search at Etsy, by adopting a user-centered approach. By grounding our evaluation in semantic intent in addition to behavioral signals, we move closer to our goal of connecting buyers with the relevant products, not just the most popular ones. While search results are influenced by multiple factors, and outcomes may vary, on the seller side, improving semantic relevance can also help surface items from small or new sellers who may not yet have the visibility of more established shops.</p>
<h1>What’s Next</h1>
<p>In ongoing and future efforts, we hope to explore the following directions:</p>
<ul>
<li><em>Better understanding of relevance-engagement dynamics.</em> In online experiments, we often observe engagement metrics decline even as semantic relevance improves (a pattern also <a href="https://arxiv.org/pdf/2409.17460v1">noted by other e-commerce platforms</a>). We suspect this results from applying uniform relevance treatments despite contextual variation. Next, we plan to explore adaptive strategies that tailor adjustments by query type.</li>
<li><em>Refining partial relevance.</em> Inspired by Amazon’s <a href="https://github.com/amazon-science/esci-data">ESCI</a> framework, we’re exploring finer-grained labels, for example, introducing new subcategories of complements and substitutes. This could potentially improve evaluation precision and power new user search experiences.</li>
<li><em>Reducing annotation effort through LLM facilitation.</em> When LLM judgments are self-consistent, they align better with human labels. This may indicate easier query-listing pairs. We are exploring using LLMs for these easy cases, focusing human effort on more complex cases.</li>
<li><em>Simplifying the multi-stage model stack</em>. Our current three-tier distillation pipeline provides flexibility but adds operational complexity. We plan to simplify this setup by exploring better performance-efficiency tradeoffs and potentially merging model tiers.</li>
<li><em>Improving relevance in retrieval.</em> So far, post-retrieval filtering is the first stage where our semantic relevance model applies. We see strong potential to enhance both inference and measurement further upstream in the retrieval layer.</li>
</ul>
<h1>Conclusion</h1>
<p>Key takeaways:</p>
<ul>
<li><strong>LLMs can meaningfully evaluate search relevance when grounded in human judgment.</strong> Aligning LLM assessments with human-labeled data ensures we measure, and continually improve, the search experience that is so essential to connecting buyers and sellers on Etsy.</li>
<li><strong>Semantic relevance redefines how Etsy optimizes search.</strong> By complementing engagement metrics with semantic relevance, we address real customer pain points and deliver more satisfying search experiences.</li>
<li><strong>Teacher-student distillation</strong> offers a flexible and efficient way to apply relevance modeling across diverse performance, latency and cost requirements.</li>
</ul>
<p>Ultimately, improving semantic relevance strengthens the human connections that define Etsy. By understanding what shoppers truly mean, we can help them find the right items. And by emphasizing relevant listings over popular ones, we can help create fairer opportunities on the search relevance factor of search visibility for our sellers – <a href="https://storage.googleapis.com/etsy-extfiles-prod/2025_Q1_GLOBAL_Seller%20Census%202024.pdf?ref=news">89% of whom</a> are businesses of one.</p>
<h1>Acknowledgments</h1>
<p>This work is brought to you in a collaborative effort by the Search Relevance Team, enabled by ML Enablement, and the Merchandising teams. Thanks to the following contributors</p>
<ul>
<li>Data: Susan Liu, Jugal Gala, David Blincoe, Yuqing Zhang, Taylor Hunt, Liz Mikolaj</li>
<li>Models: David Blincoe, Oriane Cavrois, Orson Adams, Yuqing Zhang</li>
<li>Application: Grant Sherrick, Kaushik Bekal, Haoming Chen, Patrick Callier, Davis Kim, Marcus Daly</li>
<li>Product leadership: Julia Zhou, Willy Huang, Argie Angeleas</li>
<li>Engineering leadership: Yinlin Fu, Congzhe Su, Xiaoting Zhao</li>
<li>ML Enablement partners: Ari Carter, Stan Schwertly, Shreya Agarwal, K Ogilvie, Marvin Wang, etc.</li>
<li>Other cross-team partners: Will Beckman, Karl Yokono, Audrey Chen, Heather Campbell, David Le, Khadeeja Din, etc.</li>
<li>Early contributors: Ethan Benjamin, Cung Tran, Maggie Matsui, Jack Gammack, Yogeeta Chatoredussy, Austin Clapp, Benjamin Russell, Khaled Jabr</li>
</ul>
<p>Special thanks to Oriane Cavrois &amp; David Blincoe for helping this piece come to life.</p>
<p><em>* Images are provided for illustrative purposes. Item availability on Etsy may vary.</em></p>]]></content:encoded>
            </item>
        
            <item>
                <title><![CDATA[Reducing experiment duration with predicted control variates]]></title>
                <link>https://www.etsy.com/codeascraft/reducing-experiment-duration-with-predicted-control-variates?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</link>
                <dc:creator><![CDATA[Kelly McManus]]></dc:creator>
                <pubDate>Tue, 25 Nov 2025 11:28:26 -0500</pubDate>
                <guid isPermaLink="true">https://www.etsy.com/codeascraft/reducing-experiment-duration-with-predicted-control-variates?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</guid>
                <description><![CDATA[<p>I<em>n 2021, we published a blog post titled “<a href="https://www.etsy.com/ca/codeascraft/increasing-experimentation-accuracy-and-speed-by-using-control-variates">Increasing experimentation accuracy and speed by using control variates</a>,” describing how we reduce the variance of metrics using CUPED in our experimentation platform. This is a follow-up on how CUPED has evolved at Etsy since then. Spoiler – It’s changed a lot, decreasing our average experiment duration by 3 days!</em></p>
<p>Etsy’s mission is to Keep Commerce Human. To achieve this, we need to understand the impact each change to our platform has on our buyers' and sellers' experience. Whether that involves changing the color of the “Buy Now” button on the Etsy app or updating elements of how our algorithms rank search results, we leverage large-scale online experimentation to iterate on and improve the things we build.</p>
<p>However, running an experiment can be a long process. From design and setup to running the experiment and analyzing results, the entire experimentation process can take weeks to months. Experiments must run long enough to collect sufficient data for the results to be statistically significant – ensuring we can confidently attribute observed changes to the treatment, rather than random chance. On the other hand, being able to learn from an experiment quickly is a crucial step in the product development lifecycle, enabling faster improvements to Etsy. Fortunately, there are tools to reduce experiment runtime. <a href="https://exp-platform.com/Documents/2013-02-CUPED-ImprovingSensitivityOfControlledExperiments.pdf">CUPED</a> is one of them! Variance reduction techniques like CUPED can help reduce the time to run an experiment, shortening the overall experimentation lifecycle and time to learning, as visualized below.</p>
<figure>
<img alt="Experimentation Lifecycle" src="https://i.etsystatic.com/inv/508bdd/7485804299/inv_fullxfull.7485804299_5dhtrbg7.jpg?version=0" title="Experimentation Lifecycle" />
</figure>
<h2>A recap of CUPED</h2>
<p>CUPED is a variance reduction technique that estimates experiment outcomes with greater speed and accuracy compared to a direct comparison between control and treatment groups. In 2021, Etsy <a href="https://www.etsy.com/ca/codeascraft/increasing-experimentation-accuracy-and-speed-by-using-control-variates">implemented CUPED</a> (Controlled-Experiment Using Pre-Experiment Data) for key metrics like Conversion Rate (the percentage of visitors that make a purchase). </p>
<p>CUPED leverages historical visitor data collected before the experiment begins — for example, the number of purchases in the week prior to the experiment – to explain some natural variation in the outcome metric. The pre-experiment factors are used as covariates in a linear regression model to remove some of the “noise” that is not attributable to the treatment. By accounting for this variation, CUPED reduces the variance of the treatment effect estimator, increasing statistical power and improving sensitivity without introducing bias.</p>
<p>The CUPED correction can be conceptualized as:</p>
<figure>
<img alt="CUPED formula" src="https://i.etsystatic.com/inv/c3af38/7484132339/inv_fullxfull.7484132339_n7wzrht2.jpg?version=0" title="CUPED formula" />
</figure>
<p>The CUPED-adjusted metric will have a smaller variance than the original metric, as visualized below, providing more precise estimates of a mean or treatment effect.</p>
<figure>
<img alt="CUPED variance chart" src="https://i.etsystatic.com/inv/c804c6/7436156294/inv_fullxfull.7436156294_lknh7nir.jpg?version=0" title="CUPED variance chart" />
</figure>
<p>Sample size, power, and variance are all related. Holding everything else unchanged, the smaller the variance of a metric, the smaller the sample size required to reach a desired power. Since we can reduce the variance of our metric by applying CUPED, we can achieve the same amount of power with a smaller sample size. In practice, a smaller sample size corresponds to a shorter experiment duration. </p>
<p>Etsy’s initial <a href="https://www.etsy.com/ca/codeascraft/increasing-experimentation-accuracy-and-speed-by-using-control-variates">implementation of CUPED</a> yielded an average variance reduction of 7% across all experiments, with some experiments achieving up to 30% variance reduction. Experiments that used CUPED-adjusted metrics in decision-making yielded a decision about 1 day earlier, on average. However, we’re always iterating to improve our buyers’ and sellers’ experience on Etsy, and we knew we could do even better. Enter: CUPAC.</p>
<h2>Leveling up further with CUPAC</h2>
<p>During our research and implementation of CUPED in 2020, scientists at DoorDash published a <a href="https://careersatdoordash.com/blog/improving-experimental-power-through-control-using-predictions-as-covariate-cupac/">blog post</a> describing a novel statistical method, building on CUPED, called Control Using Predictions as Covariate, or “CUPAC.” </p>
<p>When performing CUPAC, the pre-experiment data is first input into a non-linear machine learning model that captures more complex relationships than a linear model. The non-linear model is trained to predict the outcome metric of interest – for example, if an experiment is measuring the observed Conversion Rate, the model would predict Conversion Rate. The prediction more effectively captures the impact of pre-experiment behaviors on our experimental outcomes than the raw pre-experiment data because it captures complex relationships in the data that linear regression alone cannot. The prediction is then used as an “ML-based covariate” in a linear regression to perform the CUPED correction: </p>
<figure>
<img alt="CUPAC formula" src="https://i.etsystatic.com/inv/635c10/7484133503/inv_fullxfull.7484133503_juhuwigr.jpg?version=0" title="CUPAC formula" />
</figure>
<p>The CUPAC-adjusted outcome has an even smaller variance than the CUPED-adjusted outcome, as visualized below.</p>
<figure>
<img alt="CUPAC variance chart" src="https://i.etsystatic.com/inv/652fa0/7484077983/inv_fullxfull.7484077983_7p7csls3.jpg?version=0" title="CUPAC variance chart" />
</figure>
<p>Empirically, our CUPAC-adjusted metrics showed even lower variance than CUPED. Our initial prototype demonstrated that CUPAC produced an adjusted metric with an additional 10% smaller variance when compared to our original CUPED estimator. Despite the added complexity, these results justified incorporating CUPAC into our experimentation pipeline. We hypothesized it would cut average experiment duration by an additional day, enabling teams to run more experiments and ship changes to Etsy faster.</p>
<h2>Training and implementation</h2>
<p>The first step was to train the CUPAC models to predict the ML-based covariate. We identified over 100 pre-experiment features, increasing from 3 features in CUPED, to capture more behavior prior to the experiment. Using these features, we iteratively trained and tuned the models in Vertex AI. Hyperparameters were optimized on a validation dataset to maximize the median correlation between the model’s predictions and the observed in-experiment metrics across experiments.</p>
<p>Initially we trained <a href="https://xgboost.ai/">XGBoost</a>, a popular <a href="https://developers.google.com/machine-learning/decision-forests/intro-to-gbdt">gradient boosted tree</a> model, but then found <a href="https://lightgbm.readthedocs.io/en/stable/">LightGBM</a>, a similar non-linear, tree-based model, was better suited to predict the covariate. When testing the models at scale with billions of predictions, LightGBM demonstrated both rapid training and prediction times, along with strong validation results.</p>
<p>Once the models were trained, our next challenge was to implement them at scale. Our experimentation pipeline runs batch jobs for hundreds of experiments each day. From our original implementation, we had an Airflow DAG (directed acyclic graph) to orchestrate the CUPED variance reduction pipeline, as visualized below:</p>
<figure>
<img alt="CUPED pipeline" src="https://i.etsystatic.com/inv/baa64c/7436088400/inv_fullxfull.7436088400_c1h413ot.jpg?version=0" title="CUPED pipeline" />
</figure>
<p>We evolved this pipeline to support CUPAC by adding a batch prediction step to produce the ML-based covariate.</p>
<figure>
<img alt="CUPAC pipeline" src="https://i.etsystatic.com/inv/755784/7437734230/inv_fullxfull.7437734230_dc480y5e.jpg?version=0" title="CUPAC pipeline" />
</figure>
<p>In the above CUPAC pipeline, we perform the following steps: </p>
<ul>
<li>Calculate pre-experiment features and in-experiment data using BigQuery SQL jobs.</li>
<li>Predict ML-based covariates with our trained LightGBM models via parallel Dataflow jobs using the pre-experiment features.</li>
<li>Perform variance reduction with a Spark job that fits a linear regression model between the ML-based covariates and in-experiment data, creating the CUPAC-adjusted metrics.</li>
<li>Apply statistical t-tests using the CUPAC-adjusted metric to calculate the treatment effect, p-value, and power of the experiment.</li>
</ul>
<h2>Impact: Shortening average experiment duration by 3 days</h2>
<p>We measured success through variance reduction. Variance reduction is the percent change between the:</p>
<ul>
<li>Variance of the metric <em>without</em> CUPAC, and</li>
<li>Variance of the CUPAC-adjusted metric. </li>
</ul>
<p>The original CUPED implementation showed 7% variance reduction, reducing overall experiment duration by almost 1 day, on average. After implementing CUPAC, we observed an average of 27% variance reduction, nearly <strong>4x as much variance reduction</strong>, when compared to CUPED, exceeding our early research estimates.</p>
<figure>
<img alt="Impact chart" src="https://i.etsystatic.com/inv/63d8c5/7437853008/inv_fullxfull.7437853008_6cpvb4kq.jpg?version=0" title="Impact chart" />
</figure>
<p>The additional variance reduction shortens our average experiment duration by almost 3 days. <strong>This means a 10-day experiment could conclude in only 7 days due to the ability to reach power on a smaller sample size with CUPAC</strong>. These marginal time savings allow many teams to run <strong>10 or more additional experiments each year</strong>. That translates to more opportunities to test and faster insights into how we can deliver the best experience for our community of millions of sellers and buyers. </p>
<p>Notably, there was a substantial spread in variance reduction among different metrics and experiments, ranging from 2% to 77%. In the chart below, each blue bar displays the percent variance reduction for a sampled metric on an experiment. </p>
<figure>
<img alt="Percent chart" src="https://i.etsystatic.com/inv/9bc046/7485759507/inv_fullxfull.7485759507_eog6ydtb.jpg?version=0" title="Percent chart" />
</figure>
<p>The large range is expected because variance reduction can be influenced by several factors, such as metric definition, data accessibility, experimental design, and market characteristics. These factors impact how predictive the pre-experiment data is of the outcome metric, resulting in the degree of variance reduction. For example, two common experimentation metrics are Mean Visits and Purchase Rate. In the e-commerce setting, an individual's visit behavior will almost always be more stable over time than their purchasing behavior. This implies that pre-experiment data is more correlated with in-experiment data for a visit-related metric than for a purchase-related metric. Therefore, CUPAC is more effective at reducing variance in a metric like Mean Visits than in a metric like Purchase Rate.</p>
<h2>What’s next?</h2>
<p>Aligned with Etsy’s culture of experimentation, we’ll continue to evolve our pipeline to be nimble and flexible based on the needs of the teams that use them. </p>
<p>One challenge we face is that teams use metrics curated to specific parts of the Etsy experience – like search, recommendations, seller features, etc. – to make decisions on their experiment results. However, our CUPAC models take significant time to train and maintain for each metric, consequently limiting the number of CUPAC-adjusted metrics we can develop. While we continue to grow CUPAC use, we also encourage teams to continue to use CUPED, which is more scalable and has lower maintenance costs. To account for this, we plan to increase the flexibility of CUPED to more metrics by automatically collecting pre-experiment data based on the metric definition to reduce noise. In tandem with our work on CUPAC, this CUPED expansion will enable teams across Etsy to benefit from variance reduction across all their team-specific metrics, not just a select few.</p>
<p>Despite the success of CUPED and CUPAC thus far, there remains a need to explore additional variance reduction techniques for the current metrics that leverage CUPAC. In 2024, we released research findings exploring a novel approach: <a href="https://arxiv.org/abs/2410.09027">Variance reduction combining pre-experiment and in-experiment data</a>. As we look to generalize our variance reduction architecture, we expect that incorporating such techniques will continue to strengthen our experimentation platform and enable product teams to iterate more quickly.</p>
<p>Lastly, it is important to recognize that applying variance reduction in practice can be a never-ending race to squeeze the most noise out of these estimators. In our experience, the craft lies in finding the sweet spot between variance reduction, implementation cost, and the impact on experimentation velocity. That intersection is context-dependent and what makes experimentation code as craft.</p>
<p>We hope our experience inspires you to try out variance reduction techniques and determine which one is best suited to your needs!</p>
<h2>Acknowledgements</h2>
<p>Thank you to Alexander Tank and Stephane Shao for their work on initial research and implementation of CUPAC. Thanks to Pablo Crespo for his research into extending our CUPAC models with more predictive features. And, thanks to Julie Beckley, Kevin Gaan, and Mary Hu for supporting and prioritizing this project. </p>
<h2>References</h2>
<p>A. Deng, Y. Xu, R. Kohavi, T. Walker (2013). <a href="https://exp-platform.com/Documents/2013-02-CUPED-ImprovingSensitivityOfControlledExperiments.pdf">Improving the sensitivity of online controlled experiments by utilizing pre-experiment data</a>.</p>
<p>J. Li (2020). <a href="https://careersatdoordash.com/blog/improving-experimental-power-through-control-using-predictions-as-covariate-cupac/">Improving Experimental Power through Control Using Predictions as Covariate (CUPAC)</a>.</p>]]></description>
                <content:encoded><![CDATA[<p>I<em>n 2021, we published a blog post titled “<a href="https://www.etsy.com/ca/codeascraft/increasing-experimentation-accuracy-and-speed-by-using-control-variates">Increasing experimentation accuracy and speed by using control variates</a>,” describing how we reduce the variance of metrics using CUPED in our experimentation platform. This is a follow-up on how CUPED has evolved at Etsy since then. Spoiler – It’s changed a lot, decreasing our average experiment duration by 3 days!</em></p>
<p>Etsy’s mission is to Keep Commerce Human. To achieve this, we need to understand the impact each change to our platform has on our buyers' and sellers' experience. Whether that involves changing the color of the “Buy Now” button on the Etsy app or updating elements of how our algorithms rank search results, we leverage large-scale online experimentation to iterate on and improve the things we build.</p>
<p>However, running an experiment can be a long process. From design and setup to running the experiment and analyzing results, the entire experimentation process can take weeks to months. Experiments must run long enough to collect sufficient data for the results to be statistically significant – ensuring we can confidently attribute observed changes to the treatment, rather than random chance. On the other hand, being able to learn from an experiment quickly is a crucial step in the product development lifecycle, enabling faster improvements to Etsy. Fortunately, there are tools to reduce experiment runtime. <a href="https://exp-platform.com/Documents/2013-02-CUPED-ImprovingSensitivityOfControlledExperiments.pdf">CUPED</a> is one of them! Variance reduction techniques like CUPED can help reduce the time to run an experiment, shortening the overall experimentation lifecycle and time to learning, as visualized below.</p>
<figure>
<img alt="Experimentation Lifecycle" src="https://i.etsystatic.com/inv/508bdd/7485804299/inv_fullxfull.7485804299_5dhtrbg7.jpg?version=0" title="Experimentation Lifecycle" />
</figure>
<h2>A recap of CUPED</h2>
<p>CUPED is a variance reduction technique that estimates experiment outcomes with greater speed and accuracy compared to a direct comparison between control and treatment groups. In 2021, Etsy <a href="https://www.etsy.com/ca/codeascraft/increasing-experimentation-accuracy-and-speed-by-using-control-variates">implemented CUPED</a> (Controlled-Experiment Using Pre-Experiment Data) for key metrics like Conversion Rate (the percentage of visitors that make a purchase). </p>
<p>CUPED leverages historical visitor data collected before the experiment begins — for example, the number of purchases in the week prior to the experiment – to explain some natural variation in the outcome metric. The pre-experiment factors are used as covariates in a linear regression model to remove some of the “noise” that is not attributable to the treatment. By accounting for this variation, CUPED reduces the variance of the treatment effect estimator, increasing statistical power and improving sensitivity without introducing bias.</p>
<p>The CUPED correction can be conceptualized as:</p>
<figure>
<img alt="CUPED formula" src="https://i.etsystatic.com/inv/c3af38/7484132339/inv_fullxfull.7484132339_n7wzrht2.jpg?version=0" title="CUPED formula" />
</figure>
<p>The CUPED-adjusted metric will have a smaller variance than the original metric, as visualized below, providing more precise estimates of a mean or treatment effect.</p>
<figure>
<img alt="CUPED variance chart" src="https://i.etsystatic.com/inv/c804c6/7436156294/inv_fullxfull.7436156294_lknh7nir.jpg?version=0" title="CUPED variance chart" />
</figure>
<p>Sample size, power, and variance are all related. Holding everything else unchanged, the smaller the variance of a metric, the smaller the sample size required to reach a desired power. Since we can reduce the variance of our metric by applying CUPED, we can achieve the same amount of power with a smaller sample size. In practice, a smaller sample size corresponds to a shorter experiment duration. </p>
<p>Etsy’s initial <a href="https://www.etsy.com/ca/codeascraft/increasing-experimentation-accuracy-and-speed-by-using-control-variates">implementation of CUPED</a> yielded an average variance reduction of 7% across all experiments, with some experiments achieving up to 30% variance reduction. Experiments that used CUPED-adjusted metrics in decision-making yielded a decision about 1 day earlier, on average. However, we’re always iterating to improve our buyers’ and sellers’ experience on Etsy, and we knew we could do even better. Enter: CUPAC.</p>
<h2>Leveling up further with CUPAC</h2>
<p>During our research and implementation of CUPED in 2020, scientists at DoorDash published a <a href="https://careersatdoordash.com/blog/improving-experimental-power-through-control-using-predictions-as-covariate-cupac/">blog post</a> describing a novel statistical method, building on CUPED, called Control Using Predictions as Covariate, or “CUPAC.” </p>
<p>When performing CUPAC, the pre-experiment data is first input into a non-linear machine learning model that captures more complex relationships than a linear model. The non-linear model is trained to predict the outcome metric of interest – for example, if an experiment is measuring the observed Conversion Rate, the model would predict Conversion Rate. The prediction more effectively captures the impact of pre-experiment behaviors on our experimental outcomes than the raw pre-experiment data because it captures complex relationships in the data that linear regression alone cannot. The prediction is then used as an “ML-based covariate” in a linear regression to perform the CUPED correction: </p>
<figure>
<img alt="CUPAC formula" src="https://i.etsystatic.com/inv/635c10/7484133503/inv_fullxfull.7484133503_juhuwigr.jpg?version=0" title="CUPAC formula" />
</figure>
<p>The CUPAC-adjusted outcome has an even smaller variance than the CUPED-adjusted outcome, as visualized below.</p>
<figure>
<img alt="CUPAC variance chart" src="https://i.etsystatic.com/inv/652fa0/7484077983/inv_fullxfull.7484077983_7p7csls3.jpg?version=0" title="CUPAC variance chart" />
</figure>
<p>Empirically, our CUPAC-adjusted metrics showed even lower variance than CUPED. Our initial prototype demonstrated that CUPAC produced an adjusted metric with an additional 10% smaller variance when compared to our original CUPED estimator. Despite the added complexity, these results justified incorporating CUPAC into our experimentation pipeline. We hypothesized it would cut average experiment duration by an additional day, enabling teams to run more experiments and ship changes to Etsy faster.</p>
<h2>Training and implementation</h2>
<p>The first step was to train the CUPAC models to predict the ML-based covariate. We identified over 100 pre-experiment features, increasing from 3 features in CUPED, to capture more behavior prior to the experiment. Using these features, we iteratively trained and tuned the models in Vertex AI. Hyperparameters were optimized on a validation dataset to maximize the median correlation between the model’s predictions and the observed in-experiment metrics across experiments.</p>
<p>Initially we trained <a href="https://xgboost.ai/">XGBoost</a>, a popular <a href="https://developers.google.com/machine-learning/decision-forests/intro-to-gbdt">gradient boosted tree</a> model, but then found <a href="https://lightgbm.readthedocs.io/en/stable/">LightGBM</a>, a similar non-linear, tree-based model, was better suited to predict the covariate. When testing the models at scale with billions of predictions, LightGBM demonstrated both rapid training and prediction times, along with strong validation results.</p>
<p>Once the models were trained, our next challenge was to implement them at scale. Our experimentation pipeline runs batch jobs for hundreds of experiments each day. From our original implementation, we had an Airflow DAG (directed acyclic graph) to orchestrate the CUPED variance reduction pipeline, as visualized below:</p>
<figure>
<img alt="CUPED pipeline" src="https://i.etsystatic.com/inv/baa64c/7436088400/inv_fullxfull.7436088400_c1h413ot.jpg?version=0" title="CUPED pipeline" />
</figure>
<p>We evolved this pipeline to support CUPAC by adding a batch prediction step to produce the ML-based covariate.</p>
<figure>
<img alt="CUPAC pipeline" src="https://i.etsystatic.com/inv/755784/7437734230/inv_fullxfull.7437734230_dc480y5e.jpg?version=0" title="CUPAC pipeline" />
</figure>
<p>In the above CUPAC pipeline, we perform the following steps: </p>
<ul>
<li>Calculate pre-experiment features and in-experiment data using BigQuery SQL jobs.</li>
<li>Predict ML-based covariates with our trained LightGBM models via parallel Dataflow jobs using the pre-experiment features.</li>
<li>Perform variance reduction with a Spark job that fits a linear regression model between the ML-based covariates and in-experiment data, creating the CUPAC-adjusted metrics.</li>
<li>Apply statistical t-tests using the CUPAC-adjusted metric to calculate the treatment effect, p-value, and power of the experiment.</li>
</ul>
<h2>Impact: Shortening average experiment duration by 3 days</h2>
<p>We measured success through variance reduction. Variance reduction is the percent change between the:</p>
<ul>
<li>Variance of the metric <em>without</em> CUPAC, and</li>
<li>Variance of the CUPAC-adjusted metric. </li>
</ul>
<p>The original CUPED implementation showed 7% variance reduction, reducing overall experiment duration by almost 1 day, on average. After implementing CUPAC, we observed an average of 27% variance reduction, nearly <strong>4x as much variance reduction</strong>, when compared to CUPED, exceeding our early research estimates.</p>
<figure>
<img alt="Impact chart" src="https://i.etsystatic.com/inv/63d8c5/7437853008/inv_fullxfull.7437853008_6cpvb4kq.jpg?version=0" title="Impact chart" />
</figure>
<p>The additional variance reduction shortens our average experiment duration by almost 3 days. <strong>This means a 10-day experiment could conclude in only 7 days due to the ability to reach power on a smaller sample size with CUPAC</strong>. These marginal time savings allow many teams to run <strong>10 or more additional experiments each year</strong>. That translates to more opportunities to test and faster insights into how we can deliver the best experience for our community of millions of sellers and buyers. </p>
<p>Notably, there was a substantial spread in variance reduction among different metrics and experiments, ranging from 2% to 77%. In the chart below, each blue bar displays the percent variance reduction for a sampled metric on an experiment. </p>
<figure>
<img alt="Percent chart" src="https://i.etsystatic.com/inv/9bc046/7485759507/inv_fullxfull.7485759507_eog6ydtb.jpg?version=0" title="Percent chart" />
</figure>
<p>The large range is expected because variance reduction can be influenced by several factors, such as metric definition, data accessibility, experimental design, and market characteristics. These factors impact how predictive the pre-experiment data is of the outcome metric, resulting in the degree of variance reduction. For example, two common experimentation metrics are Mean Visits and Purchase Rate. In the e-commerce setting, an individual's visit behavior will almost always be more stable over time than their purchasing behavior. This implies that pre-experiment data is more correlated with in-experiment data for a visit-related metric than for a purchase-related metric. Therefore, CUPAC is more effective at reducing variance in a metric like Mean Visits than in a metric like Purchase Rate.</p>
<h2>What’s next?</h2>
<p>Aligned with Etsy’s culture of experimentation, we’ll continue to evolve our pipeline to be nimble and flexible based on the needs of the teams that use them. </p>
<p>One challenge we face is that teams use metrics curated to specific parts of the Etsy experience – like search, recommendations, seller features, etc. – to make decisions on their experiment results. However, our CUPAC models take significant time to train and maintain for each metric, consequently limiting the number of CUPAC-adjusted metrics we can develop. While we continue to grow CUPAC use, we also encourage teams to continue to use CUPED, which is more scalable and has lower maintenance costs. To account for this, we plan to increase the flexibility of CUPED to more metrics by automatically collecting pre-experiment data based on the metric definition to reduce noise. In tandem with our work on CUPAC, this CUPED expansion will enable teams across Etsy to benefit from variance reduction across all their team-specific metrics, not just a select few.</p>
<p>Despite the success of CUPED and CUPAC thus far, there remains a need to explore additional variance reduction techniques for the current metrics that leverage CUPAC. In 2024, we released research findings exploring a novel approach: <a href="https://arxiv.org/abs/2410.09027">Variance reduction combining pre-experiment and in-experiment data</a>. As we look to generalize our variance reduction architecture, we expect that incorporating such techniques will continue to strengthen our experimentation platform and enable product teams to iterate more quickly.</p>
<p>Lastly, it is important to recognize that applying variance reduction in practice can be a never-ending race to squeeze the most noise out of these estimators. In our experience, the craft lies in finding the sweet spot between variance reduction, implementation cost, and the impact on experimentation velocity. That intersection is context-dependent and what makes experimentation code as craft.</p>
<p>We hope our experience inspires you to try out variance reduction techniques and determine which one is best suited to your needs!</p>
<h2>Acknowledgements</h2>
<p>Thank you to Alexander Tank and Stephane Shao for their work on initial research and implementation of CUPAC. Thanks to Pablo Crespo for his research into extending our CUPAC models with more predictive features. And, thanks to Julie Beckley, Kevin Gaan, and Mary Hu for supporting and prioritizing this project. </p>
<h2>References</h2>
<p>A. Deng, Y. Xu, R. Kohavi, T. Walker (2013). <a href="https://exp-platform.com/Documents/2013-02-CUPED-ImprovingSensitivityOfControlledExperiments.pdf">Improving the sensitivity of online controlled experiments by utilizing pre-experiment data</a>.</p>
<p>J. Li (2020). <a href="https://careersatdoordash.com/blog/improving-experimental-power-through-control-using-predictions-as-covariate-cupac/">Improving Experimental Power through Control Using Predictions as Covariate (CUPAC)</a>.</p>]]></content:encoded>
            </item>
        
            <item>
                <title><![CDATA[Improving performance by prefetching product pages from Etsy Search]]></title>
                <link>https://www.etsy.com/codeascraft/search-prefetching-performance?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</link>
                <dc:creator><![CDATA[David Weinzimmer]]></dc:creator>
                <pubDate>Wed, 29 Oct 2025 13:14:09 -0400</pubDate>
                <guid isPermaLink="true">https://www.etsy.com/codeascraft/search-prefetching-performance?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</guid>
                <description><![CDATA[<p>Rarely are there opportunities for big, bold, game-changing improvements in web performance. The Speculation Rules API (SRA) is a recent browser development that offers just such an opportunity. This post details a joint effort between the search and the web performance teams at Etsy to implement SRA on Etsy search pages and drastically improve the performance of product listing pages with some metrics seeing 20-24% improvements and some dropping all the way to 0ms values.</p>
<h2>Prefetching Options</h2>
<p>There are two main ways to predictively fetch resources for the next page:</p>
<ul>
<li><em><code>&lt;link rel="prefetch"&gt;</code></em> allows us to instruct the browser to download a resource we believe we'll need soon. The resource can be anything, like a static asset or an HTML page.  </li>
<li><em>Speculation Rules API</em> (SRA) is a <a href="https://developer.chrome.com/blog/speculation-rules-improvements">recently updated</a> browser API which allows for a JSON definition to dictate what page A wants to do with page B. The "do"-ing can be either <strong>prefetching</strong> (just download the HTML) or <strong>prerendering</strong> (load the page, including its static assets, and <em>render</em> it completely). The prerendering happens in a new browser process and page B is ready to be swapped with the current page A instantaneously when the user navigates to B.</li>
</ul>
<p>While implementing full prerendering is likely to yield more impressive performance improvements, it is a bigger and riskier investment, mostly related to the side effects of executing the JavaScript on the target page B. Starting with prefetching is a good first step into exploring SRA.</p>
<p>The benefits of using SRA over link prefetch will become evident further in the article but the topline highlights include: a simpler API to define <em>what</em> is prefetched (via a CSS selector rather than do-it-yourself bespoke JavaScript), convenient utilities to define <em>when</em>, as well as <em>where</em> (memory and HTTP cache) prefetching happens, and a built-in upgrade path to full on prerendering.</p>
<h2>Implementing Speculation Rules</h2>
<p>The Search team at Etsy recently ran an A/B experiment to use the Speculation Rules API to prefetch the listing page when hovering over organic listing cards on the desktop search page on Chromium browsers. To do this, we added a new <code>&lt;script type="speculationrules"&gt;</code> tag to the search page with JSON that defines how we want the prefetching to work, like so:</p>
<pre><code class="language-html">&lt;script type="speculationrules"&gt;
{
  "prefetch": [{
    "where": {
        "and": [
          {
            "href_matches": "/{*/}?listing/*"
          },
          {
            "selector_matches": "[data-sr-prefetch='1']"
          }
        ]
    },
    "eagerness": "moderate"
  }]
}
&lt;/script&gt;</code></pre>
<p>This instructs the browser to download the HTML for a <em>listing</em> page when:</p>
<ul>
<li>the user hovers over a link to a listing page for 200 milliseconds (defined by the “moderate” <code>eagerness</code> property), and   </li>
<li>the link has a <code>data-sr-prefetch</code> attribute</li>
</ul>
<p>The <code>data</code> attribute allows us to more precisely opt in pages that are eligible for prefetching. </p>
<h2>Lessons learned</h2>
<p>In terms of changes to the page’s code, the implementation of SRA was straightforward. As such, we spent most of our time testing that everything was working as expected and that our systems and analytics were not inadvertently affected. And we found some surprises along the way, related to all the little details modern web pages use (such as cookies, redirects, new tabs).</p>
<p>Allow us to share a few lessons in prefetching...</p>
<h3>Two ways to prefetch</h3>
<p>As mentioned earlier, prefetching can happen one of two ways: <code>&lt;link rel="prefetch"&gt;</code> or speculative prefetch (the one using SRA). So what sets them apart?</p>
<p>They do work mostly the same, except that the speculative prefetch caches the page in both memory and the HTTP cache. The <code>&lt;link&gt;</code> only uses the HTTP cache and merely downloads the specified resource.</p>
<p>This makes the SRA way of prefetching more advantageous than the <code>&lt;link&gt;</code> prefetch because of the memory cache.</p>
<h3>Two speculative prefetches only</h3>
<p>We also discovered the number of prefetched pages that are kept in memory is restricted to two. When you prefetch a third page, the first one is evicted from the memory cache. The HTTP cache still works as usual. So again, the SRA prefetching is preferable to the <code>&lt;link&gt;</code> prefetch due to the difference in caching we just described.</p>
<p>It’s helpful while debugging to be aware of the eviction of the prefetched page from memory. But rest assured, the downloaded page is still cached locally.</p>
<h3>Eagerness</h3>
<p>While <code>&lt;link rel="prefetch"&gt;</code> advises the browser to load a resource as soon as it sees the <code>&lt;link&gt;</code> in the DOM, the speculative prefetch is more nuanced, offering <em>eager</em>, <em>immediate</em>, <em>conservative</em> and <em>moderate</em> loading. We selected <em>moderate</em> eagerness, which prefetches after the user has hovered over a link for 200ms.</p>
<p>Exploring our options we found that the <em>immediate</em> eagerness would trigger a significantly larger number of prefetches (since it executes immediately and prefetches <em>all</em> eligible pages), and we wanted to avoid creating new server requests for listing cards with a low likelihood of being clicked. However, the <em>immediate</em> eagerness setting could be worth considering if the cost of additional requests is very low. The <em>conservative</em> eagerness executes on pointer or touch down, providing a very small head start over normal browser behavior and therefore greatly reducing the potential benefits of prefetching. Conservative eagerness may only be suitable for a use case in which it is necessary to avoid unused prefetches altogether.</p>
<p>Note that <em>eager</em> and <em>immediate</em> were synonyms in the initial SRA implementation, but that is changing. Keep an eye on <a href="https://developer.chrome.com/docs/web-platform/prerender-pages#eagerness">the official docs</a> for updates.</p>
<h3>Speculations and new browser tabs</h3>
<p>Initially, SRA launched without the ability to prerender pages that open in new tabs, as Etsy listings do. This option was added later, but only for values <code>_blank</code> of the <code>target</code> attribute of the link elements, not named <code>target</code> attributes such as the ones that Etsy uses, for example <code>&lt;a href="/listing/123" target="etsy.123456"&gt;</code>.</p>
<p>Fortunately, the target restriction doesn't apply to prefetching, so for SRA prefetching (unlike SRA prerendering) there's no problem, regardless of whether or how you specify a target at all. For developers who may be considering moving from prefetching to prerendering, this distinction is something to bear in mind. </p>
<h3>5-minute rule</h3>
<p>Because of the complex nature of listing pages, Etsy's HTML pages are non-cacheable. However, the speculative prefetch keeps the prefetched pages cached in memory for five minutes. This was a helpful learning, as there would be no point of using speculative prefetches at all if they expire immediately. After five minutes, the normal caching rules apply, set via HTTP headers such as <code>Max-age</code> or <code>Expires</code>.</p>
<p>Given that only two pages are currently kept in memory cache and all others expire because they are non-cacheable, the benefits are greatly reduced when, for example, a person hovers over 3 links and eventually clicks the first one which leads to a page that's already expired from the prefetch memory (and HTTP!) cache.</p>
<p>To aid with the two-page restriction, one strategy we devised is to make our pages cacheable for five minutes when we detect a prefetch request. Such requests are identifiable because the browser sends <code>Sec-Purpose: prefetch</code> HTTP header when prefetching. This helps preserve downloaded pages that would’ve otherwise expired from both memory and HTTP cache.</p>
<h3>Video links and shadow DOM</h3>
<p>Often, listings on Etsy include product videos, which start to play on the search results page when a user hovers their mouse over them. In these cases, prefetching doesn't work: the mouse hover is effectively "swallowed," disappearing into the <a href="https://developer.mozilla.org/en-US/docs/Web/API/Web_components/Using_shadow_DOM">shadow DOM</a> of the browser's video player. One workaround is to overlay a <code>div</code> on top of the video for 200ms to let the hover register in the DOM. Then, after the 200ms has elapsed, remove the extra <code>div</code> to let the browser video controls (e.g., on right click) work as usual. You can find a demonstration of this technique <a href="https://www.phpied.com/files/sra/imagehover3.html">here</a>.</p>
<h3>Cookies</h3>
<p>If a page sets cookies, prefetching it will set those cookies as well (as demonstrated <a href="https://www.phpied.com/files/sra/test.html">here</a>). This is something to be aware of, as the prefetch may end up being unused. This may confuse your application (and/or analytics) to thinking a page has been visited where in reality it was not.</p>
<p>Again, you can use <code>Sec-Purpose: prefetch</code> HTTP header to detect prefetch requests and avoid setting the cookie as part of the prefetching process.</p>
<h3>Redirects</h3>
<p>If the link to the page being prefetched goes through a redirect, the actual page after the redirect is still being prefetched. Let’s say you have a sequence that looks like this:</p>
<p>Link on Page A -&gt; redirect -&gt; Page B </p>
<p>Here the browser follows the redirect during prefetching and still caches Page B. When the user then clicks the link on page A leading to Page B, the browser follows the usual process of going through the redirect. Normal HTTP cache rules still apply, meaning that if the redirect is cached, it won’t need to be requested again.</p>
<p>So, even though redirects are a <a href="https://developers.google.com/speed/docs/insights/AvoidRedirects">bad performance practice</a>, if you need to do them, they do not affect prefetching as long as you set appropriate caching headers.</p>
<h3>Mutating hrefs</h3>
<p>Sometimes the <code>href</code> attributes of <code>&lt;link&gt;</code> elements get modified by JavaScript on mouse hover. This does not play well with prefetching. Imagine you have:</p>
<pre><code class="language-html">&lt;a href="link.html"&gt;Follow me&lt;/a&gt;</code></pre>
<p>… which changes on hover to:</p>
<pre><code class="language-html">&lt;a href="link.html?source=footer"&gt;Follow me&lt;/a&gt;</code></pre>
<p>When the user hovers over the link, the browser starts working on prefetching <code>link.html</code> but realizes that the link to that page is no longer in the DOM and abandons the process. So the page is not prefetched even if <code>?source=footer</code> doesn’t change the target page in any way other than reporting analytics. The browser has no way of knowing this and considers the two as separate pages.</p>
<p>Additionally, the failed attempt at prefetching <code>link.html</code> counts in the “two speculations only” rule and evicts the older speculative load from the memory cache. For best results, avoid modifying links on hover.</p>
<h3>Analytics and Event Logging</h3>
<p>This is the elephant in the room. Many sites on the web today were built in a world where prefetching did not exist. So there is one big assumption: that a page load is always initiated by the user and the load can be counted as such – either server-side during page construction or client-side by JavaScript after the page is loaded (or, as it often happens, a combination of the two). With prefetching, this assumption is no longer true. A page constructed on the server-side and downloaded by the browser does not necessarily mean the page has been seen (and therefore its JavaScript has been executed). This can result in a number of miscalculations when it comes to analytics.</p>
<p>Luckily, browser APIs such as the <code>Sec-Purpose</code> HTTP header and JavaScript APIs (<code>document.prerendering</code> and <code>prerenderingchange</code> event) allow us to tell prerender requests from user-generated ones, as well as when a prerendered page is "activated" (when the user actually sees a prefetched page). For prefetches, Performance Resource Timing’s <code>deliveryType</code> method of <code>navigational-prefetch</code> can be used for the purposes of analytics.</p>
<p>We (and our analytics partners) found this to be the hardest part: ironing out the required analytics updates so that numbers remain true after implementing speculation rules. In our particular use case, we intentionally pursued a strategy of prefetching the destination page instead of prerendering it, meaning that no assets would be loaded and JavaScript would not execute on our prefetches. This gave us a relatively simple way to handle the accuracy of our analytics.</p>
<p>A foundational piece of our analytics is event logging. For example, in the controller of the listing page we log a <code>view_listing</code> event that contains key information such as the listing ID, user ID, etc. This informs not only our site analytics, but also our search training pipeline, recently viewed listing data for users, and more. We ended up creating a system to cache the payload of all events within a request to avoid firing those events during prefetches. We were then able to move that event logging to the destination page’s JavaScript bundle, deferring them until after page “activation” and mitigating the impact of prefetching on our analytics.</p>
<h2>Results</h2>
<p>We were thrilled with the performance results of the prefetching experiment. We saw a 20-24% improvement in many performance metrics we care about: <a href="https://web.dev/articles/ttfb">TTFB</a>, <a href="https://developer.mozilla.org/en-US/docs/Web/API/Document/DOMContentLoaded_event">DOMContentLoaded</a>, <a href="https://web.dev/articles/fcp">FCP</a>, <a href="https://web.dev/articles/lcp">LCP</a>. </p>
<ul>
<li>The 75th percentile time to first byte (TTFB) on the listing page improved by 23.6%   </li>
<li>We saw similar improvements throughout the request: First Contentful Paint -20.7%, Largest Contentful Paint -21.1%, DOMContentLoaded -20.4%, and Page Load -10.6%</li>
</ul>
<p>In the cumulative distribution function below, we see the control of our experiment (no speculation rules prefetches) in blue, and the treatment (speculation rules prefetches) in orange, with the treatment dramatically faster than the control at every percentile. Remarkably, about 40% of eligible browsers saw their TTFB drop nearly to zero:</p>
<figure>
<img alt="Time to first byte in the control and the experiment groups" src="https://i.etsystatic.com/inv/f69a53/7378031331/inv_fullxfull.7378031331_hfrpx3ex.jpg?version=0" title="Time to first byte in the control and the experiment groups" />
</figure>
<p>We saw small but detectable improvements in some business metrics, which is promising given that listing page views come from many sources, only some of which are search results. As we implement more prefetching in more places, we hypothesize that the numbers will further improve.</p>
<p>When people approach SRA implementation they may be worrying about unused prefetches and resource waste. In our experiment we saw a ratio of about 14:1 for the number of prefetches requested to subsequently activated pages (i.e., about 1 in 14 prefetch requests was navigated to by the user).</p>
<p>We’re encouraged by these results, and are looking forward to new opportunities to improve performance across additional surfaces.</p>
<h2>Opportunities to iterate and expand</h2>
<p>One clear opportunity is to try implementing prefetching on other pages beyond Search. Shoppers end up on product listing pages from various other referral surfaces: shop pages, our SEO-optimized landing pages, home page, etc. Prefetching could improve performance on these surfaces, leading to a better experience for Etsy buyers. </p>
<p>Another opportunity is to consider upgrading our <em>prefetching</em> to <em>prerendering</em> in the future. This would be a significant change to client-side JavaScript code operating during prefetches. However (and it's hard to contain the excitement about this!) Chrome is working on <a href="https://issues.chromium.org/issues/428500219"><code>prerender-until-script</code></a> update, which means prerendering stops at the first <code>&lt;script&gt;</code>. Even if you have <code>&lt;script&gt;</code> high up in the <code>&lt;head&gt;</code> of your page and prerendering halts early, the browser will still download page resources (scripts, styles, images, fonts) and have them ready. </p>
<p>For our use case, enabling <code>prerender-until-script</code> would mean that frontend performance metrics downstream of TTFB, such as First/Largest Contentful Paint, would likely see even larger improvements, and users would be able to interact with the listing page even earlier. This would further reduce friction for users when browsing on Etsy, letting them spend less time watching web pages load and more time engaging directly with our sellers’ amazing inventory of items. </p>
<h2>Acknowledgements</h2>
<p>Implementing SRA was truly a cross-team effort, not only by the search front-end and web performance teams but also people from infrastructure, analytics, ranking, and recommendations. Special thanks to Paul Calvano from our Web Performance team, Diana Sanchez Urban from Search Experience Web, and Eileen Toomer from the Visits team! This project also benefited from input from members of Recommendations and Listing Page teams, as well as members of our internal Architecture Advisory Group.</p>]]></description>
                <content:encoded><![CDATA[<p>Rarely are there opportunities for big, bold, game-changing improvements in web performance. The Speculation Rules API (SRA) is a recent browser development that offers just such an opportunity. This post details a joint effort between the search and the web performance teams at Etsy to implement SRA on Etsy search pages and drastically improve the performance of product listing pages with some metrics seeing 20-24% improvements and some dropping all the way to 0ms values.</p>
<h2>Prefetching Options</h2>
<p>There are two main ways to predictively fetch resources for the next page:</p>
<ul>
<li><em><code>&lt;link rel="prefetch"&gt;</code></em> allows us to instruct the browser to download a resource we believe we'll need soon. The resource can be anything, like a static asset or an HTML page.  </li>
<li><em>Speculation Rules API</em> (SRA) is a <a href="https://developer.chrome.com/blog/speculation-rules-improvements">recently updated</a> browser API which allows for a JSON definition to dictate what page A wants to do with page B. The "do"-ing can be either <strong>prefetching</strong> (just download the HTML) or <strong>prerendering</strong> (load the page, including its static assets, and <em>render</em> it completely). The prerendering happens in a new browser process and page B is ready to be swapped with the current page A instantaneously when the user navigates to B.</li>
</ul>
<p>While implementing full prerendering is likely to yield more impressive performance improvements, it is a bigger and riskier investment, mostly related to the side effects of executing the JavaScript on the target page B. Starting with prefetching is a good first step into exploring SRA.</p>
<p>The benefits of using SRA over link prefetch will become evident further in the article but the topline highlights include: a simpler API to define <em>what</em> is prefetched (via a CSS selector rather than do-it-yourself bespoke JavaScript), convenient utilities to define <em>when</em>, as well as <em>where</em> (memory and HTTP cache) prefetching happens, and a built-in upgrade path to full on prerendering.</p>
<h2>Implementing Speculation Rules</h2>
<p>The Search team at Etsy recently ran an A/B experiment to use the Speculation Rules API to prefetch the listing page when hovering over organic listing cards on the desktop search page on Chromium browsers. To do this, we added a new <code>&lt;script type="speculationrules"&gt;</code> tag to the search page with JSON that defines how we want the prefetching to work, like so:</p>
<pre><code class="language-html">&lt;script type="speculationrules"&gt;
{
  "prefetch": [{
    "where": {
        "and": [
          {
            "href_matches": "/{*/}?listing/*"
          },
          {
            "selector_matches": "[data-sr-prefetch='1']"
          }
        ]
    },
    "eagerness": "moderate"
  }]
}
&lt;/script&gt;</code></pre>
<p>This instructs the browser to download the HTML for a <em>listing</em> page when:</p>
<ul>
<li>the user hovers over a link to a listing page for 200 milliseconds (defined by the “moderate” <code>eagerness</code> property), and   </li>
<li>the link has a <code>data-sr-prefetch</code> attribute</li>
</ul>
<p>The <code>data</code> attribute allows us to more precisely opt in pages that are eligible for prefetching. </p>
<h2>Lessons learned</h2>
<p>In terms of changes to the page’s code, the implementation of SRA was straightforward. As such, we spent most of our time testing that everything was working as expected and that our systems and analytics were not inadvertently affected. And we found some surprises along the way, related to all the little details modern web pages use (such as cookies, redirects, new tabs).</p>
<p>Allow us to share a few lessons in prefetching...</p>
<h3>Two ways to prefetch</h3>
<p>As mentioned earlier, prefetching can happen one of two ways: <code>&lt;link rel="prefetch"&gt;</code> or speculative prefetch (the one using SRA). So what sets them apart?</p>
<p>They do work mostly the same, except that the speculative prefetch caches the page in both memory and the HTTP cache. The <code>&lt;link&gt;</code> only uses the HTTP cache and merely downloads the specified resource.</p>
<p>This makes the SRA way of prefetching more advantageous than the <code>&lt;link&gt;</code> prefetch because of the memory cache.</p>
<h3>Two speculative prefetches only</h3>
<p>We also discovered the number of prefetched pages that are kept in memory is restricted to two. When you prefetch a third page, the first one is evicted from the memory cache. The HTTP cache still works as usual. So again, the SRA prefetching is preferable to the <code>&lt;link&gt;</code> prefetch due to the difference in caching we just described.</p>
<p>It’s helpful while debugging to be aware of the eviction of the prefetched page from memory. But rest assured, the downloaded page is still cached locally.</p>
<h3>Eagerness</h3>
<p>While <code>&lt;link rel="prefetch"&gt;</code> advises the browser to load a resource as soon as it sees the <code>&lt;link&gt;</code> in the DOM, the speculative prefetch is more nuanced, offering <em>eager</em>, <em>immediate</em>, <em>conservative</em> and <em>moderate</em> loading. We selected <em>moderate</em> eagerness, which prefetches after the user has hovered over a link for 200ms.</p>
<p>Exploring our options we found that the <em>immediate</em> eagerness would trigger a significantly larger number of prefetches (since it executes immediately and prefetches <em>all</em> eligible pages), and we wanted to avoid creating new server requests for listing cards with a low likelihood of being clicked. However, the <em>immediate</em> eagerness setting could be worth considering if the cost of additional requests is very low. The <em>conservative</em> eagerness executes on pointer or touch down, providing a very small head start over normal browser behavior and therefore greatly reducing the potential benefits of prefetching. Conservative eagerness may only be suitable for a use case in which it is necessary to avoid unused prefetches altogether.</p>
<p>Note that <em>eager</em> and <em>immediate</em> were synonyms in the initial SRA implementation, but that is changing. Keep an eye on <a href="https://developer.chrome.com/docs/web-platform/prerender-pages#eagerness">the official docs</a> for updates.</p>
<h3>Speculations and new browser tabs</h3>
<p>Initially, SRA launched without the ability to prerender pages that open in new tabs, as Etsy listings do. This option was added later, but only for values <code>_blank</code> of the <code>target</code> attribute of the link elements, not named <code>target</code> attributes such as the ones that Etsy uses, for example <code>&lt;a href="/listing/123" target="etsy.123456"&gt;</code>.</p>
<p>Fortunately, the target restriction doesn't apply to prefetching, so for SRA prefetching (unlike SRA prerendering) there's no problem, regardless of whether or how you specify a target at all. For developers who may be considering moving from prefetching to prerendering, this distinction is something to bear in mind. </p>
<h3>5-minute rule</h3>
<p>Because of the complex nature of listing pages, Etsy's HTML pages are non-cacheable. However, the speculative prefetch keeps the prefetched pages cached in memory for five minutes. This was a helpful learning, as there would be no point of using speculative prefetches at all if they expire immediately. After five minutes, the normal caching rules apply, set via HTTP headers such as <code>Max-age</code> or <code>Expires</code>.</p>
<p>Given that only two pages are currently kept in memory cache and all others expire because they are non-cacheable, the benefits are greatly reduced when, for example, a person hovers over 3 links and eventually clicks the first one which leads to a page that's already expired from the prefetch memory (and HTTP!) cache.</p>
<p>To aid with the two-page restriction, one strategy we devised is to make our pages cacheable for five minutes when we detect a prefetch request. Such requests are identifiable because the browser sends <code>Sec-Purpose: prefetch</code> HTTP header when prefetching. This helps preserve downloaded pages that would’ve otherwise expired from both memory and HTTP cache.</p>
<h3>Video links and shadow DOM</h3>
<p>Often, listings on Etsy include product videos, which start to play on the search results page when a user hovers their mouse over them. In these cases, prefetching doesn't work: the mouse hover is effectively "swallowed," disappearing into the <a href="https://developer.mozilla.org/en-US/docs/Web/API/Web_components/Using_shadow_DOM">shadow DOM</a> of the browser's video player. One workaround is to overlay a <code>div</code> on top of the video for 200ms to let the hover register in the DOM. Then, after the 200ms has elapsed, remove the extra <code>div</code> to let the browser video controls (e.g., on right click) work as usual. You can find a demonstration of this technique <a href="https://www.phpied.com/files/sra/imagehover3.html">here</a>.</p>
<h3>Cookies</h3>
<p>If a page sets cookies, prefetching it will set those cookies as well (as demonstrated <a href="https://www.phpied.com/files/sra/test.html">here</a>). This is something to be aware of, as the prefetch may end up being unused. This may confuse your application (and/or analytics) to thinking a page has been visited where in reality it was not.</p>
<p>Again, you can use <code>Sec-Purpose: prefetch</code> HTTP header to detect prefetch requests and avoid setting the cookie as part of the prefetching process.</p>
<h3>Redirects</h3>
<p>If the link to the page being prefetched goes through a redirect, the actual page after the redirect is still being prefetched. Let’s say you have a sequence that looks like this:</p>
<p>Link on Page A -&gt; redirect -&gt; Page B </p>
<p>Here the browser follows the redirect during prefetching and still caches Page B. When the user then clicks the link on page A leading to Page B, the browser follows the usual process of going through the redirect. Normal HTTP cache rules still apply, meaning that if the redirect is cached, it won’t need to be requested again.</p>
<p>So, even though redirects are a <a href="https://developers.google.com/speed/docs/insights/AvoidRedirects">bad performance practice</a>, if you need to do them, they do not affect prefetching as long as you set appropriate caching headers.</p>
<h3>Mutating hrefs</h3>
<p>Sometimes the <code>href</code> attributes of <code>&lt;link&gt;</code> elements get modified by JavaScript on mouse hover. This does not play well with prefetching. Imagine you have:</p>
<pre><code class="language-html">&lt;a href="link.html"&gt;Follow me&lt;/a&gt;</code></pre>
<p>… which changes on hover to:</p>
<pre><code class="language-html">&lt;a href="link.html?source=footer"&gt;Follow me&lt;/a&gt;</code></pre>
<p>When the user hovers over the link, the browser starts working on prefetching <code>link.html</code> but realizes that the link to that page is no longer in the DOM and abandons the process. So the page is not prefetched even if <code>?source=footer</code> doesn’t change the target page in any way other than reporting analytics. The browser has no way of knowing this and considers the two as separate pages.</p>
<p>Additionally, the failed attempt at prefetching <code>link.html</code> counts in the “two speculations only” rule and evicts the older speculative load from the memory cache. For best results, avoid modifying links on hover.</p>
<h3>Analytics and Event Logging</h3>
<p>This is the elephant in the room. Many sites on the web today were built in a world where prefetching did not exist. So there is one big assumption: that a page load is always initiated by the user and the load can be counted as such – either server-side during page construction or client-side by JavaScript after the page is loaded (or, as it often happens, a combination of the two). With prefetching, this assumption is no longer true. A page constructed on the server-side and downloaded by the browser does not necessarily mean the page has been seen (and therefore its JavaScript has been executed). This can result in a number of miscalculations when it comes to analytics.</p>
<p>Luckily, browser APIs such as the <code>Sec-Purpose</code> HTTP header and JavaScript APIs (<code>document.prerendering</code> and <code>prerenderingchange</code> event) allow us to tell prerender requests from user-generated ones, as well as when a prerendered page is "activated" (when the user actually sees a prefetched page). For prefetches, Performance Resource Timing’s <code>deliveryType</code> method of <code>navigational-prefetch</code> can be used for the purposes of analytics.</p>
<p>We (and our analytics partners) found this to be the hardest part: ironing out the required analytics updates so that numbers remain true after implementing speculation rules. In our particular use case, we intentionally pursued a strategy of prefetching the destination page instead of prerendering it, meaning that no assets would be loaded and JavaScript would not execute on our prefetches. This gave us a relatively simple way to handle the accuracy of our analytics.</p>
<p>A foundational piece of our analytics is event logging. For example, in the controller of the listing page we log a <code>view_listing</code> event that contains key information such as the listing ID, user ID, etc. This informs not only our site analytics, but also our search training pipeline, recently viewed listing data for users, and more. We ended up creating a system to cache the payload of all events within a request to avoid firing those events during prefetches. We were then able to move that event logging to the destination page’s JavaScript bundle, deferring them until after page “activation” and mitigating the impact of prefetching on our analytics.</p>
<h2>Results</h2>
<p>We were thrilled with the performance results of the prefetching experiment. We saw a 20-24% improvement in many performance metrics we care about: <a href="https://web.dev/articles/ttfb">TTFB</a>, <a href="https://developer.mozilla.org/en-US/docs/Web/API/Document/DOMContentLoaded_event">DOMContentLoaded</a>, <a href="https://web.dev/articles/fcp">FCP</a>, <a href="https://web.dev/articles/lcp">LCP</a>. </p>
<ul>
<li>The 75th percentile time to first byte (TTFB) on the listing page improved by 23.6%   </li>
<li>We saw similar improvements throughout the request: First Contentful Paint -20.7%, Largest Contentful Paint -21.1%, DOMContentLoaded -20.4%, and Page Load -10.6%</li>
</ul>
<p>In the cumulative distribution function below, we see the control of our experiment (no speculation rules prefetches) in blue, and the treatment (speculation rules prefetches) in orange, with the treatment dramatically faster than the control at every percentile. Remarkably, about 40% of eligible browsers saw their TTFB drop nearly to zero:</p>
<figure>
<img alt="Time to first byte in the control and the experiment groups" src="https://i.etsystatic.com/inv/f69a53/7378031331/inv_fullxfull.7378031331_hfrpx3ex.jpg?version=0" title="Time to first byte in the control and the experiment groups" />
</figure>
<p>We saw small but detectable improvements in some business metrics, which is promising given that listing page views come from many sources, only some of which are search results. As we implement more prefetching in more places, we hypothesize that the numbers will further improve.</p>
<p>When people approach SRA implementation they may be worrying about unused prefetches and resource waste. In our experiment we saw a ratio of about 14:1 for the number of prefetches requested to subsequently activated pages (i.e., about 1 in 14 prefetch requests was navigated to by the user).</p>
<p>We’re encouraged by these results, and are looking forward to new opportunities to improve performance across additional surfaces.</p>
<h2>Opportunities to iterate and expand</h2>
<p>One clear opportunity is to try implementing prefetching on other pages beyond Search. Shoppers end up on product listing pages from various other referral surfaces: shop pages, our SEO-optimized landing pages, home page, etc. Prefetching could improve performance on these surfaces, leading to a better experience for Etsy buyers. </p>
<p>Another opportunity is to consider upgrading our <em>prefetching</em> to <em>prerendering</em> in the future. This would be a significant change to client-side JavaScript code operating during prefetches. However (and it's hard to contain the excitement about this!) Chrome is working on <a href="https://issues.chromium.org/issues/428500219"><code>prerender-until-script</code></a> update, which means prerendering stops at the first <code>&lt;script&gt;</code>. Even if you have <code>&lt;script&gt;</code> high up in the <code>&lt;head&gt;</code> of your page and prerendering halts early, the browser will still download page resources (scripts, styles, images, fonts) and have them ready. </p>
<p>For our use case, enabling <code>prerender-until-script</code> would mean that frontend performance metrics downstream of TTFB, such as First/Largest Contentful Paint, would likely see even larger improvements, and users would be able to interact with the listing page even earlier. This would further reduce friction for users when browsing on Etsy, letting them spend less time watching web pages load and more time engaging directly with our sellers’ amazing inventory of items. </p>
<h2>Acknowledgements</h2>
<p>Implementing SRA was truly a cross-team effort, not only by the search front-end and web performance teams but also people from infrastructure, analytics, ranking, and recommendations. Special thanks to Paul Calvano from our Web Performance team, Diana Sanchez Urban from Search Experience Web, and Eileen Toomer from the Visits team! This project also benefited from input from members of Recommendations and Listing Page teams, as well as members of our internal Architecture Advisory Group.</p>]]></content:encoded>
            </item>
        
            <item>
                <title><![CDATA[Understanding Etsy’s Vast Inventory with LLMs]]></title>
                <link>https://www.etsy.com/codeascraft/understanding-etsyas-vast-inventory-with-llms?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</link>
                <dc:creator><![CDATA[Vipul Setty]]></dc:creator>
                <pubDate>Mon, 13 Oct 2025 13:58:39 -0400</pubDate>
                <guid isPermaLink="true">https://www.etsy.com/codeascraft/understanding-etsyas-vast-inventory-with-llms?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</guid>
                <description><![CDATA[<p>For more than 20 years, Etsy has been the destination for human creativity online. Our marketplace is home to more than 100 million special items made, handpicked and designed by more than 5 million sellers. These items and the real people behind them are what set us apart. But while the huge variety of Etsy’s inventory is one of our greatest strengths, it also creates fundamental engineering challenges specific to our marketplace. </p>
<h2>The challenge: Etsy’s unique inventory</h2>
<p>With millions of creative items across thousands of categories – many of which are unique – it’s difficult to accurately capture all possible product attributes, which range from standard attributes like “color” and “material”, to niche attributes like “bead hole size” and “slime additives.” The range of possible attributes and their values is so broad that it’s a challenge even to enumerate them, let alone label listings with specific attribute data. Unlike other online retailers (that may also have enormous inventories), because products on Etsy are listed by third party sellers and often handmade or customized, we do not have global SKUs (stock keeping units), or mappings from SKUs to product attributes. </p>
<p>The listing below is an example of a unique item on Etsy, which has no SKU number or easy access to product attribute information. At first glance, the item looks like a t-shirt, but it is actually a porcelain sculpture. For niche items like this, seller provided details become especially critical.
<img alt="" src="https://i.etsystatic.com/inv/0b035a/7277596792/inv_fullxfull.7277596792_joyacfg4.jpg?version=0" /></p>
<p>We collect both structured and unstructured data from sellers, and they serve different roles in our marketplace.</p>
<ul>
<li>Unstructured data comes in the form of free-text descriptions, creative titles, and listing photos. While this content is full of useful product information, it’s harder for machines to interpret consistently and quickly at scale. </li>
<li>Structured data - in the form of product attributes like size and color - is easy for our systems to parse. It powers the buyer experience through tools such as search filtering options (offered through selectors in UI) and product-to-product comparison for characteristics of interest (material, price, etc).
<p align="center">
<img src="https://i.etsystatic.com/inv/bef7ea/7277605786/inv_fullxfull.7277605786_nhzz8slo.jpg?version=0" alt="Filters can be seen on the left side after a search query" />
<br />
<em>Filters can be seen on the left side after a search query</em>
</p></li>
</ul>
<p>While Etsy does ask sellers to provide structured data on their listings’ attributes, most fields are not required. This reduces friction in the listing process and gives sellers the flexibility to represent their often unique items accurately.</p>
<p align="center">
  <img src="https://i.etsystatic.com/inv/c449f0/7277644162/inv_fullxfull.7277644162_3604c274.jpg?version=0" alt="Sellers can fill in these attributes, or leave them blank and continue through the listing process" />
  <br />
  <em>Sellers can fill in these attributes, or leave them blank
and continue through the listing process</em>
</p>
<p>As a result, most sellers only or mostly provide <strong>unstructured data</strong> in the form of listing titles, descriptions, and photos. Frequently, key information like product dimensions is buried in the listing description or only available in listing photos.</p>
<table align="center">
  <tr>
    <td align="center" style="border:none;">
      <img src="https://i.etsystatic.com/inv/fc97dd/7325596475/inv_fullxfull.7325596475_8i7lc9xj.jpg?version=0" alt="Example listing with dimensions in description" width="300" /><br />
      <em>Example listing with dimensions in description</em>
    </td>
    <td align="center" style="border:none;">
      <img src="https://i.etsystatic.com/inv/72d894/7277649576/inv_fullxfull.7277649576_lf4urdtj.jpg?version=0" alt="Example of dimensions in a photo" width="300" /><br />
      <em> <br />Example of dimensions in a photo</em>
    </td>
  </tr>
</table>
<p>While our powerful search and discovery algorithms can process unstructured data such as that in descriptions and listing photos, passing in long context and images directly to search poses latency concerns. For these algorithms, every millisecond counts as they work to deliver relevant results to buyers as quickly as possible. Spending time filtering through unstructured data for every query is just not feasible. </p>
<p>These constraints led us to a clear conclusion: to fully unlock the potential of all inventory listed on Etsy’s site, <strong>unstructured product information needs to be distilled into structured data to power both ML models and buyer experiences.</strong></p>
<h2>LLMs present a new opportunity</h2>
<p>Before the availability of scalable LLMs, we explored various ML-based solutions to this challenge. Supervised product attribute extraction models had limited efficacy; even if we could enumerate all possible product attributes and values, many of them would be so sparse that traditional classification models would struggle to capture the long tail. Sequence tagging approaches also had difficulty scaling to multiple attributes. Transformer-based question-answering models (e.g. <a href="https://dl.acm.org/doi/10.1145/3394486.3403047">AVEQA</a>, <a href="https://dl.acm.org/doi/10.1145/3394486.3403047">MAVEQA</a> ) allowed for generalization to unseen attribute values, but still required large amounts of application-specific training data. </p>
<p>This is where the availability of foundational LLMs presented a transformational opportunity for Etsy. These models have a vast amount of general knowledge from pre-training, can process large context windows quickly and affordably, and can follow instructions given a small number of examples.
<img alt="" src="https://i.etsystatic.com/inv/9ea749/7277719848/inv_fullxfull.7277719848_15kkvq4f.jpg?version=0" /></p>
<p>With a feasible and performant solution, our next focus was to build a scalable pipeline that could extract attributes across millions of listings while maintaining confidence in the LLM output. This required robust evaluation frameworks/processes that measured quality through various metrics.</p>
<h2>Transforming &amp; evaluating unstructured data at scale</h2>
<h3>Evaluation</h3>
<p>When working with LLMs, one of the biggest challenges is evaluating model performance. We needed to ensure that, at scale across our 100M+ listings, the LLMs were consistently and reliably producing accurate, actionable results. To do this, we initially worked with a third-party labeling vendor to collect a large sample of human-annotated data containing attribute annotations for listings across multiple categories. We evaluated performance by comparing LLM inferences to this human-annotated dataset and calculated metrics like precision, recall, and Jaccard index. We used these ground truth metrics as a benchmark for model improvements via prompt and context engineering.</p>
<p>Unfortunately, there were several significant drawbacks to relying on human-labeled data. In many cases, we found that human annotators made mistakes, especially when annotating thousands of listings (after all, no one’s perfect). In the example below, a human annotator marked the light fixture as ½ inch width, while the LLM correctly extracted 5.5 inches.
<img alt="" src="https://i.etsystatic.com/inv/840b67/7277724542/inv_fullxfull.7277724542_3ut2b4kf.jpg?version=0" /></p>
<p>Furthermore, human labeling is more time-consuming and expensive. To start scaling attribute inference across thousands of categories, we needed to come up with an automated process for labeling that did not rely <em>exclusively</em> on human annotation. </p>
<p>Instead, we’ve started using high-performance, state-of-the-art LLMs to generate ground truth labels (often called “silver labels”). Human-in-the-loop is still an essential part of this process: Etsy domain experts review silver labels and iterate on the prompt to ensure high quality results. Once we’re confident in our silver label generation, we produce a larger dataset for evaluating a more scalable LLM. The diagram below shows the updated process for model development.
<img alt="" src="https://i.etsystatic.com/inv/b61baa/7277725818/inv_fullxfull.7277725818_adnczvla.jpg?version=0" /></p>
<h3>Inference</h3>
<p>The core of our LLM pipeline is context engineering. We’ve worked with partners in product, merchandising, and taxonomy to ensure that the LLM has the right context for attribute extraction, including:</p>
<ul>
<li>Seller-provided listing data, including listing titles, descriptions, and images</li>
<li>Few-shot examples hand-selected by domain experts</li>
<li>Business logic from Etsy’s product taxonomy</li>
<li>Category-specific extraction rules </li>
</ul>
<p>Each listing is represented as a JSON string of context information. This context is injected into a series of prompts to extract product attributes in parallel. LLM requests are routed through LiteLLM to different regions, ensuring higher parallelization and removing a dependency on one singular cloud location. Finally, LLM responses are parsed into Pydantic dataclasses, which provide both basic type validation and custom validation based on business logic. 
<img alt="" src="https://i.etsystatic.com/inv/4fb8e5/7330400471/inv_fullxfull.7330400471_b7t8otbo.jpg?version=0" /></p>
<p>After this process of inference completes, a post-processing job formats the validated, structured outputs. The data is then exported to filestores, database tables, and our search platform for consumption by partner teams.</p>
<h3>Monitoring</h3>
<p>Beyond the challenges of evaluating the LLM output, Inference itself may fail for many reasons: code bugs, permissions issues, transient errors, quota exceeded errors, safety filters, and more. Rather than failing the pipeline for any individual error, errors are logged, and error metrics are surfaced via our observability platform. Our team is alerted if the number of failed inferences exceeds a certain threshold. To support debugging, we log a <a href="https://www.honeycomb.io/blog/understanding-logs-vs-traces">sample of traces</a> to HoneyComb.</p>
<p>Even if the error rate is low, it’s possible that model performance has degraded. To track changes in model performance, we added performance evaluation to our pipeline. First, we run LLM inference on a sample of a ground-truth dataset, and calculate performance metrics like precision, and recall. These metrics are compared to baseline scores from the full ground-truth dataset. If any metrics deviate significantly, the pipeline is terminated. This process allows us to confirm that third-party LLMs are working as expected before we run production-scale inference.</p>
<p>The combination of tracing, logging, metric tracking, model performance evaluation, and alerting provides a complete understanding of both pipeline health and model performance metrics, enabling us to consistently transform data to power key shopping experiences with confidence at scale.</p>
<h2>Looking Forward</h2>
<p>Where we’ve applied LLM-generated product attribute data to buyer and seller-facing experiences, we’ve seen promising results. In target categories, we’ve increased the number of listings with complete attribute coverage from 31% to 91%. And earlier this year, we added LLM-inferred attributes to search filters, leading to more engagement from buyers:</p>
<ul>
<li>Engagement with relevant Search filters increased</li>
<li>Overall post-click conversion rate increased</li>
</ul>
<p>All this work combined most recently into leveraging LLM-inferred color attributes to display color swatches for each listing on the search results page. This provides at-a-glance additional information to our buyers to find exactly what they want, faster.
<img alt="" src="https://i.etsystatic.com/inv/2dc796/7325687291/inv_fullxfull.7325687291_hnekw4s1.jpg?version=0" /></p>
<h3>What's Next</h3>
<p>Our goal is to unlock the full potential of Etsy’s inventory. Product attribute extraction is just one of many ways we’re using LLMs to achieve this in our efforts to improve the shopping and selling experience on Etsy. Transforming unstructured information is enabling us to make it easier than ever for our buyers to discover exactly what they’re looking for – and easier for sellers to list and get their unique creations discovered by the shoppers seeking their special item.</p>]]></description>
                <content:encoded><![CDATA[<p>For more than 20 years, Etsy has been the destination for human creativity online. Our marketplace is home to more than 100 million special items made, handpicked and designed by more than 5 million sellers. These items and the real people behind them are what set us apart. But while the huge variety of Etsy’s inventory is one of our greatest strengths, it also creates fundamental engineering challenges specific to our marketplace. </p>
<h2>The challenge: Etsy’s unique inventory</h2>
<p>With millions of creative items across thousands of categories – many of which are unique – it’s difficult to accurately capture all possible product attributes, which range from standard attributes like “color” and “material”, to niche attributes like “bead hole size” and “slime additives.” The range of possible attributes and their values is so broad that it’s a challenge even to enumerate them, let alone label listings with specific attribute data. Unlike other online retailers (that may also have enormous inventories), because products on Etsy are listed by third party sellers and often handmade or customized, we do not have global SKUs (stock keeping units), or mappings from SKUs to product attributes. </p>
<p>The listing below is an example of a unique item on Etsy, which has no SKU number or easy access to product attribute information. At first glance, the item looks like a t-shirt, but it is actually a porcelain sculpture. For niche items like this, seller provided details become especially critical.
<img alt="" src="https://i.etsystatic.com/inv/0b035a/7277596792/inv_fullxfull.7277596792_joyacfg4.jpg?version=0" /></p>
<p>We collect both structured and unstructured data from sellers, and they serve different roles in our marketplace.</p>
<ul>
<li>Unstructured data comes in the form of free-text descriptions, creative titles, and listing photos. While this content is full of useful product information, it’s harder for machines to interpret consistently and quickly at scale. </li>
<li>Structured data - in the form of product attributes like size and color - is easy for our systems to parse. It powers the buyer experience through tools such as search filtering options (offered through selectors in UI) and product-to-product comparison for characteristics of interest (material, price, etc).
<p align="center">
<img src="https://i.etsystatic.com/inv/bef7ea/7277605786/inv_fullxfull.7277605786_nhzz8slo.jpg?version=0" alt="Filters can be seen on the left side after a search query" />
<br />
<em>Filters can be seen on the left side after a search query</em>
</p></li>
</ul>
<p>While Etsy does ask sellers to provide structured data on their listings’ attributes, most fields are not required. This reduces friction in the listing process and gives sellers the flexibility to represent their often unique items accurately.</p>
<p align="center">
  <img src="https://i.etsystatic.com/inv/c449f0/7277644162/inv_fullxfull.7277644162_3604c274.jpg?version=0" alt="Sellers can fill in these attributes, or leave them blank and continue through the listing process" />
  <br />
  <em>Sellers can fill in these attributes, or leave them blank
and continue through the listing process</em>
</p>
<p>As a result, most sellers only or mostly provide <strong>unstructured data</strong> in the form of listing titles, descriptions, and photos. Frequently, key information like product dimensions is buried in the listing description or only available in listing photos.</p>
<table align="center">
  <tr>
    <td align="center" style="border:none;">
      <img src="https://i.etsystatic.com/inv/fc97dd/7325596475/inv_fullxfull.7325596475_8i7lc9xj.jpg?version=0" alt="Example listing with dimensions in description" width="300" /><br />
      <em>Example listing with dimensions in description</em>
    </td>
    <td align="center" style="border:none;">
      <img src="https://i.etsystatic.com/inv/72d894/7277649576/inv_fullxfull.7277649576_lf4urdtj.jpg?version=0" alt="Example of dimensions in a photo" width="300" /><br />
      <em> <br />Example of dimensions in a photo</em>
    </td>
  </tr>
</table>
<p>While our powerful search and discovery algorithms can process unstructured data such as that in descriptions and listing photos, passing in long context and images directly to search poses latency concerns. For these algorithms, every millisecond counts as they work to deliver relevant results to buyers as quickly as possible. Spending time filtering through unstructured data for every query is just not feasible. </p>
<p>These constraints led us to a clear conclusion: to fully unlock the potential of all inventory listed on Etsy’s site, <strong>unstructured product information needs to be distilled into structured data to power both ML models and buyer experiences.</strong></p>
<h2>LLMs present a new opportunity</h2>
<p>Before the availability of scalable LLMs, we explored various ML-based solutions to this challenge. Supervised product attribute extraction models had limited efficacy; even if we could enumerate all possible product attributes and values, many of them would be so sparse that traditional classification models would struggle to capture the long tail. Sequence tagging approaches also had difficulty scaling to multiple attributes. Transformer-based question-answering models (e.g. <a href="https://dl.acm.org/doi/10.1145/3394486.3403047">AVEQA</a>, <a href="https://dl.acm.org/doi/10.1145/3394486.3403047">MAVEQA</a> ) allowed for generalization to unseen attribute values, but still required large amounts of application-specific training data. </p>
<p>This is where the availability of foundational LLMs presented a transformational opportunity for Etsy. These models have a vast amount of general knowledge from pre-training, can process large context windows quickly and affordably, and can follow instructions given a small number of examples.
<img alt="" src="https://i.etsystatic.com/inv/9ea749/7277719848/inv_fullxfull.7277719848_15kkvq4f.jpg?version=0" /></p>
<p>With a feasible and performant solution, our next focus was to build a scalable pipeline that could extract attributes across millions of listings while maintaining confidence in the LLM output. This required robust evaluation frameworks/processes that measured quality through various metrics.</p>
<h2>Transforming &amp; evaluating unstructured data at scale</h2>
<h3>Evaluation</h3>
<p>When working with LLMs, one of the biggest challenges is evaluating model performance. We needed to ensure that, at scale across our 100M+ listings, the LLMs were consistently and reliably producing accurate, actionable results. To do this, we initially worked with a third-party labeling vendor to collect a large sample of human-annotated data containing attribute annotations for listings across multiple categories. We evaluated performance by comparing LLM inferences to this human-annotated dataset and calculated metrics like precision, recall, and Jaccard index. We used these ground truth metrics as a benchmark for model improvements via prompt and context engineering.</p>
<p>Unfortunately, there were several significant drawbacks to relying on human-labeled data. In many cases, we found that human annotators made mistakes, especially when annotating thousands of listings (after all, no one’s perfect). In the example below, a human annotator marked the light fixture as ½ inch width, while the LLM correctly extracted 5.5 inches.
<img alt="" src="https://i.etsystatic.com/inv/840b67/7277724542/inv_fullxfull.7277724542_3ut2b4kf.jpg?version=0" /></p>
<p>Furthermore, human labeling is more time-consuming and expensive. To start scaling attribute inference across thousands of categories, we needed to come up with an automated process for labeling that did not rely <em>exclusively</em> on human annotation. </p>
<p>Instead, we’ve started using high-performance, state-of-the-art LLMs to generate ground truth labels (often called “silver labels”). Human-in-the-loop is still an essential part of this process: Etsy domain experts review silver labels and iterate on the prompt to ensure high quality results. Once we’re confident in our silver label generation, we produce a larger dataset for evaluating a more scalable LLM. The diagram below shows the updated process for model development.
<img alt="" src="https://i.etsystatic.com/inv/b61baa/7277725818/inv_fullxfull.7277725818_adnczvla.jpg?version=0" /></p>
<h3>Inference</h3>
<p>The core of our LLM pipeline is context engineering. We’ve worked with partners in product, merchandising, and taxonomy to ensure that the LLM has the right context for attribute extraction, including:</p>
<ul>
<li>Seller-provided listing data, including listing titles, descriptions, and images</li>
<li>Few-shot examples hand-selected by domain experts</li>
<li>Business logic from Etsy’s product taxonomy</li>
<li>Category-specific extraction rules </li>
</ul>
<p>Each listing is represented as a JSON string of context information. This context is injected into a series of prompts to extract product attributes in parallel. LLM requests are routed through LiteLLM to different regions, ensuring higher parallelization and removing a dependency on one singular cloud location. Finally, LLM responses are parsed into Pydantic dataclasses, which provide both basic type validation and custom validation based on business logic. 
<img alt="" src="https://i.etsystatic.com/inv/4fb8e5/7330400471/inv_fullxfull.7330400471_b7t8otbo.jpg?version=0" /></p>
<p>After this process of inference completes, a post-processing job formats the validated, structured outputs. The data is then exported to filestores, database tables, and our search platform for consumption by partner teams.</p>
<h3>Monitoring</h3>
<p>Beyond the challenges of evaluating the LLM output, Inference itself may fail for many reasons: code bugs, permissions issues, transient errors, quota exceeded errors, safety filters, and more. Rather than failing the pipeline for any individual error, errors are logged, and error metrics are surfaced via our observability platform. Our team is alerted if the number of failed inferences exceeds a certain threshold. To support debugging, we log a <a href="https://www.honeycomb.io/blog/understanding-logs-vs-traces">sample of traces</a> to HoneyComb.</p>
<p>Even if the error rate is low, it’s possible that model performance has degraded. To track changes in model performance, we added performance evaluation to our pipeline. First, we run LLM inference on a sample of a ground-truth dataset, and calculate performance metrics like precision, and recall. These metrics are compared to baseline scores from the full ground-truth dataset. If any metrics deviate significantly, the pipeline is terminated. This process allows us to confirm that third-party LLMs are working as expected before we run production-scale inference.</p>
<p>The combination of tracing, logging, metric tracking, model performance evaluation, and alerting provides a complete understanding of both pipeline health and model performance metrics, enabling us to consistently transform data to power key shopping experiences with confidence at scale.</p>
<h2>Looking Forward</h2>
<p>Where we’ve applied LLM-generated product attribute data to buyer and seller-facing experiences, we’ve seen promising results. In target categories, we’ve increased the number of listings with complete attribute coverage from 31% to 91%. And earlier this year, we added LLM-inferred attributes to search filters, leading to more engagement from buyers:</p>
<ul>
<li>Engagement with relevant Search filters increased</li>
<li>Overall post-click conversion rate increased</li>
</ul>
<p>All this work combined most recently into leveraging LLM-inferred color attributes to display color swatches for each listing on the search results page. This provides at-a-glance additional information to our buyers to find exactly what they want, faster.
<img alt="" src="https://i.etsystatic.com/inv/2dc796/7325687291/inv_fullxfull.7325687291_hnekw4s1.jpg?version=0" /></p>
<h3>What's Next</h3>
<p>Our goal is to unlock the full potential of Etsy’s inventory. Product attribute extraction is just one of many ways we’re using LLMs to achieve this in our efforts to improve the shopping and selling experience on Etsy. Transforming unstructured information is enabling us to make it easier than ever for our buyers to discover exactly what they’re looking for – and easier for sellers to list and get their unique creations discovered by the shoppers seeking their special item.</p>]]></content:encoded>
            </item>
        
            <item>
                <title><![CDATA[Unlocking Faster Insights with Experimenter-Defined Segmentations]]></title>
                <link>https://www.etsy.com/codeascraft/unlocking-faster-insights-with-experimenter-defined-segmentations?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</link>
                <dc:creator><![CDATA[Athena Chang]]></dc:creator>
                <pubDate>Wed, 8 Oct 2025 11:32:02 -0400</pubDate>
                <guid isPermaLink="true">https://www.etsy.com/codeascraft/unlocking-faster-insights-with-experimenter-defined-segmentations?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</guid>
                <description><![CDATA[<p>Imagine you have a fabulous idea to drive more sales on Etsy by giving out <em>free ice cream</em> with every purchase. How would you know if it will actually work?</p>
<p>One way to test this out is to run an <strong>experiment</strong>! An experiment is a learning tool that can help us see if a new change really affects user behavior and key metrics we care about. At Etsy, we run nearly 2000 experiments a year for testing new ideas to improve the user experience for our buyers and sellers.</p>
<p>For our ice cream idea, we could set up our experiment as an A/B test on the listing page, where one group of users (the “control” group) sees the usual “Buy it now” button, while the other (the "treatment" group) sees a shiny new version that says, “Buy it now and get a FREE ICE CREAM!”.</p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/538a43/7265363452/inv_fullxfull.7265363452_ihecj4qm.jpg?version=0" />
<figcaption>Comparison between the “Buy it now” button in the control and treatment groups in our example experiment. Disclaimer: Example for illustrative purposes only.</figcaption>
</figure>
<p>We would then track how many people click the “Buy” button in each group to see if the new button actually increases sales.</p>
<p>But what if you want to dig deeper? What if you want to know if certain groups of users react differently to the free ice cream offer?</p>
<p>For example:</p>
<ul>
<li>Do <em>ice cream lovers</em> click the button more often than those who don’t like ice cream? 🍦</li>
<li>What about <em>people who’ve had ice cream in the past week</em> versus those who haven’t? 😋</li>
<li>Or <em>people browsing high-priced items</em> versus low-priced ones? Does the price of the listing affect buyer behavior? For example, would a free ice cream entice someone to buy a $5,000 chair?</li>
</ul>
<p>This is where <strong>segmentations</strong> come into play! Segmentation allows you to slice and dice your experiment data to analyze the impact on different groups of users. It’s a powerful tool to uncover meaningful patterns that might remain hidden in the overall results. For instance, our ice cream experiment might show neutral engagement overall, but analyzing by segmentation could reveal that ice cream lovers are much more likely to buy with the new button — while non-lovers engage less, canceling each other out in the overall results.</p>
<p>Since experimentation is all about learning, using segmentations helps us maximize new insights from our experiments. However, for a very long time at Etsy, experimenters <em>didn’t</em> have the best tools to make the most of segmentations. In fact, there was no way for experimenters to define their own segmentations within our centralized experimentation platform.</p>
<p>In this post, we’ll take you through how we transformed segmentations by <strong>empowering experimenters to easily define their own segmentations within our experimentation platform</strong> — resulting in 3x the number of segmentations and unlocking faster, deeper insights for teams across Etsy.</p>
<h2>The challenge with using segmentations at Etsy</h2>
<p>Etsy’s in-house experimentation platform, <strong>Catapult</strong>, supports a fixed set of segmentation out of the box. Experimenters can view their results automatically broken down by 13 supported segmentations on their experiment results page, with results being refreshed daily.</p>
<p>However, these 13 segmentations were <em>hard-coded</em> in Catapult’s back-end, and there was no way for experimenters to analyze their results by any other segmentations within Catapult.</p>
<p><em>Okay, but can’t we just add new segmentations to Catapult?</em></p>
<p>Not so easy. Adding new segmentations required an engineer to make a code change to the data pipeline each time, which was a slow, painful process. This was simply not scalable when different teams wanted different segmentations. To make matters worse, all segmentations were processed together in a <em>single batch job</em>, which could only start once the upstream data for <em>all</em> segmentations was ready. This meant that if the upstream data for even just one segmentation was delayed, the entire batch job was held up, causing delays across the board. Adding more segmentations only made this problem worse, leading to longer processing times and higher risk of delays as each new segmentation was onboarded.</p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/07ac79/7313307073/inv_fullxfull.7313307073_qm55egu7.jpg?version=0" />
<figcaption>The Catapult data pipeline processed 13 hard-coded segmentations all within one batch job.</figcaption>
</figure>
<p>As a result, teams ended up relying on analysts to run ad-hoc SQL queries outside Catapult to explore unsupported segmentations. This process was slow, hard to reproduce, and difficult to share with product managers and stakeholders. Experimenters often felt frustrated – there was so much to learn, but no easy way to get the insights they needed to make critical business decisions quickly.</p>
<p>As the team at Etsy responsible for Catapult and internal experimentation support, we wanted to change this by <strong>empowering experimenters to define their own segmentations</strong> — and to reuse those segmentations across experiments. Our vision was to create a <strong>Segmentation Library</strong>: a collection of experimenter-defined segmentations that could be easily shared and automatically applied in the data pipeline, no code changes required!</p>
<h2>Empowering experimenters to define their own segmentations</h2>
<p>To make it possible for experimenters to define their own segmentations, we needed a way to represent segmentations as <strong>configuration objects</strong> that are both human-readable and machine-processable.</p>
<p>We drew inspiration from the ad-hoc SQL scripts our analysts had been writing to explore custom segmentations — because SQL is widely understood and expressive. Typically, those scripts all follow these steps:</p>
<ul>
<li>Query various source tables for segment data</li>
<li>Join the segment data with <em>experimental unit</em> data (i.e. users that were part of an experiment)</li>
<li>Define how to assign each experimental unit to a segment value</li>
</ul>
<p>We decided to model each segmentation as:</p>
<ul>
<li>A <strong>SQL query</strong> defining the segment data source, and how to assign segment values to each experimental unit</li>
<li>A set of <strong>join columns</strong> to connect segment data to experimental units using some shared IDs (e.g. user IDs)</li>
</ul>
<p>Here is an example of a segmentation definition from our ice cream experiment:</p>
<p><strong>Segmentation name</strong>: <code>"ice_cream_lovers"</code></p>
<p><strong>SQL</strong>:</p>
<pre><code>SELECT
  user_id,
  CASE
    WHEN rating &gt;= 9 THEN 'Ice cream lover'
    WHEN rating &lt;= 2 THEN 'Ice cream hater'
    ELSE 'Indifferent'
  END AS segment_value
FROM `ice_cream_user_ratings`</code></pre>
<p><strong>Join columns</strong>: <code>["user_id"]</code></p>
<p>The nice thing about defining segmentations this way is that, unlike the old hard-coded segmentations that were buried deep in the data pipeline, these SQL definitions are easier for analysts, product managers, and engineers to define and understand. This makes it easier for teams to adopt and reuse existing segmentations that were previously defined by other teams for their own experiments. Additionally, the existing segmentations serve as inspiration for new ones! Anyone looking to create a new segmentation can view the existing ones in the Segmentation Library for inspiration before defining their own in the Segmentation Creation tool.</p>
<p>The SQL queries and join keys can also be easily plugged into templated scripts in Catapult’s data pipeline to compute segmented experiment results. It’s a win-win for both experimenters and the data pipeline!</p>
<h2>Validating segmentation definitions</h2>
<p>With experimenter-defined SQL queries comes a big question: <em>How do we make sure they actually work?</em></p>
<p>As the owners of Catapult’s data pipeline, we want to make sure these experimenter-defined queries won’t cause any damage — like accidentally deleting important tables or causing failures in our pipeline. On the flip side, experimenters creating these segmentations want to know that their SQL will run smoothly in our pipeline, and that the results match what they expected.</p>
<p>To tackle both concerns, every new segmentation goes through a rigorous validation process before it can be deployed to production:</p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/df27a6/7313144195/inv_fullxfull.7313144195_ejcyvtnf.jpg?version=0" />
<figcaption>The segmentation validation and deployment workflows are triggered to push new segmentations to production.</figcaption>
</figure>
<p>First, when an experimenter defines a new segmentation in the Segmentation Creation tool and clicks “Validate”, it triggers a <strong>validation workflow</strong> through the workflow orchestration API. This workflow receives the experimenter's segmentation SQL query and metadata such as join keys. Then, each task in the validation process checks different parts of the segmentation query to make sure the SQL follows our rules — like having the correct output schema and using only approved join keys. If anything fails, an error message is sent back to the experimenter through the Segmentation Creation tool. This workflow-orchestration–based validation ensures that segmentations are accurate, safe, and ready to run before deployment. Once the SQL validation finishes successfully, we generate a sample result in <a href="https://cloud.google.com/bigquery">BigQuery</a> for the experimenter to review.</p>
<p>Once the experimenter reviews their sample results and are happy with it, they can click “Deploy”, which triggers an automated <strong>deployment workflow</strong> to push the new segmentation to production. This workflow adds the validated SQL to the segmentation definition repository and automatically creates a pull request. After a final validation of the segmentation SQL query, the pull request is automatically merged, and the new segmentation is immediately deployed to production. The system also manages all dependencies, ensuring that all upstream data is ready before running the segmentation in production.</p>
<p>Once a segmentation is in production, it is ready to be used in experiments!</p>
<h2>Building a more flexible pipeline with dynamic task generation</h2>
<p>After new segmentations were defined, the next challenge was to ensure that our data pipeline could automatically pick them up and process them — <em>without needing any code changes</em>.</p>
<p>The problem was that Catapult’s data pipeline was set up as a workflow with a fixed set of tasks for a predefined set of segmentations. This meant adding a new segmentation required updating the pipeline code to add new tasks.</p>
<p>To solve this, we leveraged <strong>dynamic task generation</strong> in our workflow to create tasks <em>at runtime</em> based on dynamic inputs — like the list of active segmentations — without knowing in advance how many tasks are needed.</p>
<p>This was a perfect fit for us, since it enabled us to process a changing set of segmentations each day as independent, parallel work streams, all running the same underlying logic. This approach also helped us scale up the number of supported segmentations without duplicating code.</p>
<p>Here is how we utilized dynamic task generation in our data pipeline:</p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/ffeecb/7265361036/inv_fullxfull.7265361036_s6sx41ze.jpg?version=0" />
</figure>
<ol>
<li>First, the workflow makes an API call to get the current list of active segmentations from ongoing experiments. </li>
<li>Then, using dynamic task generation, it spins up a set of identical task groups, one for each segmentation, all running independently and in parallel.</li>
<li>Within each task group:
<ol>
<li>It first uses sensors to wait for the segmentation’s upstream data to be ready.</li>
<li>Once the upstream data is ready, it kicks off subsequent SQL scripts that use the segmentation’s SQL query and join columns to connect segment data to experimental units, and aggregate them for each experiment. </li>
<li>Finally, it outputs the segmented experiment results to the Experiment Results table that is then used to display the results in the Catapult UI.</li>
</ol></li>
</ol>
<p>Here is a side-by-side comparison of our data pipeline before and after using dynamic task generation:</p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/3f4cd3/7313270611/inv_fullxfull.7313270611_emjtnrct.jpg?version=0" />
</figure>
<p>Overall, this new design offers several benefits:</p>
<ol>
<li><strong>Automated processing</strong>: New segmentations are automatically detected and processed in the data pipeline. No code changes required!</li>
<li><strong>Improved failure isolation</strong>: Each segmentation runs independently, so if one fails, it doesn’t affect the others, making the pipeline more reliable.</li>
<li><strong>Simplified downstream dependencies</strong>: Each segmentation has a unique name, allowing downstream jobs to wait for the specific tasks they need. This means they can start as soon as their required segmentations are ready, without having to wait around for the entire batch to finish.</li>
</ol>
<h2>The Impact: 3X more segmentations, faster experiment insights</h2>
<p>Within the first three months of launching the new Segmentation Creation tool, experimenters have created <em>34 new segmentations</em>. The number of segmentations our experimentation system supported <strong>more than tripled</strong>, from 13 hard-coded segmentations to a total of 47 segmentations. We knew our experimenters would love the ability to define their own segmentations — but we didn’t expect them to love it <em>this</em> much! </p>
<p>Since the launch, teams across Etsy are gaining insights faster than before. In fact, 78% of experimenters say they are <strong>saving 2-5 hours a month</strong> by analyzing their experiments with custom segmentations in Catapult. Having a Segmentation Library as a single source of truth means that teams can easily reuse existing segmentations instead of reinventing the wheel every time, making it way easier to share insights across different business areas. Plus, with faster access to experiment insights, teams are able to make decisions and iterate on their products more quickly than ever!</p>
<h2>Conclusion: Building a better Etsy through flexible experimentation</h2>
<p>In summary, by enabling experimenters to define their own segmentations — and building data pipelines smart enough to adapt to those experimenter-defined ideas, we’ve given teams across Etsy the power to shape their own experiment analysis and get actionable insights quickly.</p>
<p>Why does this matter? Because every bit of flexibility we add to experimentation leads to a better experience for Etsy buyers and sellers. <strong>When teams can iterate quickly and confidently, they build products that genuinely serve our community — helping Etsy thrive as a platform where creativity and commerce meet.</strong></p>
<p>Some key lessons we learned along the way:</p>
<ul>
<li>Moving from hard-coded logic to a <strong>configuration-driven pipeline</strong> has truly opened up tons of new possibilities and expanded what our experimentation platform can do. We’re already using this approach to make other parts of experimentation more flexible, using Segmentations as an example to follow.</li>
<li>Having a group of experimenters as early <strong>“pilot testers”</strong> was a key part of our success. They worked closely with us from the start, testing out early versions of the tool and giving us honest feedback, which helped us build the Segmentation Creation Tool to be as user-friendly as possible. Our pilot testers have also become champions who helped spread the word on this feature to other teams. The segmentations they have created served as examples for others to follow. Thank you, our pilot testers!</li>
<li>Although Segmentations has been adopted by many teams across Etsy since its release, we know there are still teams out there who haven’t discovered this tool yet! We are committed to finding new ways to <strong>share the knowledge</strong> and help teams discover how this tool can unlock the insights they need. </li>
</ul>
<p>We hope that our learnings will inspire you to turn experiment analysis from a fixed process into a flexible, self-serviceable platform that can keep up with whatever questions your team asks next!</p>]]></description>
                <content:encoded><![CDATA[<p>Imagine you have a fabulous idea to drive more sales on Etsy by giving out <em>free ice cream</em> with every purchase. How would you know if it will actually work?</p>
<p>One way to test this out is to run an <strong>experiment</strong>! An experiment is a learning tool that can help us see if a new change really affects user behavior and key metrics we care about. At Etsy, we run nearly 2000 experiments a year for testing new ideas to improve the user experience for our buyers and sellers.</p>
<p>For our ice cream idea, we could set up our experiment as an A/B test on the listing page, where one group of users (the “control” group) sees the usual “Buy it now” button, while the other (the "treatment" group) sees a shiny new version that says, “Buy it now and get a FREE ICE CREAM!”.</p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/538a43/7265363452/inv_fullxfull.7265363452_ihecj4qm.jpg?version=0" />
<figcaption>Comparison between the “Buy it now” button in the control and treatment groups in our example experiment. Disclaimer: Example for illustrative purposes only.</figcaption>
</figure>
<p>We would then track how many people click the “Buy” button in each group to see if the new button actually increases sales.</p>
<p>But what if you want to dig deeper? What if you want to know if certain groups of users react differently to the free ice cream offer?</p>
<p>For example:</p>
<ul>
<li>Do <em>ice cream lovers</em> click the button more often than those who don’t like ice cream? 🍦</li>
<li>What about <em>people who’ve had ice cream in the past week</em> versus those who haven’t? 😋</li>
<li>Or <em>people browsing high-priced items</em> versus low-priced ones? Does the price of the listing affect buyer behavior? For example, would a free ice cream entice someone to buy a $5,000 chair?</li>
</ul>
<p>This is where <strong>segmentations</strong> come into play! Segmentation allows you to slice and dice your experiment data to analyze the impact on different groups of users. It’s a powerful tool to uncover meaningful patterns that might remain hidden in the overall results. For instance, our ice cream experiment might show neutral engagement overall, but analyzing by segmentation could reveal that ice cream lovers are much more likely to buy with the new button — while non-lovers engage less, canceling each other out in the overall results.</p>
<p>Since experimentation is all about learning, using segmentations helps us maximize new insights from our experiments. However, for a very long time at Etsy, experimenters <em>didn’t</em> have the best tools to make the most of segmentations. In fact, there was no way for experimenters to define their own segmentations within our centralized experimentation platform.</p>
<p>In this post, we’ll take you through how we transformed segmentations by <strong>empowering experimenters to easily define their own segmentations within our experimentation platform</strong> — resulting in 3x the number of segmentations and unlocking faster, deeper insights for teams across Etsy.</p>
<h2>The challenge with using segmentations at Etsy</h2>
<p>Etsy’s in-house experimentation platform, <strong>Catapult</strong>, supports a fixed set of segmentation out of the box. Experimenters can view their results automatically broken down by 13 supported segmentations on their experiment results page, with results being refreshed daily.</p>
<p>However, these 13 segmentations were <em>hard-coded</em> in Catapult’s back-end, and there was no way for experimenters to analyze their results by any other segmentations within Catapult.</p>
<p><em>Okay, but can’t we just add new segmentations to Catapult?</em></p>
<p>Not so easy. Adding new segmentations required an engineer to make a code change to the data pipeline each time, which was a slow, painful process. This was simply not scalable when different teams wanted different segmentations. To make matters worse, all segmentations were processed together in a <em>single batch job</em>, which could only start once the upstream data for <em>all</em> segmentations was ready. This meant that if the upstream data for even just one segmentation was delayed, the entire batch job was held up, causing delays across the board. Adding more segmentations only made this problem worse, leading to longer processing times and higher risk of delays as each new segmentation was onboarded.</p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/07ac79/7313307073/inv_fullxfull.7313307073_qm55egu7.jpg?version=0" />
<figcaption>The Catapult data pipeline processed 13 hard-coded segmentations all within one batch job.</figcaption>
</figure>
<p>As a result, teams ended up relying on analysts to run ad-hoc SQL queries outside Catapult to explore unsupported segmentations. This process was slow, hard to reproduce, and difficult to share with product managers and stakeholders. Experimenters often felt frustrated – there was so much to learn, but no easy way to get the insights they needed to make critical business decisions quickly.</p>
<p>As the team at Etsy responsible for Catapult and internal experimentation support, we wanted to change this by <strong>empowering experimenters to define their own segmentations</strong> — and to reuse those segmentations across experiments. Our vision was to create a <strong>Segmentation Library</strong>: a collection of experimenter-defined segmentations that could be easily shared and automatically applied in the data pipeline, no code changes required!</p>
<h2>Empowering experimenters to define their own segmentations</h2>
<p>To make it possible for experimenters to define their own segmentations, we needed a way to represent segmentations as <strong>configuration objects</strong> that are both human-readable and machine-processable.</p>
<p>We drew inspiration from the ad-hoc SQL scripts our analysts had been writing to explore custom segmentations — because SQL is widely understood and expressive. Typically, those scripts all follow these steps:</p>
<ul>
<li>Query various source tables for segment data</li>
<li>Join the segment data with <em>experimental unit</em> data (i.e. users that were part of an experiment)</li>
<li>Define how to assign each experimental unit to a segment value</li>
</ul>
<p>We decided to model each segmentation as:</p>
<ul>
<li>A <strong>SQL query</strong> defining the segment data source, and how to assign segment values to each experimental unit</li>
<li>A set of <strong>join columns</strong> to connect segment data to experimental units using some shared IDs (e.g. user IDs)</li>
</ul>
<p>Here is an example of a segmentation definition from our ice cream experiment:</p>
<p><strong>Segmentation name</strong>: <code>"ice_cream_lovers"</code></p>
<p><strong>SQL</strong>:</p>
<pre><code>SELECT
  user_id,
  CASE
    WHEN rating &gt;= 9 THEN 'Ice cream lover'
    WHEN rating &lt;= 2 THEN 'Ice cream hater'
    ELSE 'Indifferent'
  END AS segment_value
FROM `ice_cream_user_ratings`</code></pre>
<p><strong>Join columns</strong>: <code>["user_id"]</code></p>
<p>The nice thing about defining segmentations this way is that, unlike the old hard-coded segmentations that were buried deep in the data pipeline, these SQL definitions are easier for analysts, product managers, and engineers to define and understand. This makes it easier for teams to adopt and reuse existing segmentations that were previously defined by other teams for their own experiments. Additionally, the existing segmentations serve as inspiration for new ones! Anyone looking to create a new segmentation can view the existing ones in the Segmentation Library for inspiration before defining their own in the Segmentation Creation tool.</p>
<p>The SQL queries and join keys can also be easily plugged into templated scripts in Catapult’s data pipeline to compute segmented experiment results. It’s a win-win for both experimenters and the data pipeline!</p>
<h2>Validating segmentation definitions</h2>
<p>With experimenter-defined SQL queries comes a big question: <em>How do we make sure they actually work?</em></p>
<p>As the owners of Catapult’s data pipeline, we want to make sure these experimenter-defined queries won’t cause any damage — like accidentally deleting important tables or causing failures in our pipeline. On the flip side, experimenters creating these segmentations want to know that their SQL will run smoothly in our pipeline, and that the results match what they expected.</p>
<p>To tackle both concerns, every new segmentation goes through a rigorous validation process before it can be deployed to production:</p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/df27a6/7313144195/inv_fullxfull.7313144195_ejcyvtnf.jpg?version=0" />
<figcaption>The segmentation validation and deployment workflows are triggered to push new segmentations to production.</figcaption>
</figure>
<p>First, when an experimenter defines a new segmentation in the Segmentation Creation tool and clicks “Validate”, it triggers a <strong>validation workflow</strong> through the workflow orchestration API. This workflow receives the experimenter's segmentation SQL query and metadata such as join keys. Then, each task in the validation process checks different parts of the segmentation query to make sure the SQL follows our rules — like having the correct output schema and using only approved join keys. If anything fails, an error message is sent back to the experimenter through the Segmentation Creation tool. This workflow-orchestration–based validation ensures that segmentations are accurate, safe, and ready to run before deployment. Once the SQL validation finishes successfully, we generate a sample result in <a href="https://cloud.google.com/bigquery">BigQuery</a> for the experimenter to review.</p>
<p>Once the experimenter reviews their sample results and are happy with it, they can click “Deploy”, which triggers an automated <strong>deployment workflow</strong> to push the new segmentation to production. This workflow adds the validated SQL to the segmentation definition repository and automatically creates a pull request. After a final validation of the segmentation SQL query, the pull request is automatically merged, and the new segmentation is immediately deployed to production. The system also manages all dependencies, ensuring that all upstream data is ready before running the segmentation in production.</p>
<p>Once a segmentation is in production, it is ready to be used in experiments!</p>
<h2>Building a more flexible pipeline with dynamic task generation</h2>
<p>After new segmentations were defined, the next challenge was to ensure that our data pipeline could automatically pick them up and process them — <em>without needing any code changes</em>.</p>
<p>The problem was that Catapult’s data pipeline was set up as a workflow with a fixed set of tasks for a predefined set of segmentations. This meant adding a new segmentation required updating the pipeline code to add new tasks.</p>
<p>To solve this, we leveraged <strong>dynamic task generation</strong> in our workflow to create tasks <em>at runtime</em> based on dynamic inputs — like the list of active segmentations — without knowing in advance how many tasks are needed.</p>
<p>This was a perfect fit for us, since it enabled us to process a changing set of segmentations each day as independent, parallel work streams, all running the same underlying logic. This approach also helped us scale up the number of supported segmentations without duplicating code.</p>
<p>Here is how we utilized dynamic task generation in our data pipeline:</p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/ffeecb/7265361036/inv_fullxfull.7265361036_s6sx41ze.jpg?version=0" />
</figure>
<ol>
<li>First, the workflow makes an API call to get the current list of active segmentations from ongoing experiments. </li>
<li>Then, using dynamic task generation, it spins up a set of identical task groups, one for each segmentation, all running independently and in parallel.</li>
<li>Within each task group:
<ol>
<li>It first uses sensors to wait for the segmentation’s upstream data to be ready.</li>
<li>Once the upstream data is ready, it kicks off subsequent SQL scripts that use the segmentation’s SQL query and join columns to connect segment data to experimental units, and aggregate them for each experiment. </li>
<li>Finally, it outputs the segmented experiment results to the Experiment Results table that is then used to display the results in the Catapult UI.</li>
</ol></li>
</ol>
<p>Here is a side-by-side comparison of our data pipeline before and after using dynamic task generation:</p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/3f4cd3/7313270611/inv_fullxfull.7313270611_emjtnrct.jpg?version=0" />
</figure>
<p>Overall, this new design offers several benefits:</p>
<ol>
<li><strong>Automated processing</strong>: New segmentations are automatically detected and processed in the data pipeline. No code changes required!</li>
<li><strong>Improved failure isolation</strong>: Each segmentation runs independently, so if one fails, it doesn’t affect the others, making the pipeline more reliable.</li>
<li><strong>Simplified downstream dependencies</strong>: Each segmentation has a unique name, allowing downstream jobs to wait for the specific tasks they need. This means they can start as soon as their required segmentations are ready, without having to wait around for the entire batch to finish.</li>
</ol>
<h2>The Impact: 3X more segmentations, faster experiment insights</h2>
<p>Within the first three months of launching the new Segmentation Creation tool, experimenters have created <em>34 new segmentations</em>. The number of segmentations our experimentation system supported <strong>more than tripled</strong>, from 13 hard-coded segmentations to a total of 47 segmentations. We knew our experimenters would love the ability to define their own segmentations — but we didn’t expect them to love it <em>this</em> much! </p>
<p>Since the launch, teams across Etsy are gaining insights faster than before. In fact, 78% of experimenters say they are <strong>saving 2-5 hours a month</strong> by analyzing their experiments with custom segmentations in Catapult. Having a Segmentation Library as a single source of truth means that teams can easily reuse existing segmentations instead of reinventing the wheel every time, making it way easier to share insights across different business areas. Plus, with faster access to experiment insights, teams are able to make decisions and iterate on their products more quickly than ever!</p>
<h2>Conclusion: Building a better Etsy through flexible experimentation</h2>
<p>In summary, by enabling experimenters to define their own segmentations — and building data pipelines smart enough to adapt to those experimenter-defined ideas, we’ve given teams across Etsy the power to shape their own experiment analysis and get actionable insights quickly.</p>
<p>Why does this matter? Because every bit of flexibility we add to experimentation leads to a better experience for Etsy buyers and sellers. <strong>When teams can iterate quickly and confidently, they build products that genuinely serve our community — helping Etsy thrive as a platform where creativity and commerce meet.</strong></p>
<p>Some key lessons we learned along the way:</p>
<ul>
<li>Moving from hard-coded logic to a <strong>configuration-driven pipeline</strong> has truly opened up tons of new possibilities and expanded what our experimentation platform can do. We’re already using this approach to make other parts of experimentation more flexible, using Segmentations as an example to follow.</li>
<li>Having a group of experimenters as early <strong>“pilot testers”</strong> was a key part of our success. They worked closely with us from the start, testing out early versions of the tool and giving us honest feedback, which helped us build the Segmentation Creation Tool to be as user-friendly as possible. Our pilot testers have also become champions who helped spread the word on this feature to other teams. The segmentations they have created served as examples for others to follow. Thank you, our pilot testers!</li>
<li>Although Segmentations has been adopted by many teams across Etsy since its release, we know there are still teams out there who haven’t discovered this tool yet! We are committed to finding new ways to <strong>share the knowledge</strong> and help teams discover how this tool can unlock the insights they need. </li>
</ul>
<p>We hope that our learnings will inspire you to turn experiment analysis from a fixed process into a flexible, self-serviceable platform that can keep up with whatever questions your team asks next!</p>]]></content:encoded>
            </item>
        
            <item>
                <title><![CDATA[Building Etsy Buyer Profiles with LLMs]]></title>
                <link>https://www.etsy.com/codeascraft/building-etsy-buyer-profiles-with-llms?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</link>
                <dc:creator><![CDATA[Isobel Scott]]></dc:creator>
                <pubDate>Wed, 3 Sep 2025 17:40:15 -0400</pubDate>
                <guid isPermaLink="true">https://www.etsy.com/codeascraft/building-etsy-buyer-profiles-with-llms?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</guid>
                <description><![CDATA[<p>Every day, shoppers from Etsy's community of nearly 90M buyers visit our marketplace to search for unique, handmade, and vintage items. But with over 100 million listings, how do we help each buyer find exactly what they're looking for? Traditional search and recommendation systems often fall short of capturing the nuanced interests that make each Etsy buyer unique, from specific styles to aesthetic preferences. At Etsy, understanding our buyers' interests is central to delivering engaging, personalized experiences. </p>
<p>Recently, we explored enhancing our personalization by leveraging large language models (LLMs) to create detailed buyer profiles based on buyers’ browsing and purchasing behaviors. We strive towards privacy by design and build these exploratory models with buyer privacy in mind. While this work is still in experimentation, we’re excited to share our early efforts and what we’ve learned.</p>
<h1>Understanding Buyer Profiles</h1>
<p>Buyer profiles anonymously capture nuanced interests for each user, like preferred styles and product categories. For example, do they tend to prefer minimalist styles? Do they mostly shop for home decor, or jewelry? They also offer insight into specific shopping missions a buyer may be on, like finding the perfect fall sweater. These profiles help us better understand our users, making the Etsy experience feel tailored to each individual buyer.</p>
<h1>Technical Implementation</h1>
<p>Our process begins by retrieving user activity data and listing information from internal sources like our feature store and BigQuery. This includes users’ recent searches, item views, purchases, and favorites. </p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/d90c38/7167110206/inv_fullxfull.7167110206_rvsp064p.jpg?version=0" />
</figure>
<p>Once we have the users’ session data with interactions, we then prompt the LLM to interpret this data and generate structured buyer profiles.</p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/ec98a6/7167223046/inv_fullxfull.7167223046_f1d4hdhg.jpg?version=0" />
<figcaption>Buyer profile data structure</figcaption>
</figure>
<p>In the case that the LLM does not have enough supporting data to infer a user's categorical interests (i.e. these interests would also not be obvious to a human looking at this user's interactions), these fields and the confidence hashmap can be empty. The LLM will always include an explanation and any observed interests in the explanation section of the buyer profile. </p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/3289ad/7167112652/inv_fullxfull.7167112652_4veuwows.jpg?version=0" />
</figure>
<h1>Bringing Personalization to Scale</h1>
<p>Initially, using this method to generate profiles for our roughly 90 million buyers would have been exceedingly costly and taken weeks to update. 
We optimized this by:</p>
<ul>
<li>Shifting our listings data source from API endpoints to BigQuery tables that are clustered and partitioned for efficient querying.</li>
<li>Decreasing the volume of input tokens. Initially, we were including about 2 years of session data. Now, we’ve reduced this to just the last 9 months. The 9 month timeframe also allowed us to lighten the prompt corrections (and cached input tokens) by reducing the weight of holiday shopping. </li>
<li>Increasing LLM and BigQuery batch sizes for data processing.</li>
<li>Introducing parallel processing with managed concurrency to avoid request rate limits.</li>
<li>Scaling up computational resources for session data retrieval and LLM processing tasks.</li>
</ul>
<p>These improvements reduced buyer profile generation time dramatically — from 21 days down to 3 days for 10 million users.
Cost management was also crucial. By adjusting the prompt to get high quality results with a smaller model, we significantly lowered the cost and made large-scale personalization economically feasible. Through these various cost management techniques, we were able to reduce the estimated cost by 94% per million users.</p>
<p>Additionally, to scale and schedule buyer profile refreshes, we used Airflow as an orchestration tool. By batching and staggering tasks by user_id, we parallelize some tasks while avoiding too many concurrent requests on BigQuery &amp; OpenAI APIs (as visualized in this Directed Acyclic Graph).</p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/4c215d/7215106915/inv_fullxfull.7215106915_9syjqngx.jpg?version=0" />
<figcaption>A snapshot of a running buyer profile DAG with staggered batching</figcaption>
</figure>
<h1>Experiments and Applications</h1>
<p>With these powerful insights in hand, we’ve started to explore several key applications of buyer profiles that would make the Etsy search experience feel unique to each shopper.</p>
<h2>Query Rewriting</h2>
<p>Query rewriting is the process of transforming a user’s submitted search query on the backend to better represent their intent to the underlying search systems. We already use query rewriting for different use cases at Etsy. It can correct spelling, add specifics to ambiguous queries, adjust niche terminology, or complete incomplete queries (where the user accidentally pressed enter before they finished writing their query).</p>
<p>To further personalize Etsy’s search experience, we explored enriching search queries with predicted interests from buyer profiles. For instance, a simple query like "cool posters" becomes "cool posters + hippie|boho|vintage|nature," significantly boosting relevance.</p>
<h3>Search results without personalization</h3>
<p><img alt="" src="https://i.etsystatic.com/inv/8042d1/7215156101/inv_fullxfull.7215156101_i7tyi5gw.jpg?version=0" /> </p>
<h3>Search results with personalization using example buyer profile</h3>
<p><img alt="" src="https://i.etsystatic.com/inv/57f38c/7215157261/inv_fullxfull.7215157261_3eozeu9r.jpg?version=0" /> </p>
<h2>Refinement Pills</h2>
<p>On the Etsy website, “refinement pills” work as an interactive query reformulation. When a user clicks on an option, the path will be prefixed to the user’s query and a new search will be executed. A mix of different types of refinement pills are generated: single pills (just one word) and grouped pills (a dropdown of options). With buyer profiles, we can present high-confidence buyer interests as clickable filters or refinement pills in the search interface, enabling users to effortlessly refine their searches based on predicted preferences. </p>
<h3>Refinement pills without personalization</h3>
<p><img alt="" src="https://i.etsystatic.com/inv/0065d9/7215158991/inv_fullxfull.7215158991_4vahuj39.jpg?version=0" /> </p>
<h3>Refinement pills with personalization using example buyer profile</h3>
<p><img alt="" src="https://i.etsystatic.com/inv/c6eec8/7167281650/inv_fullxfull.7167281650_ovxaywx2.jpg?version=0" /></p>
<p>For example, if a user clicks on a “daisy” refinement pill as in the example below, the search query "car accessories" would be reformulated to "daisy car accessories" and the new, more specific search will be executed.</p>
<h1>Measuring Success</h1>
<h2>Validating Profile Accuracy</h2>
<p>As we work to roll out buyer profile-based personalization, understanding the accuracy of our predictions is crucial. Through systematic experimentation, we're working to establish how well our LLM-generated profiles reflect actual user interests and shopping behaviors.</p>
<h2>Key Accuracy Metrics</h2>
<ul>
<li>Click-through Rate (CTR) Lift: Comparing CTR on personalized search results versus baseline helps us quantify immediate engagement improvements</li>
<li>Conversion Rate Impact: Tracking purchase rates from personalized searches reveals whether we're surfacing items users actually want to buy</li>
<li>Refinement Pill Engagement: Monitoring clicks on suggested refinement pills helps measure how well we've predicted user interests</li>
<li>Search Query Reformulation Success: Analyzing subsequent user actions after query rewriting indicates whether our enrichments align with user intent</li>
</ul>
<h2>Profile Refresh Strategy</h2>
<p>We'll work to maintain profile accuracy through several methods:</p>
<ul>
<li>Dynamically refreshing timing based on user activity levels. These include the number and frequency of recent interactions. For example, does the user click on the refinement pills proposed? Are the personalized items viewed?</li>
<li>Detecting interest drift when search patterns change significantly</li>
<li>Taking into account seasonal considerations, such as the winter holiday shopping season</li>
<li>Monitoring performance degradation signals when personalization effectiveness drops below thresholds of engagement listed above</li>
</ul>
<h1>Future Work</h1>
<h2>The Cold Start User Problem</h2>
<p>To extend this personalized experience to new users who lack extensive session data, we experimented with "inheritance profiles" using collaborative filtering. By matching early interaction signals (like brief session data or initial search terms) to existing profiles of similar users, we could predict interests for new users earlier in their relationship with Etsy.</p>
<p>This user-based collaborative filtering approach would allow us to have profiles for as many users as possible, even those with minimal behavioral data. Below is a diagram for a sample new user who might inherit aspects of a similar profile as the one above, based on just a few searches.</p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/ef8f8c/7215162515/inv_fullxfull.7215162515_o2x0riwx.jpg?version=0" />
</figure>
<h1>Conclusion</h1>
<p>As we continue to refine these profiles and expand their applications across the Etsy marketplace, we're excited about the possibilities, from more intuitive search experiences to discovery features that surface hidden gems aligned with each buyer's unique tastes. </p>
<p>With nearly 90 million active buyers and over 100 million listings, the challenge of personalization at Etsy's scale is immense, but that's exactly what makes it worth solving. Every improvement we make impacts millions of shopping journeys, helping buyers discover that perfect vintage find, custom wedding gift, or miniature dragon they didn't even know they were looking for. </p>
<h1>Acknowledgements</h1>
<p>Huge thanks to Orson Adams, David Blincoe, Jugal Gala, Davis Kim, Haoming Chen, Yinlin Fu, Julia Zhou, and the entire Search Mission Understanding team for their invaluable support and guidance.</p>]]></description>
                <content:encoded><![CDATA[<p>Every day, shoppers from Etsy's community of nearly 90M buyers visit our marketplace to search for unique, handmade, and vintage items. But with over 100 million listings, how do we help each buyer find exactly what they're looking for? Traditional search and recommendation systems often fall short of capturing the nuanced interests that make each Etsy buyer unique, from specific styles to aesthetic preferences. At Etsy, understanding our buyers' interests is central to delivering engaging, personalized experiences. </p>
<p>Recently, we explored enhancing our personalization by leveraging large language models (LLMs) to create detailed buyer profiles based on buyers’ browsing and purchasing behaviors. We strive towards privacy by design and build these exploratory models with buyer privacy in mind. While this work is still in experimentation, we’re excited to share our early efforts and what we’ve learned.</p>
<h1>Understanding Buyer Profiles</h1>
<p>Buyer profiles anonymously capture nuanced interests for each user, like preferred styles and product categories. For example, do they tend to prefer minimalist styles? Do they mostly shop for home decor, or jewelry? They also offer insight into specific shopping missions a buyer may be on, like finding the perfect fall sweater. These profiles help us better understand our users, making the Etsy experience feel tailored to each individual buyer.</p>
<h1>Technical Implementation</h1>
<p>Our process begins by retrieving user activity data and listing information from internal sources like our feature store and BigQuery. This includes users’ recent searches, item views, purchases, and favorites. </p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/d90c38/7167110206/inv_fullxfull.7167110206_rvsp064p.jpg?version=0" />
</figure>
<p>Once we have the users’ session data with interactions, we then prompt the LLM to interpret this data and generate structured buyer profiles.</p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/ec98a6/7167223046/inv_fullxfull.7167223046_f1d4hdhg.jpg?version=0" />
<figcaption>Buyer profile data structure</figcaption>
</figure>
<p>In the case that the LLM does not have enough supporting data to infer a user's categorical interests (i.e. these interests would also not be obvious to a human looking at this user's interactions), these fields and the confidence hashmap can be empty. The LLM will always include an explanation and any observed interests in the explanation section of the buyer profile. </p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/3289ad/7167112652/inv_fullxfull.7167112652_4veuwows.jpg?version=0" />
</figure>
<h1>Bringing Personalization to Scale</h1>
<p>Initially, using this method to generate profiles for our roughly 90 million buyers would have been exceedingly costly and taken weeks to update. 
We optimized this by:</p>
<ul>
<li>Shifting our listings data source from API endpoints to BigQuery tables that are clustered and partitioned for efficient querying.</li>
<li>Decreasing the volume of input tokens. Initially, we were including about 2 years of session data. Now, we’ve reduced this to just the last 9 months. The 9 month timeframe also allowed us to lighten the prompt corrections (and cached input tokens) by reducing the weight of holiday shopping. </li>
<li>Increasing LLM and BigQuery batch sizes for data processing.</li>
<li>Introducing parallel processing with managed concurrency to avoid request rate limits.</li>
<li>Scaling up computational resources for session data retrieval and LLM processing tasks.</li>
</ul>
<p>These improvements reduced buyer profile generation time dramatically — from 21 days down to 3 days for 10 million users.
Cost management was also crucial. By adjusting the prompt to get high quality results with a smaller model, we significantly lowered the cost and made large-scale personalization economically feasible. Through these various cost management techniques, we were able to reduce the estimated cost by 94% per million users.</p>
<p>Additionally, to scale and schedule buyer profile refreshes, we used Airflow as an orchestration tool. By batching and staggering tasks by user_id, we parallelize some tasks while avoiding too many concurrent requests on BigQuery &amp; OpenAI APIs (as visualized in this Directed Acyclic Graph).</p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/4c215d/7215106915/inv_fullxfull.7215106915_9syjqngx.jpg?version=0" />
<figcaption>A snapshot of a running buyer profile DAG with staggered batching</figcaption>
</figure>
<h1>Experiments and Applications</h1>
<p>With these powerful insights in hand, we’ve started to explore several key applications of buyer profiles that would make the Etsy search experience feel unique to each shopper.</p>
<h2>Query Rewriting</h2>
<p>Query rewriting is the process of transforming a user’s submitted search query on the backend to better represent their intent to the underlying search systems. We already use query rewriting for different use cases at Etsy. It can correct spelling, add specifics to ambiguous queries, adjust niche terminology, or complete incomplete queries (where the user accidentally pressed enter before they finished writing their query).</p>
<p>To further personalize Etsy’s search experience, we explored enriching search queries with predicted interests from buyer profiles. For instance, a simple query like "cool posters" becomes "cool posters + hippie|boho|vintage|nature," significantly boosting relevance.</p>
<h3>Search results without personalization</h3>
<p><img alt="" src="https://i.etsystatic.com/inv/8042d1/7215156101/inv_fullxfull.7215156101_i7tyi5gw.jpg?version=0" /> </p>
<h3>Search results with personalization using example buyer profile</h3>
<p><img alt="" src="https://i.etsystatic.com/inv/57f38c/7215157261/inv_fullxfull.7215157261_3eozeu9r.jpg?version=0" /> </p>
<h2>Refinement Pills</h2>
<p>On the Etsy website, “refinement pills” work as an interactive query reformulation. When a user clicks on an option, the path will be prefixed to the user’s query and a new search will be executed. A mix of different types of refinement pills are generated: single pills (just one word) and grouped pills (a dropdown of options). With buyer profiles, we can present high-confidence buyer interests as clickable filters or refinement pills in the search interface, enabling users to effortlessly refine their searches based on predicted preferences. </p>
<h3>Refinement pills without personalization</h3>
<p><img alt="" src="https://i.etsystatic.com/inv/0065d9/7215158991/inv_fullxfull.7215158991_4vahuj39.jpg?version=0" /> </p>
<h3>Refinement pills with personalization using example buyer profile</h3>
<p><img alt="" src="https://i.etsystatic.com/inv/c6eec8/7167281650/inv_fullxfull.7167281650_ovxaywx2.jpg?version=0" /></p>
<p>For example, if a user clicks on a “daisy” refinement pill as in the example below, the search query "car accessories" would be reformulated to "daisy car accessories" and the new, more specific search will be executed.</p>
<h1>Measuring Success</h1>
<h2>Validating Profile Accuracy</h2>
<p>As we work to roll out buyer profile-based personalization, understanding the accuracy of our predictions is crucial. Through systematic experimentation, we're working to establish how well our LLM-generated profiles reflect actual user interests and shopping behaviors.</p>
<h2>Key Accuracy Metrics</h2>
<ul>
<li>Click-through Rate (CTR) Lift: Comparing CTR on personalized search results versus baseline helps us quantify immediate engagement improvements</li>
<li>Conversion Rate Impact: Tracking purchase rates from personalized searches reveals whether we're surfacing items users actually want to buy</li>
<li>Refinement Pill Engagement: Monitoring clicks on suggested refinement pills helps measure how well we've predicted user interests</li>
<li>Search Query Reformulation Success: Analyzing subsequent user actions after query rewriting indicates whether our enrichments align with user intent</li>
</ul>
<h2>Profile Refresh Strategy</h2>
<p>We'll work to maintain profile accuracy through several methods:</p>
<ul>
<li>Dynamically refreshing timing based on user activity levels. These include the number and frequency of recent interactions. For example, does the user click on the refinement pills proposed? Are the personalized items viewed?</li>
<li>Detecting interest drift when search patterns change significantly</li>
<li>Taking into account seasonal considerations, such as the winter holiday shopping season</li>
<li>Monitoring performance degradation signals when personalization effectiveness drops below thresholds of engagement listed above</li>
</ul>
<h1>Future Work</h1>
<h2>The Cold Start User Problem</h2>
<p>To extend this personalized experience to new users who lack extensive session data, we experimented with "inheritance profiles" using collaborative filtering. By matching early interaction signals (like brief session data or initial search terms) to existing profiles of similar users, we could predict interests for new users earlier in their relationship with Etsy.</p>
<p>This user-based collaborative filtering approach would allow us to have profiles for as many users as possible, even those with minimal behavioral data. Below is a diagram for a sample new user who might inherit aspects of a similar profile as the one above, based on just a few searches.</p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/ef8f8c/7215162515/inv_fullxfull.7215162515_o2x0riwx.jpg?version=0" />
</figure>
<h1>Conclusion</h1>
<p>As we continue to refine these profiles and expand their applications across the Etsy marketplace, we're excited about the possibilities, from more intuitive search experiences to discovery features that surface hidden gems aligned with each buyer's unique tastes. </p>
<p>With nearly 90 million active buyers and over 100 million listings, the challenge of personalization at Etsy's scale is immense, but that's exactly what makes it worth solving. Every improvement we make impacts millions of shopping journeys, helping buyers discover that perfect vintage find, custom wedding gift, or miniature dragon they didn't even know they were looking for. </p>
<h1>Acknowledgements</h1>
<p>Huge thanks to Orson Adams, David Blincoe, Jugal Gala, Davis Kim, Haoming Chen, Yinlin Fu, Julia Zhou, and the entire Search Mission Understanding team for their invaluable support and guidance.</p>]]></content:encoded>
            </item>
        
            <item>
                <title><![CDATA[Context engineering case studies: Etsy-specific question answering]]></title>
                <link>https://www.etsy.com/codeascraft/context-engineering-case-studies-etsy-specific-question-answering?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</link>
                <dc:creator><![CDATA[Jerome Bellegarda]]></dc:creator>
                <pubDate>Tue, 19 Aug 2025 16:04:41 -0400</pubDate>
                <guid isPermaLink="true">https://www.etsy.com/codeascraft/context-engineering-case-studies-etsy-specific-question-answering?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</guid>
                <description><![CDATA[<p><em>This post investigates the benefits and limitations of prompt engineering in two instances of AI-assisted onboarding relying on large language model (LLM) technology. Of particular interest is how truthful (and therefore reliable) LLM-generated answers turn out to be in the context of Etsy-specific question answering. Among other insights, we find that asking the LLM to identify specific source snippets is a good way to flag potential hallucinations.</em></p>
<p>Over the past few years, pre-trained large-scale/foundation language models such as OpenAI’s <a href="https://openai.com/index/introducing-o3-and-o4-mini/">o-series</a> [1] and Google’s <a href="https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/">Gemini family</a> [2] have revolutionized the field of natural language processing (NLP). Trained on vast amounts of text, images, code, audio, and videos, such models encapsulate a great deal of world knowledge, which can be called upon to perform a wide range of downstream tasks, such as sentiment analysis, language translation, and natural language inference, among many others.</p>
<p>The canonical way to improve the task performance of a pre-trained general language model, when it needs specific knowledge beyond its original training, is called <a href="https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini-use-supervised-tuning">fine-tuning</a> [3]. The model in a sense receives supplementary training, typically applied only in its last layers, on a relevant, labeled dataset. This is a robust but somewhat expensive approach, and only possible when there’s a sufficiently large dataset to support it. With the greatly increased size and complexity of LLMs, though, <strong>prompt-based tuning</strong> has emerged as an effective and much less costly alternative. The parameters of the model remain unchanged, and instead it’s through the prompt—the instructions given the model to perform its task—that tuning is done. At its simplest a prompt is just a question, but prompts can also incorporate specific chunks of text, provide the model with task-related constraints or preferences, and in general shape the model toward producing reliable responses. <em>Prompt engineering</em> encompasses the art and science of tuning prompts so they yield optimal results on a given task.</p>
<p>One of the places at Etsy where an assistive AI could be useful is onboarding: in principle, an LLM could answer a range of questions about Etsy’s both internal and external policies and procedures. But rather than ramp up an effort to fine-tune an onboarding model with a dataset of Etsy documents, we wanted to assess whether prompts alone could elicit truthful answers. We decided to start with a limited-scale pilot project, focusing on answering questions specific to the Travel &amp; Entertainment (T&amp;E) domain. On the one hand, this is a well circumscribed domain with clear and unambiguous rules. On the other hand, most Etsy employees still have questions for every trip. This article summarizes our journey of discovery, and discusses in particular the extent to which the addition of simple explicit instructions works to mitigate LLM errors.</p>
<h1>Question answering</h1>
<p>Perhaps the most critical aspect of a question answering system is its reliability, i.e., whether it is able to provide a truthful answer to any in-domain question. In the AI-assisted onboarding use case considered here, we want new Etsy employees to be able to be confident that their questions about the T&amp;E policy are answered correctly.</p>
<p>The first step was to feed Etsy-specific data into the LLM. If we were fine-tuning, we would update (a subset of) model weight parameters from an appropriate collection of Etsy internal documents related to T&amp;E. Prompt engineering, on the other hand, freezes the model weights, treating the LLM as a black box. A number of such black-box tuning techniques exist in the field, which we review in the Appendix to this article. Prompt-based tuning was an attractive proposition in our case because all that it required was simply an adequate representation of task-specific documents.</p>
<p>We therefore folded all T&amp;E documents into the rich latent space of the LLM (also known as the embedding space). In practice, this can be done efficiently by simply updating the underlying index to account for the newly folded data. Then, as indicated in the Appendix, question answering proceeded by leveraging <a href="https://ai.google.dev/gemini-api/docs/embeddings">embedding-based search</a> [6]. For every query, embedding the initial formulation led to relevant Etsy-specific text sections, which were then added to the question to form the final prompt.</p>
<p>We tested the resulting system’s performance on a manually curated set of 40 question-and-answer pairs. For each question in the test set, we compared the answer generated by the LLM with the answer we had extracted from the relevant policy document to form a judgment of answer quality. In the authors’ estimation, the LLM answered approximately 86% of the questions correctly, or at least satisfactorily enough that there was no need for further prompt maneuvering. In the remaining 14% of cases, the LLM generated an answer which was either wrong or misleading. For example, the LLM asserted with high confidence that it's the cardholder who is responsible for the balance on a corporate credit card:</p>
<p><strong>Q:</strong> <em>Who pays the balance on my corporate card after my expense report is approved?</em></p>
<p><strong>Correct answer:</strong> <em>Etsy pays the corporate card company directly on behalf of the Employee.</em></p>
<p><strong>LLM answer:</strong> <em>The cardholder is responsible for paying the balance on their corporate card after their expense report is approved.</em></p>
<p>The statement is wrong under the terms of the corporate card agreement. Similarly, the LLM falsely claimed that Etsy doesn't reimburse the cost of checking bags on a business trip (noting in its answer that "<em>the context information provided does not mention any expenses related to checking bags</em>," another falsehood).</p>
<h1>Mitigating hallucinations</h1>
<p>The kind of LLM falsehoods illustrated in the previous section–wrong, but confidently stated–have become known as “hallucinations.” Prompt engineering offers us a number of ways of addressing and overcoming LLM hallucinations. We can craft narrower and more explicit input instructions. We can augment prompts with relevant contextual information. We can even ask the LLM to answer as if it occupied a particular role, say, data analyst or financial manager (role prompting).</p>
<p>In the case of the above hallucination about our corporate cards, we first attempted the explicit-instruction route, asking the LLM to be clear about what it didn’t know:</p>
<p><strong>Q:</strong> <em>Say "I have no idea" when you don't know the answer. Who pays the balance on my corporate card after my expense report is approved?</em></p>
<p><strong>A:</strong> <em>I have no idea.</em></p>
<p>We at least avoid the hallucination, but not in a way that will be very helpful to the user. Also note that the LLM misses the correct answer, which is actually present in the collection of documents.</p>
<p>Another try, prompting the LLM for chain-of-thought reasoning, finally led to the correct answer:</p>
<p><strong>Q:</strong> <em>Say "I have no idea" when you don't know the answer. Who pays the balance on my corporate card after my expense report is approved, and why do you think so?</em></p>
<p><strong>A:</strong> <em>The balance on your corporate card is paid by Etsy. This information is mentioned on page 42.</em></p>
<p>The above is a good illustration of how three different prompts for the same question can lead to widely divergent answers. When the internal state of the LLM is confused about a particular topic, letting it express uncertainty in the answer can often mitigate undue overconfidence. In addition, prompting for chain-of-thought reasoning is often beneficial, as it provides a way to structure the prompt with logical milestones to ensure that the LLM follows a well-circumscribed reasoning path. While it may not seem very intuitive that just adding "why do you think so" should change an all-but unresponsive answer into a correct one, the mere fact of asking for an explanation raises the bar in terms of fact checking – a side benefit of prompt engineering which doesn't require additional investment.</p>
<h1>Etsy community forums</h1>
<p>Encouraged by those findings, we decided to expand our investigation to an external use case: question answering based on historical data from Etsy seller community forums. Unlike the T&amp;E data we’ve just seen, Etsy forums contain publicly available content. Sellers ask questions such as: “What are the most effective ways to optimize my Etsy shop’s SEO to increase visibility and sales?” These questions are then answered by either Etsy or community members. As a result, compared to the T&amp;E data, Etsy forum data is substantially more complex: much less homogeneous (both in terms of style and scope), with answers by nature much more opinionated. As such,assessing the accuracy of LLM-based question answering is a little trickier for community forums than for the T&amp;E policy. </p>
<p>To note, because forums are a space where sellers engage with and offer information to one another, there may be some divergence between answers provided in the forums and official Etsy policies &amp; practices. For the purposes of our investigation, we measured only whether the LLM could accurately surface responses based on the input data it was given. A high score would indicate a helpful tool not only at seller onboarding time, but also to help both Etsy and community members routinely search this growing community database in a faster and more accurate fashion.</p>
<p>We proceeded exactly as before, first folding recent Etsy forum data into the latent space and then updating the underlying index to account for the newly folded data. Similar to the previous case study, we tested performance on a manually curated set of 50 question-and-answer pairs, comparing LLM-generated answers with answers extracted from the relevant Etsy forum to form a judgment of answer quality.</p>
<p>In the authors’ estimation, the LLM answered approximately 72% of the questions correctly before any prompt maneuvering. In the remaining 28% of cases, the LLM generated an answer which was either wrong or misleading. Given the higher heterogeneity of the data, it is not overly surprising that the percentage of poor answers doubles.</p>
<p>Comparing manually extracted and automatically generated answers is informative, not just from the point of view of seeing natural language outcomes, but also from the perspective of analyzing how such outcomes are pieced together from tidbits of information in latent space. Everything else being equal, the LLM tends to do better when a query formulation is close to the exact wording used in reference documents, and worse when the answer depends on multiple conditions being satisfied in turn.</p>
<p>In a handful of cases when the LLM stayed relatively vague and avoided providing specific information, enabling detailed query expansion occasionally led to an hallucination. This observation has concrete implications when it comes to chain-of-thought prompting: specific logical milestones must be congruent with query expansion chunks to maximize alignment.</p>
<h1>Limitations</h1>
<p>The forums Q&amp;A use case also helped surface some of the limitations of prompt engineering. For that data, we observed that the approach advocated earlier (allowing for uncertainty, providing more context, asking for an explanation) was not always effective.</p>
<p>For example, we initially thought that the following case:</p>
<p><strong>Q:</strong> <em>Help, where’s the download for 1099-K? There’s NO download button!</em></p>
<p><strong>A:</strong> <em>It is not possible to answer this question without more information. It is unclear what the 1099-K is, what it is used for, and where it can be downloaded from.</em></p>
<p>...would be readily addressed through more advanced prompt engineering focusing on additional contextual information. However, adding a contextual preamble did not help:</p>
<p><strong>Q:</strong> <em><strong>In the United States, Form 1099-K "Payment Card and Third Party Network Transactions" is a variant of Form 1099 used to report payments received through reportable payment card transactions and/or settlement of third-party payment network transactions.</strong> Help, where’s the download for 1099-K? There’s NO download button!</em></p>
<p><strong>A:</strong> <em>The context information provided does not include any information about downloading Form 1099-K. To find the download button for Form 1099-K, you will need to search for it on the website of the Internal Revenue Service (IRS) or other relevant government agency.</em></p>
<p>The preamble “<em><strong>In the United States, Form 1099-K [...] transactions.</strong></em>” was clearly not sufficient to lead to a useful answer to the original question.</p>
<p>Other times, the LLM punted with the expression: "<em>Without prior knowledge, it is impossible to determine...</em>" -- which was untrue, since the information was indeed available. We conclude that, even with the current crop of advanced reasoning models, when complex reasoning is required, chain-of-thought prompting must be carefully set up. In such situations, what seemed to help most was asking for specific source snippets, to determine where in the forums the LLM was looking for answers.</p>
<h1>Key learnings</h1>
<p>Our experiments with two flavors of AI-assisted onboarding (internal and external) indicate that, most of the time, prompt engineering works reasonably well as a practical alternative to parameter-based fine-tuning. Prompt engineering is conceptually simple and can often be leveraged to fix cases where LLMs generate non-factual information. It has limitations, however, so care should still be taken when assessing answer truthfulness. Asking the LLM to disclose specific sources is a good way to flag potential hallucinations.</p>
<h1>References</h1>
<p>[1] OpenAI blog. “<a href="https://openai.com/index/introducing-o3-and-o4-mini/">Introducing OpenAI o3 and o4-mini.</a>” 16 April 2025.</p>
<p>[2] Google DeepMind blog. “<a href="https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/">Gemini 2.5: Our most intelligent AI model.</a>” 25 March 2025.</p>
<p>[3] Google Cloud Documentation. “<a href="https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini-use-supervised-tuning">Tune Gemini models by using supervised fine-tuning.</a>” </p>
<p>[4] T. Brown et al. (2020). “<a href="https://arxiv.org/abs/2005.14165">Language models are few-shot learners.</a>” Proc. 2020 NeurIPS, Vol. 33, pp. 1877-1901.</p>
<p>[5] T. Gao, A. Fisch, and D. Chen (2021). “<a href="https://arxiv.org/abs/2012.15723">Making pre-trained language models better few-shot learners.</a>” Proc. 2021 ACL/IJCNLP, pp. 3816-2830.</p>
<p>[6] Google AI for Developers. “<a href="https://ai.google.dev/gemini-api/docs/embeddings">Embeddings.</a>”</p>
<p>[7] Google AI for Developers. “<a href="https://ai.google.dev/gemini-api/docs/prompting-strategies">Prompt design strategies.</a>” </p>
<h1>Appendix</h1>
<p>Various instantiations of prompt-based tuning have evolved over the years. The purpose of this Appendix is to give a little historical background and perspective on the set of practices that have come to be known, collectively, as prompt engineering.</p>
<p>Prompt-based tuning originally developed as an alternative form of model fine-tuning for situations where only few-shot demonstration examples are available for downstream tasks [4].  In such situations, prompt tuning makes it possible to use the same robust model for all tasks, as opposed to producing a separate, more brittle model for each one. This approach typically treats the model as a black box and encodes either the input as trainable sequences or the output as training features. It has been shown to drastically outperform conventional transfer learning procedures in the low resource setting [5].</p>
<p>More recently, prompt engineering has gained further prominence in the literature as a way to circumvent parameter-based fine-tuning of LLMs. Such fine-tuning is extremely challenging with an LLM, as any backpropagation through the entire model requires substantial computational resources. With LLMs being trained on an ever increasing amount and diversity of data, leading to an increasingly richer latent space, there is correspondingly less need to do the hard and costly work of updating model weights.</p>
<p>Prompt-based tuning only requires that task-specific documents be ingested and folded into that rich latent space. This can be done efficiently by simply updating the underlying index to account for the newly folded data. Once this is done, tuning typically relies on embedding-based search [6]: every prompt is converted to a suitable embedding in latent space by the embeddings API, text sections generally relevant to the prompt are identified using similarity in embedding space, the prompt is augmented with the most relevant text sections, and the LLM responds accordingly.</p>
<p>In this context, prompt engineering can be viewed as a form of query expansion, a well-established process to improve information retrieval. On top of that, further tuning often involves the equivalent of query reformulation, typically achieved through instruction prompting, role prompting, and few-shot prompting. For a list of best practices in this area, see "<a href="https://ai.google.dev/gemini-api/docs/prompting-strategies">Prompt design strategies</a>" [7].</p>]]></description>
                <content:encoded><![CDATA[<p><em>This post investigates the benefits and limitations of prompt engineering in two instances of AI-assisted onboarding relying on large language model (LLM) technology. Of particular interest is how truthful (and therefore reliable) LLM-generated answers turn out to be in the context of Etsy-specific question answering. Among other insights, we find that asking the LLM to identify specific source snippets is a good way to flag potential hallucinations.</em></p>
<p>Over the past few years, pre-trained large-scale/foundation language models such as OpenAI’s <a href="https://openai.com/index/introducing-o3-and-o4-mini/">o-series</a> [1] and Google’s <a href="https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/">Gemini family</a> [2] have revolutionized the field of natural language processing (NLP). Trained on vast amounts of text, images, code, audio, and videos, such models encapsulate a great deal of world knowledge, which can be called upon to perform a wide range of downstream tasks, such as sentiment analysis, language translation, and natural language inference, among many others.</p>
<p>The canonical way to improve the task performance of a pre-trained general language model, when it needs specific knowledge beyond its original training, is called <a href="https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini-use-supervised-tuning">fine-tuning</a> [3]. The model in a sense receives supplementary training, typically applied only in its last layers, on a relevant, labeled dataset. This is a robust but somewhat expensive approach, and only possible when there’s a sufficiently large dataset to support it. With the greatly increased size and complexity of LLMs, though, <strong>prompt-based tuning</strong> has emerged as an effective and much less costly alternative. The parameters of the model remain unchanged, and instead it’s through the prompt—the instructions given the model to perform its task—that tuning is done. At its simplest a prompt is just a question, but prompts can also incorporate specific chunks of text, provide the model with task-related constraints or preferences, and in general shape the model toward producing reliable responses. <em>Prompt engineering</em> encompasses the art and science of tuning prompts so they yield optimal results on a given task.</p>
<p>One of the places at Etsy where an assistive AI could be useful is onboarding: in principle, an LLM could answer a range of questions about Etsy’s both internal and external policies and procedures. But rather than ramp up an effort to fine-tune an onboarding model with a dataset of Etsy documents, we wanted to assess whether prompts alone could elicit truthful answers. We decided to start with a limited-scale pilot project, focusing on answering questions specific to the Travel &amp; Entertainment (T&amp;E) domain. On the one hand, this is a well circumscribed domain with clear and unambiguous rules. On the other hand, most Etsy employees still have questions for every trip. This article summarizes our journey of discovery, and discusses in particular the extent to which the addition of simple explicit instructions works to mitigate LLM errors.</p>
<h1>Question answering</h1>
<p>Perhaps the most critical aspect of a question answering system is its reliability, i.e., whether it is able to provide a truthful answer to any in-domain question. In the AI-assisted onboarding use case considered here, we want new Etsy employees to be able to be confident that their questions about the T&amp;E policy are answered correctly.</p>
<p>The first step was to feed Etsy-specific data into the LLM. If we were fine-tuning, we would update (a subset of) model weight parameters from an appropriate collection of Etsy internal documents related to T&amp;E. Prompt engineering, on the other hand, freezes the model weights, treating the LLM as a black box. A number of such black-box tuning techniques exist in the field, which we review in the Appendix to this article. Prompt-based tuning was an attractive proposition in our case because all that it required was simply an adequate representation of task-specific documents.</p>
<p>We therefore folded all T&amp;E documents into the rich latent space of the LLM (also known as the embedding space). In practice, this can be done efficiently by simply updating the underlying index to account for the newly folded data. Then, as indicated in the Appendix, question answering proceeded by leveraging <a href="https://ai.google.dev/gemini-api/docs/embeddings">embedding-based search</a> [6]. For every query, embedding the initial formulation led to relevant Etsy-specific text sections, which were then added to the question to form the final prompt.</p>
<p>We tested the resulting system’s performance on a manually curated set of 40 question-and-answer pairs. For each question in the test set, we compared the answer generated by the LLM with the answer we had extracted from the relevant policy document to form a judgment of answer quality. In the authors’ estimation, the LLM answered approximately 86% of the questions correctly, or at least satisfactorily enough that there was no need for further prompt maneuvering. In the remaining 14% of cases, the LLM generated an answer which was either wrong or misleading. For example, the LLM asserted with high confidence that it's the cardholder who is responsible for the balance on a corporate credit card:</p>
<p><strong>Q:</strong> <em>Who pays the balance on my corporate card after my expense report is approved?</em></p>
<p><strong>Correct answer:</strong> <em>Etsy pays the corporate card company directly on behalf of the Employee.</em></p>
<p><strong>LLM answer:</strong> <em>The cardholder is responsible for paying the balance on their corporate card after their expense report is approved.</em></p>
<p>The statement is wrong under the terms of the corporate card agreement. Similarly, the LLM falsely claimed that Etsy doesn't reimburse the cost of checking bags on a business trip (noting in its answer that "<em>the context information provided does not mention any expenses related to checking bags</em>," another falsehood).</p>
<h1>Mitigating hallucinations</h1>
<p>The kind of LLM falsehoods illustrated in the previous section–wrong, but confidently stated–have become known as “hallucinations.” Prompt engineering offers us a number of ways of addressing and overcoming LLM hallucinations. We can craft narrower and more explicit input instructions. We can augment prompts with relevant contextual information. We can even ask the LLM to answer as if it occupied a particular role, say, data analyst or financial manager (role prompting).</p>
<p>In the case of the above hallucination about our corporate cards, we first attempted the explicit-instruction route, asking the LLM to be clear about what it didn’t know:</p>
<p><strong>Q:</strong> <em>Say "I have no idea" when you don't know the answer. Who pays the balance on my corporate card after my expense report is approved?</em></p>
<p><strong>A:</strong> <em>I have no idea.</em></p>
<p>We at least avoid the hallucination, but not in a way that will be very helpful to the user. Also note that the LLM misses the correct answer, which is actually present in the collection of documents.</p>
<p>Another try, prompting the LLM for chain-of-thought reasoning, finally led to the correct answer:</p>
<p><strong>Q:</strong> <em>Say "I have no idea" when you don't know the answer. Who pays the balance on my corporate card after my expense report is approved, and why do you think so?</em></p>
<p><strong>A:</strong> <em>The balance on your corporate card is paid by Etsy. This information is mentioned on page 42.</em></p>
<p>The above is a good illustration of how three different prompts for the same question can lead to widely divergent answers. When the internal state of the LLM is confused about a particular topic, letting it express uncertainty in the answer can often mitigate undue overconfidence. In addition, prompting for chain-of-thought reasoning is often beneficial, as it provides a way to structure the prompt with logical milestones to ensure that the LLM follows a well-circumscribed reasoning path. While it may not seem very intuitive that just adding "why do you think so" should change an all-but unresponsive answer into a correct one, the mere fact of asking for an explanation raises the bar in terms of fact checking – a side benefit of prompt engineering which doesn't require additional investment.</p>
<h1>Etsy community forums</h1>
<p>Encouraged by those findings, we decided to expand our investigation to an external use case: question answering based on historical data from Etsy seller community forums. Unlike the T&amp;E data we’ve just seen, Etsy forums contain publicly available content. Sellers ask questions such as: “What are the most effective ways to optimize my Etsy shop’s SEO to increase visibility and sales?” These questions are then answered by either Etsy or community members. As a result, compared to the T&amp;E data, Etsy forum data is substantially more complex: much less homogeneous (both in terms of style and scope), with answers by nature much more opinionated. As such,assessing the accuracy of LLM-based question answering is a little trickier for community forums than for the T&amp;E policy. </p>
<p>To note, because forums are a space where sellers engage with and offer information to one another, there may be some divergence between answers provided in the forums and official Etsy policies &amp; practices. For the purposes of our investigation, we measured only whether the LLM could accurately surface responses based on the input data it was given. A high score would indicate a helpful tool not only at seller onboarding time, but also to help both Etsy and community members routinely search this growing community database in a faster and more accurate fashion.</p>
<p>We proceeded exactly as before, first folding recent Etsy forum data into the latent space and then updating the underlying index to account for the newly folded data. Similar to the previous case study, we tested performance on a manually curated set of 50 question-and-answer pairs, comparing LLM-generated answers with answers extracted from the relevant Etsy forum to form a judgment of answer quality.</p>
<p>In the authors’ estimation, the LLM answered approximately 72% of the questions correctly before any prompt maneuvering. In the remaining 28% of cases, the LLM generated an answer which was either wrong or misleading. Given the higher heterogeneity of the data, it is not overly surprising that the percentage of poor answers doubles.</p>
<p>Comparing manually extracted and automatically generated answers is informative, not just from the point of view of seeing natural language outcomes, but also from the perspective of analyzing how such outcomes are pieced together from tidbits of information in latent space. Everything else being equal, the LLM tends to do better when a query formulation is close to the exact wording used in reference documents, and worse when the answer depends on multiple conditions being satisfied in turn.</p>
<p>In a handful of cases when the LLM stayed relatively vague and avoided providing specific information, enabling detailed query expansion occasionally led to an hallucination. This observation has concrete implications when it comes to chain-of-thought prompting: specific logical milestones must be congruent with query expansion chunks to maximize alignment.</p>
<h1>Limitations</h1>
<p>The forums Q&amp;A use case also helped surface some of the limitations of prompt engineering. For that data, we observed that the approach advocated earlier (allowing for uncertainty, providing more context, asking for an explanation) was not always effective.</p>
<p>For example, we initially thought that the following case:</p>
<p><strong>Q:</strong> <em>Help, where’s the download for 1099-K? There’s NO download button!</em></p>
<p><strong>A:</strong> <em>It is not possible to answer this question without more information. It is unclear what the 1099-K is, what it is used for, and where it can be downloaded from.</em></p>
<p>...would be readily addressed through more advanced prompt engineering focusing on additional contextual information. However, adding a contextual preamble did not help:</p>
<p><strong>Q:</strong> <em><strong>In the United States, Form 1099-K "Payment Card and Third Party Network Transactions" is a variant of Form 1099 used to report payments received through reportable payment card transactions and/or settlement of third-party payment network transactions.</strong> Help, where’s the download for 1099-K? There’s NO download button!</em></p>
<p><strong>A:</strong> <em>The context information provided does not include any information about downloading Form 1099-K. To find the download button for Form 1099-K, you will need to search for it on the website of the Internal Revenue Service (IRS) or other relevant government agency.</em></p>
<p>The preamble “<em><strong>In the United States, Form 1099-K [...] transactions.</strong></em>” was clearly not sufficient to lead to a useful answer to the original question.</p>
<p>Other times, the LLM punted with the expression: "<em>Without prior knowledge, it is impossible to determine...</em>" -- which was untrue, since the information was indeed available. We conclude that, even with the current crop of advanced reasoning models, when complex reasoning is required, chain-of-thought prompting must be carefully set up. In such situations, what seemed to help most was asking for specific source snippets, to determine where in the forums the LLM was looking for answers.</p>
<h1>Key learnings</h1>
<p>Our experiments with two flavors of AI-assisted onboarding (internal and external) indicate that, most of the time, prompt engineering works reasonably well as a practical alternative to parameter-based fine-tuning. Prompt engineering is conceptually simple and can often be leveraged to fix cases where LLMs generate non-factual information. It has limitations, however, so care should still be taken when assessing answer truthfulness. Asking the LLM to disclose specific sources is a good way to flag potential hallucinations.</p>
<h1>References</h1>
<p>[1] OpenAI blog. “<a href="https://openai.com/index/introducing-o3-and-o4-mini/">Introducing OpenAI o3 and o4-mini.</a>” 16 April 2025.</p>
<p>[2] Google DeepMind blog. “<a href="https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/">Gemini 2.5: Our most intelligent AI model.</a>” 25 March 2025.</p>
<p>[3] Google Cloud Documentation. “<a href="https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini-use-supervised-tuning">Tune Gemini models by using supervised fine-tuning.</a>” </p>
<p>[4] T. Brown et al. (2020). “<a href="https://arxiv.org/abs/2005.14165">Language models are few-shot learners.</a>” Proc. 2020 NeurIPS, Vol. 33, pp. 1877-1901.</p>
<p>[5] T. Gao, A. Fisch, and D. Chen (2021). “<a href="https://arxiv.org/abs/2012.15723">Making pre-trained language models better few-shot learners.</a>” Proc. 2021 ACL/IJCNLP, pp. 3816-2830.</p>
<p>[6] Google AI for Developers. “<a href="https://ai.google.dev/gemini-api/docs/embeddings">Embeddings.</a>”</p>
<p>[7] Google AI for Developers. “<a href="https://ai.google.dev/gemini-api/docs/prompting-strategies">Prompt design strategies.</a>” </p>
<h1>Appendix</h1>
<p>Various instantiations of prompt-based tuning have evolved over the years. The purpose of this Appendix is to give a little historical background and perspective on the set of practices that have come to be known, collectively, as prompt engineering.</p>
<p>Prompt-based tuning originally developed as an alternative form of model fine-tuning for situations where only few-shot demonstration examples are available for downstream tasks [4].  In such situations, prompt tuning makes it possible to use the same robust model for all tasks, as opposed to producing a separate, more brittle model for each one. This approach typically treats the model as a black box and encodes either the input as trainable sequences or the output as training features. It has been shown to drastically outperform conventional transfer learning procedures in the low resource setting [5].</p>
<p>More recently, prompt engineering has gained further prominence in the literature as a way to circumvent parameter-based fine-tuning of LLMs. Such fine-tuning is extremely challenging with an LLM, as any backpropagation through the entire model requires substantial computational resources. With LLMs being trained on an ever increasing amount and diversity of data, leading to an increasingly richer latent space, there is correspondingly less need to do the hard and costly work of updating model weights.</p>
<p>Prompt-based tuning only requires that task-specific documents be ingested and folded into that rich latent space. This can be done efficiently by simply updating the underlying index to account for the newly folded data. Once this is done, tuning typically relies on embedding-based search [6]: every prompt is converted to a suitable embedding in latent space by the embeddings API, text sections generally relevant to the prompt are identified using similarity in embedding space, the prompt is augmented with the most relevant text sections, and the LLM responds accordingly.</p>
<p>In this context, prompt engineering can be viewed as a form of query expansion, a well-established process to improve information retrieval. On top of that, further tuning often involves the equivalent of query reformulation, typically achieved through instruction prompting, role prompting, and few-shot prompting. For a list of best practices in this area, see "<a href="https://ai.google.dev/gemini-api/docs/prompting-strategies">Prompt design strategies</a>" [7].</p>]]></content:encoded>
            </item>
        
            <item>
                <title><![CDATA[Behind the Scenes - A Glimpse into Tax Calculations]]></title>
                <link>https://www.etsy.com/codeascraft/behind-the-scenes---a-glimpse-to-tax-calculations?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</link>
                <dc:creator><![CDATA[Nancy Arnold]]></dc:creator>
                <pubDate>Sat, 29 Mar 2025 06:03:23 -0400</pubDate>
                <guid isPermaLink="true">https://www.etsy.com/codeascraft/behind-the-scenes---a-glimpse-to-tax-calculations?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</guid>
                <description><![CDATA[<p>In the past, sellers were responsible for managing and fulfilling their own tax obligations. 
However, more and more jurisdictions are now requiring marketplaces such as Etsy to collect the tax from buyers and remit the tax to the relevant authorities. Etsy now plays an active role in collecting tax from buyers and remitting it all over the world. In this post, I will walk you through our tax calculation infrastructure and how we adapted to the ongoing increase in traffic and business needs over the years.</p>
<p><strong>The tax calculation workflow</strong></p>
<p>We determine tax whenever a buyer adds an item to their Etsy shopping cart. The tax determination is based on buyer and seller location and product category, and a set of tax rules and mappings. To handle the details of these calculations we partner with Vertex, and issue a call to their <a href="https://www.vertexinc.com/">tax engine</a> via the Quotation Request API to get the right amount to show in our buyer's cart. </p>
<p>Vertex ensures accurate and efficient tax management and continuously updates the tax rules and rates for jurisdictions around the world. The two main API calls we use are Quotation Request and DistributeTaxRequest SOAP calls.</p>
<p>When the buyer proceeds to payment, an order is created, and we call back to Vertex with a DistributeTaxRequest sending the order information and tax details. We sync information with Vertex through the order fulfillment lifecycle. To keep things up to date in case an order is canceled or a refund needs to be issued later on, we inform the details of the cancellation and refunds to the tax engine via DistributeTaxRequest. This ensures that when Vertex generates tax reports for us they will be based on a complete record of all the relevant transactions. 
Etsy collects the tax from the buyers and remits that tax to the taxing authority, when required. </p>
<p><strong>Generate tax details for reporting and audit purpose</strong></p>
<p>Vertex comes with a variety of report formats out of the box, and gives us tools to define our own. When Etsy calls the Distribute Tax API, Vertex saves the information we pass to it as raw metadata in its tax journal database. A daily cron job in Vertex then moves this data to the transaction detail table, populating it with tax info. When reports and audit data are generated, we download these reports and import to Etsy’s bigdata and the workflow completes.</p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/b7e136/5660475277/inv_fullxfull.5660475277_6aqobaes.jpg?version=0" />
</figure>
<p><strong>Mapping the Etsy taxonomy to tax categories</strong></p>
<p>Etsy maintains product categories to help our buyers find exactly the items they're looking for. </p>
<p>To determine whether transactions are taxed or exempt it's not enough to know item prices and buyer locations: we have to map our product categories to Vertex's rule drivers. That was an effort involving not just engineering but also our tax and analytics teams, and with the wide range of Etsy taxonomy categories it was no small task.</p>
<p><strong>Handling increased API traffic</strong></p>
<p>Coping with the continuous increase in traffic and maintaining the best checkout experience without delays has been a challenge all the time. Out of the different upgrades we did, the most important ones were to switch to multiple instances for vertex calls and shadowing.</p>
<p><strong>Multiple Instance upgrade</strong></p>
<p>In our initial integration, we were using the same vertex instance for Quotation and Distribute calls. And the same instance was responsible for generating the reports. This report generation started to affect our checkout experience. </p>
<p>Reports are generally used by our tax team and they run them on a regular basis. But on top of that, we also run daily reports to feed the data captured by Vertex back into our own system for analytics purposes.</p>
<p>We solved this by routing the quotation calls to one instance and then distributing them to the other. This helped in maintaining a clear separation of functionalities, and avoided interference between the two processes. We had to align the configurations between the instances as well. 
Splitting up the quotation and distribution calls opened up the door to horizontal scaling, now we can add as many instances of each type and load balance the requests between instances. </p>
<p>Eg: When a request type lists multiple instances, we load balance between the instances by using the cart_id for quotations and receipt_ids for distributes I.e. cart_id % quotation_instance_count</p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/55e80d/5612390760/inv_fullxfull.5612390760_2x0ox160.jpg?version=0" />
</figure>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/54571d/5660478985/inv_fullxfull.5660478985_h59bbdge.jpg?version=0" />
</figure>
<p><strong>Shadow logging</strong></p>
<p>Shadow logging the requests helped us to simulate the stress on Vertex and monitor the checkout experience. We used this technique multiple times  in the past.</p>
<p>Whenever we had situations like, for example, adding five hundred thousand more listings whose taxes would be passed through the Vertex engine, we were concerned that the increase in traffic might impact buyer experience. To ensure it wouldn't, we tested for a period of time by slowly ramping shadow requests to Vertex: "Shadow requests" are test requests that we send to Vertex from orders, but without applying the calculated tax details to buyers' carts. This will simulate the load on vertex and we can monitor the cart checkout experience. </p>
<p>Once we have done shadowing and seen how well Vertex handled the increased traffic, we are confident that releasing the features ensures it would not have any performance implications.</p>
<p><strong>Conclusion</strong></p>
<p>Given the volume of increasing traffic and the data involved, we will have to keep improving our design to support those. We've also had to address analytics, reporting, configuration sync and many more in designing the system, but we'll leave that story for next time.</p>]]></description>
                <content:encoded><![CDATA[<p>In the past, sellers were responsible for managing and fulfilling their own tax obligations. 
However, more and more jurisdictions are now requiring marketplaces such as Etsy to collect the tax from buyers and remit the tax to the relevant authorities. Etsy now plays an active role in collecting tax from buyers and remitting it all over the world. In this post, I will walk you through our tax calculation infrastructure and how we adapted to the ongoing increase in traffic and business needs over the years.</p>
<p><strong>The tax calculation workflow</strong></p>
<p>We determine tax whenever a buyer adds an item to their Etsy shopping cart. The tax determination is based on buyer and seller location and product category, and a set of tax rules and mappings. To handle the details of these calculations we partner with Vertex, and issue a call to their <a href="https://www.vertexinc.com/">tax engine</a> via the Quotation Request API to get the right amount to show in our buyer's cart. </p>
<p>Vertex ensures accurate and efficient tax management and continuously updates the tax rules and rates for jurisdictions around the world. The two main API calls we use are Quotation Request and DistributeTaxRequest SOAP calls.</p>
<p>When the buyer proceeds to payment, an order is created, and we call back to Vertex with a DistributeTaxRequest sending the order information and tax details. We sync information with Vertex through the order fulfillment lifecycle. To keep things up to date in case an order is canceled or a refund needs to be issued later on, we inform the details of the cancellation and refunds to the tax engine via DistributeTaxRequest. This ensures that when Vertex generates tax reports for us they will be based on a complete record of all the relevant transactions. 
Etsy collects the tax from the buyers and remits that tax to the taxing authority, when required. </p>
<p><strong>Generate tax details for reporting and audit purpose</strong></p>
<p>Vertex comes with a variety of report formats out of the box, and gives us tools to define our own. When Etsy calls the Distribute Tax API, Vertex saves the information we pass to it as raw metadata in its tax journal database. A daily cron job in Vertex then moves this data to the transaction detail table, populating it with tax info. When reports and audit data are generated, we download these reports and import to Etsy’s bigdata and the workflow completes.</p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/b7e136/5660475277/inv_fullxfull.5660475277_6aqobaes.jpg?version=0" />
</figure>
<p><strong>Mapping the Etsy taxonomy to tax categories</strong></p>
<p>Etsy maintains product categories to help our buyers find exactly the items they're looking for. </p>
<p>To determine whether transactions are taxed or exempt it's not enough to know item prices and buyer locations: we have to map our product categories to Vertex's rule drivers. That was an effort involving not just engineering but also our tax and analytics teams, and with the wide range of Etsy taxonomy categories it was no small task.</p>
<p><strong>Handling increased API traffic</strong></p>
<p>Coping with the continuous increase in traffic and maintaining the best checkout experience without delays has been a challenge all the time. Out of the different upgrades we did, the most important ones were to switch to multiple instances for vertex calls and shadowing.</p>
<p><strong>Multiple Instance upgrade</strong></p>
<p>In our initial integration, we were using the same vertex instance for Quotation and Distribute calls. And the same instance was responsible for generating the reports. This report generation started to affect our checkout experience. </p>
<p>Reports are generally used by our tax team and they run them on a regular basis. But on top of that, we also run daily reports to feed the data captured by Vertex back into our own system for analytics purposes.</p>
<p>We solved this by routing the quotation calls to one instance and then distributing them to the other. This helped in maintaining a clear separation of functionalities, and avoided interference between the two processes. We had to align the configurations between the instances as well. 
Splitting up the quotation and distribution calls opened up the door to horizontal scaling, now we can add as many instances of each type and load balance the requests between instances. </p>
<p>Eg: When a request type lists multiple instances, we load balance between the instances by using the cart_id for quotations and receipt_ids for distributes I.e. cart_id % quotation_instance_count</p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/55e80d/5612390760/inv_fullxfull.5612390760_2x0ox160.jpg?version=0" />
</figure>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/54571d/5660478985/inv_fullxfull.5660478985_h59bbdge.jpg?version=0" />
</figure>
<p><strong>Shadow logging</strong></p>
<p>Shadow logging the requests helped us to simulate the stress on Vertex and monitor the checkout experience. We used this technique multiple times  in the past.</p>
<p>Whenever we had situations like, for example, adding five hundred thousand more listings whose taxes would be passed through the Vertex engine, we were concerned that the increase in traffic might impact buyer experience. To ensure it wouldn't, we tested for a period of time by slowly ramping shadow requests to Vertex: "Shadow requests" are test requests that we send to Vertex from orders, but without applying the calculated tax details to buyers' carts. This will simulate the load on vertex and we can monitor the cart checkout experience. </p>
<p>Once we have done shadowing and seen how well Vertex handled the increased traffic, we are confident that releasing the features ensures it would not have any performance implications.</p>
<p><strong>Conclusion</strong></p>
<p>Given the volume of increasing traffic and the data involved, we will have to keep improving our design to support those. We've also had to address analytics, reporting, configuration sync and many more in designing the system, but we'll leave that story for next time.</p>]]></content:encoded>
            </item>
        
            <item>
                <title><![CDATA[Adopting Jetpack Compose for Etsy’s Android App]]></title>
                <link>https://www.etsy.com/codeascraft/adopting-jetpack-compose-for-etsy39s-android-app?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</link>
                <dc:creator><![CDATA[John Kalmi]]></dc:creator>
                <pubDate>Tue, 28 Jan 2025 10:25:11 -0500</pubDate>
                <guid isPermaLink="true">https://www.etsy.com/codeascraft/adopting-jetpack-compose-for-etsy39s-android-app?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</guid>
                <description><![CDATA[<p>One of our <a href="https://www.etsy.com/about">Guiding Principles</a> at Etsy is that we “commit to our craft.” This means that we have a culture of learning, in which we’re constantly looking for opportunities to improve and learn, adopt industry best practices, and share our findings with our colleagues and our community. </p>
<p>As part of that process, Etsy recently adopted <a href="https://developer.android.com/compose">Jetpack Compose</a> – Android's modern toolkit for defining native UIs – as our preferred means of building our Android app. The process of adoption consisted of a gradual expansion in the size and complexity of features built using Compose, eventually culminating in a full rewrite of one of the primary screens in the app. The results of that rewrite gave us the confidence to recommend Compose as the primary tool for our Android engineers to build UIs going forward.</p>
<h2>Adoption</h2>
<p>Our engineers are always investigating the latest industry trends and technologies, but in this case a more structured approach was warranted due to the central nature of UI toolkits in the development process. Several engineers on the Android team were assigned to study the existing Compose documentation and examples we had used in prior builds and then create a short curriculum based on what they learned. Over several months, the team held multiple information sessions with the entire Android group, showing how to use Compose to build simple versions of some of our real app screens.</p>
<figure>
<img alt="A sample curriculum section" src="https://i.etsystatic.com/inv/b7961f/6581128905/inv_fullxfull.6581128905_lyn8t9v6.jpg?version=0" />
<figcaption>Part of our in-house curriculum for learning Jetpack Compose via small modules. Each module built upon the previous module to build more complex versions of various features in our real app.</figcaption>
</figure>
<p>Next, our Design Systems team started creating Compose versions of our internal UI toolkit components, with the goal of having a complete Compose implementation of our design system before major adoption. Compose is designed for <a href="https://developer.android.com/develop/ui/compose/migrate/interoperability-apis">interoperability</a> with our existing toolkit, XML Views, providing an uncomplicated migration path that enables us to start using these new toolkit components in our existing XML Views with minimal disruption. This was our first chance to validate that the performance of Compose would be as good as or better than our existing toolkit components. This also gave the wider Android community at Etsy a chance to start using Compose in their day-to-day work and get comfortable with the new patterns Compose introduced.</p>
<figure>
<img alt="A partial list of design system components" src="https://i.etsystatic.com/inv/fbc7f1/6634642913/inv_fullxfull.6634642913_n94tu2z7.jpg?version=0" title="" height="400" />
<figcaption>A partial list of the design system components our team was able to make available in Compose.</figcaption>
</figure>
<p>Our Design Systems team also made heavy use of one of Compose’s most powerful features: <a href="https://developer.android.com/develop/ui/compose/tooling/previews">Previews</a>. Compose Previews allow a developer to visualize Composables in as many configurations as they want using arbitrary test data, all without having to run the app on a device. Every time the team made a change to a Design Systems Composable, they could validate the effect in a wide range of scenarios. </p>
<p>After a few months of building and adopting toolkit components in Compose, our team felt it was time for a more significant challenge: rebuilding an entire screen. To prevent inadvertently causing a disruption for buyers or sellers on Etsy, we chose a heavily used screen only available in our backend for development builds. This step exposed us to a much wider scope of concerns: navigation, system UI, data fetching using <a href="https://developer.android.com/kotlin/coroutines">coroutines</a> from our API, and the orchestration of multiple Compose components interacting with each other. Using Kotlin <a href="https://developer.android.com/kotlin/flow">Flows</a>, we worked out how to structure our business and UI logic around a unidirectional data flow, a key unlock for future integration of Compose with <a href="https://www.etsy.com/codeascraft/macrame-untangling-the-knot-on-the-android-listing-screen">Macramé</a> – our standard architecture for use across all screens in the Etsy app.</p>
<p>With a full internal screen under our belts, it was time to put Compose in front of real users. A few complex bottom sheets were the next pieces of our app to get the Compose treatment. For the first time, we exposed a major part of our UI, now fully written in Compose, to buyers and sellers on Etsy. We also paired a simple version of our Macramé architecture with these bottom sheets to prove that the two were compatible. </p>
<figure>
<img alt="A Compose bottom sheet" src="https://i.etsystatic.com/inv/891968/6533015670/inv_fullxfull.6533015670_ql85bqq3.jpg?version=0" title="" height="400" />
<figcaption>A bottom sheet fully using Compose hosted inside of a screen built using Views.</figcaption>
</figure>
<p>After successfully rolling out bottom sheets using Compose, we saw an opportunity to adopt Compose on a larger scale in the Shop screen. The existing Shop screen code was confusing to follow and very difficult to run experiments on – limiting our ability to help sellers improve their virtual storefronts. Compose and Macramé held the promise of addressing all these concerns.</p>
<figure>
<img alt="The Compose Shop screen" src="https://i.etsystatic.com/inv/bf3afe/6634639379/inv_fullxfull.6634639379_5fmde656.jpg?version=0" title="" height="400" />
<figcaption>The Shop screen, fully built using Compose.</figcaption>
</figure>
<p>In just around three months, our small team completed the rebuild. Our first order of business was to run an A/B experiment on the Shop screen to compare old vs. new. The results gave Compose even better marks than we had hoped for. Initial screen rendering time improved by 5%, and subjective interactions with the Shop screen, like taps and scrolls, were quicker and more fluid. <strong>User analytics showed the new screen improved  conversion rate, add to cart actions, checkout starts, shop favoriting, listing views, and more – meaning these changes made a tangible, positive impact for our sellers.</strong></p>
<p>For the engineers tasked with coding the Shop screen, the results were just as impressive. An internal survey of engineers who had worked with the Shop screen before the rewrite showed a significant improvement in overall developer satisfaction. Building features required fewer lines of code, our respondents told us, and thanks to the Macramé architecture, testing was much easier and enabled us to greatly increase test coverage of business logic. Similar to what we learned during the development of our Design System components, Compose Previews were called out as a superpower for covering edge cases, and engineers said they were excited to work in a codebase that now featured a modern toolkit.</p>
<h2>Learnings</h2>
<p>We've learned quite a lot about Compose on our path to adopting it:</p>
<ul>
<li>Because of the unidirectional data flow of our Macramé architecture and stateless components built with Compose, state is decoupled from the UI and business logic is isolated and testable. The combination of Macramé and Compose has become the standard way we build features for our app.</li>
<li>Colocation of layout and display logic allows for much easier manipulation of spacing, margins, and padding when working with complex display logic. Dynamic spacing is extremely difficult to do with XML layouts alone, and requires code in separate files to keep it all in sync.</li>
<li>Creating previews of all possible Compose states using mock data has eliminated a large source of rework, bugs, and bad experiences for our buyers. </li>
<li>Our team found it easier to build lazy-style lists in Compose compared to managing all the pieces involved with using <a href="https://developer.android.com/develop/ui/views/layout/recyclerview">RecyclerView</a>, especially horizontal lazy lists. </li>
<li>Interoperability between Compose and Views in both directions enabled a gradual adoption of Compose. 
Animation of Composables can be triggered automatically by data changes–no writing extra code to start and stop the animations properly.</li>
</ul>
<p>While no individual tool is perfect, we’re excited about the opportunities and efficiencies Compose has unlocked for our teams. As with any new technology, there's a learning curve, and some bumps along the way. One issue we found was in a 3rd party library we use. While the library has support for Compose, at the time of the Shop screen conversion, that support was still in alpha stage. After extensive testing, we decided to move forward using the alpha version, but an incompatibility could have necessitated us finding an alternative solution.  Another learning is that LazyRows and LazyColumns, while similar in some respects to RecyclerView, come with their own specific way of handling keys and item reuse. This new lazy list paradigm has occasionally tripped us up and resulted in some unexpected behavior.</p>
<h2>Conclusion</h2>
<p>We’re thrilled with our team’s progress and outcomes in adopting this new toolkit.  We’ve now fully rewritten several key UI screens, including Listing, Favorites, Search, and Cart using Compose, with more to come. Compose has given us a set of tools that lets us be more productive when delivering new features to our buyers, and its gradual rollout in our codebase is a tangible example of the Etsy team's commitment to our craft.</p>]]></description>
                <content:encoded><![CDATA[<p>One of our <a href="https://www.etsy.com/about">Guiding Principles</a> at Etsy is that we “commit to our craft.” This means that we have a culture of learning, in which we’re constantly looking for opportunities to improve and learn, adopt industry best practices, and share our findings with our colleagues and our community. </p>
<p>As part of that process, Etsy recently adopted <a href="https://developer.android.com/compose">Jetpack Compose</a> – Android's modern toolkit for defining native UIs – as our preferred means of building our Android app. The process of adoption consisted of a gradual expansion in the size and complexity of features built using Compose, eventually culminating in a full rewrite of one of the primary screens in the app. The results of that rewrite gave us the confidence to recommend Compose as the primary tool for our Android engineers to build UIs going forward.</p>
<h2>Adoption</h2>
<p>Our engineers are always investigating the latest industry trends and technologies, but in this case a more structured approach was warranted due to the central nature of UI toolkits in the development process. Several engineers on the Android team were assigned to study the existing Compose documentation and examples we had used in prior builds and then create a short curriculum based on what they learned. Over several months, the team held multiple information sessions with the entire Android group, showing how to use Compose to build simple versions of some of our real app screens.</p>
<figure>
<img alt="A sample curriculum section" src="https://i.etsystatic.com/inv/b7961f/6581128905/inv_fullxfull.6581128905_lyn8t9v6.jpg?version=0" />
<figcaption>Part of our in-house curriculum for learning Jetpack Compose via small modules. Each module built upon the previous module to build more complex versions of various features in our real app.</figcaption>
</figure>
<p>Next, our Design Systems team started creating Compose versions of our internal UI toolkit components, with the goal of having a complete Compose implementation of our design system before major adoption. Compose is designed for <a href="https://developer.android.com/develop/ui/compose/migrate/interoperability-apis">interoperability</a> with our existing toolkit, XML Views, providing an uncomplicated migration path that enables us to start using these new toolkit components in our existing XML Views with minimal disruption. This was our first chance to validate that the performance of Compose would be as good as or better than our existing toolkit components. This also gave the wider Android community at Etsy a chance to start using Compose in their day-to-day work and get comfortable with the new patterns Compose introduced.</p>
<figure>
<img alt="A partial list of design system components" src="https://i.etsystatic.com/inv/fbc7f1/6634642913/inv_fullxfull.6634642913_n94tu2z7.jpg?version=0" title="" height="400" />
<figcaption>A partial list of the design system components our team was able to make available in Compose.</figcaption>
</figure>
<p>Our Design Systems team also made heavy use of one of Compose’s most powerful features: <a href="https://developer.android.com/develop/ui/compose/tooling/previews">Previews</a>. Compose Previews allow a developer to visualize Composables in as many configurations as they want using arbitrary test data, all without having to run the app on a device. Every time the team made a change to a Design Systems Composable, they could validate the effect in a wide range of scenarios. </p>
<p>After a few months of building and adopting toolkit components in Compose, our team felt it was time for a more significant challenge: rebuilding an entire screen. To prevent inadvertently causing a disruption for buyers or sellers on Etsy, we chose a heavily used screen only available in our backend for development builds. This step exposed us to a much wider scope of concerns: navigation, system UI, data fetching using <a href="https://developer.android.com/kotlin/coroutines">coroutines</a> from our API, and the orchestration of multiple Compose components interacting with each other. Using Kotlin <a href="https://developer.android.com/kotlin/flow">Flows</a>, we worked out how to structure our business and UI logic around a unidirectional data flow, a key unlock for future integration of Compose with <a href="https://www.etsy.com/codeascraft/macrame-untangling-the-knot-on-the-android-listing-screen">Macramé</a> – our standard architecture for use across all screens in the Etsy app.</p>
<p>With a full internal screen under our belts, it was time to put Compose in front of real users. A few complex bottom sheets were the next pieces of our app to get the Compose treatment. For the first time, we exposed a major part of our UI, now fully written in Compose, to buyers and sellers on Etsy. We also paired a simple version of our Macramé architecture with these bottom sheets to prove that the two were compatible. </p>
<figure>
<img alt="A Compose bottom sheet" src="https://i.etsystatic.com/inv/891968/6533015670/inv_fullxfull.6533015670_ql85bqq3.jpg?version=0" title="" height="400" />
<figcaption>A bottom sheet fully using Compose hosted inside of a screen built using Views.</figcaption>
</figure>
<p>After successfully rolling out bottom sheets using Compose, we saw an opportunity to adopt Compose on a larger scale in the Shop screen. The existing Shop screen code was confusing to follow and very difficult to run experiments on – limiting our ability to help sellers improve their virtual storefronts. Compose and Macramé held the promise of addressing all these concerns.</p>
<figure>
<img alt="The Compose Shop screen" src="https://i.etsystatic.com/inv/bf3afe/6634639379/inv_fullxfull.6634639379_5fmde656.jpg?version=0" title="" height="400" />
<figcaption>The Shop screen, fully built using Compose.</figcaption>
</figure>
<p>In just around three months, our small team completed the rebuild. Our first order of business was to run an A/B experiment on the Shop screen to compare old vs. new. The results gave Compose even better marks than we had hoped for. Initial screen rendering time improved by 5%, and subjective interactions with the Shop screen, like taps and scrolls, were quicker and more fluid. <strong>User analytics showed the new screen improved  conversion rate, add to cart actions, checkout starts, shop favoriting, listing views, and more – meaning these changes made a tangible, positive impact for our sellers.</strong></p>
<p>For the engineers tasked with coding the Shop screen, the results were just as impressive. An internal survey of engineers who had worked with the Shop screen before the rewrite showed a significant improvement in overall developer satisfaction. Building features required fewer lines of code, our respondents told us, and thanks to the Macramé architecture, testing was much easier and enabled us to greatly increase test coverage of business logic. Similar to what we learned during the development of our Design System components, Compose Previews were called out as a superpower for covering edge cases, and engineers said they were excited to work in a codebase that now featured a modern toolkit.</p>
<h2>Learnings</h2>
<p>We've learned quite a lot about Compose on our path to adopting it:</p>
<ul>
<li>Because of the unidirectional data flow of our Macramé architecture and stateless components built with Compose, state is decoupled from the UI and business logic is isolated and testable. The combination of Macramé and Compose has become the standard way we build features for our app.</li>
<li>Colocation of layout and display logic allows for much easier manipulation of spacing, margins, and padding when working with complex display logic. Dynamic spacing is extremely difficult to do with XML layouts alone, and requires code in separate files to keep it all in sync.</li>
<li>Creating previews of all possible Compose states using mock data has eliminated a large source of rework, bugs, and bad experiences for our buyers. </li>
<li>Our team found it easier to build lazy-style lists in Compose compared to managing all the pieces involved with using <a href="https://developer.android.com/develop/ui/views/layout/recyclerview">RecyclerView</a>, especially horizontal lazy lists. </li>
<li>Interoperability between Compose and Views in both directions enabled a gradual adoption of Compose. 
Animation of Composables can be triggered automatically by data changes–no writing extra code to start and stop the animations properly.</li>
</ul>
<p>While no individual tool is perfect, we’re excited about the opportunities and efficiencies Compose has unlocked for our teams. As with any new technology, there's a learning curve, and some bumps along the way. One issue we found was in a 3rd party library we use. While the library has support for Compose, at the time of the Shop screen conversion, that support was still in alpha stage. After extensive testing, we decided to move forward using the alpha version, but an incompatibility could have necessitated us finding an alternative solution.  Another learning is that LazyRows and LazyColumns, while similar in some respects to RecyclerView, come with their own specific way of handling keys and item reuse. This new lazy list paradigm has occasionally tripped us up and resulted in some unexpected behavior.</p>
<h2>Conclusion</h2>
<p>We’re thrilled with our team’s progress and outcomes in adopting this new toolkit.  We’ve now fully rewritten several key UI screens, including Listing, Favorites, Search, and Cart using Compose, with more to come. Compose has given us a set of tools that lets us be more productive when delivering new features to our buyers, and its gradual rollout in our codebase is a tangible example of the Etsy team's commitment to our craft.</p>]]></content:encoded>
            </item>
        
            <item>
                <title><![CDATA[Machine Learning in Content Moderation at Etsy]]></title>
                <link>https://www.etsy.com/codeascraft/machine-learning-in-content-moderation-at-etsy?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</link>
                <dc:creator><![CDATA[David Azcona]]></dc:creator>
                <pubDate>Tue, 6 Aug 2024 07:15:03 -0400</pubDate>
                <guid isPermaLink="true">https://www.etsy.com/codeascraft/machine-learning-in-content-moderation-at-etsy?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</guid>
                <description><![CDATA[<p>At Etsy, we’re focused on elevating the best of our marketplace to help creative entrepreneurs grow their businesses. We continue to invest in making Etsy a safe and trusted place to shop, so sellers’ extraordinary items can shine.</p>
<p>Today, there are more than 100 million unique items available for sale on our marketplace, and our vibrant global community is made up of over 90 million active buyers and 7 million active sellers, the majority of whom are women and sole owners of their creative businesses. To support this growing community, our Trust &amp; Safety team of Product, Engineering, Data, and Operations experts are dedicated to keeping Etsy's marketplace safe by enforcing our policies and removing potentially violating or infringing items at scale</p>
<p>For that, we make use of community reporting and automated controls for removing this potentially violating content. In order to continue to scale and enhance our detections through innovative products and technologies, we also leverage state-of-the-art Machine Learning solutions which we have already used to identify and remove over 100,000 violations during the past year on our marketplace. In this article, we are going to describe one of our systems to detect policy violations that utilizes supervised learning, a family of algorithms that uses data to train their models to recognize patterns and predict outcomes.</p>
<h1>Datasets</h1>
<p>In Machine Learning, data is one of the variables we have the most control over. Extracting data and building trustworthy datasets is a crucial step in any learning problem. In Trust &amp; Safety, we are determined to keep our marketplace and users safe by identifying violations to our policies.</p>
<p>For that, we log and annotate potential violations that enable us to collect datasets reliably. In our approach, these are translated into positives, these were indeed violations, and negatives, these were found not to be offending for a given policy. The latter are also known as hard negatives as they are close to our positives and can help us to better learn how to partition these two sets. </p>
<p>In addition, we also add easy or soft negatives by adding random items to our datasets. This allows us to give further general examples to our models for listings that do not violate any policy, which is the majority in our marketplace and improve generalizability. The number of easy negatives to add is a hyper-parameter to tune, more will mean higher training time and fewer positive representations.</p>
<p>For each training example, we extract multimodal signals, both textual and imagery from our listings.</p>
<p>Then, we split our datasets by time using progressive evaluation, to mimic our production usecase and learn to adapt to recent behavior. These are split into training, used to train our models and learn patterns, validation to fine tune our training hyper-parameters such as learning rate and to evaluate over-fitting, and test to report our metrics in an unbiased manner. </p>
<h1>Model Architecture</h1>
<p>After usual transformations and extraction of a set of offline features from our datasets, we are all set to start training our Machine Learning model.</p>
<p>The goal is to predict whether a given listing violates any of our predefined set of policies, or in contrast, it doesn’t violate any of them. For that, we added a neutral class that depicts the no violation class, where the majority of our listings fall into. This is a typical design pattern for these types of problems.</p>
<p>Our model architecture includes a text encoder and an image encoder to learn representations (aka embeddings) for each modality. </p>
<p>Our text encoder currently employs a <a href="https://arxiv.org/abs/1810.04805">BERT</a>-based architecture to extract context-full representations of our text inputs. In addition, to alleviate compute time, we leverage <a href="https://arxiv.org/abs/1909.11942">ALBERT</a>, a lighter BERT with 90% fewer parameters as the transformer blocks share them. Our initial lightweight representation used an in-house model trained for Search usecases. This allowed us to quickly start iterating and learning from this problem.</p>
<p>Our image encoder currently employs <a href="https://arxiv.org/abs/1905.11946">EfficientNet</a>, a very efficient and accurate Convolutional Neural Network (CNN). Our initial lightweight representation used an in-house model for category classification using CNNs. We are experimenting with transformer-based architectures, similar to our text encoders, with <a href="https://arxiv.org/abs/2010.11929">vision transformers</a> but its performance has not been significantly improved.</p>
<p>Inspired by <a href="https://arxiv.org/abs/1904.09078">EmbraceNet</a>, our architecture then further learns more constrained representations for both text and image embeddings separately, before they are concatenated to form a unique multimodal representation. This is then sent to a final softmax activation that maps logits to probabilities for our internal use.</p>
<p>In addition, in order to address the imbalanced nature of this problem, we leverage <a href="https://arxiv.org/pdf/1708.02002">focal loss</a> that penalizes more hard misclassified examples.</p>
<p>Figure 1 shows our model architecture with late concatenation of our text and image encoders and final output probabilities on an example.</p>
<figure>
<img alt="Model Architecture" src="https://i.etsystatic.com/inv/68753a/6158554996/inv_fullxfull.6158554996_9u5voz5s.jpg?version=0" title="Model Architecture" />
<figcaption>Model Architecture. Image is obtained from <a href="https://unsplash.com/@charlesdeluvio">@charlesdeluvio</a> on Unsplash</figcaption>
</figure>
<h1>Model Evaluation</h1>
<p>First, we experimented and iterated by training our model offline. To evaluate its performance, we established certain benchmarks, based on the business goal of minimizing the impact of any well-intentioned sellers while successfully detecting any offending listings in the platform. This results in a typical evaluation trade-off between precision and recall, precision being the fraction of correct predictions over all predictions made, and recall being the fraction of correct predictions over the actual true values. However, we faced the challenge that recall is not possible to compute, as it’s not feasible to manually review the millions and millions of new listings per day so we had to settle for a proxy for recall from what has been annotated.</p>
<p>Once we had a viable candidate to test in production, we deployed our model as an endpoint and built a service to perform pre-processing and post-processing steps before and after the call to our endpoint that can be called via an API.</p>
<p>Then, we ran an A/B test to measure its performance in production using a canary release approach, slowly rolling out our new detection system to a small percentage of traffic that we keep increasing while we validate an increase in our metrics and no unexpected computation overload.</p>
<p>Afterwards, we iterated and every time we had a promising offline candidate, named challenger, that improved our offline performance metrics, we A/B tested it with respect to our current model, named champion. We designed guidelines for model promotion to increase our metrics and our policy coverage. </p>
<p>Now, we monitor and observe our model predictions and trigger re-training when our performance degrades.</p>
<h1>Results</h1>
<p>Our supervised learning system has been continually learning as we train frequently, run experiments with new datasets and model architectures, A/B test them and deploy them in production. We have added violations as additional classes to our model. As a result, we have identified and removed more than 100,000 violations using these methodologies, in addition to other tools and services that continue to detect and remove violations. </p>
<p>This is one of our approaches to identify potentially offending content among others such as explicitly using the policy information and leverage the latest in Large Language Models (LLMs) and Generative AI. Stay tuned!</p>
<p>"To infinity and beyond!" –Buzz Lightyear, Toy Story</p>]]></description>
                <content:encoded><![CDATA[<p>At Etsy, we’re focused on elevating the best of our marketplace to help creative entrepreneurs grow their businesses. We continue to invest in making Etsy a safe and trusted place to shop, so sellers’ extraordinary items can shine.</p>
<p>Today, there are more than 100 million unique items available for sale on our marketplace, and our vibrant global community is made up of over 90 million active buyers and 7 million active sellers, the majority of whom are women and sole owners of their creative businesses. To support this growing community, our Trust &amp; Safety team of Product, Engineering, Data, and Operations experts are dedicated to keeping Etsy's marketplace safe by enforcing our policies and removing potentially violating or infringing items at scale</p>
<p>For that, we make use of community reporting and automated controls for removing this potentially violating content. In order to continue to scale and enhance our detections through innovative products and technologies, we also leverage state-of-the-art Machine Learning solutions which we have already used to identify and remove over 100,000 violations during the past year on our marketplace. In this article, we are going to describe one of our systems to detect policy violations that utilizes supervised learning, a family of algorithms that uses data to train their models to recognize patterns and predict outcomes.</p>
<h1>Datasets</h1>
<p>In Machine Learning, data is one of the variables we have the most control over. Extracting data and building trustworthy datasets is a crucial step in any learning problem. In Trust &amp; Safety, we are determined to keep our marketplace and users safe by identifying violations to our policies.</p>
<p>For that, we log and annotate potential violations that enable us to collect datasets reliably. In our approach, these are translated into positives, these were indeed violations, and negatives, these were found not to be offending for a given policy. The latter are also known as hard negatives as they are close to our positives and can help us to better learn how to partition these two sets. </p>
<p>In addition, we also add easy or soft negatives by adding random items to our datasets. This allows us to give further general examples to our models for listings that do not violate any policy, which is the majority in our marketplace and improve generalizability. The number of easy negatives to add is a hyper-parameter to tune, more will mean higher training time and fewer positive representations.</p>
<p>For each training example, we extract multimodal signals, both textual and imagery from our listings.</p>
<p>Then, we split our datasets by time using progressive evaluation, to mimic our production usecase and learn to adapt to recent behavior. These are split into training, used to train our models and learn patterns, validation to fine tune our training hyper-parameters such as learning rate and to evaluate over-fitting, and test to report our metrics in an unbiased manner. </p>
<h1>Model Architecture</h1>
<p>After usual transformations and extraction of a set of offline features from our datasets, we are all set to start training our Machine Learning model.</p>
<p>The goal is to predict whether a given listing violates any of our predefined set of policies, or in contrast, it doesn’t violate any of them. For that, we added a neutral class that depicts the no violation class, where the majority of our listings fall into. This is a typical design pattern for these types of problems.</p>
<p>Our model architecture includes a text encoder and an image encoder to learn representations (aka embeddings) for each modality. </p>
<p>Our text encoder currently employs a <a href="https://arxiv.org/abs/1810.04805">BERT</a>-based architecture to extract context-full representations of our text inputs. In addition, to alleviate compute time, we leverage <a href="https://arxiv.org/abs/1909.11942">ALBERT</a>, a lighter BERT with 90% fewer parameters as the transformer blocks share them. Our initial lightweight representation used an in-house model trained for Search usecases. This allowed us to quickly start iterating and learning from this problem.</p>
<p>Our image encoder currently employs <a href="https://arxiv.org/abs/1905.11946">EfficientNet</a>, a very efficient and accurate Convolutional Neural Network (CNN). Our initial lightweight representation used an in-house model for category classification using CNNs. We are experimenting with transformer-based architectures, similar to our text encoders, with <a href="https://arxiv.org/abs/2010.11929">vision transformers</a> but its performance has not been significantly improved.</p>
<p>Inspired by <a href="https://arxiv.org/abs/1904.09078">EmbraceNet</a>, our architecture then further learns more constrained representations for both text and image embeddings separately, before they are concatenated to form a unique multimodal representation. This is then sent to a final softmax activation that maps logits to probabilities for our internal use.</p>
<p>In addition, in order to address the imbalanced nature of this problem, we leverage <a href="https://arxiv.org/pdf/1708.02002">focal loss</a> that penalizes more hard misclassified examples.</p>
<p>Figure 1 shows our model architecture with late concatenation of our text and image encoders and final output probabilities on an example.</p>
<figure>
<img alt="Model Architecture" src="https://i.etsystatic.com/inv/68753a/6158554996/inv_fullxfull.6158554996_9u5voz5s.jpg?version=0" title="Model Architecture" />
<figcaption>Model Architecture. Image is obtained from <a href="https://unsplash.com/@charlesdeluvio">@charlesdeluvio</a> on Unsplash</figcaption>
</figure>
<h1>Model Evaluation</h1>
<p>First, we experimented and iterated by training our model offline. To evaluate its performance, we established certain benchmarks, based on the business goal of minimizing the impact of any well-intentioned sellers while successfully detecting any offending listings in the platform. This results in a typical evaluation trade-off between precision and recall, precision being the fraction of correct predictions over all predictions made, and recall being the fraction of correct predictions over the actual true values. However, we faced the challenge that recall is not possible to compute, as it’s not feasible to manually review the millions and millions of new listings per day so we had to settle for a proxy for recall from what has been annotated.</p>
<p>Once we had a viable candidate to test in production, we deployed our model as an endpoint and built a service to perform pre-processing and post-processing steps before and after the call to our endpoint that can be called via an API.</p>
<p>Then, we ran an A/B test to measure its performance in production using a canary release approach, slowly rolling out our new detection system to a small percentage of traffic that we keep increasing while we validate an increase in our metrics and no unexpected computation overload.</p>
<p>Afterwards, we iterated and every time we had a promising offline candidate, named challenger, that improved our offline performance metrics, we A/B tested it with respect to our current model, named champion. We designed guidelines for model promotion to increase our metrics and our policy coverage. </p>
<p>Now, we monitor and observe our model predictions and trigger re-training when our performance degrades.</p>
<h1>Results</h1>
<p>Our supervised learning system has been continually learning as we train frequently, run experiments with new datasets and model architectures, A/B test them and deploy them in production. We have added violations as additional classes to our model. As a result, we have identified and removed more than 100,000 violations using these methodologies, in addition to other tools and services that continue to detect and remove violations. </p>
<p>This is one of our approaches to identify potentially offending content among others such as explicitly using the policy information and leverage the latest in Large Language Models (LLMs) and Generative AI. Stay tuned!</p>
<p>"To infinity and beyond!" –Buzz Lightyear, Toy Story</p>]]></content:encoded>
            </item>
        
            <item>
                <title><![CDATA[Enhancing Cloud Usage Forecasting, Monitoring & Optimizing]]></title>
                <link>https://www.etsy.com/codeascraft/enhancing-cloud-usage-forecasting-monitoring--optimizing?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</link>
                <dc:creator><![CDATA[Anthony Tambasco]]></dc:creator>
                <pubDate>Mon, 17 Jun 2024 09:58:16 -0400</pubDate>
                <guid isPermaLink="true">https://www.etsy.com/codeascraft/enhancing-cloud-usage-forecasting-monitoring--optimizing?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</guid>
                <description><![CDATA[<p>In 2020, Etsy concluded its migration from an on-premise data center to the Google Cloud Platform (GCP). During this transition, a dedicated team of program managers ensured the migration's success. Post-migration, this team evolved into the Etsy FinOps team, dedicated to maximizing the organization's cloud value by fostering collaborations within and outside the organization, particularly with our Cloud Providers.</p>
<p>Positioned within the Engineering organization under the Chief Architect, the FinOps team operates independently of any one Engineering org or function and optimizes globally rather than locally. This positioning, combined with Etsy's robust engineering culture focused on efficiency and craftsmanship, has fostered what we believe is a mature and successful FinOps practice at Etsy.</p>
<p><strong>Forecast Methodology
</strong></p>
<p>A critical aspect of our FinOps approach is a strong forecasting methodology. A reliable forecast establishes an expected spending baseline against which we track actual spending, enabling us to identify deviations. We classify costs into distinct buckets:</p>
<ul>
<li>Core Infrastructure: Includes the costs of infrastructure and services essential for operating the Etsy.com website.</li>
<li>Machine Learning &amp; Product Enablement: Encompasses costs related to services supporting machine learning initiatives like search, recommendations, and advertisements.</li>
<li>Data Enablement: Encompasses costs related to shared platforms for data collection, data processing and workflow orchestration.</li>
<li>Dev: Encompasses non-production resources.</li>
</ul>
<p>The FinOps forecasting model relies on a trailing Cost Per <a href="https://support.google.com/analytics/answer/2731565?hl=en#zippy=%252Cin-this-article">Visit</a> (CPV) metric. While CPV provides valuable insights into changes, it's not without limitations:</p>
<ul>
<li>A meaningful portion of web traffic to Etsy involves non-human activity, like web crawlers that’s not accounted for in CPV.</li>
<li>Some services have weaker correlations to user visits.</li>
<li>Dev, data, and ML training costs lack direct correlations to visits and are susceptible to short-term spikes during POCs, experiments or big data workflows.</li>
<li>A/B tests for new features can lead to short-term CPV increases, potentially resulting in long-term CPV changes upon successful feature launches.</li>
</ul>
<p>Periodically, we run regression tests to validate if CPV should drive our forecasts. In addition to visits we have looked into headcount, GMV(Gross Merchandise Value) and revenue as independent variables. Thus far, visits have consistently exhibited the highest correlation to costs.</p>
<p><strong>Monitoring and Readouts
</strong></p>
<p>We monitor costs using internal tools built on BigQuery and Looker. Customized dashboards for all of our Engineering teams display cost trends, CPV, and breakdowns by labels and workflows. Additionally, we've set up alerts to identify sudden spikes or gradual week-over-week/month-over-month growth.</p>
<p>Collaboration with the Finance department occurs weekly to compare actual costs against forecasts, identifying discrepancies for timely corrections. Furthermore, the FinOps team conducts recurring meetings with major cost owners and monthly readouts for Engineering and Product leadership to review forecasted figures and manage cost variances.</p>
<p>While we track costs at the organization/cost center level, we don't charge costs back to the teams. This both lowers our overhead and more importantly, provides flexibility to make tradeoffs that enable Engineering velocity. </p>
<p><strong>Cost Increase Detection &amp; Mitigation
</strong></p>
<p>Maintaining a healthy CPV involves swiftly identifying and mitigating cost increases, to achieve this we:</p>
<ul>
<li>Analysis: Gather information on the increase's source, whether from specific cloud products, workflows, or usage pattern changes (ie variance in resource utilization).</li>
<li>Collaboration: Engage relevant teams, sharing insights and seeking additional context.</li>
<li>Validation: Validate cost increases from product launches or internal changes, securing buy-in from leadership if needed. </li>
<li>Mitigation: Unexpected increases undergo joint troubleshooting, where we outline and assign action items to owners, until issues are resolved.</li>
<li>Communication: Inform our finance partners about recent cost trends and their incorporation into the expected spend forecast post-confirmation or resolution with teams and engineering leadership.</li>
</ul>
<p><strong>Cost Optimization Initiatives
</strong></p>
<p>Another side of maintaining a healthy CPV involves cost optimization, offsetting increases from product launches. Ideas for cost-saving come as a result of collaboration between FinOps and engineering teams, with the Architecture team validating and implementing efficiency improvements. Notably we focus on the engineering or business impact of the cost optimization rather than solely on savings, recognizing that inefficiencies often signal larger problems.  </p>
<p>Based on effort vs. value evaluations, some ideas are added to backlogs, while major initiatives warrant dedicated squads.Below is a breakout of some of the major wins we have had in the last year or so.</p>
<ul>
<li>GCS Storage Optimization - In 2023 we stood up a squad focused on optimizing Etsy’s use of GCS, as it has been one of the largest growth areas for us over the past few years. The squad delivered a number of improvements including improved monitoring of usage, automation features for Data engineers, implementation of TTLs that match data access patterns/business needs and the adoption of Intelligent tiering.  Due to these efforts, Etsy’s GCS usage is now less than it was 2 years ago. </li>
<li>Compute Optimization - Migrated over 90% of Etsy infrastructure that is serving traffic to the latest and greatest CPU platform.  This improved our serving latency while reducing cost.</li>
<li>Increased Automation for model deployment - In an effort to improve the developer experience, our machine learning enablement team developed a tool to automate the compute configurations for new models being deployed, which also ended up saving us money.</li>
<li>Network Compression - Enabling network compression between our high throughput services both improved the  latency profile and drastically reduced the networking cost. </li>
</ul>
<p><strong>What's Next
</strong></p>
<p>While our core infrastructure spend is well understood, our focus is on improving visibility into our Machine Learning platform's spend. As these systems are shared across teams, dissecting costs tied to individual product launches is challenging. Enhanced visibility will help us refine our ROI analysis of product experiments and pinpoint future areas of opportunity for optimization.</p>]]></description>
                <content:encoded><![CDATA[<p>In 2020, Etsy concluded its migration from an on-premise data center to the Google Cloud Platform (GCP). During this transition, a dedicated team of program managers ensured the migration's success. Post-migration, this team evolved into the Etsy FinOps team, dedicated to maximizing the organization's cloud value by fostering collaborations within and outside the organization, particularly with our Cloud Providers.</p>
<p>Positioned within the Engineering organization under the Chief Architect, the FinOps team operates independently of any one Engineering org or function and optimizes globally rather than locally. This positioning, combined with Etsy's robust engineering culture focused on efficiency and craftsmanship, has fostered what we believe is a mature and successful FinOps practice at Etsy.</p>
<p><strong>Forecast Methodology
</strong></p>
<p>A critical aspect of our FinOps approach is a strong forecasting methodology. A reliable forecast establishes an expected spending baseline against which we track actual spending, enabling us to identify deviations. We classify costs into distinct buckets:</p>
<ul>
<li>Core Infrastructure: Includes the costs of infrastructure and services essential for operating the Etsy.com website.</li>
<li>Machine Learning &amp; Product Enablement: Encompasses costs related to services supporting machine learning initiatives like search, recommendations, and advertisements.</li>
<li>Data Enablement: Encompasses costs related to shared platforms for data collection, data processing and workflow orchestration.</li>
<li>Dev: Encompasses non-production resources.</li>
</ul>
<p>The FinOps forecasting model relies on a trailing Cost Per <a href="https://support.google.com/analytics/answer/2731565?hl=en#zippy=%252Cin-this-article">Visit</a> (CPV) metric. While CPV provides valuable insights into changes, it's not without limitations:</p>
<ul>
<li>A meaningful portion of web traffic to Etsy involves non-human activity, like web crawlers that’s not accounted for in CPV.</li>
<li>Some services have weaker correlations to user visits.</li>
<li>Dev, data, and ML training costs lack direct correlations to visits and are susceptible to short-term spikes during POCs, experiments or big data workflows.</li>
<li>A/B tests for new features can lead to short-term CPV increases, potentially resulting in long-term CPV changes upon successful feature launches.</li>
</ul>
<p>Periodically, we run regression tests to validate if CPV should drive our forecasts. In addition to visits we have looked into headcount, GMV(Gross Merchandise Value) and revenue as independent variables. Thus far, visits have consistently exhibited the highest correlation to costs.</p>
<p><strong>Monitoring and Readouts
</strong></p>
<p>We monitor costs using internal tools built on BigQuery and Looker. Customized dashboards for all of our Engineering teams display cost trends, CPV, and breakdowns by labels and workflows. Additionally, we've set up alerts to identify sudden spikes or gradual week-over-week/month-over-month growth.</p>
<p>Collaboration with the Finance department occurs weekly to compare actual costs against forecasts, identifying discrepancies for timely corrections. Furthermore, the FinOps team conducts recurring meetings with major cost owners and monthly readouts for Engineering and Product leadership to review forecasted figures and manage cost variances.</p>
<p>While we track costs at the organization/cost center level, we don't charge costs back to the teams. This both lowers our overhead and more importantly, provides flexibility to make tradeoffs that enable Engineering velocity. </p>
<p><strong>Cost Increase Detection &amp; Mitigation
</strong></p>
<p>Maintaining a healthy CPV involves swiftly identifying and mitigating cost increases, to achieve this we:</p>
<ul>
<li>Analysis: Gather information on the increase's source, whether from specific cloud products, workflows, or usage pattern changes (ie variance in resource utilization).</li>
<li>Collaboration: Engage relevant teams, sharing insights and seeking additional context.</li>
<li>Validation: Validate cost increases from product launches or internal changes, securing buy-in from leadership if needed. </li>
<li>Mitigation: Unexpected increases undergo joint troubleshooting, where we outline and assign action items to owners, until issues are resolved.</li>
<li>Communication: Inform our finance partners about recent cost trends and their incorporation into the expected spend forecast post-confirmation or resolution with teams and engineering leadership.</li>
</ul>
<p><strong>Cost Optimization Initiatives
</strong></p>
<p>Another side of maintaining a healthy CPV involves cost optimization, offsetting increases from product launches. Ideas for cost-saving come as a result of collaboration between FinOps and engineering teams, with the Architecture team validating and implementing efficiency improvements. Notably we focus on the engineering or business impact of the cost optimization rather than solely on savings, recognizing that inefficiencies often signal larger problems.  </p>
<p>Based on effort vs. value evaluations, some ideas are added to backlogs, while major initiatives warrant dedicated squads.Below is a breakout of some of the major wins we have had in the last year or so.</p>
<ul>
<li>GCS Storage Optimization - In 2023 we stood up a squad focused on optimizing Etsy’s use of GCS, as it has been one of the largest growth areas for us over the past few years. The squad delivered a number of improvements including improved monitoring of usage, automation features for Data engineers, implementation of TTLs that match data access patterns/business needs and the adoption of Intelligent tiering.  Due to these efforts, Etsy’s GCS usage is now less than it was 2 years ago. </li>
<li>Compute Optimization - Migrated over 90% of Etsy infrastructure that is serving traffic to the latest and greatest CPU platform.  This improved our serving latency while reducing cost.</li>
<li>Increased Automation for model deployment - In an effort to improve the developer experience, our machine learning enablement team developed a tool to automate the compute configurations for new models being deployed, which also ended up saving us money.</li>
<li>Network Compression - Enabling network compression between our high throughput services both improved the  latency profile and drastically reduced the networking cost. </li>
</ul>
<p><strong>What's Next
</strong></p>
<p>While our core infrastructure spend is well understood, our focus is on improving visibility into our Machine Learning platform's spend. As these systems are shared across teams, dissecting costs tied to individual product launches is challenging. Enhanced visibility will help us refine our ROI analysis of product experiments and pinpoint future areas of opportunity for optimization.</p>]]></content:encoded>
            </item>
        
            <item>
                <title><![CDATA[Efficient Visual Representation Learning And Evaluation]]></title>
                <link>https://www.etsy.com/codeascraft/efficient-visual-representation-learning-and-evaluation?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</link>
                <dc:creator><![CDATA[Eden Dolev]]></dc:creator>
                <pubDate>Fri, 7 Jun 2024 13:23:02 -0400</pubDate>
                <guid isPermaLink="true">https://www.etsy.com/codeascraft/efficient-visual-representation-learning-and-evaluation?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</guid>
                <description><![CDATA[<p>Etsy features a diverse marketplace of unique handmade and vintage items. It’s a visually diverse marketplace as well, and computer vision has become increasingly important to Etsy as a way of enhancing our users’ shopping experience. We’ve developed applications like visual search and visually similar recommendations that can offer buyers an additional path to find what they’re looking for, powered by machine learning models that encode images as vector representations.</p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/e81dea/6003603188/inv_fullxfull.6003603188_fgpr44tc.jpg?version=0" />
<figcaption>Figure 1. Visual representations power applications such as visual search and visually similar recommendations</figcaption>
</figure>
<p>Learning expressive representations through deep neural networks, and being able to leverage them in downstream tasks at scale, is a costly technical challenge. The infrastructure required to train and serve large models is expensive, as is the iterative process that refines them and optimizes their performance. The solution is often to train deep learning architectures offline and use the pre-computed pretrained visual representations in downstream tasks served online. (We wrote about this in a <a href="https://www.etsy.com/codeascraft/leveraging-real-time-user-actions-to-personalize-etsy-ads">previous blog post</a> on personalization from real-time sequences and diversity of representations.) In any application where a query image representation is inferred online, it's important that you have low latency, memory-aware models. Efficiency becomes paramount to the success of these models in the product. We can think about efficiency in deep learning along multiple axes: efficiency in model architecture, model training, evaluation and serving.</p>
<h2>Model Architecture</h2>
<p>The EfficientNet family of models features a convolutional neural network architecture. It uniformly optimizes for network width, depth, and resolution using a fixed set of coefficients. By allowing practitioners to start from a limited resource budget and scale up for better accuracy as more resources are available, EfficientNet provides a great starting point for visual representations. We began our trials with EfficientNetB0, the smallest size model in the EfficientNet family. We saw good performance and low latency with this model, but the industry and research community have touted Vision Transformers (ViT) as having better representations. We decided to give that a try.</p>
<p>Transformers lack the spatial inductive biases of CNN, but they outperform CNN when trained on large enough datasets and may be more robust to domain shifts. ViT decomposes the image into a sequence of patches (16X16 for example) and applies a transformer architecture to incorporate more global information. However, due to the massive number of parameters and compute-heavy attention mechanism, ViT-based architectures can be many times slower to train and inference than lightweight Convolutional Networks. Despite the challenges, more efficient ViT architectures have recently begun to emerge, featuring clever pooling, layer dropping, efficient normalization, and efficient attention or hybrid CNN-transformer designs. </p>
<p>We employ the EfficientFormer-l3 to take advantage of these ViT improvements. The EfficientFormer architecture achieves efficiency through downsampling multiple blocks and employing attention only in the last stage. This derived image representation mechanism differs from the standard vision transformer, where embeddings are extracted from the first token of the output. Instead, we extract the attention from the last block for the eight heads and perform average pooling over the sequence. In Figure 2 we illustrate these different attention weights with heat maps overlaid on an image, showing how each of the eight heads learns to focus on a different salient part.</p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/8cb94e/6051704113/inv_fullxfull.6051704113_h0y3tmqi.jpg?version=0" />
<figcaption>Figure 2. Probing the EfficientFormer-l3 pre-trained visual representations through attention heat maps.</figcaption>
</figure>
<h2>Model Training</h2>
<h3>Fine-Tuning</h3>
<p>With our pre-trained backbones in place, we can gain further efficiencies via fine tuning. For the EfficientNetB0 CNN, that means replacing the final convolutional layer and attaching a d-dimensional embedding layer followed by m classification heads, where m is the number of tasks. The embedding head consists of a new convolutional layer with the desired final representation dimension, followed by a batch normalization layer, a swish activation and a global average pooling layer to aggregate the convolutional output into a single vector per example. To train EfficientNetB0, new attached layers are trained from scratch for one epoch with the backbone layers frozen, to avoid excessive computation and overfitting. We then unfreeze 75 layers from the top of the backbone and finetune for nine additional epochs, for efficient learning. At inference time we remove the classification head and extract the output of the pooling layer as the final representation. </p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/c1a749/6003627480/inv_fullxfull.6003627480_731l7h49.jpg?version=0" />
</figure>
<p>To fine-tune the EfficientFormer ViT we stick with the pretraining resolution of 224X224, since using sequences longer than the recommended 384X384 in ViT leads to larger training budgets. To extract the embedding we average pool the last hidden state. Then classification heads are added as with the CNN, with batch normalization being swapped for layer normalization.</p>
<h3>Multitask Learning</h3>
<p>In a previous <a href="https://www.etsy.com/codeascraft/from-image-classification-to-multitask-modeling-building-etsys-search-by-image-feature">blog post</a> we described how we built a multitask learning framework to generate visual representations for Etsy's search-by-image experience. The training architecture is shown in Figure 3.</p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/846ae7/6003634602/inv_fullxfull.6003634602_scbrb3bf.jpg?version=0" />
<figcaption>Figure 3. A multitask training architecture for visual representations. The dataset sampler combines examples from an arbitrary number of datasets corresponding to respective classification heads. The embedding is extracted before the classification heads.</figcaption>
</figure>
<p>Multitask learning is an efficiency inducer. Representations encode commonalities, and they perform well in diverse downstream tasks when those are learned using common attributes as multiple supervision signals. A representation learned in single-task classification to the item’s taxonomy, for example, will be unable to capture visual attributes: colors, shapes, materials. We employ four classification tasks: a top-level taxonomy task with 15 top-level categories of the Etsy taxonomy tree as labels; a fine-grained taxonomy task, with 1000 fine-grained leaf node item categories as labels; a primary color task; and a fine-grained taxonomy task (review photos), where each example is a buyer-uploaded review photo of a purchased item with 100 labels sampled from fine-grained leaf node item categories.</p>
<p>We are able to train both EfficientNetB0 and EfficientFormer-l3 on standard 16GB GPUs (we used two P100 GPUs). For comparison, a full sized ViT requires a larger 40GB RAM GPU such as an A100, which can increase training costs significantly. We provide detailed hyperparameter information for fine-tuning either backbone in our <a href="https://arxiv.org/pdf/2305.13399">article</a>.</p>
<h2>Evaluating Visual Representations</h2>
<p>We define and implement an evaluation scheme for visual representations to track and guide model training, on three nearest neighbor retrieval tasks. After each training epoch, a callback is invoked to compute and log the recall for each retrieval task. Each retrieval dataset is split into two smaller datasets: “queries” and “candidates.” The candidates dataset is used to construct a brute-force nearest neighbor index, and the queries dataset is used to look up the index. The index is constructed on the fly after each epoch to accommodate for embeddings changing between training epochs. Each lookup yields K nearest neighbors. We compute Recall@5 and @10 using both historical implicit user interactions (such as “visually-similar ad clicks”) and ground truth datasets of product photos taken from the same listing (“intra-item”). The recall callbacks can also be used for early stopping of training to enhance efficiency.</p>
<p>The intra-item retrieval evaluation dataset consists of groups of seller-uploaded images of the same item. The query and candidate examples are randomly selected seller-uploaded images of an item. A candidate image is considered a positive example if it is associated with the same item as the query. In the “intra-item with reviews” dataset, the query image is a randomly selected buyer-uploaded review image of an item, with seller-uploaded images providing candidate examples. The dataset of visually similar ad clicks associates seller-uploaded primary images with primary images of items that have been clicked in the visually similar surface on mobile. Here, a candidate image is considered a positive example for some query image if a user viewing the query image has clicked it. Each evaluation dataset contains 15,000 records for building the index and 5,000 query images for the retrieval phase. </p>
<p>We also leverage generative AI for an experimental new evaluation scheme. From ample, multilingual historical text query logs, we build a new retrieval dataset that bridges the semantic gap between text-based queries and clicked image candidates. Text-to-image generative stable diffusion makes the information retrieval process language-agnostic, since an image is worth a thousand (multilingual) words. A stable diffusion model generates high-quality images which become image queries. The candidates are images from clicked items corresponding to the source text query in the logs. One caveat is that the dataset is biased toward the search-by-text production system that produced the logs; only a search-by-image-from-text system would produce truly relevant evaluation logs. The source-candidate image pairs form the new retrieval evaluation dataset which is then used within a retrieval callback.</p>
<p>Of course, users entering the same text query may have very different ideas in mind of, say, the garment they’re looking for. So for each query we generate several images: formally, a random sample of length 𝑛 from the posterior distribution over all possible images that can be generated from the seed text query. We pre-condition our generation on a uniform “fashion style.” In a real-world scenario, both the text-to-image query generation and the image query inference for retrieval happen in real time, which means efficient backbones are necessary. We randomly select one of the 𝑛 generated images to replace the text query with an image query in the evaluation dataset. This is a hybrid evaluation method: the error inherent in the text-to-image diffusion model generation is encapsulated in the visually similar recommendation error rate. Future work may include prompt engineering to improve the text query prompt itself, which as input by the user can be short and lacking in detail.</p>
<p>Large memory requirements and high inference latency are challenges in using text-to-image generative models at scale. We employ an open source fast stable diffusion model through token merging and float 16 inference. Compared to the standard stable diffusion implementation available at the time we built the system, this method speeds up inference by 50% with a 5x reduction in memory consumption, though results depend on the underlying patched model. We can generate 500 images per hour with one T4 GPU (no parallelism) using the patched stable diffusion pipeline. With parallelism we can achieve further speedup. Figure 4 shows that for the English text query “black bohemian maxi dress with orange floral pattern” the efficient stable diffusion pipeline generates five image query candidates. The generated images include pleasant variations with some detail loss. Interestingly, mostly the facial details of the fashion model are affected, while the garment pattern remains clear. In some cases degradation might prohibit display, but efficient generative technology is being perfected at a fast pace, and prompt engineering helps the generative process as well.</p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/58c960/6051716515/inv_fullxfull.6051716515_dw6bc33l.jpg?version=0" />
<figcaption>Figure 4. Text-to-image generation using a generative diffusion model, from equivalent queries in English and French</figcaption>
</figure>
<h2>Efficient Inference and Downstream Tasks</h2>
<p>Especially when it comes to latency-sensitive applications like visually similar recommendations and search, efficient inference is paramount: otherwise, we risk loss of impressions and a poor user experience. We can think of inference along two axes: online inference of the image query and efficient retrieval of top-k most similar items via approximate nearest neighbors. The dimension of the learned visual representation impacts the efficient retrieval design as well, and the smaller 256d derived from the EfficientNetB0 presents an advantage. EfficientNet B0 is hard to beat in terms of accuracy-to-latency trade-offs for online inference, with ~5M parameters and around 1.7ms latency on iPhone 12. The EfficientFormer-l3 has ~30M parameters and gets around 2.7ms latency on iPhone 12 with higher accuracy (while for example MobileViT-XS scores around 7ms with a third of accuracy; very large ViT are not considered since latencies are prohibitive). In offline evaluation, the EfficientFormer-l3-derived embedding achieves around +5% lift in the Intra-L Recall@5  evaluation, a +17% in Intra-R Recall@5, and a +1.8% in Visually Similar Ad clicks Recall@5.</p>
<p>We performed A/B testing on the EfficientNetB0 multitask variant across visual applications at Etsy with good results. Additionally, the EfficientFormer-l3 visual representations led to a +0.65% lift in CTR, and a similar lift in purchase rate in a first visually-similar-ads experiment when compared to the production variant of EfficientNetB0. When included in sponsored search downstream rankers, the visual representations led to a +1.26% lift in post-click purchase rate. Including the efficient visual representation in Ads Information Retrieval (AIR), an embedding-based retrieval method used to retrieve similar item ad recommendations caused an increase in click-recall@100 of 8%. And when we used these representations to compute image similarity and included them directly in the last-pass ranking function, we saw a +6.25% lift in clicks.</p>
<p>The first use of EfficientNetB0 visual embeddings was in visually similar ad recommendations on mobile. This led to a +1.92% increase in ad return-on-spend on iOS and a +1.18% increase in post-click purchase rate on Android. The same efficient embedding model backed the first search-by-image shopping experience at Etsy. Users search using photos taken with their mobile phone’s camera and the query image embedding is inferred efficiently online, which we discussed in a previous <a href="https://www.etsy.com/codeascraft/from-image-classification-to-multitask-modeling-building-etsys-search-by-image-feature">blog post</a>. </p>
<p>Learning visual representations is of paramount importance in visually rich e-commerce and online fashion recommendations. Learning them efficiently is a challenging goal made possible by advances in the field of efficient deep learning in computer vision. If you'd like a more in-depth discussion of this work, please see our full accepted paper to the <a href="https://fashionxrecsys.github.io/fashionxrecsys-2023/">#fashionXrecsys workshop at the Recsys 2023</a> conference.</p>]]></description>
                <content:encoded><![CDATA[<p>Etsy features a diverse marketplace of unique handmade and vintage items. It’s a visually diverse marketplace as well, and computer vision has become increasingly important to Etsy as a way of enhancing our users’ shopping experience. We’ve developed applications like visual search and visually similar recommendations that can offer buyers an additional path to find what they’re looking for, powered by machine learning models that encode images as vector representations.</p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/e81dea/6003603188/inv_fullxfull.6003603188_fgpr44tc.jpg?version=0" />
<figcaption>Figure 1. Visual representations power applications such as visual search and visually similar recommendations</figcaption>
</figure>
<p>Learning expressive representations through deep neural networks, and being able to leverage them in downstream tasks at scale, is a costly technical challenge. The infrastructure required to train and serve large models is expensive, as is the iterative process that refines them and optimizes their performance. The solution is often to train deep learning architectures offline and use the pre-computed pretrained visual representations in downstream tasks served online. (We wrote about this in a <a href="https://www.etsy.com/codeascraft/leveraging-real-time-user-actions-to-personalize-etsy-ads">previous blog post</a> on personalization from real-time sequences and diversity of representations.) In any application where a query image representation is inferred online, it's important that you have low latency, memory-aware models. Efficiency becomes paramount to the success of these models in the product. We can think about efficiency in deep learning along multiple axes: efficiency in model architecture, model training, evaluation and serving.</p>
<h2>Model Architecture</h2>
<p>The EfficientNet family of models features a convolutional neural network architecture. It uniformly optimizes for network width, depth, and resolution using a fixed set of coefficients. By allowing practitioners to start from a limited resource budget and scale up for better accuracy as more resources are available, EfficientNet provides a great starting point for visual representations. We began our trials with EfficientNetB0, the smallest size model in the EfficientNet family. We saw good performance and low latency with this model, but the industry and research community have touted Vision Transformers (ViT) as having better representations. We decided to give that a try.</p>
<p>Transformers lack the spatial inductive biases of CNN, but they outperform CNN when trained on large enough datasets and may be more robust to domain shifts. ViT decomposes the image into a sequence of patches (16X16 for example) and applies a transformer architecture to incorporate more global information. However, due to the massive number of parameters and compute-heavy attention mechanism, ViT-based architectures can be many times slower to train and inference than lightweight Convolutional Networks. Despite the challenges, more efficient ViT architectures have recently begun to emerge, featuring clever pooling, layer dropping, efficient normalization, and efficient attention or hybrid CNN-transformer designs. </p>
<p>We employ the EfficientFormer-l3 to take advantage of these ViT improvements. The EfficientFormer architecture achieves efficiency through downsampling multiple blocks and employing attention only in the last stage. This derived image representation mechanism differs from the standard vision transformer, where embeddings are extracted from the first token of the output. Instead, we extract the attention from the last block for the eight heads and perform average pooling over the sequence. In Figure 2 we illustrate these different attention weights with heat maps overlaid on an image, showing how each of the eight heads learns to focus on a different salient part.</p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/8cb94e/6051704113/inv_fullxfull.6051704113_h0y3tmqi.jpg?version=0" />
<figcaption>Figure 2. Probing the EfficientFormer-l3 pre-trained visual representations through attention heat maps.</figcaption>
</figure>
<h2>Model Training</h2>
<h3>Fine-Tuning</h3>
<p>With our pre-trained backbones in place, we can gain further efficiencies via fine tuning. For the EfficientNetB0 CNN, that means replacing the final convolutional layer and attaching a d-dimensional embedding layer followed by m classification heads, where m is the number of tasks. The embedding head consists of a new convolutional layer with the desired final representation dimension, followed by a batch normalization layer, a swish activation and a global average pooling layer to aggregate the convolutional output into a single vector per example. To train EfficientNetB0, new attached layers are trained from scratch for one epoch with the backbone layers frozen, to avoid excessive computation and overfitting. We then unfreeze 75 layers from the top of the backbone and finetune for nine additional epochs, for efficient learning. At inference time we remove the classification head and extract the output of the pooling layer as the final representation. </p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/c1a749/6003627480/inv_fullxfull.6003627480_731l7h49.jpg?version=0" />
</figure>
<p>To fine-tune the EfficientFormer ViT we stick with the pretraining resolution of 224X224, since using sequences longer than the recommended 384X384 in ViT leads to larger training budgets. To extract the embedding we average pool the last hidden state. Then classification heads are added as with the CNN, with batch normalization being swapped for layer normalization.</p>
<h3>Multitask Learning</h3>
<p>In a previous <a href="https://www.etsy.com/codeascraft/from-image-classification-to-multitask-modeling-building-etsys-search-by-image-feature">blog post</a> we described how we built a multitask learning framework to generate visual representations for Etsy's search-by-image experience. The training architecture is shown in Figure 3.</p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/846ae7/6003634602/inv_fullxfull.6003634602_scbrb3bf.jpg?version=0" />
<figcaption>Figure 3. A multitask training architecture for visual representations. The dataset sampler combines examples from an arbitrary number of datasets corresponding to respective classification heads. The embedding is extracted before the classification heads.</figcaption>
</figure>
<p>Multitask learning is an efficiency inducer. Representations encode commonalities, and they perform well in diverse downstream tasks when those are learned using common attributes as multiple supervision signals. A representation learned in single-task classification to the item’s taxonomy, for example, will be unable to capture visual attributes: colors, shapes, materials. We employ four classification tasks: a top-level taxonomy task with 15 top-level categories of the Etsy taxonomy tree as labels; a fine-grained taxonomy task, with 1000 fine-grained leaf node item categories as labels; a primary color task; and a fine-grained taxonomy task (review photos), where each example is a buyer-uploaded review photo of a purchased item with 100 labels sampled from fine-grained leaf node item categories.</p>
<p>We are able to train both EfficientNetB0 and EfficientFormer-l3 on standard 16GB GPUs (we used two P100 GPUs). For comparison, a full sized ViT requires a larger 40GB RAM GPU such as an A100, which can increase training costs significantly. We provide detailed hyperparameter information for fine-tuning either backbone in our <a href="https://arxiv.org/pdf/2305.13399">article</a>.</p>
<h2>Evaluating Visual Representations</h2>
<p>We define and implement an evaluation scheme for visual representations to track and guide model training, on three nearest neighbor retrieval tasks. After each training epoch, a callback is invoked to compute and log the recall for each retrieval task. Each retrieval dataset is split into two smaller datasets: “queries” and “candidates.” The candidates dataset is used to construct a brute-force nearest neighbor index, and the queries dataset is used to look up the index. The index is constructed on the fly after each epoch to accommodate for embeddings changing between training epochs. Each lookup yields K nearest neighbors. We compute Recall@5 and @10 using both historical implicit user interactions (such as “visually-similar ad clicks”) and ground truth datasets of product photos taken from the same listing (“intra-item”). The recall callbacks can also be used for early stopping of training to enhance efficiency.</p>
<p>The intra-item retrieval evaluation dataset consists of groups of seller-uploaded images of the same item. The query and candidate examples are randomly selected seller-uploaded images of an item. A candidate image is considered a positive example if it is associated with the same item as the query. In the “intra-item with reviews” dataset, the query image is a randomly selected buyer-uploaded review image of an item, with seller-uploaded images providing candidate examples. The dataset of visually similar ad clicks associates seller-uploaded primary images with primary images of items that have been clicked in the visually similar surface on mobile. Here, a candidate image is considered a positive example for some query image if a user viewing the query image has clicked it. Each evaluation dataset contains 15,000 records for building the index and 5,000 query images for the retrieval phase. </p>
<p>We also leverage generative AI for an experimental new evaluation scheme. From ample, multilingual historical text query logs, we build a new retrieval dataset that bridges the semantic gap between text-based queries and clicked image candidates. Text-to-image generative stable diffusion makes the information retrieval process language-agnostic, since an image is worth a thousand (multilingual) words. A stable diffusion model generates high-quality images which become image queries. The candidates are images from clicked items corresponding to the source text query in the logs. One caveat is that the dataset is biased toward the search-by-text production system that produced the logs; only a search-by-image-from-text system would produce truly relevant evaluation logs. The source-candidate image pairs form the new retrieval evaluation dataset which is then used within a retrieval callback.</p>
<p>Of course, users entering the same text query may have very different ideas in mind of, say, the garment they’re looking for. So for each query we generate several images: formally, a random sample of length 𝑛 from the posterior distribution over all possible images that can be generated from the seed text query. We pre-condition our generation on a uniform “fashion style.” In a real-world scenario, both the text-to-image query generation and the image query inference for retrieval happen in real time, which means efficient backbones are necessary. We randomly select one of the 𝑛 generated images to replace the text query with an image query in the evaluation dataset. This is a hybrid evaluation method: the error inherent in the text-to-image diffusion model generation is encapsulated in the visually similar recommendation error rate. Future work may include prompt engineering to improve the text query prompt itself, which as input by the user can be short and lacking in detail.</p>
<p>Large memory requirements and high inference latency are challenges in using text-to-image generative models at scale. We employ an open source fast stable diffusion model through token merging and float 16 inference. Compared to the standard stable diffusion implementation available at the time we built the system, this method speeds up inference by 50% with a 5x reduction in memory consumption, though results depend on the underlying patched model. We can generate 500 images per hour with one T4 GPU (no parallelism) using the patched stable diffusion pipeline. With parallelism we can achieve further speedup. Figure 4 shows that for the English text query “black bohemian maxi dress with orange floral pattern” the efficient stable diffusion pipeline generates five image query candidates. The generated images include pleasant variations with some detail loss. Interestingly, mostly the facial details of the fashion model are affected, while the garment pattern remains clear. In some cases degradation might prohibit display, but efficient generative technology is being perfected at a fast pace, and prompt engineering helps the generative process as well.</p>
<figure>
<img alt="" src="https://i.etsystatic.com/inv/58c960/6051716515/inv_fullxfull.6051716515_dw6bc33l.jpg?version=0" />
<figcaption>Figure 4. Text-to-image generation using a generative diffusion model, from equivalent queries in English and French</figcaption>
</figure>
<h2>Efficient Inference and Downstream Tasks</h2>
<p>Especially when it comes to latency-sensitive applications like visually similar recommendations and search, efficient inference is paramount: otherwise, we risk loss of impressions and a poor user experience. We can think of inference along two axes: online inference of the image query and efficient retrieval of top-k most similar items via approximate nearest neighbors. The dimension of the learned visual representation impacts the efficient retrieval design as well, and the smaller 256d derived from the EfficientNetB0 presents an advantage. EfficientNet B0 is hard to beat in terms of accuracy-to-latency trade-offs for online inference, with ~5M parameters and around 1.7ms latency on iPhone 12. The EfficientFormer-l3 has ~30M parameters and gets around 2.7ms latency on iPhone 12 with higher accuracy (while for example MobileViT-XS scores around 7ms with a third of accuracy; very large ViT are not considered since latencies are prohibitive). In offline evaluation, the EfficientFormer-l3-derived embedding achieves around +5% lift in the Intra-L Recall@5  evaluation, a +17% in Intra-R Recall@5, and a +1.8% in Visually Similar Ad clicks Recall@5.</p>
<p>We performed A/B testing on the EfficientNetB0 multitask variant across visual applications at Etsy with good results. Additionally, the EfficientFormer-l3 visual representations led to a +0.65% lift in CTR, and a similar lift in purchase rate in a first visually-similar-ads experiment when compared to the production variant of EfficientNetB0. When included in sponsored search downstream rankers, the visual representations led to a +1.26% lift in post-click purchase rate. Including the efficient visual representation in Ads Information Retrieval (AIR), an embedding-based retrieval method used to retrieve similar item ad recommendations caused an increase in click-recall@100 of 8%. And when we used these representations to compute image similarity and included them directly in the last-pass ranking function, we saw a +6.25% lift in clicks.</p>
<p>The first use of EfficientNetB0 visual embeddings was in visually similar ad recommendations on mobile. This led to a +1.92% increase in ad return-on-spend on iOS and a +1.18% increase in post-click purchase rate on Android. The same efficient embedding model backed the first search-by-image shopping experience at Etsy. Users search using photos taken with their mobile phone’s camera and the query image embedding is inferred efficiently online, which we discussed in a previous <a href="https://www.etsy.com/codeascraft/from-image-classification-to-multitask-modeling-building-etsys-search-by-image-feature">blog post</a>. </p>
<p>Learning visual representations is of paramount importance in visually rich e-commerce and online fashion recommendations. Learning them efficiently is a challenging goal made possible by advances in the field of efficient deep learning in computer vision. If you'd like a more in-depth discussion of this work, please see our full accepted paper to the <a href="https://fashionxrecsys.github.io/fashionxrecsys-2023/">#fashionXrecsys workshop at the Recsys 2023</a> conference.</p>]]></content:encoded>
            </item>
        
            <item>
                <title><![CDATA[Macramé: Untangling the Knot on the Etsy Android Listing Screen]]></title>
                <link>https://www.etsy.com/codeascraft/macrame-untangling-the-knot-on-the-android-listing-screen?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</link>
                <dc:creator><![CDATA[John Kalmi]]></dc:creator>
                <pubDate>Mon, 18 Mar 2024 12:19:15 -0400</pubDate>
                <guid isPermaLink="true">https://www.etsy.com/codeascraft/macrame-untangling-the-knot-on-the-android-listing-screen?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</guid>
                <description><![CDATA[<p>Easily the most important and complex screen in the Buy on Etsy Android app is the listing screen, where all key information about an item for sale in the Etsy marketplace is displayed to buyers. Far from just a title and description, a price and a few images, over the years the listing screen has come to aggregate ratings and reviews, seller and shipping and stock information, and gained a variety of personalization and recommendation features. As information-rich as it is, as central as it is to the buying experience, for product teams the listing screen is an irresistible place to test out new methods and approaches. In just the last three years, apps teams have run nearly 200 experiments on it, often with multiple teams building and running experiments in parallel.</p>
<p>Eventually, with such a high velocity of experiment and code change, the listing screen started showing signs of stress. Its architecture was inconsistent and not meant to support a codebase expanding so much and so rapidly in size and complexity. Given the relative autonomy of Etsy app development teams, there ended up being a lot of reinventing the wheel, lots of incompatible patterns getting layered atop one another; in short the code resembled a giant plate of spaghetti. The main listing Fragment file alone had over 4000 lines of code in it!</p>
<p>Code that isn’t built for testability doesn’t test well, and test coverage for the listing screen was low. VERY low. Our legacy architecture made it hard for developers to add tests for business logic, and the tests that did get written were complex and brittle, and often caused continuous integration failures for seemingly unrelated changes. Developers would skip tests when it seemed too costly to write and maintain them, those skipped tests made the codebase harder for new developers to onboard into or work with confidently, and the result was a vicious circle that would lead to even less test coverage.</p>
<h2>Introducing Macramé</h2>
<p>We decided that our new architecture for the listing screen, which we’ve named <a href="https://www.etsy.com/search?q=macram%25C3%25A9">Macramé</a>, would be based on immutable data propagated through a reactive UI. Reactive frameworks are widely deployed and well understood, and we could see a number of ways that reactivity would help us untangle the spaghetti. We chose to emulate architectures like Spotify’s <a href="https://spotify.github.io/mobius/">Mobius</a>, molded to fit the shape of Etsy’s codebase and its business requirements.</p>
<p>At the core of the architecture is an immutable State object that represents our data model. State for the listing screen is passed to the UI as a single data object via a <a href="https://developer.android.com/kotlin/flow/stateflow-and-sharedflow">StateFlow</a> instance; each time a piece of the data model changes the UI re-renders. Updates to State can be made either from a background thread or from the main UI thread, and using StateFlow ensures that all updates reach the main UI thread. When the data model for a screen is large, as it is for the listing screen, updating the UI from a single object makes things much simpler to test and reason about than if multiple separate models are making changes independently. And that simplicity lets us streamline the rest of the architecture.</p>
<p>When changes are made to the State, the monolithic data model gets transformed into a list of smaller models that represent what will actually be shown to the user, in vertical order on the screen. The code below shows an example of state held in the Buy Box section of the screen, along with its smaller Title sub-component.</p>
<pre><code>data class BuyBox(
    val title: Title,
    val price: Price,
    val saleEndingSoonBadge: SaleEndingSoonBadge,
    val unitPricing: UnitPricing,
    val vatTaxDescription: VatTaxDescription,
    val transparentPricing: TransparentPricing,
    val firstVariation: Variation,
    val secondVariation: Variation,
    val klarnaInfo: KlarnaInfo,
    val freeShipping: FreeShipping,
    val estimatedDelivery: EstimatedDelivery,
    val quantity: Quantity,
    val personalization: Personalization,
    val expressCheckout: ExpressCheckout,
    val cartButton: CartButton,
    val termsAndConditions: TermsAndConditions,
    val ineligibleShipping: IneligibleShipping,
    val lottieNudge: LottieNudge,
    val listingSignalColumns: ListingSignalColumns,
    val shopBanner: ShopBanner,
)

data class Title(
    val text: String,
    val textInAlternateLanguage: String? = null,
    val isExpanded: Boolean = false,
) : ListingUiModel()</code></pre>
<p>In our older architecture, the screen was based on a single scrollable View. All data was bound and rendered during the View's initial layout pass, which created a noticeable pause the first time the screen was loaded. In the new screen, a RecyclerView is backed by a ListAdapter, which allows for asynchronous diffs of the data changes, avoiding the need to rebind portions of the screen that aren't receiving updates. Each of the vertical elements on the screen (title, image gallery, price, etc.) is represented by its own ViewHolder, which binds whichever of the smaller data models the element relies on.</p>
<p>In this code, the BuyBox is transformed into a vertical list of ListingUiModels to display in the RecyclerView.</p>
<pre><code>fun BuyBox.toUiModels(): List&lt;ListingUiModel&gt; {
    return listOf(
        price,
        title,
        shopBanner,
        listingSignalColumns,
        unitPricing,
        vatTaxDescription,
        transparentPricing,
        klarnaInfo,
        estimatedDelivery,
        firstVariation,
        secondVariation,
        quantity,
        personalization,
        ineligibleShipping,
        cartButton,
        expressCheckout,
        termsAndConditions,
        lottieNudge,
    )
}</code></pre>
<p>An Event dispatching system handles user actions, which are represented by a sealed Event class. The use of <a href="https://www.etsy.com/codeascraft/sealed-classes-opened-my-mind">sealed classes</a> for Events, coupled with Kotlin "when" statements mapping Events to Handlers, provides compile-time safety to ensure all of the pieces are in place to handle the Event properly. These Events are fed to a single Dispatcher queue, which is responsible for routing Events to the Handlers that are registered to receive them.</p>
<p>Handlers perform a variety of tasks: starting asynchronous network calls, dispatching more Events, dispatching SideEffects, or updating State. We want to make it easy to reason about what Handlers are doing, so our architecture promotes keeping their scope of responsibility as small as possible. Simple Handlers are simple to write tests for, which leads to better test coverage and improved developer confidence.</p>
<p>In the example below, a click handler on the listing title sets a State property that tells the UI to display an expanded title:</p>
<pre><code>class TitleClickedHandler constructor() {

    fun handle(state: ListingViewState.Listing): ListingEventResult.StateChange {
        val buyBox = state.buyBox 
        return ListingEventResult.StateChange(
            state = state.copy(
                buyBox = buyBox.copy(
                    title = title.copy(isExpanded = true)
                )
            )
        )
    }
}</code></pre>
<p>SideEffects are a special type of Event used to represent, typically, one-time operations that need to interact with the UI but aren’t considered pure business logic: showing dialogs, logging events, performing navigation or showing Snackbar messages. SideEffects end up being routed to the Fragment to be handled.</p>
<p>Take the scenario of a user clicking on a listing's Add to Cart button. The Handler for that Event might:</p>
<ul>
<li>dispatch a SideEffect to log the button click</li>
<li>start an asynchronous network call to update the user’s cart</li>
<li>update the State to show a loading indicator while the cart update finishes</li>
</ul>
<p>While the network call is running on a background thread, the Dispatcher is free to handle other Events that may be in the queue. When the network call completes in the background, a new Event will be dispatched with either a success or failure result. A different Handler is then responsible for handling both the success and failure Events.</p>
<p>This diagram illustrates the flow of Events, SideEffects, and State through the architecture:</p>
<figure>
<img alt="Macramé Architecture" src="https://i.etsystatic.com/inv/044fad/5843012537/inv_fullxfull.5843012537_l4a37ye8.jpg?version=0" title="Macramé Architecture" />
<figcaption>Figure 1. A flow chart illustrating system components (blue boxes) and how events and state changes (yellow boxes) flow between them.</figcaption>
</figure>
<h2>Results</h2>
<p>The rewrite process took five months, with as many as five Android developers working on the project at once. One challenge we faced along the way was keeping the new listing screen up to date with all of the experiments being run on the old listing screen while development was in progress. The team also had to create a suite of tests that could comprehensively cover the diversity of listings available on Etsy, to ensure that we didn’t forget any features or break any.</p>
<p>With the rewrite complete, the team ran an A/B experiment against the existing listing screen to test both performance and user behavior between the two versions. Though the new listing screen felt qualitatively quicker than the old listing screen, we wanted to understand how users would react to subtle changes in the new experience.</p>
<p>We instrumented both the old and the new listing screens to measure performance changes from the refactor. The new screen performed even better than expected. Time to First Content was decreased by 18%, going from 1585 ms down to 1298 ms. This speedup resulted in the average number of listings viewed by buyers increasing 2.4%, add to carts increasing 0.43%, searches increasing by 2%, and buyer review photo views increasing by 3.3%.</p>
<p>On the developer side, unit test coverage increased from single digit percentages to a whopping 76% code coverage of business logic classes. This significantly validates our decision to put nearly all business logic into Handler classes, each responsible for handling just a single Event at a time. We built a robust collection of tools for generating testing States in a variety of common configurations, so writing unit tests for the Handlers is as simple as generating an input event and validating that the correct State and SideEffects are produced.</p>
<p>Creating any new architecture involves making tradeoffs, and this project was no exception. Macramé is under active development, and we have a few pieces of feedback on our agenda to be addressed:</p>
<ul>
<li>There is some amount of boilerplate still needed to correctly wire up a new Event and Handler, and we'd like to make that go away.</li>
<li>The ability of Handlers to dispatch their own Events sometimes makes debugging complex Handler interactions more difficult than previous formulations of the same business logic.</li>
<li>On a relatively simple screen, the architecture can feel like overkill.</li>
</ul>
<p>Adding new features correctly to the listing screen is now the easy thing to do. The dual benefit of increasing business metrics while also increasing developer productivity and satisfaction has resulted in the Android team expanding the usage of Macramé to two more of the key screens in the app (Cart and Shop), both of which completely rewrote their UI using <a href="https://developer.android.com/jetpack/compose">Jetpack Compose</a>: but those are topics for future Code as Craft posts.</p>]]></description>
                <content:encoded><![CDATA[<p>Easily the most important and complex screen in the Buy on Etsy Android app is the listing screen, where all key information about an item for sale in the Etsy marketplace is displayed to buyers. Far from just a title and description, a price and a few images, over the years the listing screen has come to aggregate ratings and reviews, seller and shipping and stock information, and gained a variety of personalization and recommendation features. As information-rich as it is, as central as it is to the buying experience, for product teams the listing screen is an irresistible place to test out new methods and approaches. In just the last three years, apps teams have run nearly 200 experiments on it, often with multiple teams building and running experiments in parallel.</p>
<p>Eventually, with such a high velocity of experiment and code change, the listing screen started showing signs of stress. Its architecture was inconsistent and not meant to support a codebase expanding so much and so rapidly in size and complexity. Given the relative autonomy of Etsy app development teams, there ended up being a lot of reinventing the wheel, lots of incompatible patterns getting layered atop one another; in short the code resembled a giant plate of spaghetti. The main listing Fragment file alone had over 4000 lines of code in it!</p>
<p>Code that isn’t built for testability doesn’t test well, and test coverage for the listing screen was low. VERY low. Our legacy architecture made it hard for developers to add tests for business logic, and the tests that did get written were complex and brittle, and often caused continuous integration failures for seemingly unrelated changes. Developers would skip tests when it seemed too costly to write and maintain them, those skipped tests made the codebase harder for new developers to onboard into or work with confidently, and the result was a vicious circle that would lead to even less test coverage.</p>
<h2>Introducing Macramé</h2>
<p>We decided that our new architecture for the listing screen, which we’ve named <a href="https://www.etsy.com/search?q=macram%25C3%25A9">Macramé</a>, would be based on immutable data propagated through a reactive UI. Reactive frameworks are widely deployed and well understood, and we could see a number of ways that reactivity would help us untangle the spaghetti. We chose to emulate architectures like Spotify’s <a href="https://spotify.github.io/mobius/">Mobius</a>, molded to fit the shape of Etsy’s codebase and its business requirements.</p>
<p>At the core of the architecture is an immutable State object that represents our data model. State for the listing screen is passed to the UI as a single data object via a <a href="https://developer.android.com/kotlin/flow/stateflow-and-sharedflow">StateFlow</a> instance; each time a piece of the data model changes the UI re-renders. Updates to State can be made either from a background thread or from the main UI thread, and using StateFlow ensures that all updates reach the main UI thread. When the data model for a screen is large, as it is for the listing screen, updating the UI from a single object makes things much simpler to test and reason about than if multiple separate models are making changes independently. And that simplicity lets us streamline the rest of the architecture.</p>
<p>When changes are made to the State, the monolithic data model gets transformed into a list of smaller models that represent what will actually be shown to the user, in vertical order on the screen. The code below shows an example of state held in the Buy Box section of the screen, along with its smaller Title sub-component.</p>
<pre><code>data class BuyBox(
    val title: Title,
    val price: Price,
    val saleEndingSoonBadge: SaleEndingSoonBadge,
    val unitPricing: UnitPricing,
    val vatTaxDescription: VatTaxDescription,
    val transparentPricing: TransparentPricing,
    val firstVariation: Variation,
    val secondVariation: Variation,
    val klarnaInfo: KlarnaInfo,
    val freeShipping: FreeShipping,
    val estimatedDelivery: EstimatedDelivery,
    val quantity: Quantity,
    val personalization: Personalization,
    val expressCheckout: ExpressCheckout,
    val cartButton: CartButton,
    val termsAndConditions: TermsAndConditions,
    val ineligibleShipping: IneligibleShipping,
    val lottieNudge: LottieNudge,
    val listingSignalColumns: ListingSignalColumns,
    val shopBanner: ShopBanner,
)

data class Title(
    val text: String,
    val textInAlternateLanguage: String? = null,
    val isExpanded: Boolean = false,
) : ListingUiModel()</code></pre>
<p>In our older architecture, the screen was based on a single scrollable View. All data was bound and rendered during the View's initial layout pass, which created a noticeable pause the first time the screen was loaded. In the new screen, a RecyclerView is backed by a ListAdapter, which allows for asynchronous diffs of the data changes, avoiding the need to rebind portions of the screen that aren't receiving updates. Each of the vertical elements on the screen (title, image gallery, price, etc.) is represented by its own ViewHolder, which binds whichever of the smaller data models the element relies on.</p>
<p>In this code, the BuyBox is transformed into a vertical list of ListingUiModels to display in the RecyclerView.</p>
<pre><code>fun BuyBox.toUiModels(): List&lt;ListingUiModel&gt; {
    return listOf(
        price,
        title,
        shopBanner,
        listingSignalColumns,
        unitPricing,
        vatTaxDescription,
        transparentPricing,
        klarnaInfo,
        estimatedDelivery,
        firstVariation,
        secondVariation,
        quantity,
        personalization,
        ineligibleShipping,
        cartButton,
        expressCheckout,
        termsAndConditions,
        lottieNudge,
    )
}</code></pre>
<p>An Event dispatching system handles user actions, which are represented by a sealed Event class. The use of <a href="https://www.etsy.com/codeascraft/sealed-classes-opened-my-mind">sealed classes</a> for Events, coupled with Kotlin "when" statements mapping Events to Handlers, provides compile-time safety to ensure all of the pieces are in place to handle the Event properly. These Events are fed to a single Dispatcher queue, which is responsible for routing Events to the Handlers that are registered to receive them.</p>
<p>Handlers perform a variety of tasks: starting asynchronous network calls, dispatching more Events, dispatching SideEffects, or updating State. We want to make it easy to reason about what Handlers are doing, so our architecture promotes keeping their scope of responsibility as small as possible. Simple Handlers are simple to write tests for, which leads to better test coverage and improved developer confidence.</p>
<p>In the example below, a click handler on the listing title sets a State property that tells the UI to display an expanded title:</p>
<pre><code>class TitleClickedHandler constructor() {

    fun handle(state: ListingViewState.Listing): ListingEventResult.StateChange {
        val buyBox = state.buyBox 
        return ListingEventResult.StateChange(
            state = state.copy(
                buyBox = buyBox.copy(
                    title = title.copy(isExpanded = true)
                )
            )
        )
    }
}</code></pre>
<p>SideEffects are a special type of Event used to represent, typically, one-time operations that need to interact with the UI but aren’t considered pure business logic: showing dialogs, logging events, performing navigation or showing Snackbar messages. SideEffects end up being routed to the Fragment to be handled.</p>
<p>Take the scenario of a user clicking on a listing's Add to Cart button. The Handler for that Event might:</p>
<ul>
<li>dispatch a SideEffect to log the button click</li>
<li>start an asynchronous network call to update the user’s cart</li>
<li>update the State to show a loading indicator while the cart update finishes</li>
</ul>
<p>While the network call is running on a background thread, the Dispatcher is free to handle other Events that may be in the queue. When the network call completes in the background, a new Event will be dispatched with either a success or failure result. A different Handler is then responsible for handling both the success and failure Events.</p>
<p>This diagram illustrates the flow of Events, SideEffects, and State through the architecture:</p>
<figure>
<img alt="Macramé Architecture" src="https://i.etsystatic.com/inv/044fad/5843012537/inv_fullxfull.5843012537_l4a37ye8.jpg?version=0" title="Macramé Architecture" />
<figcaption>Figure 1. A flow chart illustrating system components (blue boxes) and how events and state changes (yellow boxes) flow between them.</figcaption>
</figure>
<h2>Results</h2>
<p>The rewrite process took five months, with as many as five Android developers working on the project at once. One challenge we faced along the way was keeping the new listing screen up to date with all of the experiments being run on the old listing screen while development was in progress. The team also had to create a suite of tests that could comprehensively cover the diversity of listings available on Etsy, to ensure that we didn’t forget any features or break any.</p>
<p>With the rewrite complete, the team ran an A/B experiment against the existing listing screen to test both performance and user behavior between the two versions. Though the new listing screen felt qualitatively quicker than the old listing screen, we wanted to understand how users would react to subtle changes in the new experience.</p>
<p>We instrumented both the old and the new listing screens to measure performance changes from the refactor. The new screen performed even better than expected. Time to First Content was decreased by 18%, going from 1585 ms down to 1298 ms. This speedup resulted in the average number of listings viewed by buyers increasing 2.4%, add to carts increasing 0.43%, searches increasing by 2%, and buyer review photo views increasing by 3.3%.</p>
<p>On the developer side, unit test coverage increased from single digit percentages to a whopping 76% code coverage of business logic classes. This significantly validates our decision to put nearly all business logic into Handler classes, each responsible for handling just a single Event at a time. We built a robust collection of tools for generating testing States in a variety of common configurations, so writing unit tests for the Handlers is as simple as generating an input event and validating that the correct State and SideEffects are produced.</p>
<p>Creating any new architecture involves making tradeoffs, and this project was no exception. Macramé is under active development, and we have a few pieces of feedback on our agenda to be addressed:</p>
<ul>
<li>There is some amount of boilerplate still needed to correctly wire up a new Event and Handler, and we'd like to make that go away.</li>
<li>The ability of Handlers to dispatch their own Events sometimes makes debugging complex Handler interactions more difficult than previous formulations of the same business logic.</li>
<li>On a relatively simple screen, the architecture can feel like overkill.</li>
</ul>
<p>Adding new features correctly to the listing screen is now the easy thing to do. The dual benefit of increasing business metrics while also increasing developer productivity and satisfaction has resulted in the Android team expanding the usage of Macramé to two more of the key screens in the app (Cart and Shop), both of which completely rewrote their UI using <a href="https://developer.android.com/jetpack/compose">Jetpack Compose</a>: but those are topics for future Code as Craft posts.</p>]]></content:encoded>
            </item>
        
            <item>
                <title><![CDATA[How We Built The Deals Tab in Swift UI]]></title>
                <link>https://www.etsy.com/codeascraft/how-we-built-the-deals-tab-in-swift-ui?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</link>
                <dc:creator><![CDATA[Pierre Rodgers]]></dc:creator>
                <pubDate>Tue, 5 Mar 2024 17:22:54 -0500</pubDate>
                <guid isPermaLink="true">https://www.etsy.com/codeascraft/how-we-built-the-deals-tab-in-swift-ui?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</guid>
                <description><![CDATA[<h1><em>Balancing Engineering Ambition with Product Realism</em></h1>
<h2>Introduction</h2>
<p>In July of 2023, Etsy’s App Updates team, responsible for the Updates feed in Etsy’s mobile apps, set off with an ambitious goal: to revamp the Updates tab to become Deals, a home for a shopper’s coupons and sales, in time for Cyber Week 2023.</p>
<p>The Updates tab had been around for years, and in the course of its evolution ended up serving multiple purposes. It was a hub for updates about a user’s favorite shops and listings, but it was also increasingly a place to start new shopping journeys. Not all updates were created equal. The most acted-upon updates in the tab were coupons offered for abandoned cart items, which shoppers loved.</p>
<p>We spotted an opportunity to clarify intentions for our users: by refactoring favorite-based updates into the Favorites tab, and (more boldly), by recentering Updates and transforming it into a hub for a buyer’s deals.</p>
<h2>Technical Opportunity</h2>
<p>While investigating the best way to move forward with the Deals implementation, iOS engineers on the team advocated for developing a new tab from the ground up. Although it meant greater initial design and architecture effort, an entirely new tab built on modern patterns would let us avoid relying on Objective C, as well as internal frameworks like SDL (server-driven layout), which is present in many legacy Etsy app screens and comes with a variety of scalability and performance issues, and is in the process of being phased out. </p>
<p>At the same time, we needed a shippable product by October. Black Friday and Cyber Week loomed on the horizon in November, and it would be a missed opportunity, for us and for our users, not to have the Deals tab ready to go. Our ambition to use modern, not yet road-tested technologies would have to balance with realism about the needs of the product, and we were conscious of maintaining that balance throughout the course of development.</p>
<h2>In comes Swift UI and Tuist!</h2>
<p>Two new frameworks were front of mind when starting this project: <a href="https://developer.apple.com/xcode/swiftui/">Swift UI</a> and <a href="https://tuist.io/">Tuist</a>. Swift UI provides a clear, declarative framework for UI development, and makes it easy for engineers to break down views into small, reusable components. Maybe Swift UI’s biggest benefit is its built-in view previews: in tandem with componentization, it becomes a very straightforward process to build a view out of smaller pieces and preview at every step of the way. </p>
<p>Our team had experimented with Swift UI in the past, but with scopes limited to small views, such as headers. Confident as we were about the framework, we expected that building out a whole screen in Swift UI would present us some initial hurdles to overcome. In fact, one hurdle presented itself right away. In a decade-old codebase, not everything is optimized for use with newer technologies. The build times we saw for our Swift UI previews, which were almost long enough to negate the framework’s other benefits, testified to that fact.</p>
<p>This is where Tuist comes in. Our App Enablement team, which has been hard at work over the past few years modernizing the Etsy codebase, has adopted Tuist as a way of taming the monolith making it modular. Any engineer at Etsy can declare a Tuist module in their project and start working on it, importing parts of the larger codebase only as they need them. (For more on Etsy’s usage of Tuist, check out <a href="https://tuist.io/blog/2023/11/30/etsy-and-tuist/">this article</a> by Mike Simons from the App Enablement team.) Moving our work for the Deals tab into a Swift-based Tuist module gave us what it took to make a preview-driven development process practical: our previews build nearly instantly, and so long as we’re only making changes in our framework the app recompiles with very little delay.</p>
<figure>
<img alt="Deals UI Framework Diagram" src="https://i.etsystatic.com/inv/cab771/5865916855/inv_fullxfull.5865916855_49mug0jy.jpg?version=0" title="Deals UI Framework Diagram" />
<figcaption><strong>Figure 1.</strong> A view of a goal end state of a modular Etsy codebase, with a first layer of core modules (in blue), and a second layer of client-facing modules that combine to build the Etsy app.</figcaption>
</figure>
<h2>Our architecture</h2>
<p>The Deals tab comprises a number of modules for any given Etsy user, including a Deals Just for You module with abandoned cart coupons, and a module that shows a user their favorite listings that are on sale. Since the screen is just a list of modules, the API returns them as an array of typed items with the following structure:</p>
<pre><code>{
    "type": "&lt;ModuleName&gt;",
    "&lt;ModuleName&gt;": { ... }
}</code></pre>
<p>Assigning each module a type enables us to parse it correctly on the client, and moves us away from the anonymous component-based API models we had used in the past. Many models are still used across modules, however. These include, but are not limited to, buttons, headers and footers, and listing cards. To parse a new module, we either have to build a new component if it doesn't exist yet, or reuse one that does. Adding a footer to a module, for example, can be as simple as:</p>
<pre><code>// Model
{
    "type": "my_module",
    "my_module": {
        "target_listing": { &lt;ListingCard&gt; }",
        "recommended_listings": [ &lt;ListingCard&gt; ],
        "footer": { &lt;Footer&gt; } // Add footer here
    }
}

// View 
var body: some View {
    VStack {
        ListingView(listing: targetListing)
        ListingCarouselView(listings: recommendedListings)
        MyFooterView(footer: footer) // Add footer here
    }
}</code></pre>
<p>We also used Decodable implementations for our API parsing, leading to faster, clearer code and an easier way to handle optionals. With Etsy’s internal <em>APIv3</em> framework built on top of Apple’s Decodable protocol, it is very straightforward to define a model and decide what is and isn’t optional, and let the container handle the rest. For example, if the footer was optional, but the target and recommended listings are required, decoding would look like this:</p>
<pre><code>init(from decoder: Decoder) throws {
    let container = try decoder.containerV3(keyedBy: CodingKeys.self)

    // These will throw if they aren't included in the response
    self.targetListing = try container.requireV3(forKey: .targetListing)
    self.recommendedListings = try container.requireV3(forKey: .recommendedListings)

    // Footer is optional
    self.footer = container.decodeV3(forKey: .footer) 
}</code></pre>
<p>As for laying out the view on the screen, we used a Swift UI <em>List</em> to make the most of the under-the-hood cell reuse that <em>List</em> uses.</p>
<figure>
<img alt="Deals Hosting Controller Diagram" src="https://i.etsystatic.com/inv/fda52c/5865924623/inv_fullxfull.5865924623_5o6kigqf.jpg?version=0" title="Deals Hosting Controller Diagram" />
<figcaption><strong>Figure 2.</strong> On the left-hand side, a diagram of how the DealsUI view is embedded in the Etsy app. On the right-hand side, a diagram of how the DeasUI framework takes the API response and renders a list of module views with individual components.</figcaption>
</figure>
<h2>Previews, previews, more previews</h2>
<p>If we were going to maintain a good development pace, we needed to figure out a clean way to use Swift previews. Previewing a small component, like a header that takes a string, is simple enough: just initialize the header view with the header string. For more complex views, though, it gets cumbersome to build a mock API response every time you need to preview. This complexity is only amplified when previewing an entire Deals module.</p>
<p>To streamline the process, we decided to add a <code>Previews</code> enum to our more complex models. A good example of this is in the Deals Just for You coupon cards. These cards display an image or an array of images, a few lines of custom text (depending on the coupon type), and a button. Our previews enum for this API model looks like this:</p>
<pre><code>// In an extension to DealsForYouCard

enum Previews {
    static var shopCouponThreeImage: ResponseModels.DealsForYouCard {
        let titleText = "IrvingtonWoodworksStudio"
        let images = [...] // Three images
        let button = ResponseModels.Button(
            buttonText: "10% off shop",
            action: .init(...)
        )

        return ResponseModels.DealsForYouCard(
            button: button, 
            saleBadge: "20% off", 
            titleText: titleText,
            subtitleText: "Favorited shop", 
            action: .init(...), 
            images: images
        )
    }

    static var listingCoupon: ResponseModels.DealsForYouCard {
        ...    
    }
}</code></pre>
<p>Then, previewing a variety of coupon cards, it’s as straightforward as:</p>
<pre><code>#Preview {
    DealsForYouCardView(coupon: .Previews.listingCoupon)
}

#Preview {
    DealsForYouCardView(coupon: .Previews.shopCouponThreeImage)
}</code></pre>
<p>The other perk of this architecture is that it makes it very easy to nest previews, for example when previewing an entire module. To build preview data for the <em>Deals for You</em> module, we can use our coupon examples this way:</p>
<pre><code>// In an extension to DealsForYouModule
enum Previews {
    static var mockModule: ResponseModels.DealsForYouModule {
        let items: [ResponseModels.DealsForYouCard] = [.Previews.listingCoupon, .Previews.shopCouponThreeImage, .Previews.shopCouponTwoImage]
        let header = ResponseModels.DealsForYouHeader(title: "Deals just for you")

        return .init(header: header, items: items)
    }
}</code></pre>
<p>These enums are brief, clear, and easy to understand; they allow us to lean into the benefits of modularity. </p>
<p>This architecture, along with our Decodable models, also enabled us to clear a roadblock that used to occur when our team had to wait for API support before we could build new modules. For example, both the <em>Similar Items on Sale</em> and <em>Extra Special Deals</em> modules in the Deals tab were built via previews, and were ready approximately two weeks before the corresponding API work was complete, helping us meet deadlines and not have to wait for a new App Store release.</p>
<p>By taking full advantage of Swift UI's modularity and previewability, not only were we able to prove out a set of new technologies, we also exceeded product expectations by significantly beating our deadlines even with the initial overhead of adopting the framework.</p>
<h2>Challenges: UIKit interoperability</h2>
<p>Particularly when it came to tasks like navigation and favoriting, interfacing between our module and the Etsy codebase could pose challenges. An assumption that we had as engineers going into this project was that the code to open a listing page, for example, would just be readily available to use; this was not the case, however. Most navigation code within the Etsy codebase is handled by an Objective C class called EtsyScreenController. While in the normal target, it’s as straightforward as calling a function to open a listing page, that functionality was not available to us in our Deals module.</p>
<p>One option would have been to build our own navigation logic using Swift UI Navigation stacks; we weren’t trying to reinvent the wheel, however. To balance product deadlines and keep things as simple as possible, we decided not to be dogmatic, and to handle navigation outside of our framework. We did this by building a custom <code>@Environment</code> struct, called <code>DealsAction</code>, which passes off responsibility for navigation back to the main target, and uses the new Swift <code>callAsFunction()</code> <a href="https://www.hackingwithswift.com/swift/5.2/callasfunction">feature</a> so we can treat this struct like a function in our views. </p>
<p>We have a concept of a <code>DealsAction</code> type in our API response, which enables us to match an action with an actionable part of the screen. For example, a button response has an action that will be executed when a user taps the button. The DealsAction handler takes that action, and uses our existing UIKit code to perform it. </p>
<p>The Deals tab is wrapped in a <code>UIHostingController</code> in the main Etsy target, so when setting up the Swift UI view, we also set the <code>DealsAction</code> environment object using a custom view modifier:</p>
<pre><code>let dealsView = DealsView()
                .handleDealsAction {
                    [weak self] in self?.handleAction(action: $0)
                }

...

func handleDealsAction(action: DealsAction) {
    // UIKit code to execute action
}</code></pre>
<p>Then, when we need to perform an action from a Swift UI view, the action handler is present at any layer within the view hierarchy within the Deals tab. Performing the action is as simple as:</p>
<pre><code>@Environment(\.handleDealsAction) var handleDealsAction: DealsAction
...
MyButton(title: buttonText, fillWidth: false) {
    handleDealsAction(model.button?.action)
}</code></pre>
<p>We reused this pattern for other existing functionality that was only available in the main target. For example, we built an environment object for favoriting listings, or for following a shop, and for logging performance metrics. This pattern allows us to include environment objects as needed, and it simplifies adding action handling to any view. Instead of rebuilding this functionality in our Tuist module in pure Swift, which could have taken multiple sprints, we struck a balance between modernization and the need to meet product deadlines.</p>
<h2>Challenges: Listing Cards</h2>
<p>The listing card view is a common component used across multiple screens within the Etsy app. This component was originally written in Objective-C and throughout the years has been enhanced to support multiple configurations and layouts, and to be available for A/B testing. It also has built-in functionality like favoriting, which engineers shouldn't have to reimplement each time they want to present a listing card.</p>
<figure>
<img alt="Listing Card Configuration" src="https://i.etsystatic.com/inv/b069d7/5817867384/inv_fullxfull.5817867384_15cbo86b.jpg?version=0" title="Listing Card Configuration" />
<figcaption><strong>Figure 3.</strong> A diagram of how listing card views are conventionally built in UIKit, using configuration options and the analytics framework to combine various UIKit subviews.</figcaption>
</figure>
<p>It's been our practice to reuse this same single component and make small modifications to support changes in the UI, as per product or experimentation requirements. This means that many functionalities, such as favoriting, long-press menus, and image manipulation, are heavily coupled with this single component, many parts of which are still written in Objective C.</p>
<p>Early in the process of developing the new tab, we decided to scope out a way of supporting conventional listing card designs—ones that matched existing cards elsewhere in the app—without having to rebuild the entire card component in Swift UI. We knew a rebuild would eventually be necessary, since we expected to have to support listing cards that differed significantly from the standard designs, but the scope of such a rebuild was a known unknown. To balance our deadlines, we decided to push this more ambitious goal until we knew we had product bandwidth.</p>
<p>Since the listing card view is heavily coupled with old parts of the codebase, however, it wasn’t as simple as just typing <code>import ListingCard</code> and flying along. We faced two challenges: first, the API model for a listing card couldn’t be imported into our module, and second the view couldn’t be imported for simple use in a <code>UIViewRepresentable</code> wrapper. To address these, we deferred responsibility back up to the UIKit view controller. Our models for a listing card component look something like this:</p>
<pre><code>struct ListingCard {
    public let listingCardWrapper: ListingCardWrapper

    let listingCard: TypedListingCard
}</code></pre>
<p>The model is parsed in two ways: as a wrapper, where it is parsed as an untyped dictionary that will eventually be used to initialize our legacy listing card model, and as a <code>TypedListingCard</code>, which is used only within the Deals tab module.</p>
<figure>
<img alt="Deals Card Builder pattern" src="https://i.etsystatic.com/inv/d036b7/5817868364/inv_fullxfull.5817868364_4wx1xll9.jpg?version=0" title="Deals Card Builder pattern" />
<figcaption><strong>Figure 4.</strong> A diagram of how a UIKit listing card builder is passed from the main target to the Deals framework for rendering listing cards.</figcaption>
</figure>
<p>To build the listing card view, we pass in a view builder to the SwiftUI <code>DealsView</code> initializer in the hosting controller code. Here, we are in the full Etsy app codebase, meaning that we have access to the legacy listing card code. When we need to build a listing card, we use this view builder as follows:</p>
<pre><code>var body: some View {
    LazyVGrid(...) {
        ForEach(listings) { listing in
            cardViewBuilder(listing) // Returns a UIViewRepresentable
        }
    }
}</code></pre>
<p>There was some initial overhead involved in getting these cards set up, but it was worth it to guarantee that engineering unknowns in a Swift UI rewrite wouldn’t block us and compromise our deadlines. Once built, the support for legacy cards enabled us to reuse them easily wherever they were needed. In fact, legacy support was one of the things that helped us move faster than we expected, and it became possible to stretch ourselves and build at least some listing cards in the Deals tab entirely in Swift UI. This meant that writing the wrapper ultimately gave us the space we needed to avoid having to rely solely on the wrapper! </p>
<h2>Conclusion</h2>
<p>After just three months of engineering work, the Deals tab was built and ready to go, even beating product deadlines. While it took some engineering effort to overcome initial hurdles, as well as the switch in context from working in UIKit in the main target to working in Swift UI in our own framework, once we had solutions to those challenges and could really take advantage of the new architecture, we saw a very substantial increase in our engineering velocity.</p>
<p>Instead of taking multiple sprints to build, new modules could take just one sprint or less; front-end work was decoupled from API work using Previews, which meant we no longer had to wait for mock responses or even API support at all; and maybe most important, it was fun to use Swift UI’s clear and straightforward declarative UI building, and see our changes in real time!</p>
<p>From a product perspective, the Deals tab was a great success: buyers converted their sessions more frequently, and we saw an increase in visits to the Etsy app. The tab was rolled out to all users in mid October, and has seen significant engagement, particularly during Black Friday and Cyber Monday.</p>
<p>By being bold and by diving confidently into new frameworks that we expected to see benefits from, we improved engineer experience and not just met but beat our product deadlines. More teams at Etsy are using Swift UI and Tuist in their product work now, thanks to the success of our undertaking, sometimes using our patterns to work through hurdles, sometimes creating their own. We expect to see more of this kind of modernization start to make its way into the codebase. As we iterate on the Deals tab over the next year, and make it even easier for buyers to find the deals that mean the most to them, we look forward to continuing to work in the same spirit.</p>
<p><em>Special thanks to Vangeli Ontiveros for the diagrams in this article, and a huge shoutout to the whole App Deals team for their hard work on this project!</em></p>]]></description>
                <content:encoded><![CDATA[<h1><em>Balancing Engineering Ambition with Product Realism</em></h1>
<h2>Introduction</h2>
<p>In July of 2023, Etsy’s App Updates team, responsible for the Updates feed in Etsy’s mobile apps, set off with an ambitious goal: to revamp the Updates tab to become Deals, a home for a shopper’s coupons and sales, in time for Cyber Week 2023.</p>
<p>The Updates tab had been around for years, and in the course of its evolution ended up serving multiple purposes. It was a hub for updates about a user’s favorite shops and listings, but it was also increasingly a place to start new shopping journeys. Not all updates were created equal. The most acted-upon updates in the tab were coupons offered for abandoned cart items, which shoppers loved.</p>
<p>We spotted an opportunity to clarify intentions for our users: by refactoring favorite-based updates into the Favorites tab, and (more boldly), by recentering Updates and transforming it into a hub for a buyer’s deals.</p>
<h2>Technical Opportunity</h2>
<p>While investigating the best way to move forward with the Deals implementation, iOS engineers on the team advocated for developing a new tab from the ground up. Although it meant greater initial design and architecture effort, an entirely new tab built on modern patterns would let us avoid relying on Objective C, as well as internal frameworks like SDL (server-driven layout), which is present in many legacy Etsy app screens and comes with a variety of scalability and performance issues, and is in the process of being phased out. </p>
<p>At the same time, we needed a shippable product by October. Black Friday and Cyber Week loomed on the horizon in November, and it would be a missed opportunity, for us and for our users, not to have the Deals tab ready to go. Our ambition to use modern, not yet road-tested technologies would have to balance with realism about the needs of the product, and we were conscious of maintaining that balance throughout the course of development.</p>
<h2>In comes Swift UI and Tuist!</h2>
<p>Two new frameworks were front of mind when starting this project: <a href="https://developer.apple.com/xcode/swiftui/">Swift UI</a> and <a href="https://tuist.io/">Tuist</a>. Swift UI provides a clear, declarative framework for UI development, and makes it easy for engineers to break down views into small, reusable components. Maybe Swift UI’s biggest benefit is its built-in view previews: in tandem with componentization, it becomes a very straightforward process to build a view out of smaller pieces and preview at every step of the way. </p>
<p>Our team had experimented with Swift UI in the past, but with scopes limited to small views, such as headers. Confident as we were about the framework, we expected that building out a whole screen in Swift UI would present us some initial hurdles to overcome. In fact, one hurdle presented itself right away. In a decade-old codebase, not everything is optimized for use with newer technologies. The build times we saw for our Swift UI previews, which were almost long enough to negate the framework’s other benefits, testified to that fact.</p>
<p>This is where Tuist comes in. Our App Enablement team, which has been hard at work over the past few years modernizing the Etsy codebase, has adopted Tuist as a way of taming the monolith making it modular. Any engineer at Etsy can declare a Tuist module in their project and start working on it, importing parts of the larger codebase only as they need them. (For more on Etsy’s usage of Tuist, check out <a href="https://tuist.io/blog/2023/11/30/etsy-and-tuist/">this article</a> by Mike Simons from the App Enablement team.) Moving our work for the Deals tab into a Swift-based Tuist module gave us what it took to make a preview-driven development process practical: our previews build nearly instantly, and so long as we’re only making changes in our framework the app recompiles with very little delay.</p>
<figure>
<img alt="Deals UI Framework Diagram" src="https://i.etsystatic.com/inv/cab771/5865916855/inv_fullxfull.5865916855_49mug0jy.jpg?version=0" title="Deals UI Framework Diagram" />
<figcaption><strong>Figure 1.</strong> A view of a goal end state of a modular Etsy codebase, with a first layer of core modules (in blue), and a second layer of client-facing modules that combine to build the Etsy app.</figcaption>
</figure>
<h2>Our architecture</h2>
<p>The Deals tab comprises a number of modules for any given Etsy user, including a Deals Just for You module with abandoned cart coupons, and a module that shows a user their favorite listings that are on sale. Since the screen is just a list of modules, the API returns them as an array of typed items with the following structure:</p>
<pre><code>{
    "type": "&lt;ModuleName&gt;",
    "&lt;ModuleName&gt;": { ... }
}</code></pre>
<p>Assigning each module a type enables us to parse it correctly on the client, and moves us away from the anonymous component-based API models we had used in the past. Many models are still used across modules, however. These include, but are not limited to, buttons, headers and footers, and listing cards. To parse a new module, we either have to build a new component if it doesn't exist yet, or reuse one that does. Adding a footer to a module, for example, can be as simple as:</p>
<pre><code>// Model
{
    "type": "my_module",
    "my_module": {
        "target_listing": { &lt;ListingCard&gt; }",
        "recommended_listings": [ &lt;ListingCard&gt; ],
        "footer": { &lt;Footer&gt; } // Add footer here
    }
}

// View 
var body: some View {
    VStack {
        ListingView(listing: targetListing)
        ListingCarouselView(listings: recommendedListings)
        MyFooterView(footer: footer) // Add footer here
    }
}</code></pre>
<p>We also used Decodable implementations for our API parsing, leading to faster, clearer code and an easier way to handle optionals. With Etsy’s internal <em>APIv3</em> framework built on top of Apple’s Decodable protocol, it is very straightforward to define a model and decide what is and isn’t optional, and let the container handle the rest. For example, if the footer was optional, but the target and recommended listings are required, decoding would look like this:</p>
<pre><code>init(from decoder: Decoder) throws {
    let container = try decoder.containerV3(keyedBy: CodingKeys.self)

    // These will throw if they aren't included in the response
    self.targetListing = try container.requireV3(forKey: .targetListing)
    self.recommendedListings = try container.requireV3(forKey: .recommendedListings)

    // Footer is optional
    self.footer = container.decodeV3(forKey: .footer) 
}</code></pre>
<p>As for laying out the view on the screen, we used a Swift UI <em>List</em> to make the most of the under-the-hood cell reuse that <em>List</em> uses.</p>
<figure>
<img alt="Deals Hosting Controller Diagram" src="https://i.etsystatic.com/inv/fda52c/5865924623/inv_fullxfull.5865924623_5o6kigqf.jpg?version=0" title="Deals Hosting Controller Diagram" />
<figcaption><strong>Figure 2.</strong> On the left-hand side, a diagram of how the DealsUI view is embedded in the Etsy app. On the right-hand side, a diagram of how the DeasUI framework takes the API response and renders a list of module views with individual components.</figcaption>
</figure>
<h2>Previews, previews, more previews</h2>
<p>If we were going to maintain a good development pace, we needed to figure out a clean way to use Swift previews. Previewing a small component, like a header that takes a string, is simple enough: just initialize the header view with the header string. For more complex views, though, it gets cumbersome to build a mock API response every time you need to preview. This complexity is only amplified when previewing an entire Deals module.</p>
<p>To streamline the process, we decided to add a <code>Previews</code> enum to our more complex models. A good example of this is in the Deals Just for You coupon cards. These cards display an image or an array of images, a few lines of custom text (depending on the coupon type), and a button. Our previews enum for this API model looks like this:</p>
<pre><code>// In an extension to DealsForYouCard

enum Previews {
    static var shopCouponThreeImage: ResponseModels.DealsForYouCard {
        let titleText = "IrvingtonWoodworksStudio"
        let images = [...] // Three images
        let button = ResponseModels.Button(
            buttonText: "10% off shop",
            action: .init(...)
        )

        return ResponseModels.DealsForYouCard(
            button: button, 
            saleBadge: "20% off", 
            titleText: titleText,
            subtitleText: "Favorited shop", 
            action: .init(...), 
            images: images
        )
    }

    static var listingCoupon: ResponseModels.DealsForYouCard {
        ...    
    }
}</code></pre>
<p>Then, previewing a variety of coupon cards, it’s as straightforward as:</p>
<pre><code>#Preview {
    DealsForYouCardView(coupon: .Previews.listingCoupon)
}

#Preview {
    DealsForYouCardView(coupon: .Previews.shopCouponThreeImage)
}</code></pre>
<p>The other perk of this architecture is that it makes it very easy to nest previews, for example when previewing an entire module. To build preview data for the <em>Deals for You</em> module, we can use our coupon examples this way:</p>
<pre><code>// In an extension to DealsForYouModule
enum Previews {
    static var mockModule: ResponseModels.DealsForYouModule {
        let items: [ResponseModels.DealsForYouCard] = [.Previews.listingCoupon, .Previews.shopCouponThreeImage, .Previews.shopCouponTwoImage]
        let header = ResponseModels.DealsForYouHeader(title: "Deals just for you")

        return .init(header: header, items: items)
    }
}</code></pre>
<p>These enums are brief, clear, and easy to understand; they allow us to lean into the benefits of modularity. </p>
<p>This architecture, along with our Decodable models, also enabled us to clear a roadblock that used to occur when our team had to wait for API support before we could build new modules. For example, both the <em>Similar Items on Sale</em> and <em>Extra Special Deals</em> modules in the Deals tab were built via previews, and were ready approximately two weeks before the corresponding API work was complete, helping us meet deadlines and not have to wait for a new App Store release.</p>
<p>By taking full advantage of Swift UI's modularity and previewability, not only were we able to prove out a set of new technologies, we also exceeded product expectations by significantly beating our deadlines even with the initial overhead of adopting the framework.</p>
<h2>Challenges: UIKit interoperability</h2>
<p>Particularly when it came to tasks like navigation and favoriting, interfacing between our module and the Etsy codebase could pose challenges. An assumption that we had as engineers going into this project was that the code to open a listing page, for example, would just be readily available to use; this was not the case, however. Most navigation code within the Etsy codebase is handled by an Objective C class called EtsyScreenController. While in the normal target, it’s as straightforward as calling a function to open a listing page, that functionality was not available to us in our Deals module.</p>
<p>One option would have been to build our own navigation logic using Swift UI Navigation stacks; we weren’t trying to reinvent the wheel, however. To balance product deadlines and keep things as simple as possible, we decided not to be dogmatic, and to handle navigation outside of our framework. We did this by building a custom <code>@Environment</code> struct, called <code>DealsAction</code>, which passes off responsibility for navigation back to the main target, and uses the new Swift <code>callAsFunction()</code> <a href="https://www.hackingwithswift.com/swift/5.2/callasfunction">feature</a> so we can treat this struct like a function in our views. </p>
<p>We have a concept of a <code>DealsAction</code> type in our API response, which enables us to match an action with an actionable part of the screen. For example, a button response has an action that will be executed when a user taps the button. The DealsAction handler takes that action, and uses our existing UIKit code to perform it. </p>
<p>The Deals tab is wrapped in a <code>UIHostingController</code> in the main Etsy target, so when setting up the Swift UI view, we also set the <code>DealsAction</code> environment object using a custom view modifier:</p>
<pre><code>let dealsView = DealsView()
                .handleDealsAction {
                    [weak self] in self?.handleAction(action: $0)
                }

...

func handleDealsAction(action: DealsAction) {
    // UIKit code to execute action
}</code></pre>
<p>Then, when we need to perform an action from a Swift UI view, the action handler is present at any layer within the view hierarchy within the Deals tab. Performing the action is as simple as:</p>
<pre><code>@Environment(\.handleDealsAction) var handleDealsAction: DealsAction
...
MyButton(title: buttonText, fillWidth: false) {
    handleDealsAction(model.button?.action)
}</code></pre>
<p>We reused this pattern for other existing functionality that was only available in the main target. For example, we built an environment object for favoriting listings, or for following a shop, and for logging performance metrics. This pattern allows us to include environment objects as needed, and it simplifies adding action handling to any view. Instead of rebuilding this functionality in our Tuist module in pure Swift, which could have taken multiple sprints, we struck a balance between modernization and the need to meet product deadlines.</p>
<h2>Challenges: Listing Cards</h2>
<p>The listing card view is a common component used across multiple screens within the Etsy app. This component was originally written in Objective-C and throughout the years has been enhanced to support multiple configurations and layouts, and to be available for A/B testing. It also has built-in functionality like favoriting, which engineers shouldn't have to reimplement each time they want to present a listing card.</p>
<figure>
<img alt="Listing Card Configuration" src="https://i.etsystatic.com/inv/b069d7/5817867384/inv_fullxfull.5817867384_15cbo86b.jpg?version=0" title="Listing Card Configuration" />
<figcaption><strong>Figure 3.</strong> A diagram of how listing card views are conventionally built in UIKit, using configuration options and the analytics framework to combine various UIKit subviews.</figcaption>
</figure>
<p>It's been our practice to reuse this same single component and make small modifications to support changes in the UI, as per product or experimentation requirements. This means that many functionalities, such as favoriting, long-press menus, and image manipulation, are heavily coupled with this single component, many parts of which are still written in Objective C.</p>
<p>Early in the process of developing the new tab, we decided to scope out a way of supporting conventional listing card designs—ones that matched existing cards elsewhere in the app—without having to rebuild the entire card component in Swift UI. We knew a rebuild would eventually be necessary, since we expected to have to support listing cards that differed significantly from the standard designs, but the scope of such a rebuild was a known unknown. To balance our deadlines, we decided to push this more ambitious goal until we knew we had product bandwidth.</p>
<p>Since the listing card view is heavily coupled with old parts of the codebase, however, it wasn’t as simple as just typing <code>import ListingCard</code> and flying along. We faced two challenges: first, the API model for a listing card couldn’t be imported into our module, and second the view couldn’t be imported for simple use in a <code>UIViewRepresentable</code> wrapper. To address these, we deferred responsibility back up to the UIKit view controller. Our models for a listing card component look something like this:</p>
<pre><code>struct ListingCard {
    public let listingCardWrapper: ListingCardWrapper

    let listingCard: TypedListingCard
}</code></pre>
<p>The model is parsed in two ways: as a wrapper, where it is parsed as an untyped dictionary that will eventually be used to initialize our legacy listing card model, and as a <code>TypedListingCard</code>, which is used only within the Deals tab module.</p>
<figure>
<img alt="Deals Card Builder pattern" src="https://i.etsystatic.com/inv/d036b7/5817868364/inv_fullxfull.5817868364_4wx1xll9.jpg?version=0" title="Deals Card Builder pattern" />
<figcaption><strong>Figure 4.</strong> A diagram of how a UIKit listing card builder is passed from the main target to the Deals framework for rendering listing cards.</figcaption>
</figure>
<p>To build the listing card view, we pass in a view builder to the SwiftUI <code>DealsView</code> initializer in the hosting controller code. Here, we are in the full Etsy app codebase, meaning that we have access to the legacy listing card code. When we need to build a listing card, we use this view builder as follows:</p>
<pre><code>var body: some View {
    LazyVGrid(...) {
        ForEach(listings) { listing in
            cardViewBuilder(listing) // Returns a UIViewRepresentable
        }
    }
}</code></pre>
<p>There was some initial overhead involved in getting these cards set up, but it was worth it to guarantee that engineering unknowns in a Swift UI rewrite wouldn’t block us and compromise our deadlines. Once built, the support for legacy cards enabled us to reuse them easily wherever they were needed. In fact, legacy support was one of the things that helped us move faster than we expected, and it became possible to stretch ourselves and build at least some listing cards in the Deals tab entirely in Swift UI. This meant that writing the wrapper ultimately gave us the space we needed to avoid having to rely solely on the wrapper! </p>
<h2>Conclusion</h2>
<p>After just three months of engineering work, the Deals tab was built and ready to go, even beating product deadlines. While it took some engineering effort to overcome initial hurdles, as well as the switch in context from working in UIKit in the main target to working in Swift UI in our own framework, once we had solutions to those challenges and could really take advantage of the new architecture, we saw a very substantial increase in our engineering velocity.</p>
<p>Instead of taking multiple sprints to build, new modules could take just one sprint or less; front-end work was decoupled from API work using Previews, which meant we no longer had to wait for mock responses or even API support at all; and maybe most important, it was fun to use Swift UI’s clear and straightforward declarative UI building, and see our changes in real time!</p>
<p>From a product perspective, the Deals tab was a great success: buyers converted their sessions more frequently, and we saw an increase in visits to the Etsy app. The tab was rolled out to all users in mid October, and has seen significant engagement, particularly during Black Friday and Cyber Monday.</p>
<p>By being bold and by diving confidently into new frameworks that we expected to see benefits from, we improved engineer experience and not just met but beat our product deadlines. More teams at Etsy are using Swift UI and Tuist in their product work now, thanks to the success of our undertaking, sometimes using our patterns to work through hurdles, sometimes creating their own. We expect to see more of this kind of modernization start to make its way into the codebase. As we iterate on the Deals tab over the next year, and make it even easier for buyers to find the deals that mean the most to them, we look forward to continuing to work in the same spirit.</p>
<p><em>Special thanks to Vangeli Ontiveros for the diagrams in this article, and a huge shoutout to the whole App Deals team for their hard work on this project!</em></p>]]></content:encoded>
            </item>
        
            <item>
                <title><![CDATA[The AR Measuring Box: Etsy&#39;s answer to Big Tape Measure]]></title>
                <link>https://www.etsy.com/codeascraft/the-ar-measuring-box-etsys-answer-to-big-tape-measure?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</link>
                <dc:creator><![CDATA[Pedro Michel]]></dc:creator>
                <pubDate>Mon, 18 Sep 2023 16:11:58 -0400</pubDate>
                <guid isPermaLink="true">https://www.etsy.com/codeascraft/the-ar-measuring-box-etsys-answer-to-big-tape-measure?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</guid>
                <description><![CDATA[<p>A little while ago, Etsy introduced a new feature in its iOS app that could place Etsy sellers' artwork on a user's wall using Apple's Augmented Reality (AR) tools. It let them visualize how a piece would look in their space, and even gave them an idea of its size options. When we launched the feature as a beta, it was only available in "wall art"-related categories, and after the initial rollout we were eager to expand it to work with more categories. </p>
<p>What differentiates Etsy is the nature of our sellers’ unique items. Our sellers create offerings that can be personalized in numbers of ways, and they often hand-make orders based on demand. Taking the same approach we did with wall art and attempting to show 3D models of millions of Etsy items – many of which could be further customized – would be a huge undertaking. Nevertheless, with inspiration from <a href="https://careers.etsy.com/global/en/guiding-principles">Etsy's Guiding Principles</a>, we decided to dig deeper into the feature. What could we improve in the way it worked behind the scenes? What about it would make for a compelling extension into the rest of our vast marketplace?</p>
<p>We took steps to improve how we parse seller-provided data, and we used this data with Apple’s AR technology to make it easy for Etsy users to understand the size and scale of an object they might want to buy. We decided we could make tape measures obsolete (or at least not quite as essential) for our home-decor shoppers by building an AR tool to let them visualize–conveniently, accurately, and with minimal effort–how an item would fit in their space.</p>
<h3>Improving dimension parsing</h3>
<p>In our <a href="https://codeascraft.com/2020/06/23/chaining-ios-machine-learning-computer-vision-and-augmented-reality-to-make-the-magical-real/">original post on the wall art experience</a>, we mentioned the complexity involved in doing things like inferring an item's dimensions from text in its description. Etsy allows sellers to add data about dimensions in a structured way when they create a listing, but that wasn't always the case, and some sellers still provide those details in places like the description or the item's title. The solution was to create a <a href="https://en.wikipedia.org/wiki/Regular_expression">regex</a>-based parser in the iOS App that would glean dimensions (width and height) by sifting through a small number of free-form fields–title, description, customization information, overview–looking for specific patterns. We were satisfied being able to catch most of the formats in which our sellers reported dimensions, handling variable positions of values and units (<code>3 in x 5 in vs 3 x 5 in</code>), different long and short names of units, special unit characters (‘, “), and so on, in all the different languages that Etsy supports.</p>
<p>Migrating our parsing functionality to the API backend was a first step towards making the AR measuring tool platform-independent, so we could bring it to our Android App as well. It would also be a help in development, since we could iterate improvements to our regex patterns faster outside the <a href="https://www.etsy.com/codeascraft/how-etsy-ships-apps?ref=codeascraft">app release schedule</a>. And we’d get more consistent dimensions because we'd be able to cache the results instead of having to parse them live on the client at each visit.</p>
<p>We knew that an extended AR experience would need to reliably show our users size options for items that had them, so we prioritized the effort to parse out dimensions from variations in listings. We sanitized free-form text input fields that might contain dimensions—inputs like title or description—so that we could catch a wider range of formats. (Several different characters can be used to write quotation marks, used as shorthand for inches and feet, and we needed to handle special characters for new lines, fraction ligatures like ½ or ¼, etc.) Our regex pattern was split and updated so it could detect:</p>
<ul>
<li>Measurement units in plural forms (inches, feet, etc.);</li>
<li>Incorrect spellings (e.g. "foots");</li>
<li>Localization of measurement units in the languages spoken by Etsy’s users ("meters", "metros", and "mètres" in English, Spanish, and French, respectively);</li>
<li>Other formats in which dimensions are captured by sellers like dimensions with unit conversions in parentheses (e.g. <code>12 in x 12 in (30 cm x 30 cm)</code>) or with complex measurements in the imperial system (<code>3’6”</code>).</li>
</ul>
<p>Making our dimension parsing more robust and bringing it server-side had several ancillary benefits. We were able to maintain the functionality of our iOS app while removing a lot of client-side code, even in Etsy’s <a href="https://developer.apple.com/app-clips/">App Clip</a>, where size is a matter of utmost importance. And though regex processing isn’t that processor-intensive, the view feature performed better once we implemented server-side caching of parsed dimensions. We figured we could even take the parsing offline (rather than parsing every listing on every visit) by running a backfill process to store dimensions in our database and deliver them to the App along with item details.</p>
<p>We found, thanks to decoupling our parser work from the App release cycle, that we were able to test hypotheses faster and iterate at a quicker pace. So we could proceed to some improvements that would have been quite resource-intensive if we had tried to implement them on the native app side. Sellers often provide dimensions in inconsistent units, for instance, or they might add the same data multiple times in different fields, when there are variations in properties like material or color. We worked out ways to de-duplicate this data during parsing, to minimize the number of size options we show users. (Though where we find dimensions that are specifically associated with variations, we make sure to retain them, since those will more directly correlate with offering prices.) And we made it possible to prioritize structured dimension data, where sellers have captured it in dedicated fields, as a more reliable source of truth than free-form parsing.</p>
<h3>Measuring in 3D</h3>
<h4>The box</h4>
<p>With this new and improved dimension data coming to us from the server, we had to figure out the right way to present it in 3D in iOS. The display needed to be intuitive, so our users would know more or less at a glance what the tool was and how to interact with it. Ultimately, we decided to present a rectangular prism-type object scaled to the parsed dimensions we have for a given listing. <a href="https://developer.apple.com/scenekit/">Apple's SceneKit framework</a> – specifically its SCNBox class – is what creates this box, which of course we style with the Etsy Orange look. So that users understand the box's purpose, we make sure to display the length on each side. We use SceneKit's SCNNode class to create the pills displaying our measurements.</p>
<div align="center"><img src="https://i.etsystatic.com/inv/1e68db/5342416361/inv_fullxfull.5342416361_dh5zp24h.jpg?version=0" alt="An orange rectangular prism placed in between furniture with measurements shown on each edge of the prism" width="300" /></div>
<p><br /> Users drag or tap the measuring box to move it around, and it can rotate on all axes – all made possible by having a different animation for each type of rotation using SCNActions. Rotation is a must-have feature: when we place the measuring box in a user's space, we may not always be able to get the orientation correct. We might, as in the illustration below, place a side table vertically on the floor instead of horizontally. Our users would have a poor experience of the measuring tool if they couldn't adjust for that. (Note that you may see some blinking yellow dots when you try out the AR experience: these are called feature points, and they're useful for debugging, to give us an idea of what surfaces are successfully being detected.)</p>
<div align="center">
    <img src="https://media2.giphy.com/media/7azdqVEhSRGCthLpob/giphy.gif" alt="Tall measuring box placed on the grown that is showing the size of the item (a chest of drawers shown on the top left in a small ). The user taps on an arrow to flip the box so that it is horizontal, matching the shape of the chest of drawers." width="300" />
</div>
<p><br /></p>
<h4>Environment occlusion</h4>
<p>In addition to ensuring the box would be scaled correctly, we wanted it to  "sit" as realistically as possible in the real world, so we built in <a href="https://developer.apple.com/documentation/arkit/content_anchors/visualizing_and_interacting_with_a_reconstructed_scene?language=objc">scene occlusion</a>. When a user places the measuring box in a room with other furniture, scene occlusion lets it interact with real-life objects as if the box were actually there. Users get valuable information this way about how an item will fit in their space. Will that end table go between the wall and couch? Will it be tall enough to be visible from behind the couch? (As demonstrated below, the table will indeed be tall enough.)</p>
<div align="center">
       <img src="https://i.etsystatic.com/inv/ea3000/5294262044/inv_fullxfull.5294262044_t18b2i0j.jpg?version=0" alt="Measuring box placed behind couch. Only the top of the box is visible as the rest is being occluded by the couch." width="300" />
</div>
<p><br />
Environment occlusion became a possibility with <a href="https://en.wikipedia.org/wiki/Lidar">LiDAR</a>, a method of determining depth using laser light. Although LiDAR has been around for a few decades, used to map everything from archeological sites to agricultural fields, Apple only included LiDAR scanners in iPhones and iPads beginning in 2020, with the 4th-generation iPad Pro and the iPhone 12 Pro.</p>
<p>An iPhone’s LiDAR scanner retrieves depth information from the area it scans and converts it into a series of vertices which connect to form a mesh (or a surface). To add occlusion to our SpriteKit-backed AR feature, we convert the mesh into a 3D object and place it (invisibly to the user) in the environment shown on their phone. As the LiDAR scanner measures more of the environment, we have more meshes to convert into objects and place in 3D. The video below shows an AR session where for debugging purposes we assign a random color to the detected mesh objects. Each different colored outline shown over a real-world object represents a different mesh. Notice how, as we scan more of the room, the device adds more mesh objects as it continues drawing out the environment.</p>
<div align="center">
    <img src="https://media0.giphy.com/media/bH2wThJwmV7XIXdD8c/giphy.gif" alt="The desk and monitor shown become outlined in different colors. As the iPhone camera moves around to scan more of the room, more of the furniture becomes covered in colorful outlines." width="300" />
</div>
<p><br />
The user's device uses these mesh objects to know when and how to occlude the measuring box. Essentially, these mesh objects help determine where the measuring box is relative to all the real-world items and surfaces it should respect. Taking advantage of occlusion gives our users an especially realistic AR experience. In the side-by-side comparison below, the video on the left shows how mesh objects found in the environment determine what part of the measuring box will be hidden as the camera moves in front of the desk. The video on the right shows the exact same thing, but with the mesh objects hidden. </p>
<table>
<thead>
<tr>
<th>Mesh objects are visible</th>
<th>Mesh objects are hidden</th>
</tr>
</thead>
<tbody>
<tr>
<td><img src="https://media1.giphy.com/media/Ulo2PPWfPxLFcvkKs0/giphy.gif" alt="The measuring box is placed under a desk. The desk, as well as the monitor, floor, and most of the environment are digitally covered in different colors. As the camera moves to the side of the desk, more of the box becomes occluded by the side of the desk." width="300" /></td>
<td><img src="https://media1.giphy.com/media/R1waynZYyozQ5vxoJz/giphy.gif" alt="The measuring box is placed under a desk similar to the previous video. As the camera moves to the side of the desk, more of the box becomes occluded by the side of the desk making it seem like the box is actually in the room." width="300" /></td>
</tr>
</tbody>
</table>
<h3>Closing thoughts</h3>
<p>This project took a 2D concept, our Wall View experience, and literally extended it into 3-dimensional space using Apple’s newest AR tools. And though the preparatory work we did improving our dimension parser may not be anything to look at, without the consistency and accuracy of that parsed information this newly realistic and interactive tool would not have been possible. Nearly a million Etsy items now have real-size AR functionality added to them, viewed by tens of thousands of Etsy users every week. As our marketplace evolves and devices become more powerful, working on features like this only increases our appetite for more and brings us closer to providing our users with a marketplace that lets them make the most informed decision about their purchases effortlessly. </p>
<p><em>Special shoutout to Jacob Van Order and Siri McClean as well as the rest of our team for their work on this.</em></p>]]></description>
                <content:encoded><![CDATA[<p>A little while ago, Etsy introduced a new feature in its iOS app that could place Etsy sellers' artwork on a user's wall using Apple's Augmented Reality (AR) tools. It let them visualize how a piece would look in their space, and even gave them an idea of its size options. When we launched the feature as a beta, it was only available in "wall art"-related categories, and after the initial rollout we were eager to expand it to work with more categories. </p>
<p>What differentiates Etsy is the nature of our sellers’ unique items. Our sellers create offerings that can be personalized in numbers of ways, and they often hand-make orders based on demand. Taking the same approach we did with wall art and attempting to show 3D models of millions of Etsy items – many of which could be further customized – would be a huge undertaking. Nevertheless, with inspiration from <a href="https://careers.etsy.com/global/en/guiding-principles">Etsy's Guiding Principles</a>, we decided to dig deeper into the feature. What could we improve in the way it worked behind the scenes? What about it would make for a compelling extension into the rest of our vast marketplace?</p>
<p>We took steps to improve how we parse seller-provided data, and we used this data with Apple’s AR technology to make it easy for Etsy users to understand the size and scale of an object they might want to buy. We decided we could make tape measures obsolete (or at least not quite as essential) for our home-decor shoppers by building an AR tool to let them visualize–conveniently, accurately, and with minimal effort–how an item would fit in their space.</p>
<h3>Improving dimension parsing</h3>
<p>In our <a href="https://codeascraft.com/2020/06/23/chaining-ios-machine-learning-computer-vision-and-augmented-reality-to-make-the-magical-real/">original post on the wall art experience</a>, we mentioned the complexity involved in doing things like inferring an item's dimensions from text in its description. Etsy allows sellers to add data about dimensions in a structured way when they create a listing, but that wasn't always the case, and some sellers still provide those details in places like the description or the item's title. The solution was to create a <a href="https://en.wikipedia.org/wiki/Regular_expression">regex</a>-based parser in the iOS App that would glean dimensions (width and height) by sifting through a small number of free-form fields–title, description, customization information, overview–looking for specific patterns. We were satisfied being able to catch most of the formats in which our sellers reported dimensions, handling variable positions of values and units (<code>3 in x 5 in vs 3 x 5 in</code>), different long and short names of units, special unit characters (‘, “), and so on, in all the different languages that Etsy supports.</p>
<p>Migrating our parsing functionality to the API backend was a first step towards making the AR measuring tool platform-independent, so we could bring it to our Android App as well. It would also be a help in development, since we could iterate improvements to our regex patterns faster outside the <a href="https://www.etsy.com/codeascraft/how-etsy-ships-apps?ref=codeascraft">app release schedule</a>. And we’d get more consistent dimensions because we'd be able to cache the results instead of having to parse them live on the client at each visit.</p>
<p>We knew that an extended AR experience would need to reliably show our users size options for items that had them, so we prioritized the effort to parse out dimensions from variations in listings. We sanitized free-form text input fields that might contain dimensions—inputs like title or description—so that we could catch a wider range of formats. (Several different characters can be used to write quotation marks, used as shorthand for inches and feet, and we needed to handle special characters for new lines, fraction ligatures like ½ or ¼, etc.) Our regex pattern was split and updated so it could detect:</p>
<ul>
<li>Measurement units in plural forms (inches, feet, etc.);</li>
<li>Incorrect spellings (e.g. "foots");</li>
<li>Localization of measurement units in the languages spoken by Etsy’s users ("meters", "metros", and "mètres" in English, Spanish, and French, respectively);</li>
<li>Other formats in which dimensions are captured by sellers like dimensions with unit conversions in parentheses (e.g. <code>12 in x 12 in (30 cm x 30 cm)</code>) or with complex measurements in the imperial system (<code>3’6”</code>).</li>
</ul>
<p>Making our dimension parsing more robust and bringing it server-side had several ancillary benefits. We were able to maintain the functionality of our iOS app while removing a lot of client-side code, even in Etsy’s <a href="https://developer.apple.com/app-clips/">App Clip</a>, where size is a matter of utmost importance. And though regex processing isn’t that processor-intensive, the view feature performed better once we implemented server-side caching of parsed dimensions. We figured we could even take the parsing offline (rather than parsing every listing on every visit) by running a backfill process to store dimensions in our database and deliver them to the App along with item details.</p>
<p>We found, thanks to decoupling our parser work from the App release cycle, that we were able to test hypotheses faster and iterate at a quicker pace. So we could proceed to some improvements that would have been quite resource-intensive if we had tried to implement them on the native app side. Sellers often provide dimensions in inconsistent units, for instance, or they might add the same data multiple times in different fields, when there are variations in properties like material or color. We worked out ways to de-duplicate this data during parsing, to minimize the number of size options we show users. (Though where we find dimensions that are specifically associated with variations, we make sure to retain them, since those will more directly correlate with offering prices.) And we made it possible to prioritize structured dimension data, where sellers have captured it in dedicated fields, as a more reliable source of truth than free-form parsing.</p>
<h3>Measuring in 3D</h3>
<h4>The box</h4>
<p>With this new and improved dimension data coming to us from the server, we had to figure out the right way to present it in 3D in iOS. The display needed to be intuitive, so our users would know more or less at a glance what the tool was and how to interact with it. Ultimately, we decided to present a rectangular prism-type object scaled to the parsed dimensions we have for a given listing. <a href="https://developer.apple.com/scenekit/">Apple's SceneKit framework</a> – specifically its SCNBox class – is what creates this box, which of course we style with the Etsy Orange look. So that users understand the box's purpose, we make sure to display the length on each side. We use SceneKit's SCNNode class to create the pills displaying our measurements.</p>
<div align="center"><img src="https://i.etsystatic.com/inv/1e68db/5342416361/inv_fullxfull.5342416361_dh5zp24h.jpg?version=0" alt="An orange rectangular prism placed in between furniture with measurements shown on each edge of the prism" width="300" /></div>
<p><br /> Users drag or tap the measuring box to move it around, and it can rotate on all axes – all made possible by having a different animation for each type of rotation using SCNActions. Rotation is a must-have feature: when we place the measuring box in a user's space, we may not always be able to get the orientation correct. We might, as in the illustration below, place a side table vertically on the floor instead of horizontally. Our users would have a poor experience of the measuring tool if they couldn't adjust for that. (Note that you may see some blinking yellow dots when you try out the AR experience: these are called feature points, and they're useful for debugging, to give us an idea of what surfaces are successfully being detected.)</p>
<div align="center">
    <img src="https://media2.giphy.com/media/7azdqVEhSRGCthLpob/giphy.gif" alt="Tall measuring box placed on the grown that is showing the size of the item (a chest of drawers shown on the top left in a small ). The user taps on an arrow to flip the box so that it is horizontal, matching the shape of the chest of drawers." width="300" />
</div>
<p><br /></p>
<h4>Environment occlusion</h4>
<p>In addition to ensuring the box would be scaled correctly, we wanted it to  "sit" as realistically as possible in the real world, so we built in <a href="https://developer.apple.com/documentation/arkit/content_anchors/visualizing_and_interacting_with_a_reconstructed_scene?language=objc">scene occlusion</a>. When a user places the measuring box in a room with other furniture, scene occlusion lets it interact with real-life objects as if the box were actually there. Users get valuable information this way about how an item will fit in their space. Will that end table go between the wall and couch? Will it be tall enough to be visible from behind the couch? (As demonstrated below, the table will indeed be tall enough.)</p>
<div align="center">
       <img src="https://i.etsystatic.com/inv/ea3000/5294262044/inv_fullxfull.5294262044_t18b2i0j.jpg?version=0" alt="Measuring box placed behind couch. Only the top of the box is visible as the rest is being occluded by the couch." width="300" />
</div>
<p><br />
Environment occlusion became a possibility with <a href="https://en.wikipedia.org/wiki/Lidar">LiDAR</a>, a method of determining depth using laser light. Although LiDAR has been around for a few decades, used to map everything from archeological sites to agricultural fields, Apple only included LiDAR scanners in iPhones and iPads beginning in 2020, with the 4th-generation iPad Pro and the iPhone 12 Pro.</p>
<p>An iPhone’s LiDAR scanner retrieves depth information from the area it scans and converts it into a series of vertices which connect to form a mesh (or a surface). To add occlusion to our SpriteKit-backed AR feature, we convert the mesh into a 3D object and place it (invisibly to the user) in the environment shown on their phone. As the LiDAR scanner measures more of the environment, we have more meshes to convert into objects and place in 3D. The video below shows an AR session where for debugging purposes we assign a random color to the detected mesh objects. Each different colored outline shown over a real-world object represents a different mesh. Notice how, as we scan more of the room, the device adds more mesh objects as it continues drawing out the environment.</p>
<div align="center">
    <img src="https://media0.giphy.com/media/bH2wThJwmV7XIXdD8c/giphy.gif" alt="The desk and monitor shown become outlined in different colors. As the iPhone camera moves around to scan more of the room, more of the furniture becomes covered in colorful outlines." width="300" />
</div>
<p><br />
The user's device uses these mesh objects to know when and how to occlude the measuring box. Essentially, these mesh objects help determine where the measuring box is relative to all the real-world items and surfaces it should respect. Taking advantage of occlusion gives our users an especially realistic AR experience. In the side-by-side comparison below, the video on the left shows how mesh objects found in the environment determine what part of the measuring box will be hidden as the camera moves in front of the desk. The video on the right shows the exact same thing, but with the mesh objects hidden. </p>
<table>
<thead>
<tr>
<th>Mesh objects are visible</th>
<th>Mesh objects are hidden</th>
</tr>
</thead>
<tbody>
<tr>
<td><img src="https://media1.giphy.com/media/Ulo2PPWfPxLFcvkKs0/giphy.gif" alt="The measuring box is placed under a desk. The desk, as well as the monitor, floor, and most of the environment are digitally covered in different colors. As the camera moves to the side of the desk, more of the box becomes occluded by the side of the desk." width="300" /></td>
<td><img src="https://media1.giphy.com/media/R1waynZYyozQ5vxoJz/giphy.gif" alt="The measuring box is placed under a desk similar to the previous video. As the camera moves to the side of the desk, more of the box becomes occluded by the side of the desk making it seem like the box is actually in the room." width="300" /></td>
</tr>
</tbody>
</table>
<h3>Closing thoughts</h3>
<p>This project took a 2D concept, our Wall View experience, and literally extended it into 3-dimensional space using Apple’s newest AR tools. And though the preparatory work we did improving our dimension parser may not be anything to look at, without the consistency and accuracy of that parsed information this newly realistic and interactive tool would not have been possible. Nearly a million Etsy items now have real-size AR functionality added to them, viewed by tens of thousands of Etsy users every week. As our marketplace evolves and devices become more powerful, working on features like this only increases our appetite for more and brings us closer to providing our users with a marketplace that lets them make the most informed decision about their purchases effortlessly. </p>
<p><em>Special shoutout to Jacob Van Order and Siri McClean as well as the rest of our team for their work on this.</em></p>]]></content:encoded>
            </item>
        
            <item>
                <title><![CDATA[The So-fine Real-time ML Paradigm]]></title>
                <link>https://www.etsy.com/codeascraft/the-so-fine-real-time-ml-paradigm?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</link>
                <dc:creator><![CDATA[Kyle Gallatin]]></dc:creator>
                <pubDate>Tue, 5 Sep 2023 15:55:41 -0400</pubDate>
                <guid isPermaLink="true">https://www.etsy.com/codeascraft/the-so-fine-real-time-ml-paradigm?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</guid>
                <description><![CDATA[<h2>Introduction</h2>
<p>Each year, Etsy hosts an event known as  “CodeMosaic” - an internal hackathon in which Etsy admin propose and build bold advances quickly in our technology across a number of different themes.  People across Etsy source ideas, organize into teams, and then have 2-3 days to build innovative proofs-of-concept that might deliver big wins for Etsy’s buyers and sellers, or improve internal engineering systems and workflows.</p>
<p>Besides being a ton of fun, CodeMosaic is a time for engineers to pilot novel ideas. Our team’s project this year was extremely ambitious - we wanted to build a system for stateful machine learning (ML) model training and online machine learning. While our ML pipelines are no stranger to streaming data, we currently don’t have any models that learn in an online context - that is, that can have their weights updated in near-real time.</p>
<p>Stateful training updates an already-trained ML model artifact incrementally, sparing the cost of retraining models from scratch. <em>Online learning</em> updates model weights in production rather than via batch processes. Combined, the two approaches can be extremely powerful. <a href="https://arxiv.org/abs/2107.07106">A study conducted by Grubhub in 2021</a> reported that a shift to stateful online learning saw up to a 45x reduction in costs with a 20% increase in metrics, and I’m all about saving money to make money.</p>
<figure>
<img alt="Stateful continuous retraining and online learning" src="https://i.etsystatic.com/inv/d8a51b/5272462297/inv_fullxfull.5272462297_1e6jr1i2.jpg?version=0" title="Stateful continuous retraining and online learning" />
</figure>
<h2>Day 1 - Planning</h2>
<p>Of course, building such a complex system would be no easy task. The ML pipelines we use to generate training data from user actions require a number of offline, scheduled batch jobs. As a result it takes quite a while, 40 hours at a minimum, for user actions to be reflected in a model’s weights.</p>
<figure>
<img alt="current pipeline" src="https://i.etsystatic.com/inv/0d445c/5272464395/inv_fullxfull.5272464395_mtgyerd2.jpg?version=0" title="current pipeline" />
</figure>
<p>To make this project a success over the course of three days, we needed to scope our work tightly across three streams:</p>
<ol>
<li><strong>Real-time training data</strong> - the task here was to circumvent the batch jobs responsible for our current training data and get attributions (user actions) right from the source.</li>
<li><strong>A service to consume the data stream and learn incrementally</strong> - today, we heavily leverage TensorFlow for model training. We needed to be able to load a model's weights into memory, read data from a stream, update that model, and incrementally push it out to be served online. </li>
<li><strong>Evaluation</strong> - we'd have to make a case for our approach by validating its performance benefits over our current batch processes. </li>
</ol>
<p>No matter how much we limited the scope it wasn't going to be easy, but we broke into three subteams reflecting each track of work and began moving towards implementation.</p>
<h2>Day 2 - Implementation</h2>
<p>The real-time training data team began by looking far upstream of the batch jobs that compute training data - at Etsy’s Beacon Main Kafka stream, which contains bot-filtered events. By using Kafka SQL and some real-time calls to our streaming feature platform, Rivulet, we figured we could put together a realistic approach to solving this part of the problem.</p>
<p>Of course, as with all hackathon ideas it was easier said than done. Much of our feature data uses the binary avro data format for serialization, and finding the proper schema for deserializing and joining this data was troublesome. The team spent most of the second day munging the data in an attempt to join all the proper sources across platforms. And though we weren't able to write the output to a new topic, the team actually did manage to join multiple data sources in a way that generated real-time training data!</p>
<p>Meanwhile the team focusing on building the consumer service to actually learn from the model faced a different kind of challenge: decision making. What type of model were we going to use? Knowing we weren’t going to be able to use the actual training data stream yet - how would we mock it? Where and how often should we push new model artifacts out? </p>
<p>After significant discussion, we decided to try using an Ad Ranking model as we had an Ads ML engineer in our group and the Ads models take a long time to train - meaning we could squeeze a lot of benefit out of them by implementing continuous training. The engineers in the group began to structure code that pulled an older Ads model into memory and made incremental updates to the weights to satisfy the second requirement. </p>
<p>That meant that all we had left to handle was the most challenging task - evaluation. None of this architecture would mean anything if a model that was trained online performed worse than the model retrained daily in batch. Evaluating a model with more training training periods is also more difficult, as each period we’d need to run the model on some held-out data in order to get an accurate reading without data leakage. </p>
<figure>
<img alt="oof" src="https://i.etsystatic.com/inv/171da0/5224262572/inv_fullxfull.5224262572_694cztlm.jpg?version=0" title="oof" />
</figure>
<p>Instead of performing an extremely laborious and time-intensive evaluation for continuous training like the one outlined above, we chose to have a bit more fun with it. After all, it was a hackathon! What if we made it a competition? Pick a single high-performing Etsy ad and see which surfaced it first, our continuously trained model or the boring old batch-trained one?</p>
<p>We figured if we could get a continuously trained model to recommend a high-performing ad sooner, we’d have done the job! So we set about searching for a high-performing Etsy ad and training data that would allow us to validate our work. </p>
<p>Of course, by the time we were even deciding on an appropriate advertised listing, it was the end of day two, and it was pretty clear the idea wasn’t going to play out before it was time for presentations. But still a fun thought, right?</p>
<h2>Presentation takeaways and impact</h2>
<p>Day 3 gives you a small window for tidying up work and slides, followed by team presentations. At this point, we loosely had these three things:</p>
<ol>
<li>Training data from much earlier in our batch processing pipelines</li>
<li>A Kafka consumer that could almost update a TensorFlow model incrementally </li>
<li>A few click attributions and data for a specific listing</li>
</ol>
<figure>
<img alt="final product" src="https://i.etsystatic.com/inv/b8e099/5272467237/inv_fullxfull.5272467237_68m5p29m.jpg?version=0" title="final product" />
</figure>
<p>In the hackathon spirit, we phoned it in and pivoted towards focusing on the theoretical of what we’d been able to achieve!</p>
<p>The 1st important potential area of impact was cost savings. We estimated that removing the daily “cold-start” training and replacing it with continuous training would save about <strong>$212K annually in Google Cloud costs</strong> for the 4 models in ads alone.</p>
<p>This is a huge potential win - especially when coupled with the likely metrics gains coming from more reactive models. After all, if we were able to get events to models 40 hours earlier, who knows how much better our ranking could get! </p>
<h2>Future directions and conclusion</h2>
<p>Like many hackathon projects, there's no shortage of hurdles getting this work into a production state. Aside from the infrastructure required to actually architect a continuous-training pipeline, we’d need a significant number of high-quality checks and balances to ensure that updating models in real-time didn’t lead to sudden degradations in performance. The amount of development, number of parties involved, and the breadth of expertise to get this into production would surely be extensive. However, as ML continues to mature, we should be able to enable more complex architectures with less overhead.</p>]]></description>
                <content:encoded><![CDATA[<h2>Introduction</h2>
<p>Each year, Etsy hosts an event known as  “CodeMosaic” - an internal hackathon in which Etsy admin propose and build bold advances quickly in our technology across a number of different themes.  People across Etsy source ideas, organize into teams, and then have 2-3 days to build innovative proofs-of-concept that might deliver big wins for Etsy’s buyers and sellers, or improve internal engineering systems and workflows.</p>
<p>Besides being a ton of fun, CodeMosaic is a time for engineers to pilot novel ideas. Our team’s project this year was extremely ambitious - we wanted to build a system for stateful machine learning (ML) model training and online machine learning. While our ML pipelines are no stranger to streaming data, we currently don’t have any models that learn in an online context - that is, that can have their weights updated in near-real time.</p>
<p>Stateful training updates an already-trained ML model artifact incrementally, sparing the cost of retraining models from scratch. <em>Online learning</em> updates model weights in production rather than via batch processes. Combined, the two approaches can be extremely powerful. <a href="https://arxiv.org/abs/2107.07106">A study conducted by Grubhub in 2021</a> reported that a shift to stateful online learning saw up to a 45x reduction in costs with a 20% increase in metrics, and I’m all about saving money to make money.</p>
<figure>
<img alt="Stateful continuous retraining and online learning" src="https://i.etsystatic.com/inv/d8a51b/5272462297/inv_fullxfull.5272462297_1e6jr1i2.jpg?version=0" title="Stateful continuous retraining and online learning" />
</figure>
<h2>Day 1 - Planning</h2>
<p>Of course, building such a complex system would be no easy task. The ML pipelines we use to generate training data from user actions require a number of offline, scheduled batch jobs. As a result it takes quite a while, 40 hours at a minimum, for user actions to be reflected in a model’s weights.</p>
<figure>
<img alt="current pipeline" src="https://i.etsystatic.com/inv/0d445c/5272464395/inv_fullxfull.5272464395_mtgyerd2.jpg?version=0" title="current pipeline" />
</figure>
<p>To make this project a success over the course of three days, we needed to scope our work tightly across three streams:</p>
<ol>
<li><strong>Real-time training data</strong> - the task here was to circumvent the batch jobs responsible for our current training data and get attributions (user actions) right from the source.</li>
<li><strong>A service to consume the data stream and learn incrementally</strong> - today, we heavily leverage TensorFlow for model training. We needed to be able to load a model's weights into memory, read data from a stream, update that model, and incrementally push it out to be served online. </li>
<li><strong>Evaluation</strong> - we'd have to make a case for our approach by validating its performance benefits over our current batch processes. </li>
</ol>
<p>No matter how much we limited the scope it wasn't going to be easy, but we broke into three subteams reflecting each track of work and began moving towards implementation.</p>
<h2>Day 2 - Implementation</h2>
<p>The real-time training data team began by looking far upstream of the batch jobs that compute training data - at Etsy’s Beacon Main Kafka stream, which contains bot-filtered events. By using Kafka SQL and some real-time calls to our streaming feature platform, Rivulet, we figured we could put together a realistic approach to solving this part of the problem.</p>
<p>Of course, as with all hackathon ideas it was easier said than done. Much of our feature data uses the binary avro data format for serialization, and finding the proper schema for deserializing and joining this data was troublesome. The team spent most of the second day munging the data in an attempt to join all the proper sources across platforms. And though we weren't able to write the output to a new topic, the team actually did manage to join multiple data sources in a way that generated real-time training data!</p>
<p>Meanwhile the team focusing on building the consumer service to actually learn from the model faced a different kind of challenge: decision making. What type of model were we going to use? Knowing we weren’t going to be able to use the actual training data stream yet - how would we mock it? Where and how often should we push new model artifacts out? </p>
<p>After significant discussion, we decided to try using an Ad Ranking model as we had an Ads ML engineer in our group and the Ads models take a long time to train - meaning we could squeeze a lot of benefit out of them by implementing continuous training. The engineers in the group began to structure code that pulled an older Ads model into memory and made incremental updates to the weights to satisfy the second requirement. </p>
<p>That meant that all we had left to handle was the most challenging task - evaluation. None of this architecture would mean anything if a model that was trained online performed worse than the model retrained daily in batch. Evaluating a model with more training training periods is also more difficult, as each period we’d need to run the model on some held-out data in order to get an accurate reading without data leakage. </p>
<figure>
<img alt="oof" src="https://i.etsystatic.com/inv/171da0/5224262572/inv_fullxfull.5224262572_694cztlm.jpg?version=0" title="oof" />
</figure>
<p>Instead of performing an extremely laborious and time-intensive evaluation for continuous training like the one outlined above, we chose to have a bit more fun with it. After all, it was a hackathon! What if we made it a competition? Pick a single high-performing Etsy ad and see which surfaced it first, our continuously trained model or the boring old batch-trained one?</p>
<p>We figured if we could get a continuously trained model to recommend a high-performing ad sooner, we’d have done the job! So we set about searching for a high-performing Etsy ad and training data that would allow us to validate our work. </p>
<p>Of course, by the time we were even deciding on an appropriate advertised listing, it was the end of day two, and it was pretty clear the idea wasn’t going to play out before it was time for presentations. But still a fun thought, right?</p>
<h2>Presentation takeaways and impact</h2>
<p>Day 3 gives you a small window for tidying up work and slides, followed by team presentations. At this point, we loosely had these three things:</p>
<ol>
<li>Training data from much earlier in our batch processing pipelines</li>
<li>A Kafka consumer that could almost update a TensorFlow model incrementally </li>
<li>A few click attributions and data for a specific listing</li>
</ol>
<figure>
<img alt="final product" src="https://i.etsystatic.com/inv/b8e099/5272467237/inv_fullxfull.5272467237_68m5p29m.jpg?version=0" title="final product" />
</figure>
<p>In the hackathon spirit, we phoned it in and pivoted towards focusing on the theoretical of what we’d been able to achieve!</p>
<p>The 1st important potential area of impact was cost savings. We estimated that removing the daily “cold-start” training and replacing it with continuous training would save about <strong>$212K annually in Google Cloud costs</strong> for the 4 models in ads alone.</p>
<p>This is a huge potential win - especially when coupled with the likely metrics gains coming from more reactive models. After all, if we were able to get events to models 40 hours earlier, who knows how much better our ranking could get! </p>
<h2>Future directions and conclusion</h2>
<p>Like many hackathon projects, there's no shortage of hurdles getting this work into a production state. Aside from the infrastructure required to actually architect a continuous-training pipeline, we’d need a significant number of high-quality checks and balances to ensure that updating models in real-time didn’t lead to sudden degradations in performance. The amount of development, number of parties involved, and the breadth of expertise to get this into production would surely be extensive. However, as ML continues to mature, we should be able to enable more complex architectures with less overhead.</p>]]></content:encoded>
            </item>
        
            <item>
                <title><![CDATA[Leveraging Real-Time User Actions to Personalize Etsy Ads]]></title>
                <link>https://www.etsy.com/codeascraft/leveraging-real-time-user-actions-to-personalize-etsy-ads?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</link>
                <dc:creator><![CDATA[Alaa Awad]]></dc:creator>
                <pubDate>Fri, 14 Jul 2023 15:54:41 -0400</pubDate>
                <guid isPermaLink="true">https://www.etsy.com/codeascraft/leveraging-real-time-user-actions-to-personalize-etsy-ads?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</guid>
                <description><![CDATA[<h3>Introduction</h3>
<p>Personalization is vital to connect our unique marketplace to the right buyer at the right time. Etsy has recently introduced a novel, general approach to personalizing ML models based on encoding and learning from short-term (one-hour) sequences of user actions through a reusable three-component deep learning module, the <em>adSformer Diversifiable Personalization Module (ADPM)</em>. We describe in detail our method in our <a href="https://arxiv.org/pdf/2302.01255.pdf">recent paper</a>, with an emphasis on personalizing the CTR (clickthrough rate) and PCCVR (post-click conversion rate) ranking models we use in Etsy Ads. Here, we'd like to present a brief overview.</p>
<p>Etsy offers its sellers the opportunity to place sponsored listings as a supplement to the organic results returned by Etsy search. For sellers and buyers alike, it’s important that those sponsored listings be as relevant to the user’s intent as possible. As Figure 1 suggests, when it comes to search, a “jacket” isn't always just any jacket:</p>
<div align="center"><figure><img alt="Ad results for the query jacket" src="https://i.etsystatic.com/inv/89d8d6/5048078712/inv_fullxfull.5048078712_he5407ld.jpg?version=0" title="Ad results for the query jacket" /></figure></div>
<div align="center"><figure><img alt="Personalized Ad results for the query jacket" src="https://i.etsystatic.com/inv/e8f01e/5048080966/inv_fullxfull.5048080966_stopi8fs.jpg?version=0" title="Personalized Ad results for the query jacket" /><figcaption>Figure 1: Ad results for the query jacket for a user who has recently interacted with mens leather jackets. In the top row, the results without personalized ranking; in the bottom row, the results with session personalization.</figcaption></figure></div>
<p>For ads to be relevant, they need to be personalized.</p>
<p>If we define a “session” as a one-hour shopping window, and make a histogram of the total number of listings viewed across a sample of sessions (Fig. 2), we see that a power law distribution emerges. The vast majority of users interact with only a small number of listings before leaving their sessions.</p>
<div align="center"><figure><img alt="A histogram of listing views in a user session" src="https://i.etsystatic.com/inv/dbc69e/5048107434/inv_fullxfull.5048107434_1h2fhofh.jpg?version=0" title="A histogram of listing views in a user session" /><figcaption>Figure 2: A histogram of listing views in a user session. Most users see fewer than ten listings in a one-hour shopping window.</figcaption></figure></div>
<p>Understood simply in terms of listing views, it might seem that session personalization would be an insurmountable challenge. To overcome this challenge we leverage a rich stream of user actions surrounding those views and communicating intent, for example: search queries, item favorites, views, add-to-carts, and purchases. Our rankers can optimize the shopping experience in the moment by utilizing streaming features being made available within seconds of these user actions. </p>
<p>Consider a hypothetical sequence of lamps viewed by a buyer within the last hour.</p>
<figure><img alt="Figure 3: An example of a user session with the sequence of items viewed over time." src="https://i.etsystatic.com/inv/4fd6b3/5096337205/inv_fullxfull.5096337205_5ws08oq7.jpg?version=0" title="Figure 3: An example of a user session with the sequence of items viewed over time." /><figcaption>Figure 3: An example of a user session with the sequence of items viewed over time. <strong>70s orange lamp ---&gt; retro table lamp --&gt; vintage mushroom lamp</strong></figcaption></figure><p>Not only is the buyer looking within a particular set of lamps (orange, mushroom-shaped), but they arrived at these lamps through a sequence of query refinements. The search content itself contains information about the visual and textual similarities between the listings, and the order in which the queries occur adds another dimension of information. The content and the sequence of events can be used together to infer what is driving the user’s current interest in lamps.</p>
<h3>adSformer Diversifiable Personalization Module</h3>
<p>The adSformer Diversifiable Personalization Module (ADPM), illustrated on the left hand side of Figure 4, is Etsy's solution for using temporal and content signals for session personalization. A dynamic representation of the user is generated from a sequence of the user's most recent streamed actions. The input sequence contains item IDs, queries issued and categories viewed. We consider the item IDs, queries, and categories as “entities” that have recent interactions within the session. For each of these entities we consider different types of actions within a user session–views, recent cart-adds, favorites, and purchases–and we encode each type of entity/action pair separately. This lets us capture fine-grained information about the user's interests in their current session.</p>
<figure><img alt="the ADPM architecture." src="https://i.etsystatic.com/inv/624571/5048121990/inv_fullxfull.5048121990_5800uxly.jpg?version=0" title="the ADPM architecture." /><figcaption>Figure 4: On the left, a stack representing the ADPM architecture. The right part of the figure is a blown-out illustration of the adSformer Encoder component.</figcaption></figure><p>Through ablation studies we found that ADPM’s three components work together symbiotically to outperform experiments where each component is considered independently. Furthermore, in deployed applications, the diversity of learned signals improves robustness to input distribution shifts. It also leads to more relevant personalized results, because we understand the user from multiple perspectives. Here is how the three components operate:</p>
<ol><li>
<p><strong>Component One: The adSformer Encoder</strong>
The adSformer encoder component uses one or more custom adSformer blocks illustrated in the right panel of Figure 4. This component learns a deep, expressive representation of the one-hour input sequence. The adSformer block modifies the standard transformer block in the attention literature by adding a final global max pooling layer. The pooling layer downsamples the block’s outputs by extracting the most salient signals from the sequence representation instead of outputting the fully concatenated standard transformer output. Formally, for a user, for a one-hour sequence S of viewed item IDs, the adSformer encoder is defined as the output of a stack of layers <em>g(x)</em>, where <em>x</em> is the output of each previous layer and <em>o1</em> is the component’s output. The first layer is an embedding of item and position.
<img alt="eq1" src="https://i.etsystatic.com/inv/1d68a5/5096364413/inv_fullxfull.5096364413_5e8gg3sk.jpg?version=0" title="eq1" /></p>
</li>
<li>
<p><strong>Component Two: Pretrained Representations.</strong> Component two employs pretrained embeddings of item IDs that users have interacted with together with average pooling to encode the one-hour sequence of user actions. Depending on downstream performance and availability, we choose from multimodal (AIR) representations and visual representations. Thus component two encodes rich image, text and multimodal signals from all the items in the sequence. The advantage of leveraging pretrained item embeddings is that these rich representations are learned efficiently offline using complex deep learning architectures that would not be feasible online in real time. Formally, for a given one-hour sequence of <em>m1hr</em> item IDs pretrained d-dimensional embedding vectors e, we compute a sequence representation as
<img alt="equation2" src="https://i.etsystatic.com/inv/fe740e/5096377215/inv_fullxfull.5096377215_ap3e17ll.jpg?version=0" title="equation2" /></p>
</li>
<li><strong>Component Three: Representations Learned "On the Fly"</strong>
The third component of ADPM introduces representations learned for each sequence from scratch in its own vector space as part of the downstream models. This component learns lightweight representations for many different sequences for which we do not have pretrained representations available, for example sequences of favorited shop ids. Formally, for z one-hour sequences of entities acted upon <em>S</em> we learn embeddings for each entity and sequence to obtain the component’s output <em>o3</em> as
<img alt="equation3" src="https://i.etsystatic.com/inv/ee0131/5048154642/inv_fullxfull.5048154642_6hhvnysd.jpg?version=0" title="equation3" /></li>
</ol><p>The intermediary outputs of the three components are concatenated to form the final ADPM output, the dynamic user representation <em>u</em>. This user representation is then concatenated to the input vector in various rankers or recommenders we want to real-time personalize. Formally, for one-hour variable length sequences of user actions <em>S</em>, and ADPM’s components outputs <em>o</em>
<img alt="equation4" src="https://i.etsystatic.com/inv/fca820/5096384679/inv_fullxfull.5096384679_4gim0xk5.jpg?version=0" title="equation4" /></p>
<p>From a software perspective, the module is implemented as a Tensorflow Keras module which can easily be employed in downstream models through a simple import statement. </p>
<h3>Pretrained Representation Learning</h3>
<p>The second component of the ADPM includes pretrained representations. We rely on several pretrained representations: image embeddings, text embeddings, and multimodal item representations.</p>
<p><strong>Visual Representations</strong>
In Etsy Ads, we employ image signals across a variety of tasks, such as visually similar candidate generation, search by image, as inputs for learning other pretrained representations, and in the ADPM's second component. To effectively leverage the rich signal encoded in Etsy Ads images we train image embeddings in a multitask classification learning paradigm. By using multiple classification heads, such as taxonomy, color, and material, our representations are able to capture more diverse information about the image. So far we have derived great benefit from our multitask visual embeddings, trained using a lightweight EfficientNetB0 architecture, and weights pretrained on ImageNet as backbone. We replaced the final layer with a 256-dimensional convolutional block, which becomes the output embedding. We apply image random rotation, translation, zoom, and a color contrast transformation to augment the dataset during training. We are currently in the process of updating the backbone architectures to efficient vision transformers to further improve the quality of the image representations and the benefits derived in downstream applications, including the ADPM.</p>
<p><strong>Ads Information Retrieval Representations </strong>
Ads Information Retrieval (AIR) item representations encode an item ID through a metric learning approach, which aims to learn a distance function or similarity metric between two items. Standard approaches to metric learning include siamese networks, contrastive loss, and triplet loss. However, we found more interpretable results using a sampled in-batch softmax loss. This method treats each batch as a classification problem pairing all the items in a batch that were co-clicked. A pseudo-two-tower architecture is used to encode the source items and candidate items towers which share all trainable weights across both towers. Each item tower captures and encodes information about an item’s title, image, primary color, attributes, category, etc. This information diversity is key to our personalization outcomes. By leveraging a variety of data sources, the system can identify patterns and insights that would be missed by a more limited set of inputs. </p>
<h3>ADPM-Personalized Sponsored Search</h3>
<p>ADPM’s effectiveness and generality is demonstrated in the way we use it to personalize the CTR prediction model in EtsyAds’ Sponsored Search. The ADPM encodes reverse-chronological sequences of recent user actions (in the sliding one-hour window we've discussed), anywhere on Etsy, for both logged-in and logged-out users. We concatenate ADPM’s output, the dynamic user representation, to the rest of the wide input vector in the CTR model. To fully leverage this even wider input vector, a deep and cross (DCN) interaction module is included in the overall CTR architecture. If we remove the DCN module, the CTR’s model ROC-AUC drops by 1.17%. The architecture of the ADPM-personalized CTR prediction model employed by EtsyAds in sponsored search is given in Figure 5. (We also employ the ADPM to personalize the PCCVR model with a similar architecture, which naturally led to ensembling the two models in a multitask architecture, a topic beyond the scope of this blog post.) </p>
<figure><img alt="ADPM used in a downstream ranking model" src="https://i.etsystatic.com/inv/bfbf9a/5096396023/inv_fullxfull.5096396023_szxvnyfk.jpg?version=0" title="ADPM used in a downstream ranking model" /><figcaption>Figure 5: An example of how the ADPM is used in a downstream ranking model</figcaption></figure><p>The ADPM-personalized CTR and PCCVR models outperformed the CTR and PCCVR non-personalized production baselines by +2.66% and +2.42%, respectively, in offline Area Under the Receiver Operating Characteristic Curve (ROC-AUC). Following the robust online gains in A/B tests, we deployed the ADPM-personalized sponsored search system to 100% of traffic. </p>
<h3>Conclusion</h3>
<p>The adSformer diversifiable personalization module (ADPM) is a scalable, general approach to model personalization from short-term sequences of recent user actions. Its use in sponsored search to personalize our ranking and bidding models is a milestone for EtsyAds, and is delivering greater relevance in sponsored placements for the millions of buyers and sellers that Etsy's marketplace brings together. If you would like more details about ADPM, please see our <a href="https://arxiv.org/pdf/2302.01255.pdf">paper</a>.</p>]]></description>
                <content:encoded><![CDATA[<h3>Introduction</h3>
<p>Personalization is vital to connect our unique marketplace to the right buyer at the right time. Etsy has recently introduced a novel, general approach to personalizing ML models based on encoding and learning from short-term (one-hour) sequences of user actions through a reusable three-component deep learning module, the <em>adSformer Diversifiable Personalization Module (ADPM)</em>. We describe in detail our method in our <a href="https://arxiv.org/pdf/2302.01255.pdf">recent paper</a>, with an emphasis on personalizing the CTR (clickthrough rate) and PCCVR (post-click conversion rate) ranking models we use in Etsy Ads. Here, we'd like to present a brief overview.</p>
<p>Etsy offers its sellers the opportunity to place sponsored listings as a supplement to the organic results returned by Etsy search. For sellers and buyers alike, it’s important that those sponsored listings be as relevant to the user’s intent as possible. As Figure 1 suggests, when it comes to search, a “jacket” isn't always just any jacket:</p>
<div align="center"><figure><img alt="Ad results for the query jacket" src="https://i.etsystatic.com/inv/89d8d6/5048078712/inv_fullxfull.5048078712_he5407ld.jpg?version=0" title="Ad results for the query jacket" /></figure></div>
<div align="center"><figure><img alt="Personalized Ad results for the query jacket" src="https://i.etsystatic.com/inv/e8f01e/5048080966/inv_fullxfull.5048080966_stopi8fs.jpg?version=0" title="Personalized Ad results for the query jacket" /><figcaption>Figure 1: Ad results for the query jacket for a user who has recently interacted with mens leather jackets. In the top row, the results without personalized ranking; in the bottom row, the results with session personalization.</figcaption></figure></div>
<p>For ads to be relevant, they need to be personalized.</p>
<p>If we define a “session” as a one-hour shopping window, and make a histogram of the total number of listings viewed across a sample of sessions (Fig. 2), we see that a power law distribution emerges. The vast majority of users interact with only a small number of listings before leaving their sessions.</p>
<div align="center"><figure><img alt="A histogram of listing views in a user session" src="https://i.etsystatic.com/inv/dbc69e/5048107434/inv_fullxfull.5048107434_1h2fhofh.jpg?version=0" title="A histogram of listing views in a user session" /><figcaption>Figure 2: A histogram of listing views in a user session. Most users see fewer than ten listings in a one-hour shopping window.</figcaption></figure></div>
<p>Understood simply in terms of listing views, it might seem that session personalization would be an insurmountable challenge. To overcome this challenge we leverage a rich stream of user actions surrounding those views and communicating intent, for example: search queries, item favorites, views, add-to-carts, and purchases. Our rankers can optimize the shopping experience in the moment by utilizing streaming features being made available within seconds of these user actions. </p>
<p>Consider a hypothetical sequence of lamps viewed by a buyer within the last hour.</p>
<figure><img alt="Figure 3: An example of a user session with the sequence of items viewed over time." src="https://i.etsystatic.com/inv/4fd6b3/5096337205/inv_fullxfull.5096337205_5ws08oq7.jpg?version=0" title="Figure 3: An example of a user session with the sequence of items viewed over time." /><figcaption>Figure 3: An example of a user session with the sequence of items viewed over time. <strong>70s orange lamp ---&gt; retro table lamp --&gt; vintage mushroom lamp</strong></figcaption></figure><p>Not only is the buyer looking within a particular set of lamps (orange, mushroom-shaped), but they arrived at these lamps through a sequence of query refinements. The search content itself contains information about the visual and textual similarities between the listings, and the order in which the queries occur adds another dimension of information. The content and the sequence of events can be used together to infer what is driving the user’s current interest in lamps.</p>
<h3>adSformer Diversifiable Personalization Module</h3>
<p>The adSformer Diversifiable Personalization Module (ADPM), illustrated on the left hand side of Figure 4, is Etsy's solution for using temporal and content signals for session personalization. A dynamic representation of the user is generated from a sequence of the user's most recent streamed actions. The input sequence contains item IDs, queries issued and categories viewed. We consider the item IDs, queries, and categories as “entities” that have recent interactions within the session. For each of these entities we consider different types of actions within a user session–views, recent cart-adds, favorites, and purchases–and we encode each type of entity/action pair separately. This lets us capture fine-grained information about the user's interests in their current session.</p>
<figure><img alt="the ADPM architecture." src="https://i.etsystatic.com/inv/624571/5048121990/inv_fullxfull.5048121990_5800uxly.jpg?version=0" title="the ADPM architecture." /><figcaption>Figure 4: On the left, a stack representing the ADPM architecture. The right part of the figure is a blown-out illustration of the adSformer Encoder component.</figcaption></figure><p>Through ablation studies we found that ADPM’s three components work together symbiotically to outperform experiments where each component is considered independently. Furthermore, in deployed applications, the diversity of learned signals improves robustness to input distribution shifts. It also leads to more relevant personalized results, because we understand the user from multiple perspectives. Here is how the three components operate:</p>
<ol><li>
<p><strong>Component One: The adSformer Encoder</strong>
The adSformer encoder component uses one or more custom adSformer blocks illustrated in the right panel of Figure 4. This component learns a deep, expressive representation of the one-hour input sequence. The adSformer block modifies the standard transformer block in the attention literature by adding a final global max pooling layer. The pooling layer downsamples the block’s outputs by extracting the most salient signals from the sequence representation instead of outputting the fully concatenated standard transformer output. Formally, for a user, for a one-hour sequence S of viewed item IDs, the adSformer encoder is defined as the output of a stack of layers <em>g(x)</em>, where <em>x</em> is the output of each previous layer and <em>o1</em> is the component’s output. The first layer is an embedding of item and position.
<img alt="eq1" src="https://i.etsystatic.com/inv/1d68a5/5096364413/inv_fullxfull.5096364413_5e8gg3sk.jpg?version=0" title="eq1" /></p>
</li>
<li>
<p><strong>Component Two: Pretrained Representations.</strong> Component two employs pretrained embeddings of item IDs that users have interacted with together with average pooling to encode the one-hour sequence of user actions. Depending on downstream performance and availability, we choose from multimodal (AIR) representations and visual representations. Thus component two encodes rich image, text and multimodal signals from all the items in the sequence. The advantage of leveraging pretrained item embeddings is that these rich representations are learned efficiently offline using complex deep learning architectures that would not be feasible online in real time. Formally, for a given one-hour sequence of <em>m1hr</em> item IDs pretrained d-dimensional embedding vectors e, we compute a sequence representation as
<img alt="equation2" src="https://i.etsystatic.com/inv/fe740e/5096377215/inv_fullxfull.5096377215_ap3e17ll.jpg?version=0" title="equation2" /></p>
</li>
<li><strong>Component Three: Representations Learned "On the Fly"</strong>
The third component of ADPM introduces representations learned for each sequence from scratch in its own vector space as part of the downstream models. This component learns lightweight representations for many different sequences for which we do not have pretrained representations available, for example sequences of favorited shop ids. Formally, for z one-hour sequences of entities acted upon <em>S</em> we learn embeddings for each entity and sequence to obtain the component’s output <em>o3</em> as
<img alt="equation3" src="https://i.etsystatic.com/inv/ee0131/5048154642/inv_fullxfull.5048154642_6hhvnysd.jpg?version=0" title="equation3" /></li>
</ol><p>The intermediary outputs of the three components are concatenated to form the final ADPM output, the dynamic user representation <em>u</em>. This user representation is then concatenated to the input vector in various rankers or recommenders we want to real-time personalize. Formally, for one-hour variable length sequences of user actions <em>S</em>, and ADPM’s components outputs <em>o</em>
<img alt="equation4" src="https://i.etsystatic.com/inv/fca820/5096384679/inv_fullxfull.5096384679_4gim0xk5.jpg?version=0" title="equation4" /></p>
<p>From a software perspective, the module is implemented as a Tensorflow Keras module which can easily be employed in downstream models through a simple import statement. </p>
<h3>Pretrained Representation Learning</h3>
<p>The second component of the ADPM includes pretrained representations. We rely on several pretrained representations: image embeddings, text embeddings, and multimodal item representations.</p>
<p><strong>Visual Representations</strong>
In Etsy Ads, we employ image signals across a variety of tasks, such as visually similar candidate generation, search by image, as inputs for learning other pretrained representations, and in the ADPM's second component. To effectively leverage the rich signal encoded in Etsy Ads images we train image embeddings in a multitask classification learning paradigm. By using multiple classification heads, such as taxonomy, color, and material, our representations are able to capture more diverse information about the image. So far we have derived great benefit from our multitask visual embeddings, trained using a lightweight EfficientNetB0 architecture, and weights pretrained on ImageNet as backbone. We replaced the final layer with a 256-dimensional convolutional block, which becomes the output embedding. We apply image random rotation, translation, zoom, and a color contrast transformation to augment the dataset during training. We are currently in the process of updating the backbone architectures to efficient vision transformers to further improve the quality of the image representations and the benefits derived in downstream applications, including the ADPM.</p>
<p><strong>Ads Information Retrieval Representations </strong>
Ads Information Retrieval (AIR) item representations encode an item ID through a metric learning approach, which aims to learn a distance function or similarity metric between two items. Standard approaches to metric learning include siamese networks, contrastive loss, and triplet loss. However, we found more interpretable results using a sampled in-batch softmax loss. This method treats each batch as a classification problem pairing all the items in a batch that were co-clicked. A pseudo-two-tower architecture is used to encode the source items and candidate items towers which share all trainable weights across both towers. Each item tower captures and encodes information about an item’s title, image, primary color, attributes, category, etc. This information diversity is key to our personalization outcomes. By leveraging a variety of data sources, the system can identify patterns and insights that would be missed by a more limited set of inputs. </p>
<h3>ADPM-Personalized Sponsored Search</h3>
<p>ADPM’s effectiveness and generality is demonstrated in the way we use it to personalize the CTR prediction model in EtsyAds’ Sponsored Search. The ADPM encodes reverse-chronological sequences of recent user actions (in the sliding one-hour window we've discussed), anywhere on Etsy, for both logged-in and logged-out users. We concatenate ADPM’s output, the dynamic user representation, to the rest of the wide input vector in the CTR model. To fully leverage this even wider input vector, a deep and cross (DCN) interaction module is included in the overall CTR architecture. If we remove the DCN module, the CTR’s model ROC-AUC drops by 1.17%. The architecture of the ADPM-personalized CTR prediction model employed by EtsyAds in sponsored search is given in Figure 5. (We also employ the ADPM to personalize the PCCVR model with a similar architecture, which naturally led to ensembling the two models in a multitask architecture, a topic beyond the scope of this blog post.) </p>
<figure><img alt="ADPM used in a downstream ranking model" src="https://i.etsystatic.com/inv/bfbf9a/5096396023/inv_fullxfull.5096396023_szxvnyfk.jpg?version=0" title="ADPM used in a downstream ranking model" /><figcaption>Figure 5: An example of how the ADPM is used in a downstream ranking model</figcaption></figure><p>The ADPM-personalized CTR and PCCVR models outperformed the CTR and PCCVR non-personalized production baselines by +2.66% and +2.42%, respectively, in offline Area Under the Receiver Operating Characteristic Curve (ROC-AUC). Following the robust online gains in A/B tests, we deployed the ADPM-personalized sponsored search system to 100% of traffic. </p>
<h3>Conclusion</h3>
<p>The adSformer diversifiable personalization module (ADPM) is a scalable, general approach to model personalization from short-term sequences of recent user actions. Its use in sponsored search to personalize our ranking and bidding models is a milestone for EtsyAds, and is delivering greater relevance in sponsored placements for the millions of buyers and sellers that Etsy's marketplace brings together. If you would like more details about ADPM, please see our <a href="https://arxiv.org/pdf/2302.01255.pdf">paper</a>.</p>]]></content:encoded>
            </item>
        
            <item>
                <title><![CDATA[The Problem with Timeseries Data in Machine Learning Feature Systems]]></title>
                <link>https://www.etsy.com/codeascraft/the-problem-with-timeseries-data-in-machine-learning-feature-systems?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</link>
                <dc:creator><![CDATA[Kyle Gallatin]]></dc:creator>
                <pubDate>Fri, 23 Jun 2023 16:30:41 -0400</pubDate>
                <guid isPermaLink="true">https://www.etsy.com/codeascraft/the-problem-with-timeseries-data-in-machine-learning-feature-systems?utm_source=OpenGraph&amp;utm_medium=PageTools&amp;utm_campaign=Share</guid>
                <description><![CDATA[<h1>Introduction</h1>
<p>The Feature Systems team at Etsy is responsible for the platform and services through which machine learning (ML) practitioners create, manage and consume feature data for their machine learning models. We recently made new real-time features available through our streaming feature platform, Rivulet, where we return things like “most recent add-to-carts.” While timeseries data itself wasn’t new to our system, these newer features from our streaming feature service would be the first timeseries inputs to be supplied to our ML models themselves to inform search, ads, and recommendations use cases.</p>
<p>Not too long after we made these features available to users for ML model training, we received a message from Harshal, an ML practitioner on Recommendations, warning us of "major problems" lying in wait.
<img alt="" src="https://i.etsystatic.com/inv/8b03b2/5028042330/inv_fullxfull.5028042330_7o47f3m8.jpg?version=0" /><em>Figure 1. A user message alerting us to the possibility of "major problems for downstream ML models" in our use of the timestamp datatype.</em></p>
<p>Harshal told us our choice to export real-time features using a timestamp datatype was going to cause problems in downstream models. </p>
<p>The training data that comes from our offline feature store uses the binary Avro file format, which has a logical type called <em>timestamp</em> we used to store these features, with an annotation specifying that they should be at the millisecond precision. The problem, we were being informed, is that this Avro logical type would be interpreted differently in different frameworks. Pandas, NumPy, and Spark would read our timestamps, served with millisecond precision, as datetime objects with nanosecond precision - creating the possibility of a training/serving skew. In order to prevent mismatches, and the risk they posed of silent failures in production, the recommendation was that we avoid the timestamp type entirely and serve our features as a more basic numeric data type, such as Longs.</p>
<h1>Getting to the root of the issue</h1>
<p>We started the way software engineers usually do, attempting to break down the problem and get to root causes. Before changing data types, we wanted to understand if the misinterpretation of the precision of the timestamp was an issue with Python, Spark, or even a misuse of the Avro timestamp annotation that we were using to specify the millisecond precision. We were hesitant to alter the data type of the feature without an in-depth investigation. After all, timestamp and datetime objects are typically passed around between systems precisely to resolve inconsistencies and improve communication.</p>
<p>We started by attempting to put together a diagram of all the different ways that timestamp features were represented across our systems. The result was a diagram like this:
<img alt="" src="https://i.etsystatic.com/inv/99cd8e/5028043606/inv_fullxfull.5028043606_besvow2f.jpg?version=0" /><em>Figure 2. A diagram of all the objects/interpretations of timestamp features across our systems. Though the user only ever sees microseconds, between system domains we see a diversity of representations.</em></p>
<p>While it was clear Spark and other frameworks <em>weren’t</em> respecting the timestamp annotation that specified millisecond precision, we began to realize that that problem was actually a symptom of a larger issue for our ML practitioners. Timestamp features can take a number of different forms before finally being passed into a model. In itself this isn't really surprising. Every type is language-specific in one way or another - the diagram would look similar if we were going to be serializing integers in Scala and deserializing integers in Python. However, the overall disparity between objects is much greater for complex datetime objects than it is for basic data types. There is simply more room for interpretation with datetime objects, and less certainty about how they translate across system boundaries, and for our use case in training ML models uncertainty was exactly what we <em>did not</em> want. </p>
<p>As we dug deeper into the question, it started to become clear that we weren’t trying to resolve a specific bug or issue, but reduce the amount of toil for ML practitioners who would be consuming timestamp features long-term. While the ISO-8061 format is <em>massively</em> helpful for sharing datetime and timestamp objects across different systems, it’s less helpful when all you’re looking for <em>is an integer representation at a specific precision</em>.</p>
<p>Since these timestamps were features of a machine learning model, we didn’t need all the complexity that datetime objects and timestamp types offered across systems. The use case for this information was to be fed as an integer of a specific precision into an ML model, and nothing more. Storing timestamps as logical types increased cognitive overhead for ML practitioners and introduced additional risk that training with the wrong precision could degrade model quality during inference.</p>
<h1>Takeaways</h1>
<p>This small request bubbled into a much larger discussion during one of our organization’s architecture working group meetings. Although folks were initially hesitant to change the type of these features, by the end of the meeting there was a broad consensus that it would be desirable to represent datetime features in our system as a primitive numeric type (unix timestamps with millisecond precision) to promote consistency between model training and inference.</p>
<p>Given the wide range of training contexts that all of our features are used in, we decided it was a good idea to promote consistency between training and inference by standardizing on primitive types more generally. Members of the Feature Systems team also expressed a desire to improve documentation around how features are transformed end-to-end throughout the current system to make things easier for customers in the future. </p>
<p>We designed our ML features with abstraction and interoperability in mind, as software engineers do. It’s not that ML isn’t a software engineering practice, but that it’s a domain in which the specific needs of <em>ML software</em> didn’t match our mental model of best practices for the system. Although ML has been around for some time, the rapidly-changing nature of the space means the nuance of many ML-specific guidelines are still ill-defined. I imagine this small cross-section of difficulty applying software practices to ML practices will be the first of many as ML continues its trajectory through software systems of all shapes and sizes.</p>]]></description>
                <content:encoded><![CDATA[<h1>Introduction</h1>
<p>The Feature Systems team at Etsy is responsible for the platform and services through which machine learning (ML) practitioners create, manage and consume feature data for their machine learning models. We recently made new real-time features available through our streaming feature platform, Rivulet, where we return things like “most recent add-to-carts.” While timeseries data itself wasn’t new to our system, these newer features from our streaming feature service would be the first timeseries inputs to be supplied to our ML models themselves to inform search, ads, and recommendations use cases.</p>
<p>Not too long after we made these features available to users for ML model training, we received a message from Harshal, an ML practitioner on Recommendations, warning us of "major problems" lying in wait.
<img alt="" src="https://i.etsystatic.com/inv/8b03b2/5028042330/inv_fullxfull.5028042330_7o47f3m8.jpg?version=0" /><em>Figure 1. A user message alerting us to the possibility of "major problems for downstream ML models" in our use of the timestamp datatype.</em></p>
<p>Harshal told us our choice to export real-time features using a timestamp datatype was going to cause problems in downstream models. </p>
<p>The training data that comes from our offline feature store uses the binary Avro file format, which has a logical type called <em>timestamp</em> we used to store these features, with an annotation specifying that they should be at the millisecond precision. The problem, we were being informed, is that this Avro logical type would be interpreted differently in different frameworks. Pandas, NumPy, and Spark would read our timestamps, served with millisecond precision, as datetime objects with nanosecond precision - creating the possibility of a training/serving skew. In order to prevent mismatches, and the risk they posed of silent failures in production, the recommendation was that we avoid the timestamp type entirely and serve our features as a more basic numeric data type, such as Longs.</p>
<h1>Getting to the root of the issue</h1>
<p>We started the way software engineers usually do, attempting to break down the problem and get to root causes. Before changing data types, we wanted to understand if the misinterpretation of the precision of the timestamp was an issue with Python, Spark, or even a misuse of the Avro timestamp annotation that we were using to specify the millisecond precision. We were hesitant to alter the data type of the feature without an in-depth investigation. After all, timestamp and datetime objects are typically passed around between systems precisely to resolve inconsistencies and improve communication.</p>
<p>We started by attempting to put together a diagram of all the different ways that timestamp features were represented across our systems. The result was a diagram like this:
<img alt="" src="https://i.etsystatic.com/inv/99cd8e/5028043606/inv_fullxfull.5028043606_besvow2f.jpg?version=0" /><em>Figure 2. A diagram of all the objects/interpretations of timestamp features across our systems. Though the user only ever sees microseconds, between system domains we see a diversity of representations.</em></p>
<p>While it was clear Spark and other frameworks <em>weren’t</em> respecting the timestamp annotation that specified millisecond precision, we began to realize that that problem was actually a symptom of a larger issue for our ML practitioners. Timestamp features can take a number of different forms before finally being passed into a model. In itself this isn't really surprising. Every type is language-specific in one way or another - the diagram would look similar if we were going to be serializing integers in Scala and deserializing integers in Python. However, the overall disparity between objects is much greater for complex datetime objects than it is for basic data types. There is simply more room for interpretation with datetime objects, and less certainty about how they translate across system boundaries, and for our use case in training ML models uncertainty was exactly what we <em>did not</em> want. </p>
<p>As we dug deeper into the question, it started to become clear that we weren’t trying to resolve a specific bug or issue, but reduce the amount of toil for ML practitioners who would be consuming timestamp features long-term. While the ISO-8061 format is <em>massively</em> helpful for sharing datetime and timestamp objects across different systems, it’s less helpful when all you’re looking for <em>is an integer representation at a specific precision</em>.</p>
<p>Since these timestamps were features of a machine learning model, we didn’t need all the complexity that datetime objects and timestamp types offered across systems. The use case for this information was to be fed as an integer of a specific precision into an ML model, and nothing more. Storing timestamps as logical types increased cognitive overhead for ML practitioners and introduced additional risk that training with the wrong precision could degrade model quality during inference.</p>
<h1>Takeaways</h1>
<p>This small request bubbled into a much larger discussion during one of our organization’s architecture working group meetings. Although folks were initially hesitant to change the type of these features, by the end of the meeting there was a broad consensus that it would be desirable to represent datetime features in our system as a primitive numeric type (unix timestamps with millisecond precision) to promote consistency between model training and inference.</p>
<p>Given the wide range of training contexts that all of our features are used in, we decided it was a good idea to promote consistency between training and inference by standardizing on primitive types more generally. Members of the Feature Systems team also expressed a desire to improve documentation around how features are transformed end-to-end throughout the current system to make things easier for customers in the future. </p>
<p>We designed our ML features with abstraction and interoperability in mind, as software engineers do. It’s not that ML isn’t a software engineering practice, but that it’s a domain in which the specific needs of <em>ML software</em> didn’t match our mental model of best practices for the system. Although ML has been around for some time, the rapidly-changing nature of the space means the nuance of many ML-specific guidelines are still ill-defined. I imagine this small cross-section of difficulty applying software practices to ML practices will be the first of many as ML continues its trajectory through software systems of all shapes and sizes.</p>]]></content:encoded>
            </item>
        </channel></rss>