LinkedIn Engineering

Introduction: Technical Paper on LinkedIn's A/B Testing Platform

Nanyu Chen — Thu, 01 Oct 2015 07:00:00 +0000

XLNT is the end-to-end A/B testing platform used at LinkedIn to not only solve the day-to-day A/B testing needs across the company, but also sophisticated use cases that are prevalent in a social network setting. With all the lessons learned from using the platform, we decided to take an in-depth look at how we approach A/B testing and write a technical paper. This post is based on our paper and shares how we built the platform, dealt with some challenging scenarios and fostered a strong experimental culture.

The XLNT Platform
XLNT was designed to encompass each of the three steps of the testing process: design, deploy and analyze.

A highlight of our design capability is flexible targeting. Not only does the platform provide 40+ built-in member attributes stored in Voldemort for experimenters to leverage, it also allows external attributes to be onboarded seamlessly and provides an integrated way for real-time attributes available only in a runtime request to be used.
In the deploy stage, we have a straightforward two-step process to implement an experiment in the application layer and have enabled centralized service configuration that is totally independent of application code release, leveraging Rest.li and Databus.
Finally, analyzing experiments is fully automated, with the pipeline consuming more than 10TB of data and producing more than 150million summary records stored in Pinot on a daily basis. This is a large scale join and aggregate process enabled by the Cubert framework, which consumes application code logs ETLed to our HDFS clusters from Kafka topics and data for 1000+ engagement metrics, preprocessed by an independent pipeline. A highlight of our analysis pipeline is its ability to enable multi-dimensional analysis in certain scenarios for experimenters to dig deeper and get more actionable insights.

Beyond the Basics
We face several challenging A/B testing scenarios at LinkedIn, some of which are specific to experimentation on social networks.

In an organization running hundreds of experiments daily, interactions pose a serious threat to experiment trustworthiness. We use XLNT to address the three most common concerns and use cases related to interactions between experiments. While experiments are fully overlapping and orthogonal by default, there are simple solutions to splitting traffic to allow experiments to run disjointly, allow interaction analysis in a full factorial fashion before analyzing each factor separately, and enable fractional factorial design, where only certain combinations from different factors are implemented and analyzed.

We have enabled testing on guests (based on browser IDs) as well as on other units. A challenge we have resolved is to serve a unified experience for users switching between member and guest status, while ensuring we have measurement for both. An even more interesting problem arises when there are different experimental units within the same entity type, arising particularly in a social network setting, where the same user can play two different roles with each needing to be tested separately. In the paper, we highlight this problem with an example based on a “viewer/viewee” experiment and describe the bias variance tradeoff.

Offline experiments are integrated into XLNT as well. The challenge here is to avoid selection bias when we run email experiments, experiments coupled with email campaigns, and cohort experiments. We can’t simply use active members as the population set for email experiments. We also have to correct/avoid bias if we want to analyze the effect of the experiments and email campaigns together/separately. When running cohort analysis there is a subtilty of dynamically updating the cohort selection during the experiment when the selection criteria and the experiment outcome are not independent.

When it comes to network A/B testing, we can’t assume sample responses are independent of the treatment assignment of others. Our solution is based on a sampling and estimation framework. In the sampling stage, we partition users into clusters and randomize at cluster level. In the estimation stage, we used some more sophisticated estimators. The network A/B tests we have run at LinkedIn based on this sampling and estimation framework have indicated strong network effects.

Fostering an Experimental Culture
There are several XLNT features and concepts we introduced at LinkedIn to enable us to take education and evangelization past the "classroom".

We integrated experiment reports with business reporting by using a unified metric definition across the entire organization. This provides the foundation that enables other organizations such as Finance to bake A/B test results into business forecasting.

We also introduced site wide impact, a concept that not only allow us to provide a directional signal, but also the size of the global lift that will occur when the winning treatment is ramped to 100 percent. We conceptualized this feature so that we can compute site wide impact leveraging readily available summary statistics without having to doubling our computation effort. A paradox is that for metrics like “CTR”, local and site wide impact can disagree directionally.

As an effort to simplify multiple testing, we introduced a simple two-step rule of thumb for experimenters to follow that is mathematically equivalent to a Bayesian interpretation of people’s prior belief on whether a metric would be impacted.

To drive greater transparency regarding experiment launch decisions, we launched Most Impactful Experiments, a tool we built to bubble up notable impacts among all experiments for each product metric. We use a three-step algorithm to control false discovery. A couple of key lessons we learned from building the feature are shared in the paper.

For a more in-depth look at LinkedIn’s A/B testing strategy and technology, read the entire paper.

A/B TestingxlntKDD

Author:

Nanyu Chen

Author's LinkedIn Profile URL:

https://www.linkedin.com/profile/view?id=AAIAAAMlYkUBIffOuX0NByDI_kWIONUeyd5A07g&trk=nav_responsive_tab_profile

LinkedIn Since:

March 2014

Author Avatar:

Author Title:

Sr Applied Research Engineer

Content For:

Blog

Creating Community Around Open Source, Working With Legacy Code, Architecture Hoisting and More

Erran Berger — Wed, 30 Sep 2015 19:15:36 +0000

LinkedIn’s publishing platform gives professionals a way to share their personal opinions about topical professional news and interests, including our engineers. Here, we regularly round up some of the best pieces written recently by LinkedIn engineers.

"Building Communities"
By Todd Palino, Staff Site Reliability Engineer at LinkedIn
Todd discusses how creating an open and welcoming community is essential to the success and continued development of open source projects. He argues that accepting feedback from others, encouraging discussion, and treating even basic questions or trivial concerns as important contributions ultimately leads to better open source projects.

"Crafting Insanity (or: Working With Legacy Code)"
By Brendan Drew, Staff Software Engineer at LinkedIn
Legacy code can drive any engineer crazy, but Brendan discusses how having a measured approach toward working with legacy code can make your job – and the job of everyone else who comes in contact with it later – much easier.

"Vigilance, Guide Rails, and Architecture Hoisting"
By David Max, Senior Software Engineer at LinkedIn
Details matter. Ask the folks at NASA who lost the Mars Climate Orbiter 15 years ago because they failed to notice a difference between systems that measured in pounds and inches, rather than the metric system. The project’s failure is an example of a common conundrum in large-scale projects: How to reduce risk in a project without putting in too many constraints. David looks at “architecture hoisting” a development system that builds risk mitigation into the core code of a project.

"I Have Only One Regret: I Should Have Worked More."
By Jens Pillgram-Larsen, Senior Engineering Manager, Development Tools at LinkedIn
Work-life balance is a constant battle for nearly everyone. Jens discusses how hard work toward something you’re passionate about can actually be an incredibly enriching part of life, rather than its natural opposite.

publisher platformcontent

Author:

Erran Berger

Author's LinkedIn Profile URL:

https://www.linkedin.com/in/erranberger

LinkedIn Since:

2009

Author Avatar:

Author Title:

Head of Engineering, Content Products

Content For:

Blog

Espresso Onboarding Experiences: InMail

Eun-Gyu Kim — Tue, 29 Sep 2015 07:00:00 +0000

Fast growth is a happy problem to have, but not an easy one to solve. LinkedIn has experienced rapid member growth over the years and many of our engineers have witnessed the corresponding explosive data growth in awe. Until recently, LinkedIn relied on a traditional RDBMS as a primary data store for most of our data. Hundreds of terabytes of data were organized into Oracle shards that were incrementally provisioned as member growth continued. Several problems surfaced including:

Hot Shards — Typical Oracle shards were created in the order by which members joined the LinkedIn service. Members who joined early on tended to accumulate more activities over time. This resulted in imbalanced traffic across shards, where some shards saw more traffic than others (i.e. hot shards).
Schema Evolution — Data schemas need to evolve all the time to incorporate more information as business requirements change. In Oracle databases (or for any other traditional DBMS for that matter), this typically means a DBA running manual maintenance with ALTER TABLE queries. This is an error prone and time consuming process as millions of rows are read-locked during the maintenance.
Provisioning — Creating additional shards were not automatic. Provisioning of a shard translates to manual DBA work, as well as configuration changes from the application team. Coordinating such efforts were often painful.
Cost — Specialized hardware and annual software licensing costs were expensive.

Fast forward to 2015, most of the major Oracle systems have been migrated over to Espresso, a horizontally scalable NoSQL database developed internally at LinkedIn. We have written extensively about the Espresso design in an earlier post. Espresso currently powers all of LinkedIn’s member profile data, InMail, and a subset of our homepage and mobile applications.

How did we get to this point? The migration from the legacy Oracle implementation to Espresso is an interesting topic on its own. The effort was more than just transferring bytes from one database to another, but a set of carefully designed features and workflows. We hope to turn this into a series of blog posts pointing out the highlights of this journey. But let’s start with the largest first — InMail.

InMail

InMail is one of the core features that makes LinkedIn an engaging professional network service. It is a messaging service that connects 380M currently registered members, and is by far the largest dataset at LinkedIn. At a high level, each member is associated with one mailbox that contains messages received, sent, and archived. InMail is characterized by several access patterns which make it a unique use case.

Mailbox Search — A member’s mailbox contains sent/received/archived messages. The InMail application needs to perform a full-text search over all previous messages.
Maintain Counter — Each time a message is received, corresponding counters (e.g. number of messages unread) need to be updated. The update of the message and the counters need to be transactional.
Paginated Results — 99% of the time, a member is interested in the N most recent messages. The application displays paginated messages and search results in reverse chronological order. The messages that are very old are rarely accessed. This usage pattern in particular can be utilized for optimizations.
Write Spikes — Invitations sent out to a member’s connections can generate a large number of mailbox writes in a relatively short amount of time.

Optimizations

Espresso is a horizontally scalable document store with secondary index support. Even without introducing additional features, Espresso design provides necessary functionality for the InMail use cases. However, constantly updating hundreds of millions of mailboxes while maintaining a search index does not come free. Several optimizations were implemented as a result:

Time Partitioned Indexes

For InMail use case, Espresso internally maintains its secondary index using Lucene, whose segments are stored as MySQL rows. When the application sends a search query for a particular mailbox, Espresso needs to read all index segments stored in MySQL and assemble them together as a Lucene index. For mailboxes with a large number of messages, the cost of index assembly becomes increasingly expensive.

Most of the members spend time accessing the most recent N messages.

The problem is alleviated by carefully aligning the system with the user's data access pattern. Since most of the members spend time accessing the most recent N messages, it makes sense to organize the indexes into a series of time buckets. Each bucket has a fixed size, and once the number of index segments in a given bucket exceeds a value, a new bucket is created. This localizes index access/update to a relatively small bucket, and effectively speeds up the mailbox searches and paginated results.

Group Commits

When the member invitations get sent out in bulk, the InMail application may generate a large number of requests for a mailbox (sent folder) in a relatively short period of time. This may result in hundreds of concurrent requests trying to update the index for the same mailbox. An index update is preceded by acquiring a write lock for the target mailbox, meaning other concurrent requests for the same mailbox are blocked. In a high throughput system, such lock contention typically leads to a thread pool exhaustion. Other requests are now starved.

A bursty write pattern like this kept a few Espresso engineers up at night

After a series of design proposals, we have introduced a feature called Group Commit. When a storage node observes high number of concurrent index writes waiting for the same lock, they are grouped together and executed by a single thread. Group commit strategy significantly increases the throughput since the construction of index — the initial read of multiple index segment rows to assemble an index — is now reduced to once per group rather than per each index update request. Use of single thread also prevents excessive lock contention and starvation. The tradeoff here, of course, is the increased latency since individual requests are now dependent on all requests in the group finishing the execution. The increased latency can be compensated with client timeout adjustment. The benefits for group commit far exceeds the minor latency tradeoff.

Materialized Aggregate

InMail maintains several counters per mailbox. For example, when a new message is written to a mailbox, the number of unread messages is incremented by one. The unread message counter for a mailbox is decremented as one of the unread messages changes its status (isUnread == false).

It’s very tempting to think that a transactional write can satisfy this feature. As such, the application could supposedly couple each message insert with a counter increment, and make a transactional MULTI-PUT Espresso request. Even though the atomicity of the two writes would be guaranteed in a single cluster, this solution does not work for multiple data center cluster deployments. It is entirely possible that an update originated from a remote data center overwrite the local counter update. Depending on the order of events, this can actually produce a counter drift.

The workaround is to push the counter computation down to the storage level.

We have defined a declarative mechanism to register trigger-like predicates and aggregate functions (albeit limited) when defining a document schema. As series of updates are performed in each storage node, the aggregates are transactionally recomputed according to the conditions defined by the predicates. The aggregate results (i.e. counters) themselves are not replicated across data centers, as the storage nodes in each data center can simply recompute their own mailbox counters with respect to all local and remote updates. This feature enables Espresso to maintain precise counters across multiple data centers.

The following declaration maintains the number of unread messages through COUNT() aggregate. For each update, number of rows that meets the predicate condition (isUnread == true) is computed and written to unreadCount field.

The SUM() is another aggregate that is available in Espresso. For example, it can be used to sum over the total amount of bytes for all messages in a mailbox.

Personal Data Routing

From a technical standpoint, deployment of a petabyte scale cluster is not inherently different from deploying a smaller cluster. However, deployment of such size needs careful consideration from another dimension — CAPEX.

A typical Espresso cluster is deployed to three data centers, forming a data-everywhere, active/active topology. A write can originate in any data center and convergence will be reached through cross data-center replication. Espresso storage nodes within a data center also has a replica factor of 3, meaning each partition is typically 1 master and 2 slaves. With cost in the equation, the Espresso team needed to answer the hard question — ‘How much redundancy do we really need?’

How much redundancy do we really need?

We found out that the answer to the cost question really hinges upon the following observations.

Two data centers are sufficient for disaster recovery (well, most localized disasters).
The CAPEX of adding a data center copy on a petabyte level is very high. Conversely, the savings from keeping number of copies to 2 is difficult to ignore.
Some datasets are more personal than others. A mailbox is only accessed by the owner of the mailbox. As long as the member traffic is consistently routed to the same data center as his/her mailbox, we can afford to reduce geographic distribution of this dataset.

For this reason, a strategic decision was made to limit each member's mailbox copy to two data centers, as opposed to data-everywhere.

Without going into too much detail, this simply means that the data for a given mailbox will not be found in all data centers.

A logical data store backed by at most two data centers.

We have built a routing layer called Personal Data Routing (PDR) for this purpose. Regardless of which data center a request originates from, the service layer is able to lookup a special routing table and forward the request to a logical data store backed by at most two physical data centers. With PDR, we are able to control the degree of redundancy at the data center level. InMail mailboxes are currently divided into two logical stores – USE and USW – and have achieved significant footprint (therefore cost) reduction.

Migration

Basic Idea

When the InMail optimizations were tested and ready to go, it was time for the big move. The basic idea of moving data is actually not too different from moving boxes as you would if you were moving to a new house — we pack, ship, load, cleanup and reorganize.

Pack Boxes — ETL data from Oracle to Hadoop, and transform the result into Espresso partitioned data
Ship & Load — Copy and load the partitioned Espresso data to all storage node replicas without taking any down time.
Cleanup & Reorganize — Remove import related files. Replay the rest of the events (what’s often called delta) from Oracle that have been updated since the ETL generation.

Coordination and Workflow

As the migration was about to start, there was a sense of urgency shared across teams as the operational complexity, licensing cost, and hardware footprint of maintaining Oracle instances were taking a toll each day. However, the timely migration was not simply a matter of ‘fast execution’, as it required coordinated efforts between multiple teams. It took careful planning from Oracle DBAs, InMail engineers, and Espresso engineers to come up with an optimal schedule. Eventually, the teams have agreed to proceed with the following schedule:

Although the pipeline looks relatively simple, there were subtle factors to consider in terms of sizing individual batches:

The amount of time spent generating and loading data is directly proportional to the amount of delta catchup required afterwards (that is, more data piles up over time).
The pipeline is bound by the largest batch in the workflow.
The pipeline runs optimally if the batches are relatively equal in size.

We ended up choosing about five Oracle shards for a batch, which is equivalent to about 25 million member mailboxes. At the end of each batch, the stakeholders gathered together to checkpoint and then move to another one. When the pipelined workflow was in full throttle, each team was working fully in parallel.

ETL from Oracle to Hadoop

The process of taking ETL from the Oracle databases was owned by the DBA team. The DBAs took one shard from different Oracle instances so that the ETL can be generated in parallel. Since the shard sizes differ to some degree, there was some mixing-and-matching of the shards so that the total sum would be relatively constant between the batches. The Oracle dump was written to HDFS in a Hadoop cluster for additional transformation.

The InMail engineering team then ran a Pig script that transformed the Oracle dump into Espresso-ready data. Specifically:

Monolithic Oracle dump was transformed into partitioned data (1024 for InMail), using the same partitioning hash function from Espresso.
In addition to the baseline data, secondary index segments were also generated. This was done so that we can enable the index lookup immediately after the import.
Each partitioned data was converted into tab-delimited data file, which was ready to be loaded into MySQL server with LOAD DATA INFILE syntax. Binary portion of data were represented using HEX.

Copying and Bulk Loading Data

At the time of the migration, the HDFS in use was not able to maintain a high read throughput. A few trial runs also suggested that an increase in concurrency further degrades the HDFS read performance. Espresso cluster was not co-located with the Hadoop cluster, which was another limitation. We looked for a solution where the reads from the HDFS can be kept to a minimum.

The target Espresso cluster had a replica factor of 3. Instead of all replicas performing the reads, we limited the direct interaction with the HDFS to a single replica per group. After that replica pulled the necessary partitions to its local file system, we let the other two replicas copy from it using compression and network pipes (e.g. tar | gzip | netcat). This resulted in a hierarchical distribution of input data. The copy between the replicas were dramatically faster than the HDFS read, since it was done within the same network. The slow HDFS performance was eventually overcome with this workaround.

Once each storage node obtained the partitioned data it requires, the actual load was done by a simple call to MySQL’s ‘LOAD DATA INFILE’ query. The storage nodes were taking write traffic for the partitions that had completed the migration. We wanted to satisfy two requirements while the bulk load took place.

No down time. No read/write impact to the partitions that have already completed the migration.
At least two replicas in service. This guaranteed that mastership handoff can still take place.

Since the bulk load process was I/O intensive in nature and would interfere with the service quality, we decided to take each instance offline to perform the bulk load. With Helix — the distributed system coordination service that Espresso uses — we were able to programmatically disable one storage node after another for the bulk loading maintenance. If the node undergoing maintenance happened to be a master, one of the slave replicas was automatically promoted as a master without service interruption.

After some trial runs, we were fully confident that each replica could run bulk import with high throughput. There were 12 replica groups (slices) for the InMail cluster. We repeatedly took one node out of each slice for maintenance, effectively loading 12 nodes at a time.

At all times, we maintained the full availability with a redundancy factor of at least 2.

Replaying the Delta Writes

After the bulk load, InMail engineers went back to the Oracle shards to collect the new writes (delta) that have been accumulating from the last ETL point. The amount of the delta writes was relatively low, so the catch up phase involved simply replaying the events to the Espresso router. This was done through a small dedicated cluster that was designed to take the Oracle delta writes and replay them as Espresso requests. Once the catch up was fully complete, the InMail team flipped the switch that made Espresso the source of truth for the newly migrated shards. For safety reasons, the dual-write to Oracle and Espresso were maintained throughout the migration process.

Conclusion

The migration was swift. We’ve migrated 200+ million mailboxes (at the time of migration) in less than three month period while maintaining full availability. At this point, the InMail Oracle instances are fully decommissioned, eliminating complexity, operability problems and costs.

The migration of InMail was a valuable learning experience for many stakeholders.

Espresso team has learned what it takes to serve the largest dataset at LinkedIn, and was able to introduce creative optimizations as a result. The experience also showcased that Espresso is ready to take on big challenges. Multiple teams have collaborated as one unit, demonstrating how the culture of valuing teamwork can truly have a business impact.

The InMail migration was a team effort. Many thanks to the Oracle DBA team and InMail team (aka COMM team) for great execution. Numerous engineers in the Espresso team had sleepless nights in coming up with the optimizations and working on the migration. We also would like to thank Alex Vauthey and Greg Arnold for their leadership, and Mammad Zadeh and Ivo Dimitrov for their clear vision and guidance.

Espresso Migration InMail

Author:

Eun-Gyu Kim

Author's LinkedIn Profile URL:

https://www.linkedin.com/pub/eun-gyu-kim/6/b54/5b2

LinkedIn Since:

Jan 2013

Author Avatar:

Author Title:

Staff Software Engineer

Content For:

Blog

Bridging Batch and Streaming Data Ingestion with Gobblin

Shirshanka Das — Mon, 28 Sep 2015 07:00:00 +0000

Genesis

Less than a year ago, we introduced Gobblin, a unified ingestion framework, to the world of Big Data. Since then, we’ve shared ongoing progress through a talk at Hadoop Summit and a paper at VLDB. Today, we’re announcing the open source release of Gobblin 0.5.0, a big milestone that includes Apache Kafka integration.

Our motivations for building Gobblin stemmed from our operational challenges in building and maintaining disparate pipelines for different data sources across batch and streaming ecosystems. At one point, we were running more than 15 different kinds of pipelines, each with their own idiosyncrasies around error modes, data quality capabilities, scaling, and performance characteristics. Our guiding vision for Gobblin has been to build a framework that can support data movement across streaming and batch sources and sinks without requiring a specific persistence or storage technology.

Our first target sink was Hadoop’s ubiquitous HDFS storage system and that has been our focus for most of last year. All of LinkedIn’s data (hundreds of terabytes per day) needs to get aggregated into Hadoop before being combined in interesting ways to build insightful data products, surface meaningful business insights for executive and analyst reporting, and provide experimentation-focused analysis. At LinkedIn, Gobblin is currently integrated with more than a dozen data sources including Salesforce, Google Analytics, Amazon S3, Oracle, LinkedIn Espresso, MySQL, SQL Server, SFTP, Apache Kafka, patent and publication sources, CommonCrawl, etc.

Open Source

We open-sourced Gobblin earlier this year and we’re excited by the amount of engagement and activity on GitHub as well as our discussion group since the very early days. In the past few months, contributors from different companies and continents have committed important bug fixes. Additionally, the community has contributed important features such as a byte-oriented Kafka extractor and S3 integration. The 0.5.0 release has two big features: a) production-grade integration with Kafka as a data source and b) support for operational monitoring and metadata integration.

Bye Bye Camus, Hello Gobblin

Camus was built by LinkedIn specifically to get Kafka data into Hadoop. However, over the years it has accumulated a fair bit of technical debt which would have taken us quite a bit of work to unwind and would be duplicative of work that we’re already doing in Gobblin. Most of the issues were related to operability, data integrity and flexibility to take advantage of different execution frameworks.

In the past few months, we’ve integrated Kafka as a supported data source for Gobblin. The figure above illustrates the Gobblin operator pipeline for Kafka ingestion. Compared to Camus, this gives us better support for robust hourly compaction, simpler configuration and overall uniformity in debugging and analysing ingestion performance and failures across all source types.

At LinkedIn, Gobblin is currently ingesting about a thousand Kafka topics that stream an aggregate of hundreds of terabytes per day. Over the next quarter, we plan to migrate all Camus flows into Gobblin. The current execution framework that we’re running in production is based on MapReduce but this lays the foundation for us to move to different frameworks in the near future.

What’s Next: The Path to Continuous Ingestion

One of the biggest challenges with building a single ingestion framework for both batch and streaming is dealing with impedance mismatches between the source, the sink and the execution environment.

As described earlier, currently in production, we run MapReduce based batch ingestion jobs every 10 minutes on Hadoop to pull data from Kafka into HDFS and publish the data every hour. This has served us well because these batches are simple idempotent retriable units of work. However there is an interesting efficiency cost to this. Every time a batch gets set up, it needs to acquire schemas for all of the topics it is going to ingest, work with the resource scheduler to set up mappers, once mapper slots are acquired, start up the JVM-s, pull data down for a few mins, persist checkpoints to disk and then tear down. This cycle repeats every 10 minutes. We observed that during really busy periods in the cluster, we were spending a lot of time in the setup phase compared to how long we were actually ingesting.

This motivated us to move to a classic streaming-based model where we could ingest continuously. However, there is a different efficiency cost here; you can provision your ingestion job to support the average throughput of the aggregate streams but data lag will suffer during peak times. If you provision for maximum throughput, resource utilization will be sub-optimal during off-peak times. Since we run these jobs in a shared hugely multi-tenant Hadoop cluster, we don’t want to hog resources without good reason. The ideal deployment scenario is where we can deploy Gobblin in continuous ingestion mode with the option to elastically expand or shrink the cluster as incoming data increases and decreases to maintain a configurable data lag.

To implement this, we are leveraging two projects, Apache YARN for macro-level container allocation and Apache Helix for micro-level resource assignment, fault-tolerance and re-allocation. We’ve previously talked about how these two projects can be combined to create auto-scaling distributed systems. Helix allows us to bin-pack work-units within the acquired YARN containers and supports elastic scaling on demand by releasing or acquiring new containers from YARN depending on the performance requirements. This work is currently in flight and we’re planning to roll this out in production next quarter. This will bring further latency reductions in our ingestion from streaming sources, enable resource utilization efficiencies and allow us to integrate with streaming sinks seamlessly. The Helix framework allows us the flexibility to support other container management frameworks like Mesos, Kubernetes, etc. We will welcome contributions from the Open Source community in this direction. Stay tuned for a future blog post with a more in-depth discussion of the design and implementation of this feature.

Team, Community and Outreach

The Gobblin team from left to right: (sitting) Chavdar Botev, Issac Buenrostro, Ying Dai
(standing) Pradhan Cadabam, Min Tu, Ziyang Liu, Yinan Li, Sahil Takiar, Abhishek Tiwari

We would like to acknowledge the impactful contributions of Narasimha Reddy Veeramreddy, Ken Goodhope, Henry Cai and Lin Qiao to the Gobblin project over the years. At LinkedIn, we’ve been fortunate to have stellar partners who have all helped make Gobblin a better product. We would like to give a shout out to our data services and Hadoop operations team, the Espresso and Kafka teams, the Bizo and Lynda teams, and the Content ingestion team. Externally, we’re excited to see the community adoption of Gobblin and are working on making it even easier to use and extend. Special thanks to Kapil Surlaker, Greg Arnold and Alex Vauthey from the management team for their constant encouragement and support.

We’re going to be talking about Gobblin, Pinot, Kafka, Samza and our latest invention Dali at LinkedIn’s second annual Big Data Perspectives event at our NYC R&D office in the Empire State building this week. Hope to see you there!

Big DataGobblinHadoopKafkaETLOpen SourceDistributed Systems

Author:

Shirshanka Das

Author's LinkedIn Profile URL:

https://www.linkedin.com/in/shirshankadas

LinkedIn Since:

April 2010

Author Avatar:

Author Title:

Sr. Staff Software Engineer

Content For:

Blog

Rewinder: Interactive Analysis of Hadoop's Computational Resources

Vamshi Hardageri — Wed, 23 Sep 2015 07:00:00 +0000

Co-authors:

Teja Thotapalli

Brian Jue

Tu Tran

Sandhya Ramu

As LinkedIn continues to grow in size and stature, the data volume being generated continues to increase at an exponential rate. In order to gain insights from the massive amounts of structured and unstructured data at LinkedIn, we leverage the Hadoop framework, which has the power to store and process all kinds of large datasets. Hadoop's multi-tenancy architecture allows us to address the challenges of a shared storage/compute environment and maintain resource provisioning and service level guarantees for tenants.

One of the responsibilities of our Hadoop Application Operations team is to monitor and maintain the bulk the of the data on grid, including many of the business engagement workflows that consume the datasets. It's a precarious balance between ensuring that the data is generated and propagated across the grid clusters while also adhering to the dataset consumption timings of the user. Keeping track of each individual job, their immediate and overall resource utilization, coupled with the potential contention hurdles they may face, can be a daunting task. This blog post explains why we needed to create a tool that could help us better manage the operational aspects of our Hadoop application.

The Problem: Too Many Flows, Not Enough Insights

As an operational team supporting many flows and jobs running on the Hadoop cluster, we often felt the need for insights into how resources were being allocated. These insights will help analyze the behavior of the job. For example, if the cluster is busy for one hour, that can be attributed to a job's long run time.

Currently, the ResourceManager and Job History Server are the only two venues that provide visibility into resource utilization and they each have limitations. The ResourceManager can only show the resource utilization of the cluster at present time, not in recent past. Job History Server provides information about the resource ask by that application but for only MapReduce jobs. While this information is necessary in understanding how things unfolded, it can be a tedious task to stitch together different job activities and their associated allocation of resources to provide a holistic view of resource utilization.

Our Solution: Rewinder

In order to help assuage the operational aspect of the Hadoop applications running on the grid clusters, we developed Rewinder, named for the ability to allow us to go back any minute in time. As a method to garner insights into the resource utilization, Rewinder sifts through the sea of application data and surfaces the reasons behind how the computation resources are utilized and how they may affect job processing.

Rewinder collects raw information about how the memory and vCore resources are being utilized in a grid cluster using a rich set of YARN Rest APIs. This raw information is aggregated to give visibility at varying granularities, namely the entire cluster, queue and user.

The Rewinder tool is comprised of four components :

Extractor : A high frequency job that runs every minute and uses the YARN Rest API to get the raw data, which is essentially the resource utilization of each running Application Master. This also does some basic aggregation.
Reporter : A nightly job that feeds on the existing raw data and generates reports with the desired the insights.
Housekeeper : A nightly job that runs the basic housekeeping activities like purging old data and creating new table partitions.
Trigger : The actual driver program that uses Java Quartz Scheduler to maps out all the above components.

As of now, all the data is stored in a MySQL DB. Currently for our dev cluster data is 14GB and stores 70 days worth of data. We are adding on average 600 thousand records (2GB) of data to the database every day.

One of the challenges we were facing was determining how much memory was being consumed by an application. Every minute, Rewinder captures the resource allocation for each application and calculates its consumption rate, denoted both in megabytes per minute and in the number of vCores per minute.

It's important to note that the API does not expose the actual resource consumption, rather it shows the allocated resources from the entire pool. Once allocated, they cannot be claimed by anyone else until it flows back to pool, consequently, we collect this information at the lowest grain (Application Master) that can be rolled up to multiple levels to gain insights.

Here are a few examples of Rewinder's capabilities :

You can easily go to any minute in a day, and see what jobs are running and how the resources are being shared among multiple applications, user and queue.

For any given time frame, you can see how the resources are shared among users in a queue. This bubbles up the top resource consuming users. This also let us do a given user vs. rest of the world comparison. You can also see how many applications are in waiting state and the top ten resource consumers.

Rewinder provides insight into how applications waiting for resources are piling up.

The tool also generates reports on resource utilization per user, answering questions like :
- What does resource utilization look like over the past 30 days?
- Where does the user rank in resource utilization?
- How many applications does that user submit every day?
- What is the average wait time and run time of applications?
- What time and which queues does the user submit most of his jobs?
- What are the top resource-consuming jobs?

In each queue or grid, we can see what a general resource utilization looks like on the weekday as well as the weekend. We can also see the top resource consumers and the average wait times for the applications submitted in the queue.

What's Next?

We are always looking for ways to improve Rewinder so that it becomes an increasingly effective tool for our users. Our next steps include making the tool itself more streamlined with additional features that tie a top level Hadoop flows as defined in Azkaban with the underlying task level information to make it more complete.

Hadoopoperations

Author:

Vamshi Hardageri

Author's LinkedIn Profile URL:

http://www.linkedin.com/in/vamshihardageri

LinkedIn Since:

March 07, 2012

Author Avatar:

Author Title:

Sr. Data Operations Engineer

Content For:

Blog

Video: Tools Team Revolutionizes Software Development

Baron Roberts — Tue, 22 Sep 2015 07:00:00 +0000

In the last four years, we’ve dramatically revamped the way we release software. One of the leaders of our tools team Jens Pillgram-Larsen shares what changed, what the future will hold and how his team evolved along the way.

LinkedIn Tools Engineer Looks to Revolutionize Software Development from LinkedIn

toolsengineering culture

Author:

Baron Roberts

Author's LinkedIn Profile URL:

https://www.linkedin.com/pub/baron-roberts/0/159/65b

LinkedIn Since:

April 2013

Author Avatar:

Author Title:

Staff Software Engineer

Content For:

Blog

The Evolution of A/B Testing Platform at LinkedIn

Weitao Duan — Mon, 21 Sep 2015 07:00:00 +0000

“Doubt the conventional wisdom unless you can verify it with reason and experiment” - Steve Albini

At LinkedIn, we experiment with new ideas before trusting our instincts. Experimentation plays an important role in product innovation and business growth. It is an essential ingredient to greater member happiness, stronger business impact and higher talent productivity.

XLNT, an internal LinkedIn platform built to help make data-driven A/B testing decisions, has gained its popularity among various teams and products – everyday, hundreds of experiments are running and being studied on XLNT. Over the past year, XLNT has evolved significantly to a full-fledged platform covering many aspects of A/B testing. We have developed many powerful and yet easy-to-use features to help better run A/B tests and analyze results.

Flexibility: Custom Experimental Unit

Traditionally, we've focused A/B testing efforts on our members in order to improve their experience using LinkedIn. We are, however, extremely focused on growth as well, and try to give guests users the best experience possible so they'll want to sign up. We do this by letting guests use certain features, like applying for jobs, without signing up and making sure that once they do sign up, the process is easy and fluid. In addition, we aim to provide relevant jobs and informative job descriptions to our audience.With this in mind, we have expanded A/B testing practice to custom experimental units and leveraged XLNT to test products outside our core member platform. With flexible experiment units, we can optimize how we engage guest users, send emails and present jobs through experimentation on XLNT. Analyzing experiments on guest, emails, jobs, and other custom experiment units has never been easier.

Transparency: Metrics You Follow and Ramp Alert

LinkedIn is a fairly large company with various product teams. Every product team is likely to run multiple experiments at one time. As you can imagine, a member in one product area can never possibly be aware of all the experiments that other teams are running. Product owners of LinkedIn have set up, monitor and experiment on their essential metrics on XLNT. However, all the products and features on LinkedIn potentially interact and influence each other. Without a proper channel, locating the source of impact is just as difficult as finding a needle in a haystack.

To provide such a channel, XLNT has developed a feature called “Metrics You Follow”. The “Metrics You Follow” page is the place where LinkedIn employees can subscribe to metrics and get a list of the experiments that are impacting these metrics. The list of experiment are selected and ranked by a multi-criteria algorithm that takes into account experiment population, effect size and metric intrinsic volatility. “Metrics You Follow” bridges the gap between metric followers and experiment owners.

At the end of an A/B experiment, a decision to ramp up or terminate has to be made. Before the decision is made, the experiment owner should review the A/B test results. However, a bad experiment that negatively impacts a metric could be ramped up and it is up to XLNT to alert the experiment “owner” and the metric owner and start the communication between them. Ramp alert feature notifies the experiment owner upon ramp-up request. The metric owner will be notified immediately when an experiment that negatively impacts this metric ramps up in production. With better communication and transparent information, bad experiments can be caught at the early stages.

Data Driven Decision Making: Post Experiment Power

Statistical power measures an experiment’s statistical sensitivity to detect an effect that actually exists. The statistical power of an experiment is essential for business decision-making. An experiment that is underpowered could have a large negative impact without us knowing it. To help run better A/B tests and make more data driven decisions, we have developed the post experiment power feature on XLNT. Prior to running an experiment, the minimal experiment impact (also commonly known as effect size) one wishes to detect are determined and will be used to calculate the statistical power. The post experiment power feature on XLNT not only surfaces the current power values for the comparison between treatment and control, but also provides recommendation on how to achieve enough power if the current power is low. These recommendations are based on the metric’s historical data and current experiment information and can be at experiment level or metric level. The experiment level recommendation tries to achieve enough power for all the important metrics that the every team should be aware of when running an experiment. The metric level recommendation specifically aims to achieve enough power for the metric of interest. These recommendations encourage users to make more informed decisions on the experiments.

Customized and On-demand Analysis: XLNT on Demand

The large-scale unified data pipeline generates standard A/B test reports for all experiments and satisfies most of the needs for experimentation. One unified report does not answer all the questions about one experiment. Users sometimes want to deep dive into an experiment and perform advanced analyses, for example, cohort analysis. XLNT on demand, a tool we recently built, provides such customized and on-demand analysis for an experiment. It allows for customized date range, customized metrics, customized dimensions and even customized members for experiment analysis. Users can also run complex cohort analyses on XLNT on demand. XLNT on demand leverages most of the cool features on XLNT so it eliminates most of the need for ad-hoc A/B testing analysis.

Hundreds of experiments are running on XLNT each day and the number of metric we support has grown to over one thousand. With a goal of providing easy-to-use, accurate and comprehensive A/B testing solution at scale, more great features are to come as XLNT evolves.

A/B Testing

Author:

Weitao Duan

Author's LinkedIn Profile URL:

https://www.linkedin.com/in/weitaoduan

LinkedIn Since:

07/2014

Author Avatar:

Author Title:

Senior Data Scientist

Content For:

Blog

T-Rex, Luigi, Pac-Man and More: Check Out the New Global Ops Space

Christie DeBlasio — Thu, 17 Sep 2015 18:18:37 +0000

When I say “Global Operations” are you picturing rows of grey cubicles and occasional heads popping up over the walls? If so, you’ve never seen the Global Ops space at LinkedIn. The team recently moved into a new location that was pretty plain, but now looks like scenes from Jurassic World, Super Mario, and a Las Vegas lounge, thanks to our recent Rock Your Space contest.

When the team relocated into the three-story building, we announced that each quarter of a floor – twelve groups in all – would get a budget, design lead, and 10 weeks to decorate their space. The designs were intended as long-term additions and we encouraged teams to let their creativity shine. In late July, we invited judges from across LinkedIn to evaluate the spaces for originality, effort, teamwork, and overall presentation.

The end-results were pretty fantastic, as you can see in the photos below. The Media Productions Team came in first place, with “XL Media Playground,” an interactive space that showcased some of the team’s projects including digital mapping, drones, video editing, green screens, and of course, a bar serving media themed drinks.

Other themes around the building included the LinkedIn Barcade (our runners up!), Jurassic World, Vegas Club Lounge, and Super Mario Brothers. Check out some of the awesome decor we get to see every day:

cultureoperations

Author:

Christie DeBlasio

Author's LinkedIn Profile URL:

https://www.linkedin.com/in/cdeblasio

LinkedIn Since:

June 2014

Author Avatar:

Author Title:

Executive Assistant, Global Operations

Content For:

Blog

Jumping the Gender Gap, A World Without CSS, Antifragile Software Systems, and Other Must Reads

Erran Berger — Fri, 11 Sep 2015 07:00:00 +0000

"'Don't Let Being the Only Girl Stop You!' -- #ThankYourMentor"
By Tiffany Lim, Software Engineer at LinkedIn
Tiffany remembers fifth grade, the first time she realized there was a gender gap and how she overcame her fear of being the only girl in the science club. Her elementary school teacher encouraged her to persevere in a male-dominated industry, advice Tiffany has held onto throughout her engineering career. The best mentors don't just believe in you, Tiffany writes, they show you how to believe in yourself.

”A World Without CSS”
By Bradley Cypert, UI Engineer at LinkedIn
What if CSS was never invented and we had to live in an (internet) world with unstylized websites? Bradley argues that, while browsers and individual programs might have created their own styling, it would be a nightmare of different styles and standards for developers.

"The 'Runner's High' Moment in Programming"
By Pengfei (Jason) Li, Senior Software Engineer at LinkedIn
Running is therapeutic for many people. There's no better feeling than overcoming the struggle of a harder-than-anticipated run and coming out on top. Jason draws a parallel between between running and programming. Projects don't always go according to plan and sometimes reaching the finish line can seem like an impossible task, but a difficult journey there makes the end result that much sweeter.

"XCTest Helper Methods"
By Kyle Sherman, Software Engineer (SlideShare) at LinkedIn
Kyle offers advice for anyone writing unit tests in Xcode. If you are writing a helper method that gets called from numerous tests using the XCTest framework, you may have noticed that if something fails and the code is within your helper method, you will just see a failure in the helper method and not any information about which line in which test it failed. Kyle offers up his solution.

"Managing CSS Transitions with JavaScript through Functional Composition"
By Kevin Greene, Web Developer at LinkedIn
As websites add more animations and transitions, web developers are required to manage the CSS through JavaScript. Kevin offers up simple, pure coding language to handle this.

"Antifragile Software Systems"
By Jens Pillgram-Larsen, Senior Engineering Manager at LinkedIn
It's impossible to build a perfect software system that never fails. We all know this, intuitively and experientially. Instead of trying to achieve perfection, we should strive to build the perfect process—one that is Antifragile. Jens explains why an Antifragile system is one that becomes better when it is stressed and how it is the system equivalent of adopting a growth mindset.

"Developer Happiness"
By David Max, Senior Software Engineer at LinkedIn
A person’s perspective on a project or task at work can vary wildly depending on their job title. David takes a look at a recent project where his team moved a task over to a scalable distributed architecture in order to utilize more processing power. The project worked, but resulted in software engineers who were frustrated by the constraints of the new computing framework. David argues that engineers can become better at their jobs by understanding the architecture of the software they work with, which will help them realize why restraints on certain architecture are there and how they can work within them.

publisher platformcontent

Author:

Erran Berger

Author's LinkedIn Profile URL:

https://www.linkedin.com/in/erranberger

LinkedIn Since:

2009

Author Avatar:

Author Title:

Head of Engineering, Content Products

Content For:

Blog

The Many Facets of 'Faceted Search'

Dmytro Ivchenko — Thu, 10 Sep 2015 07:00:00 +0000

Faceted search is a vital part of LinkedIn’s search experience. It’s a key feature in the exploratory searches done by job seekers, recruiters and market analysts when trying to find the information they need on LinkedIn. It provides structure to search results, which enables fast navigation and discovery.

When it comes to a great search experience, there are two elements that matter most: correctness and performance. Conventional approaches to faceted search sacrifice either too much correctness or don’t perform fast enough.

This post covers a new approach to faceted search based on inverted index, this is compared to the conventional forward index-based approach. We outline technical challenges and superiority of the new approach when dealing with very large data sets. We also include some practical results from the largest LinkedIn search index to back up our ideas. It is important to note that this post assumes a basic familiarity with the inverted index-based approaches to search.

What is faceted search?

Faceted search is best explained through an example. In the screenshot below, there are two facets, 'Location' and 'Current Company', each with multiple facet values. Some facet values like 'LinkedIn', 'Greater New York City Area' and 'San Francisco Bay Area' are selected. If the original query was 'Software Engineer', the facet selections mean the actual results include software engineers from the New York City Area or the Bay Area and who work at LinkedIn.

In the above example, the astute reader would have noticed that multiple facet value selections within the same facet are treated as an OR, i.e. the selections for 'Location' indicate we want engineers in New York City OR the Bay Area. The counts reflect these semantics: a selection of a facet value within a facet like 'Location' will not affect the counts of other locations. If this were not true, the counts of all other locations would become 0 once we select a location like 'San Francisco Bay Area'. This is not very useful or intuitive.

The challenges of faceted search

The LinkedIn search stack, called Galene, supports early termination by providing features like static rank along with special retrieval queries (which we will discuss in future posts). Early termination enables us to do typeahead searches using an inverted index, and enables queries with lots of hits to execute fast. For example, in typeahead we index two-letter prefixes, which match millions of documents per shard. There is no cost-effective way to provide millisecond latencies for those queries if we were to retrieve and score all documents.

On the other hand, early termination doesn't work well with the standard approach for faceting. The standard way to discover facet values and compute their counts is to use forward index while scoring documents. Facet values are put in priority queues and several facet values with highest counts are selected to ultimately be displayed on the search page left rail.

If we terminate a search without retrieving all documents, we will have inaccurate counts for low cardinality facet values if we use this forward index based approach. This challenge is compounded if we want to estimate the counts of high cardinality facet values through some sort of sampling.

In summary, doing facet counting using the forward index means that we cannot do two things at the same time:

Approximate facet counts for large cardinality facet values to improve performance
Guarantee exact values for low counts

The last requirement is very important as typically when count is low (something which fits on a couple of search pages) the numbers should be exact while discrepancies in high numbers don't matter too much, i.e. it doesn't matter if we have 10200 or 10215 results, but it matters if we have 5 or 6 results.

Our solution

Thus we had to come up with a different algorithm for faceting. We chose to use inverted index posting lists to count facet values. Using the inverted index for facet counting allows us to guarantee exact counts for low cardinality, while at the same time allowing us to estimate the counts for high cardinality values, providing a significant performance boost. Our approach also retains the option for us to early terminate when scoring documents.

First, we split the faceted search problem into two components: discovery and counting. Facet discovery is the process of deciding which values to display for a given facet. For example, the 'Current Company' facet requires us to 'discover' a list of companies that we are going to display. This list is based on the input query, so the companies shown for the query 'Software Engineer' will be different from the companies shown for the query 'Mechanical Engineer'.

Our decision was to use forward index to discover facet values during scoring and to use the inverted index to count them. These two steps are sequential as counting can't start until at least some facet values are discovered.

The typical search cluster setup consists of multiple index partitions hosted on search nodes and a broker, which does a fan-out request and gathers results. When faceting is involved, the broker typically does two requests: the first one performs regular scoring and discovers facet values along the way. Once the broker gathers all facets values the search nodes it issues a subsequent request to count the top N values for each facet.

Facet discovery

When using early termination search query is typically augmented to retrieve most relevant documents while skipping non-relevant ones. This way, facet values from non-relevant documents might be skipped during the retrieval stage or they can be discarded during top N selection on broker, which introduces relevance component into the way we select facet values to display.

The problem arises when a facet selection is made. In this case we can not use search query to discover facets as facet selection becomes a part of it while it should not be considered during discovery of a selected facet.

Let's say we have facet value LinkedIn selected for Current Company facet. Let's also say that our early termination limit is 100 and we need at least 50 documents to discover facets. Essentially we need to execute the following queries:

(+Engineer +LinkedIn) [100]
+Engineer [50]

'+' means that a clause is required. When having two or more clauses it can be thought as boolean AND.
[100] and [50] are early termination limits

Query 1 will be used for scoring as well as for discovery of facets other than Current Company. Query 2 will be used for discovery of Current Company facet. It's possible to multiplex these two queries into a single one:

(+Engineer ?LinkedIn[100]) [50]

'?' means that a clause is optional. When having two or more clauses it can be thought as boolean OR. In this case it's combined with a required clause, so it becomes fully optional, meaning it doesn't have to match at all.
[100] means early termination limit on ?LinkedIn clause.

The query above retrieves same documents as two separate queries but doing it much more efficiently in a single pass. The necessity of multiplexing becomes evident when having two or more facets selected. Fortunately similar logic can be applied in this case as well.

Using the inverted index for facet counting

To use the inverted index for facet counting, we created a special counting query which can be executed like any other query that utilizes the posting lists of the inverted index. For example, the following query will count the engineers in Google, Facebook and LinkedIn by essentially traversing the posting lists for the terms 'Engineer', 'Google', 'Facebook', and 'LinkedIn':

FACETS Google Facebook LinkedIn QUERY Engineer

The result of this query will be a set of (facet value, facet count) pairs. QUERY means the original query issued by a user, in this case 'Engineer', and is called the query condition. When counting, each search node essentially executes the following query:

+Engineer +(?Google ?Facebook ?LinkedIn)

This query matches only documents which contribute to at least one facet value count. For each matched document, we increment the count of the corresponding facet value. Since we have an 'OR' expression, each matched document can increment multiple facet value counts.

Consider the example above. It shows an inverted index with 4 terms and 4 documents with docids 1 - 4. The circled docids indicate that the counting query produced a match.

The first document matches query condition and it also matches 'Google'. The second document matches only 'Google' and will not be counted. The third document matches the query condition and 'LinkedIn'. The fourth document matches everything. The counting query will match documents 1, 3, and 4. After counting is finished it will produce the following counts: Google 2, Facebook 1, LinkedIn 2.

When facet values are selected, the facet counting query becomes more complicated. But it still can be constructed and executed in a single pass similar to the above, matching only those documents which contribute to at least one facet value count.

Counting approximation

The approach above is correct, but will result in poor performance for queries which match millions of documents per shard.

It is here where the use of the inverted index pays dividends: since the inverted index posting lists come with skip lists, we can leverage this to sample a subset of the index and produce estimated counts for large cardinalities.

When using sampling, the counting query can be rewritten to:

+Engineer 
+(
    ?(+Google +Google_Sampling) 
    ?(+Facebook +Facebook_Sampling) 
    ?(+LinkedIn +LinkedIn_Sampling) 
    ?Engineer_Sampling
)

There are many ways to implement sampling iterator. We've experimented with quite a few different approaches, but the following one is quite performant and provides very good accuracy in the same time.

The idea is to compute facet value count as:

where D(x) is a facet value density function. It's a continuous function defined on [0,max doc] and takes values in a range [0, 1].

For performance sake we use linear interpolation to approximate D(x), which provides decent enough accuracy. We split document id space into R equal ranges of size S = max doc / R. Sampling iterator then returns up to F first documents from each range. As document id space is contiguous the implementation of the sampling iterator boils down to several arithmetic operations. We then calculate density value D(x) for the beginning of each range r as M(r) / F, where M(r) is amount of documents matched against corresponding facet value iterator in a range r. Finally we integrate density function D(x) to compute facet count:

Since this approach actually retrieves documents from the entire posting list, it works very well regardless of the distribution of documents. This is important because we have observed that our static rank tends to make posting lists dense at the front and sparse at the end, e.g. a posting list with document ids 1, 2, 4, 6, 20, 100, 1000 can be very typical.

Example

Let's say we have document ids in 0-400 range. We split them into R = 4 equal intervals, 100 documents each. We consider 40 first documents from each interval to compute density function D(x).

Let's say that:

first interval matched 10 documents with facet count iterator
second - 5 documents
third - 2 documents
fourth - 1 document

We calculate square under function D(x) by evaluating 400/2/40/4*(10 + 5 + 5 + 2 + 2 + 1 + 1 + 0) = 32.5

This way the final approximated count is reported to be 33.

Results

The following study was done based on a query logs from the main member search product on LinkedIn. We applied our approximation algorithm for each query and compared results against the algorithm with approximations turned off.

Capacity: total runtime decreased by a factor of 7
Latencies: p50 decreased 1.2 times, p90 - 1.4 times, p95 - 2.6 times, p99 - 11.5 times

This plot above displays percentage of error for the estimated counts. X is the exact count we would get if not using any approximations. Y axis is the percentage error of the approximation. The plot shows that our error is never more than 30% with p95 = 5% and p99 = 17% It's important to realize that approximations start only upon configured threshold (45 in the plot above) so the error rate for exact count less than 45 is guaranteed to be 0%.

Conclusion

To summarize, our goal with the new faceting was threefold:

To retain the the option of early terminating searches when the result set is large.
To guarantee exact counts for low cardinality facet values.
To improve the performance of counting large facet values by providing estimated

We achieved these goals by splitting the faceting operation into two separate phases: discovery and counting. We used the traditional forward index scan approach for discovery, and used an inverted index traversal with sampling for the counting phase. With this approach we were able to achieve the right balance of correctness and performance.

Results show tremendous gains in capacity and high percentile latencies. The key to this approach is the flexibility it provides to tune different knobs to make different tradeoffs between correctness and performance.

Acknowledgements

Many thanks to search infrastructure engineers, who worked on faceting, especially Apurva Mehta, Niranjan Balasubramanian, Yingchao Liu, Choongsoon Jesse Chang and Michael Chernyak.

FacetingSearchGaleneInformation Retrievalinfrastructure

Author:

Dmytro Ivchenko

Author's LinkedIn Profile URL:

https://www.linkedin.com/in/dmytroivchenko

LinkedIn Since:

05/2012

Author Avatar:

Author Title:

Sr Staff Engineer

Content For:

Blog