asemanfar

Lock retries with memcached

Mon, 19 Jul 2010 00:09:45 -0700

I've seen multiple implementations of locks using memcached that use exponential retry algorithms and have yet to figure out the reasoning behind it. From what I understand, exponential back off is used in TCP packet retries, and I think that's where it first came from but I don't see how it applies to locks implemented with memcached. I'm going to try to explain why I think exponential retry is wrong in this situation.

Analogy
Let's say you need to do laundry, and you walk to the laundry room to acquire a lock on the laundry machine. As soon as you get to the door, another person is also about to walk into the door. You both step and your shoulders collide with each other and the door frame. Neither of you make it through, so you wait and try again, he goes first then you. 38 minutes later, you need to move your laundry from the washer to the dryer, so you walk over but the dryer is taken. So you look at the timer, wait 12 more minutes, and move your laundry over.

These are two different types of contention, where the main difference is that the contention in the latter case doesn't prevent progress of both parties.

Memcached Locks
Typically, when you're using memcached to lock a resource, it's to ensure operations are performed on some resource serially. In an ideal world, processes would enter a queue for rights to perform an operation on a specific resource and each would be notified when it's their turn to perform their operation. But that level of synchronization is non-trivial in a large system, so instead all the processes poll a memcached key for rights to perform their operation.

Analysis
With an exponential retry algorithm, a process will wait longer and longer between tries until the resource is available. Let's see if this makes sense with 2 processes requesting a lock on the same resource: process A and B both need to use resource 1. Process A gets the lock first and intends to perform an operation on resource 1 that has a mean duration 200ms and a std dev of 70ms. 10ms later, process B needs resource 1, so it tries to acquire the lock but fails. At this point in time, we can calculate the probability that this resource is unlocked after t ms given the mean and std dev. We can calculate when we should try next, and bound that by how often we want to tax our system (ops/sec). An exponential retry algorithm will try at 100ms, then at 300ms (+200), 700ms, 1500ms. But we know that if it's not released after 300ms, it's likely that it will be available soon after that, and very unlikely that process A needed 700ms given the 200ms mean and 70ms std dev.

So why can't we use math to calculate how often we retry, similar to looking at the time left on the dryer in the laundry analogy? We can pretty easily track how many processes are waiting for a lock using a memcached key and incr/decr, and can easily calculate the operation duration distribution. And even if we don't keep track of the waiters, we can still use a linear algorithm and it won't be making the incorrect conclusion that the exponential algorithm did.

Implementation
Here is one implementation the described algorithm: http://gist.github.com/480851

The goal of this algorithm is to minimize the amount of time a resource spends unlocked despite a process that could use it, while keeping in mind reasonable pressure on the system.

I did some tests with the algorithm back when I wrote in Nov 2009, but don't have the data or code anymore. I intend on revisiting this to collect data on the distribution of tries, time spent waiting, and the amount of time the resource spends unlocked. I'll update this post when I get time to do that.

Using a bloom filter to optimize a SQL query

Mon, 08 Feb 2010 07:08:48 -0800

Several weeks ago, we started hitting some bottle necks with our database. We have 3 MySQL machines: 1 master, 1 online slave and 1 offline slave. The online slave started lagging pretty badly, about 0.46 seconds per second. This post is about one of the things I did to alleviate the load on this machine and eliminate slave lag.

The slave database was only used for a couple types of queries: comments and user search. This meant that people would post comments, see them (write-through to memcache) and then see them disappear next time they came back (they didn't exist on the slave yet). In addition, our user search page would be out-dated and individuals' details would be incorrect. In order to find out which of these queries was a bigger contributor to slave lag, I watched the output of 'show full processlist' and looked for queries that took a long time (you can alternatively enable the slow query log, but that has a bit of overhead and MySQL 5.0 requires a restart).

The offending query quickly became clear:
SELECT * FROM users WHERE facebook_uid in () and installed = 1;



Installed means they have added our application on Facebook

This query is executed to find your list of Facebook friends who are also playing our game. The way MySQL executes this query is by looking up each value in the list of integers in the index on facebook_uid, which for many people is several hundred to thousands of integers, and checking the installed field on those that had a row. The biggest problem with this query is that a lot of time was wasted looking up people who were not in the database or are not installed; we couldn't avoid querying for the people who are playing the game since we needed at least a little bit of information about them to show to the user.

We needed a way to efficiently reduce this huge list of users to remove the ones who would not be part of the result set. There are 10s of millions of installed users, so we can't simply keep a set in memory and have it be faster than MySQL's index lookups. I decided to use a bloom filter, which is one of the most memory efficient ways to check set membership. In this case, the set is defined as Facebook User IDs whose installed field is equal to '1' in our database. Using a bloom filter, we were able to significantly reduce the number of users we checked in the database; and because bloom filters can't have false negatives, we'd never incorrectly exclude an installed user from that query.

There are a couple bloom filter implementations for Ruby, but neither seemed to provide a service interface. So I decided to write one: http://github.com/arya/bloom_filter. This bloom filter has been in production for a couple weeks now. It's currently doing adds in an average of 2.92 milliseconds, and filters an average of 317 integers in average 12.7 milliseconds.

The bloom filter can be used either as a service, or in process. Check out the readme for details.

ActiveRecord bug with reload and new_record

Wed, 27 Jan 2010 06:02:59 -0800

In ActiveRecord, the new_record? method returns true when the object represents a record that is not persisted. This is implemented by a trivial flag set when you initialize a new record with Model.new. The reload method in ActiveRecord calls find and passes the primary key (usually id) for that model, replaces the attributes hash with the newly retrieved (and clears some internal caches). A bug arises when you try to call reload on a new record after you've set a value for id.

After you initialize a new record, set its id, and then call reload which subsequently loads all the attributes from the database (if it exists), the AR object is left in a weird state; when you save it tries to insert despite the fact that the record knows it came from the database and will raise an error.

record_a = Model.new(:some_attribute => "some value") record_b = Model.new(:some_attribute => "some other value") record_a.id = record_b.id = 1 record_a.save record_b.reload # this reloads the record with the attributes from record_a record_b.save # this tries to do an insert and throws ActiveRecord::RecordNotUnique

The fix is relatively simple: set @new_record to false in the reload method.

I should explain why I'm doing this in the first place and when this would happen to you. I ran into this in a highly concurrent environment where the primary key of a record is not an auto increment column. When two processes do a select to find a record that doesn't exist yet and try to create it, they will race to create that record in the DB; losing that race means you'll get ActiveRecord::RecordNotUnique. After this patch goes through, you'll be able to call reload and continue as you were. If you are writing an application where the primary key is not auto increment, you should be aware of this race case wherever you create records of that model and rescue ActiveRecord::RecordNotUnique.

RackProctitle

Tue, 26 Jan 2010 06:04:24 -0800

The CodeRack Contest recently came to a close and RackProctitle came in 3rd place.

RackProctitle allows you to see what request a given web server process is handling. By running "watch -n 0.1 ‘ps aux | grep “USER\|rack” ’ you can see all your rack processes, what request each is currently handling and for how long (or if they’re idle), what their last request was and how long it took, how many requests have been handled in the life of this process, and the requests in queue (if it's receiving more than 1 request at a time).

It ends up being really useful to diagnose problems. Example output [screenshot]:
rack/14a0b0 [1/682]: handling 0.0s /users/6014086 You can set a prefix, which will replace “rack” in the line above, and you can set an APPLICATION_VERSION constant which will be after the prefix. You can use this to distinguish between different rack processes if you run multiple applications on the same machine. In the example above, APPLICATION_VERSION is set to the first six characters of the git commit hash.

It's a gem:
sudo gem install rack_proctitle In some file somewhere in your application (optional): APPLICATION_VERSION = File.read(File.join(RAILS_ROOT, “REVISION”))[0,6] # if you use Capistrano, which creates a REVISION file for you. In your rackup file (prefix is optional):
use RackProctitle, :prefix => “application_name” We use watch to get a continuously updating status of our rack processes: watch -n 0.1 ‘ps aux | grep “USER\|application_name” | grep -v grep ; uptime; hostname’ The watch linux utility is one of those things that is hard to Google, so here's the link if it's not installed or not in your package manager: http://procps.sourceforge.net/

Concept and output formatting is based on Alexander Staubo’s mongrel_proctitle but has more features, is thread safe, and is Rack compatible. For more information on why mongrel_proctitle is not thread safe, read this post.

As always, feature requests and bug reports always welcome.