James on Software

Introducing packagecloud.io

2014-04-13T21:23:17-04:00

Every operation of any sufficient complexity winds up accumulating a bunch of internal debs, rpms, rubygems, and whatever other package types are relevant to the company. Unfortunately, finding a place to put them is usually a big headache.

The tooling for managing package repositories is buggy, frustrating, and outdated at best. Even if you do manage to get a repository setup, getting https, gpg signing, backups, and all the other stuff right takes forever, and even then, deployment of new packages is usually some bizarre process involving scp and running a script.

Having managed this process (poorly) several times ourselves, Joe and I set out to try to come up with a better solution. We've been in private beta for a couple of months, and we're finally ready to do a public release.

If you've ever managed a package repository before, I bet this'll make you as excited as it makes me:

Check it out at: packagecloud.io and join us in #packagecloud on freenode. We'd love to hear your feedback.

Women in Open Source

2013-10-16T11:00:56-04:00

tl;dr:

Doing open source work is a path to success in the software industry.
Women are underrepresented in open source, and therefore largely not currently capitalizing on this substantial opportunity.
The open source world is daunting for everyone, and it's especially bad for women.
I am offering my time to mentor a woman who is interested in getting involved in open source — particularly deeply technical and/or low level work. Also, this.
The last paragraph explains how to get in touch if you're interested.

The work that I've done in open source has led to nearly every major leap I've made since I started working as a programmer. It's how I got my last job (an executive position), how I get speaking gigs, and how I met nearly every one of my current set of friends and mentors.

That last point is important: research shows a link between something called "social capital" and career advancement. Your circle of friends — a representation of your social capital — acts as a source of information, resources, and credibility ^[1]. In other words, your friends teach you stuff, help you out, and if they're successful and prominent, they make you look good (twitter retweets might make this an especially big deal in our world).

I've experienced this first hand. In my last job, I was — quite frankly — failing miserably until I met a couple of guys who helped me understand how to think about what I was doing in the right way. With their help, I turned things around so well that I get invited to speak all over the world about my successes in that area. I met those guys as a result of open source work. Quite a snowball.

I have personal examples for days, but thankfully, you don't need to take my word for it. The bibliography for ^[1] is literally a list of research on this topic.

Women are terribly underrepresented in open source ^[2]. Since open source is an effective way to accumulate social capital in the software world, womens' underrepresentation is an enormous missed opportunity for career advancement, success, and — frankly — making money.

I get it, though: the open source world is inhospitable to basically everyone, and it's a lot worse for women. Whether we consciously intend it to be or not, open source is a boys club. Although I have observed the men I hang around with be very welcoming and inclusive towards women who are interested in joining our circle, being passively inclusive isn't enough. We need to actively recruit and mentor women until we reach critical mass ^[3].

That's why I've decided to offer my time as a mentor to a woman who is interested in getting involved in open source — particularly in something deeply technical and/or low level, since that's my area of interest and one where women are particularly poorly represented. Specifically, I'm talking about working on things like databases (distributed or otherwise), memory allocators, virtual machines, operations tooling, etc. The kinds of things I write about on this blog.

I'm willing to offer half a day a week of my time for whatever I can do to help, whether it's teaching, pair programming, debugging, connecting with smart people I know, or any and every combination of those things and whatever else. I'm new at this, so we'll have to figure it out as we go.

If you don't have time for — or interest in — open source work, but feel that you might benefit from more informal regular chats (IMs, IRC, whatever) about whatever it is that you're working on, I'd love to help in any way I can, so definitely still get in touch. This is the type of relationship that I have with many of my mentors and it's tremendously useful.

Drop me an email with a bit of background about who you are and the type of thing you might be interested in working on. You don't already have to be working on low level or deeply technical stuff. A few years ago, I wasn't either. All you need to be successful is interest and motivation.

A Social Capital Theory of Career Success, Scott E. Seibert, Maria L. Kraimer and Robert C. Liden
Gender: Integrated Report of Findings, Dawn Nafus, James Leach, Bernhard Krieger
Incidentally, the research shows that mentoring is a key contributor to career advancement, higher salaries, and more job satisfaction. It also shows that women typically have less access to mentoring, arguably a key facet of the glass ceiling.
1. Impressing for Success: A Gendered Analysis of a Key Social Capital Accumulation Strategy, Savita Kumra, Susan Vinnicombe

How tcmalloc Works

2013-05-19T12:09:57-04:00

tl;dr:

This is a long blog post that goes in to a bunch of detail about one of the highest performance memory allocators around.
You should probably read the code instead of reading this article.
If you aren't familiar with the topic of memory allocation, you should read my last blog post Memory Allocators 101 first.

tcmalloc is a memory allocator that's optimized for high concurrency situations. The tc in tcmalloc stands for thread cache — the mechanism through which this particular allocator is able to satisfy certain (often most) allocations locklessly. It's probably the most well-conceived piece of software I've ever had the pleasure of reading, and although I can't realistically cover every detail, I'll do my best to go over the important points.

Like most modern allocators, tcmalloc is page-oriented, meaning that the internal unit of measure is usually pages rather than bytes. This has the effect of making it easier to reduce fragmentation, and increase locality in various ways. It also makes keeping track of metadata far simpler. tcmalloc defines a page as 8192 bytes^[1], which is actually 2 pages on most linux systems.

Chunks can be thought of as divided in to two top-level categories. "Small" chunks are smaller than kMaxPages (defaults to 128) and are further divided in to size classes and satisfied by the thread caches or the central per-size class caches. "Large" chunks are >= kMaxPages and are always satisfied by the central PageHeap.

Size Classes

By default, tcmalloc creates 86 size classes for "small" chunks, each of which have several important properties which define thread cache behaviour as well as fragmentation and waste characteristics.

The number of pages allocated at once for a particular size class is one such property. It is carefully defined such that transfers between the central and thread caches are within a range that strikes a balance between wasting chunks sitting around unused in thread caches, and having to go to the central cache too often, causing contention for its lock. The code which determines this number also guarantees that the amount of waste per size class is at most 12.5%, and that the alignment guarantees of the malloc API are respected.

Size class data is stored in SizeMap and the first thing to be initialized on startup.

Thread Caches

Thread caches are a lazily initialized thread-local data structure which contains one free list (singly-linked) per size class. They also contain metadata regarding the current total size of their contents.

Allocations and deallocations from thread caches are lockless and constant-time in the best case. If the thread cache doesn't already contain a chunk for the size class that is being allocated, it has to fetch some chunks for that class from the central cache, of which there is one per size class. If the thread cache becomes too full (more on what that means in a second) on deallocation, chunks are migrated back to the central cache. Each central cache has its own lock to reduce contention during such migrations.

As chunks are migrated in and out of a thread cache, it bounds its own size in two interesting ways.

First, there is an overall total size of all the combined thread caches. Each cache keeps track of its total contents as chunks are migrated to and from the central caches, as well as allocated or deallocated. Initially, each cache is assigned an equal amount of space from the overall total. However, as some caches inevitably need more or less space, there is a clever algorithm whereby one cache can "steal" unused space from one of its neighbours.
Second, each free list has a maximum size, which gets increased in an interesting way as objects are migrated in to it from the central cache. If the list exceeds its maximum size, chunks are released to the central cache.

If a thread cache has exceeeded its maximum size after a migration from the central cache or on deallocation, it first attempts to find some extra headroom in its own free-lists by checking to see if they have any excess that can be released to the central caches. Chunks are considered excess if they have been added to a free list since the last allocation that the list satisfied^[3]. If it can't free up any space that way, it will attempt to "steal" space from one of its neighbouring thread caches, which requires holding the pageheap_lock.

Central caches have their own system for managing space across all the caches in the system. Each is capped at either 1MB of chunks or 1 entry, whichever is greater. As central caches need more space, they can "steal" it from their neighbours, using a similar mechanism to the one employed by thread caches. If a thread cache attempts to migrate objects back to a central cache that is full and unable to acquire more space, the central cache will release those objects to the PageHeap, which is where it got them in the first place.

Page Heap

The PageHeap can be thought of as the root of the whole system. When chunks aren't floating around the caches or allocated in the running application, they're living in one of the PageHeap's free lists. This is where chunks are allocated in the first place, using TCMalloc_SystemAlloc and ultimately released back to the operating system, using TCMalloc_SystemRelease. It's also where "large" allocations are satisfied and provides the interface for tracking heap metadata.

The PageHeap manages Span objects, which represent a contiguous run of pages. Each Span has several important properties.

PageID start is the start address of the memory the Span describes. PageID is typedef'd to uintptr_t.
Length length is the number of pages in the Span. Length is also typedef'd to uintptr_t.
Span *next and Span *prev are pointers for when the Span is in one of the doubly linked free-listsb in the PageHeap.
A bunch more stuff, but this post is getting really long.

The PageHeap has kMaxPages + 1 free lists — one for each span length from 0...kMaxPages and one for lengths greater than that. The lists are doubly linked and split in to normal and returned sections.

The normal section contains Spans whose pages are definitely mapped in to the process's address space.
The returned section contains Spans whose pages have been returned to the operating system using madvise with MADV_FREE. The OS is free to reclaim those pages as necessary. However, if the application uses that memory before it has been reclaimed, the call to madvise is effectively negated. Even in the case that the memory has been reclaimed, the kernel will remap those addresses to a freshly zero'd region of memory. So, not only is it safe to reuse pages that have been returned, it's an important strategy for reducing heap fragmentation.

The PageHeap also contains the PageMap, which is a radix tree that maps addresses to their respective Span objects, and the PageMapCache, which maps a chunk's PageID to its size class for chunks that are in the cache system. This is the mechanism through which tcmalloc stores its metadata, rather than using headers and footers to the actual pointers. Although it is somewhat less space efficient, it is substantially more cache efficient since all of the involved data structures are slab allocated.

Allocations from the PageHeap are performed via PageHeap::New(Length n), where n is the number of pages being requested.

First, the free lists >= to n (unless n is >= kMaxPages) are traversed looking for a Span big enough to satisfy n. If one is found, it is removed from the list and returned. This type of allocation is best-fit, but because it's not address ordered, it is suboptimal as far as fragmentation is concerned — presumably a performance tradeoff. The normal lists are all checked before moving on to checking the returned lists. I'm not sure exactly why.
If none of those lists have a fitting Span, the large lists are traversed, looking for an address-ordered best fit. This algorithm is O(n) accross all the Spans in both large lists, which can get very expensive in situations where concurrency is fluctuating dramatically and the heap has become fragmented. I have written a patch which reorganizes the large lists in to a skip list if they exceed a configurable total size to improve large allocation performance for applications which encounter this circumstance.
If the Span that has been found is at least one page bigger than the requested allocation, it is split in to a chunk sufficient to satisfy the allocation, and whatever is leftover is re-added to the appropriate free list before returning the newly allocated chunk.
If no suitable Span is found, the PageHeap attempts to grow itself by at least n pages before starting the process again from the beginning. If it is unsuccessful at finding a suitable chunk the second time around, it returns NULL, which ultimately results in ENOMEM.

Deallocations to the PageHeap are performed via PageHeap::Delete(Span* span). Their effect is that the Span is merged in to the appropriate free-list.

First, the adjacent Span objects (both left and right) are acquired from the PageMap. If either or both of them are free, they are removed from whatever free-list they happen to be on and coalesced together with span.
Then, span is prepended to whichever free list it now belongs on.
Finally, the PageHeap checks to see whether it's time to release memory to the operating system, and releases some if it is.

Each time a Span is returned to PageHeap, its member scavenge_counter_ is decremented by the length of that Span. If scavenge_counter_ drops below 0, the last Span is released from one of the free lists or the large list, removed from the normal list, and then added to the appropriate returned list for possible reuse later. scavenge_counter_ is then reset to min(kMaxReleaseDelay, (1000.0 / FLAGS_tcmalloc_release_rate) * number_of_pages_released). So, tuning FLAGS_tcmalloc_release_rate has a substantial effect on when memory gets released.

Conclusions

This blog post is incredibly long. Congratulations for getting here. And yet I barely feel like I've covered anything.
If this kind of problem is interesting to you, I highly recommend reading the source code. Although tcmalloc is very complex, the code is extremely approachable and well commented. I barely know C++ and was still able to write a substantial patch. Particularly with this blog post as a guide, there's not much to be afraid of.
I'll cover jemalloc in a future episode.
Listen to my (and Joe Damato's) podcast — it's about this kind of stuff.

[1] Unless the experimental feature `TCMALLOC_LARGE_PAGES` is enabled.

[2] This is sort of a simplification of a more complicated system, but should be good enough for this purpose.

Memory Allocators 101

2013-05-15T11:37:02-04:00

For the last few weeks, I've been working on a couple of patches to tcmalloc, Google's super high performance memory allocator. I'm going to post about them soon, but first I thought it would be cool to give some background about what a memory allocator actually does. So, if you've ever wondered what happens when you call malloc or free, read on.

A memory allocator's responsibility is to manage free blocks of memory. If you've never read a malloc implementation, you may have assumed that calling free simply causes memory to be released to the operating system. But acquiring memory from the OS has a cost, so allocators tend to keep free chunks around for a while for possible re-use before deciding to release them.

Managing freed memory is an incredibly interesting and hard problem with two main concerns: performance and reducing heap fragmentation / waste:

How do we organize free blocks of memory such that we can quickly locate a sufficiently large block (or determine that we lack one) when someone calls malloc without making calls to free prohibitively expensive?
What can we do to reduce fragmentation and waste in the face of sometimes drastically changing allocation patterns over the lifetime of a (potentially long-running) program? It's worth noting that heap fragmentation can have a substantial impact on CPU cache efficiency.
As a bonus, there's also the matter of concurrency, but that's probably beyond the scope of this post.

The most fun part of this problem is that our two primary objectives are often in direct opposition. For example, keeping one linked list of free blocks per allocation size (say, rounded up to number of pages) can make calls to malloc best case constant time, but unless some waste is accepted and chunks are kept around long enough to be reused, the worst case path will be taken more often than not.

Of course, there also are a multitude of other issues to consider, such as how to decide to release memory to the operating system, and how to avoid becoming the bottleneck in a concurrent program (I'm looking at you, glibc). And the implementation details are interesting, too.

Implementation

A very basic malloc implementation might use the linux system call sbrk(2) to acquire memory from the operating system and a linked list to store free chunks. That would make calls to free constant time, but malloc would be O(n).

Of course, the allocator needs to store metadata about each chunk it manages, such as its size, free/in-use status, free-list pointer(s), etc. But since you can't exactly call malloc in an allocator, it's common to store metadata in a "header" that just precedes the address that is handed to the application. So, if the header is 16 bytes in size, then the header would start at ptr - 16. Pointer arithmetic galore.

In order to reduce fragmentation and promote memory reuse, it's common for malloc implementations to attempt to coalesce free blocks of memory with adjacent ones, if they happen to also be free. If metadata is being stored in a header, then it's easy to determine the size and status of the pointer to the right.

But the header doesn't provide any way of determining the size of the chunk to the left, so coalescing mallocs frequently put the size of each block in a footer, which is typically sized to fit a size_t. Then finding the chunk to the left would look something like this:

Things get even more complicated because subsequent invocations of system calls like sbrk or mmap aren't guaranteed to return contiguous virtual addresses. So, when looking for chunks to coalesce, care has to be taken to make sure that invalid pointers aren't dereferenced by adding to a pointer that's on the edge of what's being managed.

Typically this means creating and maintaining a separate data structure with which to keep track of the regions of virtual address space that the allocator is managing. Some allocators, such as tcmalloc, simply store their metadata in that data structure rather than in headers and footers, which avoids a lot of error-prone pointer arithmetic.

I could probably continue writing about this forever, but this seems like a good place to stop for now. If your interest is piqued and you'd like to learn more about memory allocators, I highly recommend diving in to writing your own malloc implementation. It's a challenging project, but it's fun and it'll give you a lot of insight in to an important part of how your computer works.

Soon I'll follow up on this post with one about the allocator work I've been doing lately. Also, if you're interested in this kind of stuff, check out my new podcast where Joe Damato and I talk about systems programming. We'll definitely be covering allocators in the next few weeks sometime.

Some allocator resources:

malloc(3) man page
http://www.cs.cmu.edu/afs/cs/academic/class/15213-f10/www/lectures/17-allocation-basic.pdf
tcmalloc docs
glibc malloc source — this is some random github repo, so I have no idea how much its been fucked with

Introducing The Real Talk Podcast

2013-04-29T00:26:42-04:00

[Joe Damato] and I have released the inaugural episode of our new, highly technical podcast realtalk.io.

We will be doing frequent technical deep dives and releasing our conversations raw and unedited with all errors, omissions, awkward pauses, and curse words intact.

Check out the website, soundcloud, and subscribe.

MRI's Method Caches

2013-04-14T12:01:14-04:00

tl;dr

Method resolution is expensive, so method caches are crucial to invocation performance.
Your Ruby code probably calls methods kind of often, so invocation performance matters.
MRI's method cache invalidation strategy is quite naive, leading to very low hit rates in most Ruby code.
I wrote some patches that substantially improve the situation.
This blog post is surprisingly uninflammatory.

The Long Version

One of MRI's big performance problems is that method cache expiry is global. That is, any time you make a change to any class anywhere, the entire VM's method caches get busted at the same time. This is why you'll frequently hear people saying that "calling Object#extend is bad".

Actually, it's not just Object#extend. Charlie Somerville put together what I believe to be an exhaustive list of things that clear MRI's method caches. Method cache busting is so pervasive that it's almost impossible to avoid using somebody's code that does it. Disaster.

Let's back up for a second, though. What is a method cache and why are they important?

Method Cache Basics

Take the following class hierarchy:

Internally, MRI stores methods in a hash table on the RClass struct. When you call an inherited method on a descendent, MRI has to walk up the class hierarchy to find it, checking the method table at each step to see if there's anything there to call.

So, in our above example, if we wanted to call hello on an instance of E, MRI would have to execute method lookups on E, D, C, B, only to finally find the method in A's method table.

It turns out that method resolution is actually quite expensive, which is why method caches exist. Rather than resolving a method each time we want to call it, we cache a reference to the method somewhere we can get to it cheaply, substantially reducing the cost of subsequent invocations.

But Ruby is dynamic, so those caches can't necessarily live forever. If we call String#gsub, for example, and then undef it without expiring the method cache, it'll still be reachable. Cache invalidation is hard, as we know, so MRI takes a somewhat brute force approach.

How MRI's Method Caches Work

Currently, MRI has two types of method caches. Ruby code is compiled down to MRI's instructions. Each instruction has some data associated with it. The send instruction has an iseq_inline_cache_entry, which acts as an inline method cache.

You can read the logic here. Basically, it works like this: look at the inline cache. If it's valid, use it. If not, go actually look up the method and cache it. Pretty much exactly what you'd expect.

In the case of an inline instruction cache miss, there's actually a secondary, global method cache. Oddly, though, the global method cache is limited to 2048 entries, and its semantics for deciding what to keep and what to dump are essentially random.

It's not as unlikely as you might hope for two methods in a tight loop to be clobbering each others' entries in the global method cache table.

Both caches' entries have a field that stores the ruby_vm_global_state_version from when they were filled. A cache entry is considered valid if it matches the klass pointer and the current ruby_vm_global_state_version. So, incrementing the global state version by 1 invalidates all of the inline instruction caches as well as the global method cache.

This has the effect of making invalidation very cheap, but far reaching. Whenever you make a change to any class, call extend, or do any of the other things detailed in Charlie's article, all of the method caches that have built up since your program started become invalid and you have to repay the cost of method resolution all over again.

The Numbers

After years of complaining about Ruby's method caching behaviour, I finally decided to instrument it a couple of weeks ago. I found that for our application, the method cache was being invalidated at least 20 times per request, and that around 10% of our request profile was spent performing method resolution. For our application, the cost of Ruby's global method cache invalidation was extremely high.

The average cost of each method resolution for our production application is around one microsecond, which doesn't sound like a lot but it adds up. We were seeing at least 8000 cache misses per request, totalling 8ms or more.

As part of my instrumentation patchset, I also created a mechanism that logs a stacktrace each time the method cache is invalidated. I found that the majority of invalidations in our app were from inside of ActiveRecord - some easier to fix than others. Many were also caused by random gems doing things like instantiating OpenStructs.

At this point, it started seeming somewhat impractical to go and patch rails and all these other gems that I use, so I decided to investigate the amount of effort that would be required to actually solve the problem in MRI.

Hierarchical Method Cache Invalidation

I'll be using class to mean class or module here, since they have the same backing structure in the VM.

Ruby's inheritance tree is a directed acyclic graph, and the semantics of method resolution mean that a change to a given class only affects it and its descendents. So, in an ideal scenario, we would only need to invalidate the method caches for those branches of the inheritance tree.

I've written a patch for MRI that implements such an algorithm, and it's currently serving 100% of our production traffic. We've seen around a 9% reduction in average latency with this patch. Others who've tried it haven't seen such big jumps. Your mileage may vary.

The algorithm is actually quite simple (credit to Charlie Nutter / JRuby for the idea). We have a 64 bit global (one per VM instance) sequence that is monotonically increasing. Every time we alloc a new RClass, we increment the sequence, and assign the class a unique value.

Method and inline cache entries are tagged with the class's sequence value when they're filled. When a class is modified, we traverse the class hierarchy downwards from the modification point, assigning each class a new sequence number.

A method cache entry is considered valid if its sequence number matches the current sequence of the class. So, if the class or one of its parents has been modified since the cache entry was created, its class will have a new sequence number, and it will have therefore been invalidated.

Performance

For our application, this cache invalidation strategy substantially reduces the number of method cache misses we see in production and has reduced request latency by ~8-9%, but there are tradeoffs involved. Since invalidation requires a graph traversal, it's a lot more expensive than the current strategy of merely incrementing an integer.

If your application makes frequent modifications to classes and modules which have a large number of descendents, the cost of invalidation may outweigh the increase in method cache hit rate. That said, I would imagine that such modifications are relatively uncommon and should be considered a bad practice either way.

It's also worth noting here that while this patch will likely improve the performance of apps that employ the strategy of extending arbitrary objects to implement DCI, that pattern is still a performance problem, because it creates tons of one-off metaclasses whose methods wind up being mostly uncacheable.

The Code

My patchset includes several things:

Subclass tracking: Class#subclasses, and Module#included_in. Rails implements this with an O(n) traversal of ObjectSpace. With my patches, that's no longer necessary.
Hierarchical method cache invalidation: the subject of this whole article.
Method cache instrumentation: RubyVM::MethodCache has several useful singleton methods you may want to track, including hits, misses, miss_time, invalidation_time, etc.

You can find the code in my branch or install it with rvm install jamesgolick.

I am planning to submit these patches back upstream, but I have to port them to Ruby 2.0 first, so I guess that's my next project. Huge thanks and credit to Aman Gupta, Charlie Somerville, Charles Nutter, and funny-falcon for all their code, help, and testing!

The Cost of Ruby 1.9.3's GC::Profiler

2012-11-19T15:13:49-05:00

This is a long one, and y'all are busy I'm sure so here's the tl;dr:

If you run ruby in production, you need to keep track of GC stats.
Ruby 1.9.3's GC::Profiler does a bunch of really weird shit.
- It keeps a 104 byte sample of every GC run since it was enabled forever.
- Calling GC::Profiler.total_time loops over every sample in memory to calculate the total.
- The space used to keep those samples in memory is never freed. However, it does get reused when you call GC::Profiler.clear.
- Therefore: if you are using GC::Profiler in production, and you're not calling GC::Profiler.clear regularly, you're leaking a substantial amount of memory (>1GB / machine for us), slowing down garbage collection somewhat, and the cost of retreiving the stats (GC::Profiler.total_time) will continue to increase unbounded until the process is restarted.
I am working on an alternative, low-overhead GC Profiler that is designed to be run in production. It's called GC::BasicProfiler. You can find the patch here and follow development here.
Also, you may want to check out this fork for some backports from ruby 2.0 — including the COW-friendly garbage collector. Good stuff.

The Long Version

Ruby's GC is a steaming pile of shit — but that's not news to anybody. If you're running ruby in production, tracking GC behaviour is essential so that you can minimize its effects on perceived performance (hence oob_gc, etc). Fortunately, Ruby 1.9.3 ships with GC::Profiler, which provides detailed instrumentation on GC runs.

Over the last month or so, I've been working on some rails performance tooling. Last night, I noticed that requests with my instrumentation enabled were taking around an order of magnitude longer than those without it. Weird. So, I installed perftools.rb and rack-perftools_profiler and got a really surprising result (irrelevant lines ommitted):

Apparently calling GC::Profiler.total_time is so slow that more than 50% of request time was spent in there? Is that actually possible? My instrumentation calls GC::Profiler.total_time frequently under the assumption that it's inexpensive, but obviously that was a faulty assumption unless perftools.rb is wrong. Let's take a look at the implementation. (the code in context is here)

There's a loop in total_time? What the fuck is going on here?

Turns out that if you have GC::Profiler enabled, the VM records a gc_profile_record every time the garbage collector runs.

Then, when you call total_time, it loops over all of the gc_profile_records that have been created in order to sum the total. According to my profile, total_time was responsible for more than 50% of request time. How many gc_profile_records could there actually be?

Oh. Well I guess that explains that. So — stupid question, but when do these things get freed? Apparently only in rb_objspace_free which only ever gets called when the VM terminates, so the answer is never. Cool.

Upon further investigation, it's pretty clear that this whole system was designed with the expectation that you'd call GC::Profiler.clear regularly. The profiler keeps its samples in an array at objspace->profile.record that it increases in size by 1000 every time it runs out of space.

If you don't call GC::Profiler.clear, that array keeps increasing in size forever. This is not documented. Obviously.

On our production systems, unicorn workers that have been running for a few hours had an objspace->profile.size of around 350000. On x86_64, sizeof(struct gc_profile_record) == 104, so around 35MB of overhead per process multiplied by 25 processes per machine for a total of nearly 1GB per machine — after only 3 hours. That will grow forever until the processes are restarted.

That's the bad news.

The good news: GC::BasicProfiler

Ultimately, GC::Profiler was designed to provide detailed information about every GC run — probably for the VM implementers to use when tuning the GC (haha yeah right). But seriously, somebody probably wants that, but it isn't me. For those of us who simply want to keep track of GC stats on our production applications, we need a less expensive implementation.

GC::BasicProfiler is a first step towards something like that. It has a very simple, low-overhead implementation, and GC::BasicProfiler.total_time works the way you might expect.

Enabling and disabling BasicProfiler works exactly the same as Profiler but you don't need to call clear to avoid leaking memory. In fact, there's no clear method at all.

If you're interested in following the development of this patch, it'll be here.

One more lol for the road

It's the little things. Don't worry, though — fixed in BasicProfiler.

Moving On

2012-09-05T14:33:09-04:00

Almost four years ago, I was speaking at a software engineering conference in Montreal. At the speakers lunches, I met up with one of the founders of the conference, and we immediately hit it off. He told me about his growing company, and a month later, the consulting firm I'd been running was closed, our office vacant, and I had joined BitLove (the company that runs FetLife — which was then known as Protose) as CTO. It's bittersweet to announce that as of a few weeks ago, I've decided to move on.

Over the last four years, I played a huge role in every part of running FetLife. In addition to being responsible for our technology, I made business and product decisions, helped design features, wrote copy, communicated with the community, worked on support stuff, and more. I've always had an interest in — and read about — all this stuff, but actually having the opportunity to participate in it, make real mistakes, and have real successes was incredible.

As a technologist, it's hard to imagine somewhere I could've grown more quickly. When I joined, I knew a few things about writing Rails apps. While I was there, I got the opportunity to do everything with every part of the stack. I learned how to make it all run in production for a big user base and a lot of traffic, with a tiny team.

When I joined, we had ~100k users (I got user ID 129315) and Rails was serving around 50 million requests a month. Since then, our user base grew to over 1.5 million users — our traffic to almost 500 million pageviews a month and over 1 billion Rails requests (not to mention requests to other services like chat). We did it with an engineering team that hovered around 2 people (including me).

I'm really proud of the engineering work I did at BitLove. We operated an extremely high throughput MySQL installation, and I was able to solve various InnoDB scalability limitations that we encountered. I implemented a web-based IM system (similar to Facebook Chat) that hundreds of thousands of people use to send tens of millions of messages every month. I built everything from the presence and routing implementation in Erlang to the UI in Javascript. I also designed and built an extremely stable and fast activity stream architecture, almost single-handedly ran operations for the ~40 machine cluster in around 2 hours a week, and perhaps most importantly, I dramatically improved my health in the process.

The work I did at BitLove certainly represents the biggest challenges and accomplishments of my life and career to date. It was a wild ride with many ups and downs — everything they promised a startup would be. So it was incredibly difficult to leave a growing and successful company that I had a big hand in building. But it's time for new challenges.

So what's next?

I've got a couple of really amazing opportunities on the table right now that I'm super excited about. Because I get so heavily invested in my work and like to stick around companies for many years, this is a very big decision, and I'm not taking it lightly. I'm certainly open to hearing about any opportunities you think I might be a fit for, so do get in touch!

Consulting

In the meantime, I'm available for consulting work. If your company needs help with any of the kinds of things I discussed above — especially performance and scalability stuff, we should chat. Email me.

InnoDB kernel_mutex Contention and Memory Allocators

2012-07-18T18:41:41-04:00

tl;dr: We found that in our case, contention for InnoDB's kernel_mutex was caused by contention for a malloc arena lock. We fixed it by moving to tcmalloc. Instructions on how to do that here.

We recently doubled the IO throughput capacity of our near-capacity MySQL master by adding a second RAID controller, and striping the two together. As we were climbing up to a record throughput peak the following weekend, there was a major db latency spike (>3x).

A look at SHOW ENGINE INNODB STATUS indicated quite a bit of contention for InnoDB's kernel_mutex.

Note: the contention I observed was actually considerably worse than what I pasted above, but I didn't save the output, so this is all I have to show.

The kernel_mutex has been removed in MySQL 5.6, but that's unfortunately not ready for production. As a workaround, the Percona guys suggest modifying innodb_sync_spin_loops, which had absolutely no effect for our workload. They also suggest lowering innodb_thread_concurrency, which did reduce contention, but it also reduced concurrency, which left us right back where we started.

I pulled out my poor man's profiler to see if I could figure out exactly what was holding the lock and what it was doing with it. Here are the stacks I got.

Immediately, we can see that lots of stuff is waiting on locks inside of malloc/free-related functions. After reading through the MySQL sources, it was clear that this thread was holding the kernel_mutex.

Note: all links to glibc code below are specifically to the version that we are using.

Reading through _int_free in the glibc sources seemed to indicate that there was only one lock (malloc_state->mutex) in there.

Our glibc was built with --enable-experimental-malloc, which is supposed to reduce contention by dividing the heap in to multiple arenas, each with their own lock (at least as far as I understand it — and I'm far from an expert).

tcmalloc is a malloc implementation from google-perftools that satisfies small malloc requests without locks by using a per-thread cache. Using tcmalloc should mean that the allocations inside the kernel_mutex are (at least mostly) lockless.

Here's how to LD_PRELOAD tcmalloc.

Put this in /usr/local/bin/mysqld_wrapper:

Put this fragment in my.cnf:

Since we moved to tcmalloc, all of the contention for the kernel_mutex has completely disappeared. We're also seeing better performance overall and using ~15% less memory in total. This fix probably isn't applicable in all cases, but if you're seeing kernel_mutex contention, it's worth using your poor man's profiler to see whether swapping allocators might help.

How to Lose 100 Pounds

2012-07-07T13:57:13-04:00

I've struggled with my weight for nearly my entire life. I went from being chubby in elementary school to overweight in high school to obese in university. At my biggest, I was almost 280 pounds (I'm 5'6"). Finally, around 5 years ago, I got a spark of inspiration that ultimately led to me dropping a total of 110 pounds (and counting). Here's how I did it.

But first, the obligatory before and after shots:

Motivation

Losing weight requires an enormous amount of motivation. You're going to have to change your lifestyle and make real sacrifices. It's going to be hard. Motivation will help you continue to justify the changes you've made, and prevent you from slipping back in to old habits.

Funny enough, I actually got my first seed of motivation from pneumonia. I was 278 pounds at the time. After three horrible, bed-ridden weeks, I was down to 258. It was painful, but it taught me the most important weight loss lesson of all: it's possible.

Like a lot of other kids from my generation, I grew up overweight. When you can't remember a time when you weren't, being fat is a part of your identity. So, silly as it sounds, I think there was a part of me that believed that weight loss was impossible on some level — or at least that the amount of weight I needed to lose was insurmountable.

If you only take one thing away from this article, let it be that. You can lose weight. No matter how fucked up your metabolism (more on that later), no matter how long you've been overweight, it is possible.

Strategies

I'm going to talk about a few of the strategies, diets, and other random things that I have tried because I think people will find them interesting. But I'll give you an easy way out of reading the rest of this article just in case you're already bored. Ready? Here it is.

STOP EATING PROCESSED FOOD. THAT INCLUDES SUGAR, WHEAT PRODUCTS, SUGAR REPLACEMENTS LIKE SUCRALOSE, ASPARTAME, ETC, AND EVERYTHING ELSE YOU'RE THINKING OF THAT MIGHT BE AN EXCEPTION. EXCEPT STEVIA. YOU CAN HAVE STEVIA.

Ok, so with that yelling out of the way, here's a bit about my journey.

Briefly On Exercise

I'm going to keep this short. Exercise has never helped me lose weight. For much of the time that I was grossly over weight, I was also extremely physically active, often whitewater kayaking or downhill skiing for several hours 4 or 5 days a week, and continuing to put on fat. Despite conventional wisdom to the contrary, exercise isn't an effective weight loss strategy for me.

Portion Control

After I lost the pneumonia weight, I was literally terrified that I might put it back on. So I decided to try eating less. I ate all the same things, but avoided going back for seconds. I ate pasta, pizza, and dessert until I was full, but not stuffed. I lost 20 more pounds over a few months. Then, it leveled off.

That, really, is the story of my weight loss effort. Strategies, and diets that work for a while and then plateau. Sometimes, it's possible to break through a plateau, but other times, you need to up your game with better eating.

I tried for another six or so months to break through the portion control plateau. It never happened. I was actually feeling pretty good about where I was, though, so I didn't really make much of an effort to progress for a few more months.

Lower Carb Diet

Shortly after moving from Montreal to Vancouver, I started seeing a personal trainer, hoping to accelerate my progress on the scale and in the gym. She had me keep a food journal, and immediately picked up on the amount of carbs that I was eating back then. I was vegetarian at the time, and I was eating tons of breads and pastas. She told me to eat more vegetables, and tofu, and watch my carb intake. I lost about 20 pounds before plateauing hard.

On this diet, I was still eating bread, pasta, and sugar, just less. And after a while, I found it impossible to continue losing weight. So I started looking for other solutions.

Eat to Live

Eat to Live is an all vegan diet designed by Dr. Joel Fuhrman. Only fruits, vegetables, legumes, nuts, and seeds are allowed; no oils, dairy, sugar, or even juices (bit of an exaggeration, but for our purposes this is accurate enough) are permitted. I had very mixed results on Eat to Live. I did lose about 15 pounds, but I had a very difficult time keeping it off, and found it very difficult to eat enough food to feel full for more than an hour at a time. I found that I was constantly eating, and still often feeling starvingly hungry.

With that said, I actually know a lot of people who've had great success on ETL including my ex-girlfriend, who I was living with at the time (so we were eating nearly identically, though me significantly more than her), and my good friend Giles, who actually introduced me to the book. Which brings me to another one of my weight loss conclusions.

Everybody's body is different. Some people have amazing success on a diet, while others are incapable of losing weight. I no longer believe that there's one perfect diet out there that suits everybody. Your mileage will vary with every approach.

The only consistent thing I've been able to identify across all my friends and family who've lost weight is avoiding processed foods.

Psoriasis and acne

An interesting aside here is that ETL led me discover that it's possible to control psoriasis with diet. The medical community doesn't seem to be aware of this, but I am completely psoriasis free after years of being covered in it.

At first, I thought that it was the greens that caused my skin to clear up, but since then, I've realized that it's a balance of factors. Greens do help, but merely avoiding processed foods is enough to keep me completely psoriasis free. That being said, I started drinking coffee again a little while ago, and noticed that a small amount of psoriasis came back. Upping my intake of greens seems to make it clear up. So, it's a bit of a balancing act.

Oh also, I'm extremely prone to acne, but I've found that avoiding high glycemic index foods keeps my face and body completely clear of pimples.

On Vegetarianism

I'm definitely going to get hate mail for this, but here goes anyway. I was vegetarian for most of my weight loss journey. My conclusion was ultimately that vegetarianism made it significantly more difficult to lose weight. Here's why.

At home, cooking my own meals from my own groceries, vegetarianism was perfectly fine. But, every time I ate in a restaurant, on the street, or even at a friend's place, my options were nearly invariably some combination of pasta, bread, and sugar. I probably have the shittiest metabolism in the world, but when I eat that stuff, I gain weight. Lots of it.

I really enjoy eating in restaurants, which made the whole thing all the more difficult. During the whole time that I was on Eat to Live, I would painstakingly lose 7 or 8 pounds by religiously sticking to the diet for a month, then travel to a conference for a week and gain 15. It was frustrating to say the least, which led me to the very difficult conclusion that I needed to at least try breaking my nearly ten years of vegetarianism.

My Current Diet

My current diet is really simple: no processed carbs (that includes 'carbless' sugar replacements except stevia). I go through periods where I eat a ton of fruits and vegetables, but lately, I've mostly been eating meat and fish.

Do I miss chocolate and ice cream? Definitely. But I eat guilt-free bacon or chicken wings whenever I feel like it, and seeing results makes the sacrifice more than worthwhile.

This diet means that when I go out to eat (which I do regularly), I can have a steak without feeling guilty. I tell people that I'm allergic to sugar and flour, which gives me a reasonable excuse for being the pain in the ass guy who has to ask the waiter about the ingredients in every dish on the menu. I'd encourage you to tell similar lies if they help you stick to a diet.

There've been a few periods over the last year and a half where I've started eating bread again and gained back a bunch of weight. In April of this year, though, I finally committed to this diet as a more permanent lifestyle, and have been ever since. I've dropped around 40 pounds since then, and I'm not stopping until I can see my abs.

Conclusions

Everybody's body is different. Your friends may have had success with diet X, but you may not. Don't let that discourage you. You'll find something that works.

The best diet is the one that you can stick to, even if the weight loss is slower. If a diet fights against your lifestyle, it's going to be that much harder to maintain. That was my problem with ETL. I love to eat out and I travel a lot, so I couldn't stick to it. And at the end of the day, I didn't lose weight. The less you have to change your lifestyle to accomplish your goals, the better your chances of success.

You can lose weight, still enjoy the food you eat, and even go out to restaurants while you do it. Obviously, you won't be able to eat everything you're eating now, because if you could, you'd already be thin. But, it'll be a sacrifice worth making. The best thing you've ever done.