<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atom10full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><feed xmlns="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:thr="http://purl.org/syndication/thread/1.0" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
    <title>{ work }</title>
    
    
    <link rel="alternate" type="text/html" href="http://work.tinou.com/" />
    <id>tag:typepad.com,2003:weblog-59125</id>
    <updated>2011-04-03T04:42:25-07:00</updated>
    
    <generator uri="http://www.typepad.com/">TypePad</generator>
    <atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/atom+xml" href="http://feeds.feedburner.com/typepad/tinou/work" /><feedburner:info uri="typepad/tinou/work" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://hubbub.api.typepad.com/" /><entry>
        <title>Memcached for Dummies</title>
        <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/typepad/tinou/work/~3/eIxeCpyhIt0/memcached-for-dummies.html" />
        <link rel="replies" type="text/html" href="http://work.tinou.com/2011/04/memcached-for-dummies.html" thr:count="1" thr:updated="2011-04-07T02:06:09-07:00" />
        <id>tag:typepad.com,2003:post-6a00d83451ce5a69e20147e3b42c7a970b</id>
        <published>2011-04-03T04:42:25-07:00</published>
        <updated>2011-04-03T04:26:25-07:00</updated>
        <summary>The other day I was chatting with a colleague about Memcached. Eviction policy came up, and I casually mentioned that Memcache isn't strictly LRU. But a quick Bing search said Memcache is LRU, like this Wikipedia entry. Hmm, I was...</summary>
        <author>
            <name>Tinou Bao</name>
        </author>
        
        
<content type="xhtml" xml:lang="en-US" xml:base="http://work.tinou.com/">
<div xmlns="http://www.w3.org/1999/xhtml"><p>The other day I was chatting with a colleague about <a href="http://memcached.org/" target="_blank">Memcached</a>.  Eviction policy came up, and I casually mentioned that Memcache isn't <em>strictly</em> LRU.  But a quick Bing search said Memcache <em>is</em> LRU, like <a href="http://en.wikipedia.org/wiki/Memcached" target="_blank">this Wikipedia entry</a>.  Hmm, I was 99.9% sure Memcache <em>is not</em> LRU, something to do with how it manages memory, but maybe I was wrong all these years.  After reading through some Danga mailing lists and documentation, the answer is, Memcached is LRU <em>per slab class</em>, but not globally LRU.  </p>
<p>So what does this exactly mean.  Imagine starting Memcached with these command-line arguments to override the defaults,</p>
<blockquote>
<p><span style="font-family: courier new,courier; font-size: 8pt;">memcached -vv -m 1 -I 1k -f 3 -M</span></p>
<p><span style="font-family: courier new,courier; font-size: 8pt;">-vv</span><br /><span style="font-family: courier new,courier; font-size: 8pt;">for verbose output</span></p>
<p><span style="font-family: courier new,courier; font-size: 8pt;">-m 1</span><br /><span style="font-family: courier new,courier; font-size: 8pt;">maximum 1 megabyte cache</span></p>
<p><span style="font-family: courier new,courier; font-size: 8pt;">-I 1k</span><br /><span style="font-family: courier new,courier; font-size: 8pt;">1 kilobyte (1024 byte) page size</span></p>
<p><span style="font-family: courier new,courier; font-size: 8pt;">-f 3</span><br /><span style="font-family: courier new,courier; font-size: 8pt;">chunk size growth factor of 3 </span></p>
<p><span style="font-family: courier new,courier; font-size: 8pt;">-M</span><br /><span style="font-family: courier new,courier; font-size: 8pt;">return error on memory exhausted (rather than removing items)</span><br /><span style="font-family: courier new,courier; font-size: 8pt;">just to see things more explicitly</span></p>
</blockquote>
<p><br />The output will be something like this,</p>
<blockquote>
<p><span style="font-family: courier new,courier; font-size: 8pt;">slab class   1: chunk size        96 perslab      10</span><br /><span style="font-family: courier new,courier; font-size: 8pt;"> slab class   2: chunk size       288 perslab       3</span><br /><span style="font-family: courier new,courier; font-size: 8pt;"> slab class   3: chunk size      1024 perslab       1</span><br /><span style="font-family: courier new,courier; font-size: 8pt;"> &lt;26 server listening (auto-negotiate)</span><br /><span style="font-family: courier new,courier; font-size: 8pt;"> &lt;27 server listening (auto-negotiate)</span><br /><span style="font-family: courier new,courier; font-size: 8pt;"> &lt;28 send buffer was 9216, now 5592405</span><br /><span style="font-family: courier new,courier; font-size: 8pt;"> &lt;29 send buffer was 9216, now 5592405</span><br /><span style="font-family: courier new,courier; font-size: 8pt;"> &lt;28 server listening (udp)</span><br /><span style="font-family: courier new,courier; font-size: 8pt;"> &lt;28 server listening (udp)</span><br /><span style="font-family: courier new,courier; font-size: 8pt;"> &lt;28 server listening (udp)</span><br /><span style="font-family: courier new,courier; font-size: 8pt;"> &lt;28 server listening (udp)</span><br /><span style="font-family: courier new,courier; font-size: 8pt;"> &lt;29 server listening (udp)</span><br /><span style="font-family: courier new,courier; font-size: 8pt;"> &lt;29 server listening (udp)</span><br /><span style="font-family: courier new,courier; font-size: 8pt;"> &lt;29 server listening (udp)</span><br /><span style="font-family: courier new,courier; font-size: 8pt;"> &lt;29 server listening (udp)</span></p>
</blockquote>
<p><br />The visual representation below may be helpful, </p>
<p><a href="http://tinou.blogs.com/.a/6a00d83451ce5a69e2014e6058a5b5970c-pi" style="display: inline;"><img alt="Memcache" border="0" class="asset  asset-image at-xid-6a00d83451ce5a69e2014e6058a5b5970c" src="http://tinou.blogs.com/.a/6a00d83451ce5a69e2014e6058a5b5970c-800wi" title="Memcache" /></a></p>
<p>We have a cache with a max memory limit of 1 megabyte.  This 1 megabyte of memory is divided into 1 kilobyte (1024 byte) pages.  These 1024 byte pages are assigned, as needed, on a first come, first serve basis, to the three slab classes.  Page assignment to a slab class is permanent.  The three slab classes are determined by the page size and the chunk size growth factor.  If you change the page size and/or chunk size growth factor you'll get a different slab class configuration.</p>
<p>With our settings, slab 1 has a chunk size of 96 bytes (the smallest, initial chunk size).  There are 10 chunks per page (11 chunks would be more than the 1024 byte page size) in slab 1.  Slab 2 has a chunk size of 288 (96 x 3 growth factor).  There are only 3 chunks per page in slab 2.  Finally, slab 3 has a chunk size equal to the page size and obviously only 1 chunk per page.</p>
<p>Each chunk can hold an item up to its size.  By item, I mean the key, the value and some overhead.  The item's size determines which slab it is stored in.  Small items, say 40 or 60 bytes, are put in slab 1.  Leftover storage is wasted.  That is, putting a 60 byte item in a 96 byte chunk results in 36 byte wasted space.  These 36 bytes are not available for other use.  Slab 2 holds items between 96-288 bytes.  Slab 3 holds items between 288-1024 bytes.  With our arguments, we could not store items larger than 1024 bytes.</p>
<p>So that's an overview of how Memcached is layed out.  Pages, chunks, slabs.  Now imagine we store a bunch of small items and a bunch of large items but no medium items until all pages from the page pool are used.   </p>
<p><a href="http://tinou.blogs.com/.a/6a00d83451ce5a69e2014e8733e215970d-pi" style="display: inline;"><img alt="Memcache2" border="0" class="asset  asset-image at-xid-6a00d83451ce5a69e2014e8733e215970d" src="http://tinou.blogs.com/.a/6a00d83451ce5a69e2014e8733e215970d-800wi" title="Memcache2" /></a> <br />Slab 1 might have 800 pages and slab 3 might have 200 pages (those numbers are not exact, won't add up).  Everything is good, until we try to store a medium size item.</p>
<p><a href="http://tinou.blogs.com/.a/6a00d83451ce5a69e20147e3b3fa0c970b-pi" style="display: inline;"><img alt="Memcache3" border="0" class="asset  asset-image at-xid-6a00d83451ce5a69e20147e3b3fa0c970b" src="http://tinou.blogs.com/.a/6a00d83451ce5a69e20147e3b3fa0c970b-800wi" title="Memcache3" /></a> <br />Memcached will give us an out of memory error (or won't store the item  w/o the -M option) because slab class 2 has no available pages to use.  Pages from slab class 1 and slab class 3 can't be re-assigned to slab 2.  You can think of Memcached as having separate internal caches defined by the slab classes.  </p>
<p>Finally, this brings us back to the LRU question.  Since each slab class is essentially a separate cache, there is no  global LRU, only slab class LRU.  The least recently used item won't get evicted for a new item if the items are not in the same slab class.  Depending on your storage/access patterns, you could end up with one slab class constantly evicting recently used items, while another slab having a bunch of old items that just sit around. </p>
<p><a href="http://tinou.blogs.com/.a/6a00d83451ce5a69e20147e3b41bd5970b-pi" style="display: inline;"><img alt="Memcache4" border="0" class="asset  asset-image at-xid-6a00d83451ce5a69e20147e3b41bd5970b" src="http://tinou.blogs.com/.a/6a00d83451ce5a69e20147e3b41bd5970b-800wi" title="Memcache4" /></a></p>
<p>Which brings me to another point, items/chunks are not actively reclaimed/expired.  Memached does not have a background thread that explicitly expires items, reclaiming used chunks for new items.  When a slab needs a chunk and there are no more pages, Memcache will look at the queue tail for items to evict.  Memcache will make a best effort to evict expired items (items you've explicitly set to expire after some time).  In scenario 1, item 2, an expired item, is evicted.  However, in scenario 2, item 1, which has not yet expired, will be evicted, even though item 4 would seem like the better candidate.  But since item 4 is not near the tail, Memcached stops looking and just expires item 1. </p>
<p>Another example of leaky abstraction?  The Memcached API is very simple, straightforward.  But at a certain point, you need implementation knowledge to make sure things are behaving as expected.</p></div>
</content>



    <feedburner:origLink>http://work.tinou.com/2011/04/memcached-for-dummies.html</feedburner:origLink></entry>
    <entry>
        <title>It's All Garbage</title>
        <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/typepad/tinou/work/~3/CebLmbW0RjE/a-uni%EF%AC%81ed-theory-of-garbage-collection.html" />
        <link rel="replies" type="text/html" href="http://work.tinou.com/2011/03/a-uni%EF%AC%81ed-theory-of-garbage-collection.html" thr:count="0" />
        <id>tag:typepad.com,2003:post-6a00d83451ce5a69e20147e35607e9970b</id>
        <published>2011-03-20T12:17:54-07:00</published>
        <updated>2011-03-19T21:19:48-07:00</updated>
        <summary>Amazon.com says I bought Programming Ruby (2nd edition) on November 4, 2004. Seven short years of Java later I'm finally putting it to use. One immediate question is, how does garbage collection work in Ruby? Is it a tracing or...</summary>
        <author>
            <name>Tinou Bao</name>
        </author>
        <category scheme="http://www.sixapart.com/ns/types#category" term="Algorithm" />
        <category scheme="http://www.sixapart.com/ns/types#category" term="Languages" />
        <category scheme="http://www.sixapart.com/ns/types#category" term="Programming" />
        
        
<content type="xhtml" xml:lang="en-US" xml:base="http://work.tinou.com/">
<div xmlns="http://www.w3.org/1999/xhtml"><p>Amazon.com says I bought <em>Programming Ruby</em> (2nd edition) on November 4, 2004.  Seven short years of Java later I'm finally putting it to use.  One immediate question is, how does garbage collection work in Ruby?  Is it a tracing or reference counting collector?  Is it generational, and if so, how many generations?  These questions eventually led me to an interesting paper by David Bacon, Perry Cheng and VT Rajan on <em>A Unified Theory of Garbage Collection</em>.</p>
<p style="text-align: left;">Universally, garbage collectors are viewed as either tracing or reference counting.  The Sun HotSpot VM, for example, is a tracing collector.  It traverses the object graph from the roots, finding reachable live objects and reclaiming unreachable dead objects.  PHP, on the other hand, uses reference counting to decide which objects can be safely garbage collected.  This table summarizes the typical characterization of each collector type,</p>
<table border="1" cellpadding="5" cellspacing="0" style="margin-left: auto; margin-right: auto;">
<tbody>
<tr>
<td> </td>
<td>Tracing        <br /></td>
<td>Reference Counting</td>
</tr>
<tr>
</tr>
<tr>
<td>Collection Style <br /></td>
<td>Batch<br /></td>
<td>Incremental<br /></td>
</tr>
<tr>
<td>Cost per Mutation <br /></td>
<td>None<br /></td>
<td>High<br /></td>
</tr>
<tr>
<td>Throughput <br /></td>
<td>High<br /></td>
<td>Low<br /></td>
</tr>
<tr>
<td>Pause Times<br /></td>
<td>Long<br /></td>
<td>Short<br /></td>
</tr>
<tr>
<td>Real Time <br /></td>
<td>No<br /></td>
<td>Yes<br /></td>
</tr>
<tr>
<td>Collects Cycles         <br /></td>
<td>Yes<br /></td>
<td>No<br /></td>
</tr>
</tbody>
</table>
<p style="text-align: left;">The first observation is, once optimizations are added to the basic tracing and reference counting collectors, they begin to converge.  For example, with deferred reference counting there is a periodic scanning of stack references to reduce per-mutation cost.  A generational tracing collector, on the other hand, imposes a per-mutation overhead but gains shorter pause times.  So even though collectors are typically described as one or the other, we should really view them as hybrids.</p>
<p style="text-align: left;">The second interesting observation is, tracing algorithms are the duals of the reference counting algorithms.  Both types of algorithms share the same structure.</p>
<table border="1" cellpadding="5" cellspacing="0" style="margin-left: auto; margin-right: auto;">
<tbody>
<tr>
<td> </td>
<td>Tracing        <br /></td>
<td>Reference Counting</td>
</tr>
<tr>
</tr>
<tr>
<td>Starting Point<br /></td>
<td>Roots    <br /></td>
<td>Anti-Roots<br /></td>
</tr>
<tr>
<td>Graph Traversal<br /></td>
<td>Forward from roots       <br /></td>
<td>Forward from anti-roots<br /></td>
</tr>
<tr>
<td>Objects Traversed<br /></td>
<td>Live<br /></td>
<td>Dead<br /></td>
</tr>
<tr>
<td>Initial Reference Count<br /></td>
<td>Low (0)        <br /></td>
<td>High<br /></td>
</tr>
<tr>
<td>Reference Count Reconstruction <br /></td>
<td>Addition<br /></td>
<td>Subtraction<br /></td>
</tr>
<tr>
<td>Extra Iteration         <br /></td>
<td>Sweep Phase<br /></td>
<td>Trial Deletion<br /></td>
</tr>
</tbody>
</table>
<p style="text-align: left;">Tracing collectors start at the root, traverse foward, finding live objects.  Initially all reference counts are zero (an underestimate) but live objects have their reference count incremented to their true value (keeping a bit flag is just an optimization of actual count).  Reference counting collectors start from non-root objects, traversing forward to find dead objects.  Reference counts are initially high (an overestimation that includes references from dead objects) but decremented until the true reference count is obtained.  Ultimately, the difference is just the initial value (an underestimate or overestimate) and whether we increment or decrement to obtain the true reference count.</p>
<p style="text-align: left;">So that was interesting, but lets return my the original question about Ruby garbage collection and add a few other languages in for fun.  Here's a quick summary.  Don't quote me on it, corrections appreciated. </p>
<table border="1" cellpadding="5" cellspacing="0" style="margin-left: auto; margin-right: auto;">
<tbody>
<tr>
<td>Language</td>
<td>Garbage Collection (for reference/popular implementation - others may be available)<br /></td>
</tr>
<tr>
<td>Erlang</td>
<td>tracing - per process (very small processes), compacting then generational optimization<br /></td>
</tr>
<tr>
<td>Haskell</td>
<td>tracing - generational (3), 512KB nursery<br /></td>
</tr>
<tr>
<td>Java</td>
<td>tracing - generational (2), serial, concurrent, concurrent + incremental<br /></td>
</tr>
<tr>
<td>Perl</td>
<td>reference counting - does not handle cycles<br /></td>
</tr>
<tr>
<td>PHP</td>
<td>reference countng  - 5.3 handles cycles w/ "on the fly" cycle detector, prior versions did not<br /></td>
</tr>
<tr>
<td>Python</td>
<td>reference counting - handles cycles w/ tracing cycle detector<br /></td>
</tr>
<tr>
<td>Ruby</td>
<td>tracing - non-generational<br /></td>
</tr>
<tr>
<td>Scala</td>
<td>see Java</td>
</tr>
</tbody>
</table></div>
</content>



    <feedburner:origLink>http://work.tinou.com/2011/03/a-uni%EF%AC%81ed-theory-of-garbage-collection.html</feedburner:origLink></entry>
    <entry>
        <title>Causal Atomic Broadcast</title>
        <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/typepad/tinou/work/~3/Q78taGWQ2PM/causal-atomic-broadcast.html" />
        <link rel="replies" type="text/html" href="http://work.tinou.com/2011/03/causal-atomic-broadcast.html" thr:count="0" />
        <id>tag:typepad.com,2003:post-6a00d83451ce5a69e2014e86ca103f970d</id>
        <published>2011-03-17T16:14:25-07:00</published>
        <updated>2011-03-17T17:35:50-07:00</updated>
        <summary>Spent the other day reading up on ZooKeeper a little. ZooKeeper is a distributed, open-source coordination service for distributed applications. Out of the box ZooKeeper can be used for name service, configuration and group membership. Building on ZooKeeper's primitives, you...</summary>
        <author>
            <name>Tinou Bao</name>
        </author>
        <category scheme="http://www.sixapart.com/ns/types#category" term="Distributed Computing" />
        
        
<content type="xhtml" xml:lang="en-US" xml:base="http://work.tinou.com/">
<div xmlns="http://www.w3.org/1999/xhtml"><p>Spent the other day reading up on ZooKeeper a little.  ZooKeeper is a distributed, open-source coordination service for     distributed applications.  Out of the box ZooKeeper can be used for name service, configuration and group membership.  Building on ZooKeeper's primitives, you can create barriers, locks, queues, etc.</p>
<p><a href="http://tinou.blogs.com/.a/6a00d83451ce5a69e2014e5feee374970c-pi" style="display: inline;"><img alt="Screen shot 2011-03-17 at 1.57.23 PM" border="0" class="asset  asset-image at-xid-6a00d83451ce5a69e2014e5feee374970c image-full" src="http://tinou.blogs.com/.a/6a00d83451ce5a69e2014e5feee374970c-800wi" title="Screen shot 2011-03-17 at 1.57.23 PM" /></a></p>
<p>Sounds interesting enough, but what got me really interested was ZooKeeper's atomic broadcast, the guts of ZooKeeper.  Atomic broadcast is very closely related to consensus (apparently equivalent in certain asynchronous systems).  Both are fundamental problems in distributed systems.</p>
<p>So what exactly is atomic broadcast?  To answer this, lets work our way from Reliable Broadcast.</p>
<p><strong>Reliable Broadcast</strong></p>
<p>Reliable Broadcast is the weakest type of fault-tolerant broadcast.  Informally, if a correct process broadcasts a message then all correct processes eventually receive that message.  Reliable Broadcast doesn't impose any message delivery ordering.</p>
<p><strong>FIFO Broadcast</strong></p>
<p>FIFO Broadcast is reliable broadcast that satisifies FIFO Order.  If a process broadcasts message m<sub>1</sub> before message m<sub>2</sub>, then no correct process delivers m<sub>2</sub> before it delivers m<sub>1</sub>.  For example, if message m<sub>1</sub> is a deposit of $100 into your banking account and message m<sub>2</sub> is a subsequent withdrawal of $75, you most definitely want FIFO Broadcast, otherwise your bank will charge you an overdraft fee (on top of their ridiculous <a href="http://blogs.wsj.com/deals/2011/03/16/its-coming-the-5-atm-fee/" target="_blank">$5 ATM fee</a>).</p>
<p><strong>Causal Broadcast</strong></p>
<p>Sometimes FIFO Order is not enough because FIFO ordering too limited in context (a single process).  You need Causal Order to guarantee that if the broadcast of message m<sub>1</sub> "happens before" or "causally precedes" the broadcast of message m<sub>2</sub>, then no correct process delivers m<sub>2</sub> before it delivers m<sub>1</sub>.</p>
<p>For example, imagine three processes a, b and c.  Process a broadcasts m<sub>1a</sub>, "Banks to charge $5 ATM fee."  Process b delivers m<sub>1a</sub> then broadcasts message m<sub>1b</sub>, "That's outrageous!"  Without Causal Order, process c could deliver message m<sub>1b</sub> before m<sub>1a</sub>,</p>
<p>m<sub>1b</sub> : "That's outrageous!"<br />m<sub>1a</sub> : "Banks to charge $5 ATM fee."</p>
<p>which doesn't make sense (what is outrageous?).  Causal Order requires delivery of m<sub>1a</sub> before m<sub>1b</sub>.</p>
<p><strong>Atomic Broadcast</strong></p>
<p>Causal Order imposes a partial ordering in the system, so messages without causal relationships are logically concurrent and do not have any delivery order guarantees.  This can be problematic in some cases.  For example, imagine two replicated databases DB<sub>1</sub> and DB<sub>2</sub>.  Process a broadcasts message m<sub>1a</sub>, "deposit $100."  Process b broadcasts message m<sub>1b</sub>, "charge 10% fee."  Since there is no causal relationship between the messages, this situation may arise,</p>
<p>DB<sub>1</sub> : $0 + $100 - ($100 * 10%) = $90<br />DB<sub>2</sub> : $0 - ($0 * 10%) + $100 = $100</p>
<p>which is obviously not desirably.</p>
<p>Atomic Broadcast imposes a Total Order on the system so that all messages are delivered in the same order, whatever that order maybe.  So in this example, it's either $90 or $100 but never $90 in one database and $100 in the other.</p>
<p>We can further classify Atomic Broadcast by the actual ordering it imposes.  FIFO Atomic Broadcast is then Reliable Broadcast that is FIFO Order and Total Order.  Causal Atomic Broadcast is then Reliable Broadcast that is Causal Order and Total Order.  Causal Atomic Broadcast is the strongest guarantee we've examined.</p>
<p><strong>Summary</strong></p>
<p>This diagram nicely summarizes Reliable Broadcast and the orderings above (taken from "A Modular Approach to Fault-Tolerant Broadcasts and Related Problems," which the above is basically a summary of).</p>
<p><a href="http://tinou.blogs.com/.a/6a00d83451ce5a69e2014e86c9fa00970d-pi" style="display: inline;"><img alt="Screen shot 2011-03-17 at 3.11.41 PM" border="0" class="asset  asset-image at-xid-6a00d83451ce5a69e2014e86c9fa00970d image-full" src="http://tinou.blogs.com/.a/6a00d83451ce5a69e2014e86c9fa00970d-800wi" title="Screen shot 2011-03-17 at 3.11.41 PM" /></a> <br />So there you have, atomic broadcast.</p></div>
</content>



    <feedburner:origLink>http://work.tinou.com/2011/03/causal-atomic-broadcast.html</feedburner:origLink></entry>
    <entry>
        <title>Sequential Random Sampling</title>
        <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/typepad/tinou/work/~3/0h5oDF_u_Ng/sequential-random-sampling.html" />
        <link rel="replies" type="text/html" href="http://work.tinou.com/2011/01/sequential-random-sampling.html" thr:count="0" />
        <id>tag:typepad.com,2003:post-6a00d83451ce5a69e20147e1daa5fd970b</id>
        <published>2011-01-22T12:11:26-08:00</published>
        <updated>2011-01-22T12:11:26-08:00</updated>
        <summary>Time flies when you're having fun (even when you're not). It's been more than a year since my last work related blog. Writing about random computer science and programming topics is like going to the gym; once you get off...</summary>
        <author>
            <name>Tinou Bao</name>
        </author>
        <category scheme="http://www.sixapart.com/ns/types#category" term="Algorithm" />
        
        
<content type="xhtml" xml:lang="en-US" xml:base="http://work.tinou.com/">
<div xmlns="http://www.w3.org/1999/xhtml"><p>Time flies when you're having fun (even when you're not).  It's been more than a year since my last work related blog.  Writing about random computer science and programming topics is like going to the gym; once you get off the routine it's very easy to just sit on your literal and figurative ass.</p>
<p>For my first topic back I want to relay some sequential random sampling algorithms I came across this week.  It started out when we needed some random samples from the database.  Performance issues aside, MySQL lets you do</p>
<p style="text-align: center;"><code>select column from table order by rand() limit n </code></p>
<p>to select n random rows from a table.  The randomness of the rand() function aside, someting troubled me about this approach.  Won't your results be skewed toward the lower rows when you get duplicate random values?</p>
<p>So I was searching around for some random sampling algorithms and ran across a few from the works of Jeffrey Vitter I found rather interesting.</p>
<p><strong>Algorithm S</strong></p>
<p>Sequentially go through N elements to select n random samples.  For each element generate an independent random variable U between 0 and 1 and test that NU &gt; n.  If true, skip, otherwise select for the sample and decrement n.  In either case decrement N.</p>
<p><strong>Algorithms A-D</strong></p>
<p>These algorithms optimize Algorithm S by skipping elements.  Define a skip function S(n,N) that will tell us how many elements to skip.</p>
<p>Algorithm S and its optimizations are for sequential sampling where N is known.  But sometimes N is not known or it's prohibitive to determine N.  Then we turn to reservoir samping. </p>
<p><strong>Algorithm R</strong></p>
<p>There is a reservoir R of n sample elements.  The invariant is that the reservoir is a random sample of all the elements we've processed thus far.  To maintain the invariant, as we sequentially process each element generate an independent random variable between 0 and the current element.  Then either put the current element into the reservoir or ignore it.</p>
<p><strong>Algorithms X, Y and Z</strong></p>
<p>Optimization of algorithm R so we can skip over elements, much like Algorithms A-D.</p>
<p> </p></div>
</content>



    <feedburner:origLink>http://work.tinou.com/2011/01/sequential-random-sampling.html</feedburner:origLink></entry>
    <entry>
        <title>Tries, John Mitchell, Memcache and Read-Only PHP Sessions</title>
        <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/typepad/tinou/work/~3/aofFEhvF3nA/tries-john-mitchell-memcache-and-readonly-php-sessions.html" />
        <link rel="replies" type="text/html" href="http://work.tinou.com/2009/10/tries-john-mitchell-memcache-and-readonly-php-sessions.html" thr:count="0" />
        <id>tag:typepad.com,2003:post-6a00d83451ce5a69e20120a64613a5970b</id>
        <published>2009-10-31T21:54:38-07:00</published>
        <updated>2010-12-23T01:20:28-08:00</updated>
        <summary>Too busy with work to blog lately, but here are some random thoughts I've been meaning to write about. Burst Tries Several prominent (Google ranking) articles on implementing autocomplete suggest using a trie data structure. Unfortunately, none of those articles...</summary>
        <author>
            <name>Tinou Bao</name>
        </author>
        
        
<content type="xhtml" xml:lang="en-US" xml:base="http://work.tinou.com/">
<div xmlns="http://www.w3.org/1999/xhtml"><p>Too busy with work to blog lately, but here are some random thoughts I've been meaning to write about. </p>
<p><strong>Burst Tries</strong></p>
<p>Several prominent (Google ranking) articles on implementing autocomplete suggest using a <a href="http://en.wikipedia.org/wiki/Trie">trie data structure</a>.  Unfortunately, none of those articles point out how horribly space consuming tries are.  A naive, non-compacting trie structure is exponential in space.  With an a-z alphabet you have 26 exponent word length nodes.  That's a lot of nodes.  If you want the best of both worlds, the memory requirement of a binary search tree and the speed of a trie, try the burst trie.  Burst tries: A Fast, Efficient Data Structure for String Keys by Steffen Heinz, Justin Zobel and Hugh E. Williams.</p>
<p><strong>John Mitchell</strong></p>
<p>In my <a href="http://work.tinou.com/2009/07/wtf-is-fbounded-polymorphism.html">blog on F-Bounded Polymorphism</a> I forgot to credit <a href="http://theory.stanford.edu/people/jcm/">John Mitchell</a> as one of the authors of F-Bounded Polymorphism for Object-Oriented Programming.  This was purely accidental and has nothing to do with the fact that he's a Stanford guy.</p>
<p><strong>Memcached</strong></p>
<p>Memcached is awesome.  Locks, CAS, is there anything it can't do?  PHP's support for memcached makes PHP just barely tolerable as a language.  The "M" in LAMP needs to be officially changed to Memcache, not MySQL.</p>
<p><strong>PHP</strong></p>
<p>What a mess of a language.  It has deviated too far from its original goals.  Really miss type safety, because, "after all, run time is a little late to find out whether you have a landing gear." <a href="http://archive.eiffel.com/doc/manuals/technology/typing/paper/page.html">(Eiffel dude</a>)  Really miss all of Java's built in libraries.  I find it incredible that no one using PHP has had the need for a red-black tree, because I can't seem to find an implementation of one.  I do like PHP's request processing model better than Java's, though.  Adios threads, synchronized, volatile.   </p>
<p><strong>Read Only Sessions</strong></p>
<p>I find it really puzzling that PHP doesn't have a notion of read-only sessions.  Seems like this would be a no-brainer considering that PHP web apps are typically read dominant.  And you can't write your own quickly because there's no function to decode a session string, except to decode it and <em>overwrite</em>  the current session.  Someone did provide a function to decode the session string <a href="http://php.oregonstate.edu/manual/en/function.session-decode.php">here</a>, but I'm not sure if I fully trust it to decode everything correctly.</p>
<p>Think that's about it for now.</p></div>
</content>



    <feedburner:origLink>http://work.tinou.com/2009/10/tries-john-mitchell-memcache-and-readonly-php-sessions.html</feedburner:origLink></entry>
    <entry>
        <title>Pure Inefficiencies</title>
        <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/typepad/tinou/work/~3/D32EyDuEd14/pure-inefficiencies.html" />
        <link rel="replies" type="text/html" href="http://work.tinou.com/2009/08/pure-inefficiencies.html" thr:count="0" />
        <id>tag:typepad.com,2003:post-6a00d83451ce5a69e20120a4ddb4ea970b</id>
        <published>2009-08-09T20:08:13-07:00</published>
        <updated>2009-08-09T20:08:13-07:00</updated>
        <summary>Skimmed a paper titled "Pure versus Impure Lisp" by Nicholas Pippenger the other day. Impure Lisp dialects allow mutations. Pippenger wanted to know if pure Lisp, without mutation, was less "powerful". Under two conditions he showed the following, A computation...</summary>
        <author>
            <name>Tinou Bao</name>
        </author>
        <category scheme="http://www.sixapart.com/ns/types#category" term="Functional Programming" />
        
        
<content type="xhtml" xml:lang="en-US" xml:base="http://work.tinou.com/">
<div xmlns="http://www.w3.org/1999/xhtml"><p>Skimmed a paper titled "Pure versus Impure Lisp" by Nicholas Pippenger the other day.  Impure Lisp dialects allow mutations.  Pippenger wanted to know if pure Lisp, without mutation, was less "powerful".  Under two conditions he showed the following,   </p><div style="text-align: center;"><a href="http://tinou.blogs.com/.a/6a00d83451ce5a69e20120a4dd709e970b-pi" style="display: inline;"><img alt="Picture 1" border="0" class="at-xid-6a00d83451ce5a69e20120a4dd709e970b " src="http://tinou.blogs.com/.a/6a00d83451ce5a69e20120a4dd709e970b-800wi" title="Picture 1" /></a> <br /></div><p>A computation in impure Lisp upper bounded by O(n) is lower bounded in pure Lisp by big-omega n log n.  Basically, given the same problem there's a theoretical lower limit to pure Lisp's efficiency compared to impure Lisp.  </p><p>But then the guys at <a href="http://sneezy.cs.nott.ac.uk/fplunch/">FP Lunch</a> lead me to a follow-up paper titled "More Haste, Less Speed: Lazy Versus Eager Evaluation".  In it, Bird, Jone and de Moor showed that if you changed the pure Lisp evaluator to pure <strong><em>and</em></strong> lazy as oppose to Pippenger's original pure and eager then you can get the same linear time as the impure program.  Interestingly enough, the authors noted that lazy evaluation is often implemented with mutations behind the scenes (via memoization), so it's really a discussion of restricted mutation.  This is along the lines of an earlier posting I made about the <a href="http://work.tinou.com/2009/07/the-problem-with-immutability.html">problem with immutability</a> and how compilers can, behind the scenes, solve some of the inefficiencies of immutability, but then it becomes a philosophical debate on what it really means to be immutable (or pure).</p><p>And that concludes another installment of interesting topics I'll never need to know.</p></div>
</content>



    <feedburner:origLink>http://work.tinou.com/2009/08/pure-inefficiencies.html</feedburner:origLink></entry>
    <entry>
        <title>Fail Fast, Debugging and Systems that Never Stop</title>
        <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/typepad/tinou/work/~3/frgYHGk11_w/fail-fast-debugging-and-systems-that-never-stop.html" />
        <link rel="replies" type="text/html" href="http://work.tinou.com/2009/08/fail-fast-debugging-and-systems-that-never-stop.html" thr:count="1" thr:updated="2009-11-13T11:30:42-08:00" />
        <id>tag:typepad.com,2003:post-6a00d83451ce5a69e20120a52bb5ff970c</id>
        <published>2009-08-07T18:43:27-07:00</published>
        <updated>2009-08-07T18:43:27-07:00</updated>
        <summary>If you're a Java guy you're probably familiar with the phrase "fail fucking fast" in the context of Iterators (I added the fucking part). But fail fast is much more than just iterators, it's a general design principle--and a very...</summary>
        <author>
            <name>Tinou Bao</name>
        </author>
        <category scheme="http://www.sixapart.com/ns/types#category" term="Concurrency" />
        <category scheme="http://www.sixapart.com/ns/types#category" term="Design" />
        <category scheme="http://www.sixapart.com/ns/types#category" term="Distributed Computing" />
        <category scheme="http://www.sixapart.com/ns/types#category" term="Programming" />
        <category scheme="http://www.sixapart.com/ns/types#category" term="Software" />
        
        
<content type="xhtml" xml:lang="en-US" xml:base="http://work.tinou.com/">
<div xmlns="http://www.w3.org/1999/xhtml"><p>If you're a Java guy you're probably familiar with the phrase "fail fucking fast" in the context of Iterators (I added the fucking part).  But fail fast is much more than just iterators, it's a general design principle--and a very important one at that.  </p><p><a href="http://en.wikipedia.org/wiki/Jim_Gray_%28computer_scientist%29">Jim Gray</a> on fail fast: "it [the process] should either function correctly or it should detect the fault, signal failure and stop operating."  You can replace "the process" with component, class, etc.  Whatever you designate as your logical unit, the fail fast principle remains.  What you absolutely don't want to do is detect the fault, <strong>don't</strong> signal failure and <strong>continue</strong>.  This is why <a href="http://work.tinou.com/2009/05/pardon-the-interruption-please-dont-swallow-1.html">swallowing exceptions is one of my all time pet peeves</a>.  Fail fast is often described in terms of fault detection latency, the time between initial error detection and the fault.  You want your latency to be small.  </p><p>There are two advantages to failing fast. First, easier debugging and analysis.  It complicates things greatly if your code detects errors but continues.  Eventually something will go wrong but now you can't easily track down the root cause.  I recently worked with a component (one that shall remain nameless) that doesn't follow the fail fast principle.  During initialization the component knew it was missing a configuration file but instead of halting it continued on.  When the component was finally invoked things obviously did not work, but the root cause was not obvious.  Ended up wasting two days tracking down the root cause, when it really should have taken 30 seconds.  </p><p>Second, ability to build "systems that never stop".  Failing fast is so important to building highly available systems that <a href="http://www.sics.se/%7Ejoe/">Joe Armstrong</a> talks about it at great lengths in his presentation on "systems that never stop (and Erlang)."  In this presentation he presents 6 laws,</p><ol>
<li>Isolation</li>
<li>Concurrency</li>
<li>Failure Detection</li>
<li>Fault Identification</li>
<li>Live Code Upgrade</li>
<li>Stable Storage</li>
</ol>
<p>Without failure detection you can't fail fast, and if you can't fail fast you can't achieve isolation.  Real world example: Amazon S3's massive outage last year.  A huge embarrassment to Amazon's otherwise great cloud stack.  Their code didn't have adequate failure detection, allowing a single corrupt bit to propagate throughout the system.  The result was hours of downtime while engineers scrambled.  If Amazon had been able to detect the single corrupt bit then they could have isolated the failing component instead of letting the error bring down the entire system.</p><p>Few of us will be coding S3 like systems, but it doesn't mean we can't follow the fail fast approach.  It will make debugging easier and just might prevent similar embarrassments.  I think some programmers don't like to fail fast because they worry they'll be blamed for the failure, so just pretend things are good and pray nothing bad happens downstream.  I suppose it depends on how much you believe in God.  </p></div>
</content>



    <feedburner:origLink>http://work.tinou.com/2009/08/fail-fast-debugging-and-systems-that-never-stop.html</feedburner:origLink></entry>
    <entry>
        <title>WTF is F-Bounded Polymorphism</title>
        <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/typepad/tinou/work/~3/2zJCFtz21j8/wtf-is-fbounded-polymorphism.html" />
        <link rel="replies" type="text/html" href="http://work.tinou.com/2009/07/wtf-is-fbounded-polymorphism.html" thr:count="3" thr:updated="2009-11-15T13:35:46-08:00" />
        <id>tag:typepad.com,2003:post-6a00d83451ce5a69e201157139317b970c</id>
        <published>2009-07-23T23:39:09-07:00</published>
        <updated>2009-10-26T09:47:35-07:00</updated>
        <summary>Here's an example of where I read random articles just because the title is very catchy: F-Bounded Polymorphism for Object-Oriented Programming by Peter Canning, William Cook, Walter Hill, Walter Olthoff and John Mitchell. Polymorphism is already confusing enough. Just look...</summary>
        <author>
            <name>Tinou Bao</name>
        </author>
        <category scheme="http://www.sixapart.com/ns/types#category" term="Languages" />
        
        
<content type="xhtml" xml:lang="en-US" xml:base="http://work.tinou.com/">
<div xmlns="http://www.w3.org/1999/xhtml"><p>Here's an example of where I read random articles just because the title is very catchy: F-Bounded Polymorphism for Object-Oriented Programming by Peter Canning, William Cook, Walter Hill, Walter Olthoff and John Mitchell.  Polymorphism is already confusing enough.  Just look at the Wikipedia entry on <a href="http://en.wikipedia.org/wiki/Type_polymorphism">type polymorphism</a>: "this article may be confusing or unclear to readers".</p><p><a href="http://tinou.blogs.com/.a/6a00d83451ce5a69e20115713906bb970c-pi" style="display: inline;"><img alt="Picture 1" border="0" class="at-xid-6a00d83451ce5a69e20115713906bb970c " src="http://tinou.blogs.com/.a/6a00d83451ce5a69e20115713906bb970c-500pi" title="Picture 1" /><br /></a></p><p><span style="text-decoration: underline;" />There's parametric polymorphism, let-polymorphism, impredicative polymorphism, ad-hoc polymorphism, intensional polymorphism and subtype polymorphism.  So WTF is F-bounded polymorphism then.  If there's no Wikipedia entry does it really exist?  Despite no mention in Wikipedia F-bounded polymorphism is pretty important and I'm using it everyday without knowing it.</p><p>Lets start with "bounded quantification".  With bounded quantification you can do something like this</p><p>
</p><pre style="border-style: dotted; border-width: 1px; margin: 10px 15px; padding: 5px;">Moveable {<br /> Moveable move(int x, int y)<br />}<br /><br />Moveable moveMeOneInch(Moveable m) {...}<br /></pre><p>
The method moveMeOneInch takes in some a Moveable, moves it one inch, and returns a Moveable.  It is polymorphic and will work on Cars, Trains, Points--anything that's Moveable.  All good, except we'll have to do some unsafe type casting.</p><pre style="border-style: dotted; border-width: 1px; margin: 10px 15px; padding: 5px;">Car c = ...<br />Car movedCar = (Car) moveMeOneInch(c);<br /></pre><p>

But with Java 5 we can do something like this</p><pre style="border-style: dotted; border-width: 1px; margin: 10px 15px; padding: 5px;">&lt;T extends Moveable&gt; T moveMeOneInch(T m) {...}<br /><br />Car c = ...<br />Airplane p = ...<br /><br />Car movedCar = moveMeOneInch(c);<br />Airplane movedAirplane = moveMeOneInch(p);<br /></pre><p>and get the type safety we want, because "run time is a little late to find out whether you have a landing gear".  What we've done is created an "F-bound". </p><pre style="border-style: dotted; border-width: 1px; margin: 10px 15px; padding: 5px;">F-Moveable[t] = { move : int, int -&gt; t }<br />F-Moveable[Car] = { move : int, int -&gt; Car }<br />F-Moveable[Airplane] = { move : int, int -&gt; Airplane }<br /></pre>
<p>Note the type appears on both sides.  </p><p>Here's a real world example.  The Collections utility class has a max method that takes in a Collection of Comparables and return the max of the collection.</p><pre style="border-style: dotted; border-width: 1px; margin: 10px 15px; padding: 5px;">public static &lt;T extends Object &amp; Comparable&lt;? super T&gt;&gt; <br /> T max(Collection&lt;? extends T&gt; coll) {...}<br /><br />//Integer implements Comparable&lt;Integer&gt;<br />Collection&lt;Integer&gt; ints = ...<br />Integer i = Collections.max(ints);<br /></pre>
<p>Here we bound the type of the collection and the type is part of the bound for Comparable.  A collection of integers, and integers are comparable to integers.</p><p>That's F-bounded polymorphism in a nutshell.  Now you can say, "yea, I like Java, it supports F-bounded polymorphism."</p></div>
</content>



    <feedburner:origLink>http://work.tinou.com/2009/07/wtf-is-fbounded-polymorphism.html</feedburner:origLink></entry>
    <entry>
        <title>Scala, FUD and the PCP Drug</title>
        <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/typepad/tinou/work/~3/Kev-FJXOU8U/scala-fud-and-the-pcp-drug.html" />
        <link rel="replies" type="text/html" href="http://work.tinou.com/2009/07/scala-fud-and-the-pcp-drug.html" thr:count="3" thr:updated="2009-07-16T09:03:53-07:00" />
        <id>tag:typepad.com,2003:post-6a00d83451ce5a69e201157206ce89970b</id>
        <published>2009-07-14T19:28:51-07:00</published>
        <updated>2009-07-14T21:32:34-07:00</updated>
        <summary>When something new comes along the established players often spread fear, uncertainty and doubt (FUD) in an attempt to remain dominant. But what's the opposite of FUD? I propose praise, certainty and perfection: PCP. Every now and then something new...</summary>
        <author>
            <name>Tinou Bao</name>
        </author>
        <category scheme="http://www.sixapart.com/ns/types#category" term="Languages" />
        <category scheme="http://www.sixapart.com/ns/types#category" term="Opinion" />
        
        
<content type="xhtml" xml:lang="en-US" xml:base="http://work.tinou.com/">
<div xmlns="http://www.w3.org/1999/xhtml"><p>When something new comes along the established players often spread fear, uncertainty and doubt (FUD) in an attempt to remain dominant.  But what's the opposite of FUD?  I propose <em><strong>praise, certainty and perfection</strong></em>: <strong>PCP</strong>.  Every now and then something new comes along that so enamors early adopters they uncontrollably praise its merits, the certainty of its value and its utter perfection.  Like a drug, PCP is dangerous unless carefully supervised.
</p><p>
Take Scala, for example. There's so much PCP surround Scala I'm about to overdose.<a href="http://www.dzone.com/links/index.html">  DZone</a>'s top link for the last two days has been a blog titled <a href="http://www.khelll.com/blog/scala/scala-is-my-next-choice/">Scala is my next choice</a>.  It's a nice entry highlighting some of Scala's strengths but the author is too generous with some of the claims.  A quick read might well lead a programmer to conclude Scala will solve all of our programming problems.  Lets look at two claims.
</p>
<p><strong>Efficient</strong>
</p>
<blockquote><p>
Scala is an efficient language, actually due to the latest benchmarks Scala is almost as fast as Java. The JVM implementation of Scala compiles down to bytecode, but during so, the code passes through optimization phaze. An optimization example is the tail recursion optimization, to help you stick to the functional paradigm without sacrificing the performance.
</p></blockquote>
<p>
While the Scala compiler (which I praised <a href="http://work.tinou.com/2009/06/scalable-component-abstractions.html">here</a>) does do tail recursion optimization it's very limited, a performance hack if you will.  The tail call has to be within a final method or local function (and has to be to itself, hence the recursive qualifier).  Rich Dougherty has a nice post on <a href="http://blog.richdougherty.com/2009/04/tail-calls-tailrec-and-trampolines.html">tail calls in Scala</a> that goes into this further.  The fundamental issue is that unlike Scheme or Haskell there's no guarantee of tail calls in Scala because the JVM doesn't support tail calls. See John Rose's post <a href="http://blogs.sun.com/jrose/entry/tail_calls_in_the_vm">here</a> for further information on tail call support in the JVM and the difference between soft tail calls (the Scala hack) and hard tail calls. Bottom line: recursion in Scala will affect your performance.
</p>
<p><strong>Concurrency</strong>
</p>
<blockquote><p>
Thus Scala suggests another concurrency model called Actors, where an actor sends and receives asynchronous messages in it’s inbox instead of sharing data. This is called: shared nothing model. Once you stop thinking about shared data, you relief yourself from the headache of thinking of code synchronization and deadlocks problems.
</p></blockquote>
<p>
I'm not going to argue which concurrency model is better, but it's incorrect to say that you won't get deadlocks using Actors. See James Iry's post on <a href="http://james-iry.blogspot.com/2009/04/erlang-style-actors-are-all-about_16.html">deadlocks in Erlang</a>, or see <a href="http://people.csail.mit.edu/cel/">Charles Leiserson</a>'s <a href="http://www.cilk.com/multicore-e-book/">discussion of deadlocks</a> with MPI, another message based model used by the super computer guys.  Anytime there's a cycle in the dependencies you can get deadlocks, including when you're just passing messages around. Bottom line: Scala won't stop your deadlock problems. 
</p><p>
I don't have anything against Scala but lets not make Scala out to be this magically language that will solve all our performance, <a href="http://work.tinou.com/2009/06/scalas-dirty-little-secret-to-scalability.html">scalability</a> and concurrency problems.</p></div>
</content>



    <feedburner:origLink>http://work.tinou.com/2009/07/scala-fud-and-the-pcp-drug.html</feedburner:origLink></entry>
    <entry>
        <title>Everything is a Function</title>
        <link rel="alternate" type="text/html" href="http://feedproxy.google.com/~r/typepad/tinou/work/~3/thiUeRiwBuI/everything-is-a-function.html" />
        <link rel="replies" type="text/html" href="http://work.tinou.com/2009/07/everything-is-a-function.html" thr:count="3" thr:updated="2010-09-16T10:35:24-07:00" />
        <id>tag:typepad.com,2003:post-6a00d83451ce5a69e2011571f86041970b</id>
        <published>2009-07-12T02:21:37-07:00</published>
        <updated>2009-07-12T02:36:33-07:00</updated>
        <summary>I picked up Real World Haskell last week when Gilad Bracha warned me that I might become unemployable if I don't brush up on my functional programming skills. I found his comment amusing because I have Lisp on my resume...</summary>
        <author>
            <name>Tinou Bao</name>
        </author>
        <category scheme="http://www.sixapart.com/ns/types#category" term="Languages" />
        
        
<content type="xhtml" xml:lang="en-US" xml:base="http://work.tinou.com/">
<div xmlns="http://www.w3.org/1999/xhtml"><p>I picked up <a href="http://www.realworldhaskell.org/blog/">Real World Haskell</a> last week when Gilad Bracha warned me that I might become <a href="http://gbracha.blogspot.com/2009/06/ban-on-imports.html">unemployable</a> if I don't brush up on my functional programming skills.  I found his comment amusing because I have Lisp on my resume and have always wondered if anybody cared that I know Lisp.  Know is an exaggeration;  all I remember is cons, car, cdr and thinking why the f*** are there so many parentheses.  Yet here I am, a decade older, reading up on lambda-calculus and folding to remain marketable.</p><p>One thing that might confuse the imperative programmer when reading Haskell code for the first time is the unfamiliar function definitions.  In Java if you have an "add" method that adds two ints and returns an int you might write it as</p>
<pre style="border-style: dotted; border-width: 1px; margin: 10px 15px; padding: 5px;">add (int, int) : int<br /></pre><p>
or</p>
<pre style="border-style: dotted; border-width: 1px; margin: 10px 15px; padding: 5px;">int add (int a, int b)<br /></pre>
<p>or</p>
<pre style="border-style: dotted; border-width: 1px; margin: 10px 15px; padding: 5px;">(int, int) -&gt; int<br /></pre>
<p>In any case it's clear that add takes two ints and returns an int.  But in Haskell your add function will be </p>
<pre style="border-style: dotted; border-width: 1px; margin: 10px 15px; padding: 5px;">add :: Int -&gt; Int -&gt; Int<br /></pre>
<p>You might guess that the first int is the first argument, the second int is the second argument, and the final int the return type.  But not really.  In Haskell all functions are single argument functions.  Add is a function that takes an int, returns a function that takes an int, returns a function that evaluates to an int.  Here's an example with the map function.  </p><p>
</p><pre style="border-style: dotted; border-width: 1px; margin: 10px 15px; padding: 5px;">map abs [-1,-3,4,-12] <br />[1,3,4,12]<br /><br />:type map<br />map :: (a -&gt; b) -&gt; [a] -&gt; [b]<br /></pre>

<p>It appears that the map function takes two arguments, a mapping function (in this case the abs function that returns the absolute value of a number) and a list, and applies the mapping function to each element of the list.  But as you can see in the type definition map is a function that takes a mapping function (a-&gt;b), returns a function that takes a list [a], returns a function that evaluates to a list [b].  </p><p>This is because the lambda-calculus which Haskell is based upon doesn't have built-in support for multi-argument functions.  It'd be pretty easy to support multi-argument functions according to Benjamin Pierce, but to confuse us imperative programmers they decided it would be easier and better to use higher-order functions (functions that accept and/or return functions). Instead of f = λ(x,y).s we have f = λx.λy.s.  Then we would just write f v w and do two "reductions" instead of writing something like f(v,w).  </p><p>Converting multi-argument functions into higher-order functions is called "currying" after Haskell Curry, for whom the Haskell language is named.  Haskell was a contemporary of Alonzo Church, who developed the lambda-calculus during the 1920s.  While at Princeton Church supervised the doctoral thesis of Alan Turing, the tape machine guy, who committed suicide in 1954 after being prosecuted for his homosexuality.   (That was for all the history buffs out there.)</p><p>Once you see it a few times the notation becomes "normal".  It's actually very elegant (among other things).  Everything is a single argument function.  I mean everything.  You know how Smalltalk guys are always saying everything is an object, well the lambda ultimate guys say everything is a function, including <a href="http://en.wikipedia.org/wiki/Church_encoding">numbers</a> like 23 and 17.  Maybe that's what Java's missing, a pithy tag line.  Java, where everything is a....</p></div>
</content>



    <feedburner:origLink>http://work.tinou.com/2009/07/everything-is-a-function.html</feedburner:origLink></entry>
 
</feed><!-- ph=1 -->

