<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:creativeCommons="http://backend.userland.com/creativeCommonsRssModule" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>Daniel Lemire's blog</title>
	
	<link>http://lemire.me/blog</link>
	<description>Computer Scientist and Open Scholar: Databases, Information Retrieval, Business Intelligence.</description>
	<lastBuildDate>Mon, 14 May 2012 14:17:21 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/daniel-lemire/atom" /><feedburner:info uri="daniel-lemire/atom" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><geo:lat>45</geo:lat><geo:long>-73</geo:long><creativeCommons:license>http://creativecommons.org/licenses/by-nc-sa/2.0/</creativeCommons:license><feedburner:emailServiceId>daniel-lemire/atom</feedburner:emailServiceId><feedburner:feedburnerHostname>http://feedburner.google.com</feedburner:feedburnerHostname><feedburner:feedFlare href="http://www.bloglines.com/sub/http://feeds.feedburner.com/daniel-lemire/atom" src="http://www.bloglines.com/images/sub_modern11.gif">Subscribe with Bloglines</feedburner:feedFlare><feedburner:feedFlare href="http://fusion.google.com/add?feedurl=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://buttons.googlesyndication.com/fusion/add.gif">Subscribe with Google</feedburner:feedFlare><feedburner:feedFlare href="http://www.plusmo.com/add?url=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://plusmo.com/res/graphics/fbplusmo.gif">Subscribe with Plusmo</feedburner:feedFlare><feedburner:feedFlare href="http://www.thefreedictionary.com/_/hp/AddRSS.aspx?http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://img.tfd.com/hp/addToTheFreeDictionary.gif">Subscribe with The Free Dictionary</feedburner:feedFlare><feedburner:feedFlare href="http://www.bitty.com/manual/?contenttype=rssfeed&amp;contentvalue=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://www.bitty.com/img/bittychicklet_91x17.gif">Subscribe with Bitty Browser</feedburner:feedFlare><feedburner:feedFlare href="http://www.newsalloy.com/?rss=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://www.newsalloy.com/subrss3.gif">Subscribe with NewsAlloy</feedburner:feedFlare><feedburner:feedFlare href="http://www.live.com/?add=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://tkfiles.storage.msn.com/x1piYkpqHC_35nIp1gLE68-wvzLZO8iXl_JMledmJQXP-XTBOLfmQv4zhj4MhcWEJh_GtoBIiAl1Mjh-ndp9k47If7hTaFno0mxW9_i3p_5qQw">Subscribe with Live.com</feedburner:feedFlare><feedburner:feedFlare href="http://mix.excite.eu/add?feedurl=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://image.excite.co.uk/mix/addtomix.gif">Subscribe with Excite MIX</feedburner:feedFlare><feedburner:feedFlare href="http://download.attensa.com/app/get_attensa.html?feedurl=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://www.attensa.com/blogs/attensa/WindowsLiveWriter/BadgeredintoBadges_10C02/attensa_feed_button5.gif">Subscribe with Attensa for Outlook</feedburner:feedFlare><feedburner:feedFlare href="http://www.webwag.com/wwgthis.php?url=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://www.webwag.com/images/wwgthis.gif">Subscribe with Webwag</feedburner:feedFlare><feedburner:feedFlare href="http://www.podcastready.com/oneclick_bookmark.php?url=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://www.podcastready.com/images/podcastready_button.gif">Subscribe with Podcast Ready</feedburner:feedFlare><feedburner:feedFlare href="http://www.flurry.com/pushRssFeed.do?r=fb&amp;url=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://www.flurry.com/images/flurry_rss_logo2.gif">Subscribe with Flurry</feedburner:feedFlare><feedburner:feedFlare href="http://www.wikio.com/subscribe?url=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://www.wikio.com/shared/img/add2wikio.gif">Subscribe with Wikio</feedburner:feedFlare><feedburner:feedFlare href="http://www.dailyrotation.com/index.php?feed=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://www.dailyrotation.com/rss-dr2.gif">Subscribe with Daily Rotation</feedburner:feedFlare><item>
		<title>Summer reading recommendations</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/iTByySzLeMc/</link>
		<comments>http://lemire.me/blog/archives/2012/05/14/summer-reading-recommendations/#comments</comments>
		<pubDate>Mon, 14 May 2012 14:17:21 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category />

		<guid isPermaLink="false">http://lemire.me/blog/?p=4265</guid>
		<description>What came after by Sam Winston is an intriguing scifi novel. It describes a near-future dystopia where a handful of large corporations have taken over the USA. After being a puppet to powerful interests, the government has finally been abolished. In some sense, it is the anti-libertarian novel: what if we let the free market [...]</description>
			<content:encoded><![CDATA[<p><a href="http://www.amazon.com/What-Came-After-ebook/dp/B005V5DJ7U?tag=daniellemires-20" rel="nofollow">What came after</a> by Sam Winston is an intriguing scifi novel. <img src="http://www.whatcameafter.com/Resources/wcaimage.jpeg" style="width:200px;margin:5px;float:right"/> It describes a near-future dystopia where a handful of large corporations have taken over the USA. After being a puppet to powerful interests, the government has finally been abolished. In some sense, it is the anti-<a href="http://en.wikipedia.org/wiki/Libertarianism">libertarian</a> novel: what if we let the <em>free</em> market prevail? Eventually, some large corporations may become so powerful that they can use force to prevent competition. Though overall credible, I found the absence of any state  a bit unbelievable because I view corporations and states as mutually supporting concepts: large corporations may try to control the state, but they rarely try to abolish it.  The hero is out to save his daughter, at first, and then he becomes part of a larger fight. The writing is beautiful. Short sentences. Powerful text. An emotional roller-coaster. The novel would make a great movie. Meanwhile, <a href="http://www.amazon.com/What-Came-After-ebook/dp/B005V5DJ7U?tag=daniellemires-20" rel="nofollow">the e-book is cheap ($3.99)</a>. I expect the author to write a follow-up.</p>
<p>Unless you live under a rock, you have heard of the <a href="http://www.amazon.com/The-Hunger-Games-Trilogy-ebook/dp/B004XJRQUQ/?tag=daniellemires-20" rel="nofollow">Hunger Games trilogy</a> by Suzanne Collins. <img src="http://www.suzannecollinsbooks.com/images/Mockingjaycover-330.jpg" style="width:200px;margin:5px;float:right"/><br />
 They recently made a decent movie out of the first book. Like in <em>What came after</em>, the books describe a near-future dystopia where war and oppression have reduced humanity to few towns supporting a relatively wealthy capital. What I found interesting in these novels, is how the main character (Katniss) is an anarchist. That is, she cares for those she love (her tribe), but she is rather immune to big ideas and propaganda.  This becomes clearer as the story progresses, and many people have hated the ending for this reason. The trilogy is reasonably priced: <a href="http://www.amazon.com/The-Hunger-Games-Trilogy-ebook/dp/B004XJRQUQ/?tag=daniellemires-20" rel="nofollow">you can get all of it for under $20</a>.</p>
<p>In the <a href="http://www.amazon.com/Trilisk-Parker-Interstellar-Travels-ebook/dp/B005Q22AI2?tag=daniellemires-20/?tag=daniellemires-20" rel="nofollow">Trilisk Ruins</a>, Michael McCloskey describes a far future universe where human beings have encountered alien ruins on diverse planets. <img src="http://cache.smashwire.com/bookCovers/6ff1af0c2a9e58cb2946622512472092314c3cac" style="width:200px;margin:5px;float:right"/><br />
 These ruins have obvious commercial values: alien artifacts are immensely valuable. Meanwhile, the government has restricted access to these ruins to its own military. The main character is a xenoarchaeologist who is frustrated by the lack of access to these new findings. She decides to  embark with a bunch of pirates/mercenaries who hope to visit new alien ruins before the military can get their hands on them. The novel touches on a common theme in scifi: it is unwise to put your military in charge of first contact with aliens.  <a href="http://www.amazon.com/Trilisk-Parker-Interstellar-Travels-ebook/dp/B005Q22AI2?tag=daniellemires-20" rel="nofollow">You can get the e-book for $2.99</a>.</p>
<p>The <a href="http://www.amazon.com/The-Galactic-Mage-ebook/dp/B006VCZMVS/?tag=daniellemires-20" rel="nofollow">Galactic Mage</a> by John Daulton is a twisted, but fun story. <img src="http://daultonbooks.com/wp-content/uploads/2012/01/cover_only_web1.jpg" style="width:200px;margin:5px;float:right"/> We have human beings from Earth who were in contact with another human civilization across the galaxy. This civilization was apparently wiped out after it sent a warning. A fleet of ships from Earth is assembled to go investigate. Meanwhile, on a remote planet, a bona fide mage has decided to go explore space. He does so by teleporting himself (and his tower) in space. At first, the premise seems unbelievable, and it is, but it is fascinating to see how a mage might explore space. Without any scientific background, he is faced with several challenges such as unimaginable distances. Unfortunately, the novel never quite feels complete: many issues are left unresolved. Nevertheless, it is an entertaining book well suited for teenagers or people looking for a fun little book. <a href="http://www.amazon.com/The-Galactic-Mage-ebook/dp/B006VCZMVS/?tag=daniellemires-20" rel="nofollow">You can grab it for $2.99</a>.</p>
<p>Finally, the <a href="http://www.amazon.com/Empire-of-the-Gods-ebook/dp/B0061C3ER2/?tag=daniellemires-20" rel="nofollow">Empire of the Gods</a> by David Stag is a well written space adventure. <img src="http://www.mathachew.com/wp-content/uploads/2011/12/book_empire_of_the_gods-200x300.jpg" style="width:200px;margin:5px;float:right"/> It describes a universe where most people live in misery under an oppressive and all-knowing government. The leaders derive their power from mysterious rods that seem to give them the powers of gods. I found it very entertaining even though there are obvious flaws. For example, as in the Galactic Mage, we are supposed to believe that the universe is populated by human beings. Thus, as scifi, this is a so-so book, but thankfully, the novel works well as a fantasy. The writing is solid.  <a href="http://www.amazon.com/Empire-of-the-Gods-ebook/dp/B0061C3ER2/?tag=daniellemires-20" rel="nofollow">The e-book is only $0.99</a>.</p>
<p><strong>Disclaimer</strong>: I got Empire of the Gods for free from the author.</p>
<div class="related">
<p>Related posts (automatically generated):</p>
<ul>
		<li><a href="http://lemire.me/blog/archives/2010/07/09/summer-reading-my-recommendations/" rel="bookmark">Summer reading: my recommendations (2010)</a><!-- (14.8)--></li>
	</ul>
</div>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=iTByySzLeMc:AeHQc0R45TE:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=iTByySzLeMc:AeHQc0R45TE:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/iTByySzLeMc" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://lemire.me/blog/archives/2012/05/14/summer-reading-recommendations/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://lemire.me/blog/archives/2012/05/14/summer-reading-recommendations/</feedburner:origLink></item>
		<item>
		<title>Punk money: how you can print your own currency… legally</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/n2SJapZktsk/</link>
		<comments>http://lemire.me/blog/archives/2012/04/25/punk-money-how-you-can-print-your-own-currency-legally/#comments</comments>
		<pubDate>Wed, 25 Apr 2012 18:10:16 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category />

		<guid isPermaLink="false">http://lemire.me/blog/?p=4244</guid>
		<description>We all want and need money. However, for many services, paying actual dollars is inefficient. The transaction costs are too high. So we need a system whereas perfect strangers can make deals at a very small transaction cost. For this purpose, people use punk money: You publicly promise a favor in exchange for a service, [...]</description>
			<content:encoded><![CDATA[<p>We all want and need money. However, for many services, paying actual dollars is inefficient. The transaction costs are too high. </p>
<p>So we need a system whereas perfect strangers can make deals at a very small transaction cost. For this purpose,  people use <a href="https://twitter.com/#!/search/realtime/%23punkmoney">punk money</a>:</p>
<ul>
<li>You publicly promise a favor in exchange for a service, you may stipulate the terms.</li>
<li>The Web records your transaction.</li>
<li>Your public reputation guarantees the transaction.</li>
</ul>
<p>Six months ago, I needed a particular piece of software. In exchange for the code, I <a href="https://plus.google.com/105888615414982242080/posts/4JD7YZZK6vf">promised to promote a web site</a> on the Google+ social network.  <a href="https://plus.google.com/112791705546370961242/posts">Joshua Grochow</a> won the contract and I owe him.</p>
<p>Punk money should also be able to solve more systemic problems. For example, it is often hard to find good reviewers for  research papers. To solve this problem, we  create an intermediary between the authors and the reviewers (e.g., a conference or a journal). This intermediary is often supported by a larger organization (e.g., a publisher) seeking financial gain.</p>
<p>As an alternative,  I published the following open contract in the spirit of punk money:</p>
<blockquote>
<ol>
<li> Write a research paper in my general area of expertise.</li>
<li> Send me the paper.</li>
<li> I will read it in a reasonable delay (not 3 freaking months), or tell you that I&#8217;m too busy. (If your paper is really bad, I might also ignore you, politely.)</li>
<li> I will give you feedback.</li>
<li> At some point, I might feel that the paper is quite good, and then I will publicly say so, putting my reputation on the line.</li>
</ol>
<p>In exchange, you have to mention me in the acknowledgements.
</p></blockquote>
<p>So far one researcher took me up on this offer and I reviewed his paper (privately) within a few weeks. If many researchers adopted a similar open contract, we could create a workable lightweight alternative for scientific publishing. It will not replace journals or conferences, but it is also nearly free compared to the current system.</p>
<p>The famous mathematician <a href="http://www.math.rutgers.edu/~zeilberg/pj.html">Doron Zeilberger</a> has used this punk approach to validate some of his research papers. Indeed, he has a set of papers that only appear on his web site. He validates some of them by asking other mathematicians to review his work. The net result is that you can probably trust these papers, after all some established mathematicians were willing to vouch for the work. Journals cannot offer anything better. In exchange, he acknowledges the other researchers. The transaction happens without an intermediary. </p>
<p>Of course, you could apply the same type of contracts to any type of publication. Perhaps you are willing to review books or novels, and promote them in exchange for some favor. Perhaps you are willing to provide code reviews for open source projects. And so on.</p>
<p>The concept is entirely general. Maybe there is an annoying bug in your favorite open source web browser, and while you cannot fix it yourself, you would be willing to put a bounty. What could it be? Maybe you are willing to have a pizza delivered to the house of the programmer who provides the fix. Or maybe you will post a poem in honor of whoever fixed the problem. </p>
<p>Punk money has three major characteristics:</p>
<ul>
<li>There is no real intermediary other than the Web. Or rather, the intermediary is light and easily replaceable.</li>
<li>There is a written record.</li>
<li>It is an explicit credit system, not free labor.</li>
</ul>
<p>In contrast, a site like <a href="http://stackoverflow.com/">Stack Overflow</a> allows you to ask a question for free. People who do the hard work of providing a detailed answer get <em>clown money</em> (e.g., reputation points). The intermediary (the owner of Stack Overflow) is not easily replaceable: it is an essential component of the system. Such specialized sites work  amazingly well for specific problems, but punk money has far broader potential.</p>
<p><strong>Note</strong>: I think there should be a <a href="http://en.wikipedia.org/wiki/Punk_Money">Wikipedia article on punk money</a>. It should trace back the origin of the term and provide sufficient context so that it stands a chance of meeting Wikipedia standards. I am too lazy too do it, but if you do it I will update this blog post with a link to the new entry and, if you wish, a note crediting your effort.</p>
<p><strong>Further reading</strong>: See <a href="http://www.punkmoney.org/note/195270548140470272">punkmoney.org</a> for a related tool.</p>
<p><strong>Update regarding taxation</strong>: The trades I have in mind are already happening without any fiscal consideration. I already review research papers, that&#8217;s a service I render. I also benefit from this service when I submit a research paper. I also support open source software (including handling bug reports) while benefiting from the open source software services of others (e.g., Linux). Yet there is no taxation involved.</p>
<p><strong>Update regarding the word &#8220;currency&#8221;</strong>: What I describe is a credit system, not a currency. I used the word currency in my title because it sounded good.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=n2SJapZktsk:dRBUhPBFKhg:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=n2SJapZktsk:dRBUhPBFKhg:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/n2SJapZktsk" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://lemire.me/blog/archives/2012/04/25/punk-money-how-you-can-print-your-own-currency-legally/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		<feedburner:origLink>http://lemire.me/blog/archives/2012/04/25/punk-money-how-you-can-print-your-own-currency-legally/</feedburner:origLink></item>
		<item>
		<title>Computer scientists need to learn about significant digits</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/HO8qTzAbDtE/</link>
		<comments>http://lemire.me/blog/archives/2012/04/20/computer-scientists-need-to-learn-about-significant-digits/#comments</comments>
		<pubDate>Fri, 20 Apr 2012 14:20:23 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category />

		<guid isPermaLink="false">http://lemire.me/blog/?p=4240</guid>
		<description>I probably spend too much time reviewing research papers. It makes me cranky. Nevertheless, one thing that has become absolutely clear to me is that computer scientists do not know about significant digits. When you write that the test took 304.03&amp;#160;s, you are telling me that the 0.03 s is somehow significant (otherwise, why tell [...]</description>
			<content:encoded><![CDATA[<p>I probably spend too much time reviewing research papers. It makes me cranky.</p>
<p>Nevertheless, one thing that has become absolutely clear to me is that computer scientists do not know about <a href="http://en.wikipedia.org/wiki/Significant_digits">significant digits</a>.</p>
<p>When you write that the test took 304.03&nbsp;s, you are telling me that the 0.03 s is somehow significant (otherwise, why tell me about it?). Yet it is almost certainly <strong>insignificant</strong>.</p>
<p>In computer science, you should almost never use more than two significant digits. So 304.03&nbsp;s is indistinguishable from 300&nbsp;s. And 33.14 MB is the same thing as 33 MB.</p>
<p>Why does it matter?</p>
<ul>
<li>Cutting down numbers to their significant digits simplifies the exposition. It is simpler to say that it took 300 s than to say that it took  304.03 s.</li>
<li>Numbers expressed without significant digits often lie. Running your program does not take 304.03 s. Maybe it did this one time, but if you run it again, you will get a different number.</li>
</ul>
<p>Please learn to express your experimental results using as few digits as you can.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=HO8qTzAbDtE:L8t39aUaEfY:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=HO8qTzAbDtE:L8t39aUaEfY:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/HO8qTzAbDtE" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://lemire.me/blog/archives/2012/04/20/computer-scientists-need-to-learn-about-significant-digits/feed/</wfw:commentRss>
		<slash:comments>16</slash:comments>
		<feedburner:origLink>http://lemire.me/blog/archives/2012/04/20/computer-scientists-need-to-learn-about-significant-digits/</feedburner:origLink></item>
		<item>
		<title>Let us abolish page limits in scientific publications</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/jkg9eIdGL3g/</link>
		<comments>http://lemire.me/blog/archives/2012/04/18/let-us-abolish-page-limits-in-scientific-publications/#comments</comments>
		<pubDate>Wed, 18 Apr 2012 14:26:10 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category />

		<guid isPermaLink="false">http://lemire.me/blog/?p=4226</guid>
		<description>As scientists, we are often subjected to strict page limits. These limits made sense when articles were printed on expensive paper. They are now obsolete. But we still need to print the articles on paper! At least in Computer Science, almost everyone has adopted electronic media. It is cheaper and more convenient. I carry thousands [...]</description>
			<content:encoded><![CDATA[<p>As scientists, we are often subjected to strict page limits. These limits made sense when articles were printed on expensive paper. They are now obsolete.</p>
<ul>
<li><strong>But we still need to print the articles on paper!</strong> At least in Computer Science, almost everyone has adopted electronic media. It is cheaper and more convenient. I carry thousands of research papers on my laptop: I would require a part-time archivist to get the same result with paper. And 99% of all references are a mouse click away. Given a research paper, I can quickly search through it for interesting terms. It is true that paper versions can sometimes be handy. However, we have this marvelous technology called the personal printer. You can get one for $100. And these printers are connected to computers smart enough to print just the pages you need. You need to review the proof of a theorem on paper? Just print out the proof, specifically. Most people who can afford access to printed journals can afford a printer and the printing costs.</li>
<li><strong>Reviewers prefer to review short papers.</strong> It can be more difficult to review a short paper than a long paper. I speak from experience. For example, I am currently reviewing papers for <a href="http://recsys.acm.org/2012/">ACM RecSys</a> where we have two tracks: short and long papers. It takes me just as long to review short papers. Indeed, reading the text itself is not the bottleneck. What takes the bulk of my time?
<ul>
<li>Checking the literature is time consuming. I often ask myself:  did they really advance the state-of-the-art? Other times, I want to check how the submitted manuscript differ from previous work from the same authors.</li>
<li> Reviewing the methodology or the mathematical proof also takes me a long time, especially when the authors have omitted details.</li>
</ul>
<p> If the authors expand unnecessarily on uninteresting aspects of their work, or spend much time reviewing elementary facts, it does not slow me down much because I can easily skip it, as long as the work is well structured. In fact, I find that I can get the gist of an entire Ph.D. thesis, if it is well written, faster than I can understand some short research papers. To summarize: the number of pages is not the primary factor determining how long it takes to review a paper. The problem is not that papers are too long,  rather it is  that they are often written too poorly.</li>
<li><strong>We want to entice authors to be concise.</strong> Everything else being equal, a concise text will be better written and easier to read than its longer counterpart. However, everything else is not equal. For example, Venkatesh Rao&#8217;s <a href="http://www.ribbonfarm.com/2011/06/08/a-brief-history-of-the-corporation-1600-to-2100/">brief History of the Corporation</a> is a blog post containing 7000 words. It is an order of magnitude larger than most blog posts. Aren&#8217;t Internet users supposed to suffer from attention deficit? Surely, nobody has time for such a long blog post? Yet it has become a classic. It has been extensively covered by various Internet news sites and forums, cited thousands of times. This is no excuse to use long and complicated sentences or to repeat yourself: Rao is an expert writer even though he writes long blog posts. So, while it is true that we have little tolerance for boring ramblings, what matters is less the length of the text, and more how interesting it is.  </li>
</ul>
<p>Thankfully, page limits are going away, slowly. <a href="http://people.csail.mit.edu/marcua/">Adam Marcus</a> sent me a link to the <a href="http://www.acm.org/uist/uist2012/cfp.html">UIST call for papers</a> where they are openly flexible regarding page limits:</p>
<blockquote><p> While we will review papers longer than 10 pages, the contributions must warrant the extra length. </p></blockquote>
<p>Similarly,  <a href="http://www.cs.utah.edu/~regehr/">John Regehr</a> sent me link to the OOPSLA call for papers:</p>
<blockquote><p>The length of a submitted paper should not be a point of concern for authors. Authors should focus instead on addressing the criteria mentioned above, whether it takes 5 pages or 15 pages. It is, however, the responsibility of the authors to keep the reviewers interested and motivated to read the paper. Reviewers are under no obligation to read all or even a substantial portion of a paper if they do not find the initial part of the paper interesting.</p></blockquote>
<p><strong>Further reading</strong>: Stephen King made a killing his novel <a href="http://www.amazon.com/The-Stand-Stephen-King/dp/0307743683?tag=daniellemires-20" rel="nofollow">The Stand</a>. Yet it spans nearly 1500 pages. Rao wrote several posts on why he shouldn&#8217;t be expected to use few words: <a href="http://www.ribbonfarm.com/2012/01/11/seeking-density-in-the-gonzo-theater/">Seeking Density in the Gonzo Theater</a> and <a href="http://www.ribbonfarm.com/2012/02/29/just-add-water/">Just Add Water</a>.</p>
<p><strong>Update</strong>: According to an anonymous reader,  copy editing is often charged by the number of pages. So it can cost twice as much to publish a paper twice as long, even if you only publish it electronically.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=jkg9eIdGL3g:jGjs7SWoNGA:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=jkg9eIdGL3g:jGjs7SWoNGA:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/jkg9eIdGL3g" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://lemire.me/blog/archives/2012/04/18/let-us-abolish-page-limits-in-scientific-publications/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		<feedburner:origLink>http://lemire.me/blog/archives/2012/04/18/let-us-abolish-page-limits-in-scientific-publications/</feedburner:origLink></item>
		<item>
		<title>How to manipulate the masses by language alone</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/wxvdmkVGkvg/</link>
		<comments>http://lemire.me/blog/archives/2012/04/13/how-to-manipulate-the-masses-by-language-alone/#comments</comments>
		<pubDate>Fri, 13 Apr 2012 18:44:49 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category />

		<guid isPermaLink="false">http://lemire.me/blog/?p=4205</guid>
		<description>George Orwell with novel 1984 popularized the idea that by changing the language, you could change the minds. It is easy to forget that we are routinely victims of this strategy. A fascinating example is the French language itself. I long had this image of the French revolution as the French people, that is, the [...]</description>
			<content:encoded><![CDATA[<p>George Orwell with <a href="http://en.wikipedia.org/wiki/Nineteen_Eighty-Four">novel 1984</a> popularized the idea that by changing the language, you could change the minds. It is easy to forget that we are routinely victims of this strategy.</p>
<p>A fascinating example is the French language itself. I long had this image of the French revolution as the French people, that is, the people who spoke French, rising up. But during the revolution in 1789, <a href="http://en.wikipedia.org/wiki/History_of_French">only half the population of France spoke <strong>some</strong> French</a>. The state of France created the French language we know today. It was an act of social engineering to ensure that there would be a united French people.</p>
<p>A widespread instance of this strategy is <a href="http://en.wikipedia.org/wiki/Politically_correct">political correctness</a>. Apparently, it is racist to say that Martin Luther King was black. We don&#8217;t have firemen anymore, have you noticed? We have firefighters.</p>
<p>The term <em>climate change</em> is another fascinating example. Prior to 2003, we talked about <em>global warming</em>. It changed when Frank Luntz, a political consultant, convinced the American president to force people to talk about changes instead of warming, because it feels less threatening.</p>
<p>Another example is &#8220;intellectual property&#8221;.  If &#8220;intellectual property&#8221; is bona fide property, then you should be able to steal it. Can you? <a href="http://en.wikipedia.org/wiki/Dowling_v._United_States_(1985)">The Supreme Court of the United States</a> thinks you can&#8217;t steal intellectual property the same way you can steal cars:</p>
<blockquote><p>(&#8230;) interference with copyright does not easily equate with theft, conversion, or fraud. The infringer of a copyright does not assume physical control over the copyright nor wholly deprive its owner of its use. </p></blockquote>
<p>Yet even judges get confused. Recently a programmer from Goldman Sachs who copied and shared secret software <a href="http://www.wired.com/threatlevel/2012/04/code-not-physical-property">was acquitted of theft charges</a>. Yet, in its verdict, the court writes that he <em>stole purely intangible property embodied in a purely intangible format.</em>  He cannot be convicted of theft, so why use the word in the first place? The intellectual property lobby goes even further when it talks about <em>piracy</em>. (Thankfully, they haven&#8217;t yet prosecuted someone for actual piracy.) Effectively, they have changed the language, they have gotten us to attribute new meaning to existing words, to associate piracy and theft to the infringement of exclusivity rights. </p>
<p>Scientists often play the same games. For example, to make something sound  serious, just append <em>engineering</em> to it: knowledge engineering, software engineering, data engineering. </p>
<p>Experience has taught me to be suspicious of people who spends too much effort redefining words. They are probably not out to help you think clearly.</p>
<p><strong>Credit</strong>: Thanks to Marc Couture for the legal reference and an inspiring discussion.</p>
<p><strong>Related video</strong>:  <a href=" http://www.youtube.com/watch?v=rFMl0stqai0">Too Much Copyright</a> </p>
<p><strong>Further information</strong>: <a href="http://www.youtube.com/watch?v=CNk_kzQCclo">Euphemistic Language</a> by George Carlin, <a href="http://www.amazon.com/Words-That-Work-What-People/dp/1401302599?tag=daniellemires-20" rel="nofollow">Words That Work</a> by Frank I. Luntz, <a href="http://www.amazon.com/How-Not-Say-What-Mean/dp/0199208395?tag=daniellemires-20" rel="nofollow">How Not To Say What You Mean</a> by R. W. Holder</p>
<p><strong>Update</strong>: I do realize that global warming and climate change refer to different concepts from a scientific point of view. But what people worry about is not so much the change, as change is unavoidable. Rather, we worry about the warming&#8230; don&#8217;t we? Or are some people really set on preventing any kind of climate change?</p>
<div class="related">
<p>Related posts (automatically generated):</p>
<ul>
		<li><a href="http://lemire.me/blog/archives/2005/03/16/d-more-misc/" rel="bookmark">French is the third language in the USA</a><!-- (11.5)--></li>
		<li><a href="http://lemire.me/blog/archives/2008/11/05/selecting-emails-per-language/" rel="bookmark">Selecting emails per language</a><!-- (10.4)--></li>
	</ul>
</div>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=wxvdmkVGkvg:5iq5k4r9H_k:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=wxvdmkVGkvg:5iq5k4r9H_k:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/wxvdmkVGkvg" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://lemire.me/blog/archives/2012/04/13/how-to-manipulate-the-masses-by-language-alone/feed/</wfw:commentRss>
		<slash:comments>20</slash:comments>
		<feedburner:origLink>http://lemire.me/blog/archives/2012/04/13/how-to-manipulate-the-masses-by-language-alone/</feedburner:origLink></item>
		<item>
		<title>Bit packing is fast, but integer logarithm is slow</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/Pb24G4xu4C4/</link>
		<comments>http://lemire.me/blog/archives/2012/04/05/bit-packing-is-fast-but-integer-logarithm-is-slow/#comments</comments>
		<pubDate>Thu, 05 Apr 2012 19:02:37 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category />

		<guid isPermaLink="false">http://lemire.me/blog/?p=4148</guid>
		<description>In How fast is bit packing?, we saw how to store non-negative integers smaller than 2N using N bits per integer by a technique called bit packing. A careful C++ bit packing implementation is fast: e.g., over 1 billion integers per second. However, before you pack the integers, you might need to scan them to [...]</description>
			<content:encoded><![CDATA[<p>In <a href="http://lemire.me/blog/archives/2012/03/06/how-fast-is-bit-packing/">How fast is bit packing?</a>, we saw how to store non-negative integers smaller than 2<sup><em>N</em></sup> using  <em>N</em> bits per integer by a technique called bit packing. A careful C++ bit packing implementation is fast: e.g., over 1 billion integers per second.</p>
<p>However, before you pack the integers, you might need to scan them to determine the number of bits needed (<em>N</em>). Unfortunately, it is a relatively expensive process.</p>
<p>Given a positive integer <em>x</em>, we seek  the smallest integer <em>N</em> such that the integer <em>x</em> is less than 2<sup><em>N</em></sup>. The value <em>N</em> is often called the <em>integer logarithm</em> of <em>x</em>.</p>
<p>There are <a href="http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog">several clever techniques</a> to compute the integer logarithm using portable C code. Yet you can do better using processor-specific instructions. The GNU GCC compiler makes this easy with a special function that counts the number of leading zeros for 32-bit integers  (<tt>__builtin_clz</tt>). Even so, it is relatively slow.</p>
<p>Thankfully, you can avoid computing the integer logarithm of each integer by a simple test involving a right shift:<br />
<code><br />
if((x>>b) !=0)<br />
  b = integer_logarithm(x);<br />
</code><br />
With proper loop unrolling, this is nearly as fast as bit packing.</p>
<p><strong>Update</strong>: Preston Bannister correctly points out that you can do much better. Simply compute the logical or between all integers and then compute the integer logarithm of the result. It is much, much faster.</p>
<p>To experiment with this problem, I wrote a <a href="https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2012/04/05/bit-packing-is-fast-but-integer-logarithm-is-slow">small program</a> which finds the maximum integer logarithm of a large array of random integers. It then packs the integers using this logarithm. </p>
<ul>
<li>I find that I can pack between 1 billion and 2 billions integers per second.</li>
<li>I compute the maximum integer logarithm at a rate of 3 billion integers per second.</li>
</ul>
<p>When plotting the speeds as functions of the actual  maximum integer logarithm, we see that the computation of the logarithm is not sensitive to the value of the actual logarithm, except for the approach based on the <tt>__builtin_clz</tt> function which is slower when the logarithm is less than 8.<br />
<img src="http://lemire.me/blog/wp-content/uploads/2012/04/Screen-Shot-2012-04-05-at-9.26.07-PM.png" /></p>
<p>In my tests, I used the GNU GCC 4.6.2 compiler on an Intel core i7 processor. My code is <a href="https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2012/04/05/bit-packing-is-fast-but-integer-logarithm-is-slow">freely available</a>.</p>
<p><strong>Conclusion</strong> When packing an array of integers, finding the maximum logarithm can take anywhere from 1/4 to 1/3 of the running time. However, brute force techniques that compute the integer logarithm of every integer are much slower.</p>
<div class="related">
<p>Related posts (automatically generated):</p>
<ul>
		<li><a href="http://lemire.me/blog/archives/2012/03/06/how-fast-is-bit-packing/" rel="bookmark">How fast is bit packing?</a><!-- (17.8)--></li>
	</ul>
</div>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=Pb24G4xu4C4:_lv-3tkdHGc:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=Pb24G4xu4C4:_lv-3tkdHGc:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/Pb24G4xu4C4" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://lemire.me/blog/archives/2012/04/05/bit-packing-is-fast-but-integer-logarithm-is-slow/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		<feedburner:origLink>http://lemire.me/blog/archives/2012/04/05/bit-packing-is-fast-but-integer-logarithm-is-slow/</feedburner:origLink></item>
		<item>
		<title>It is what you do, not what you own</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/1qDyUBCjNxE/</link>
		<comments>http://lemire.me/blog/archives/2012/04/03/what-you-do-not-what-you-own/#comments</comments>
		<pubDate>Tue, 03 Apr 2012 15:53:32 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category />

		<guid isPermaLink="false">http://lemire.me/blog/?p=4124</guid>
		<description>Over 20 years ago, back when I was in high school, I went on a sailboat trip. I was so impressed that I decided to own a sailboat one day. I realized that a sailboat was expensive, and I guess I thought that owning a boat would not only be cool, it would be a [...]</description>
			<content:encoded><![CDATA[<p> <a style="float:right;margin:5px" href="http://lemire.me/blog/wp-content/uploads/2012/04/535245_259976157425124_100002382257993_585108_60003389_n.jpg"><img src="http://lemire.me/blog/wp-content/uploads/2012/04/535245_259976157425124_100002382257993_585108_60003389_n-300x200.jpg" alt="" title="535245_259976157425124_100002382257993_585108_60003389_n" width="300" height="200"  /></a><br />
Over 20 years ago, back when I was in high school, I went on a sailboat trip. I was so impressed that I decided to own a sailboat one day. I realized that a sailboat was expensive, and I guess I thought that owning a boat would not only be cool, it would be a symbol of my success. (Can you recognize me in the sailboat picture?)</p>
<p><a style="float:right;margin:5px" href="http://lemire.me/blog/wp-content/uploads/2012/04/P1040692.jpg"><img src="http://lemire.me/blog/wp-content/uploads/2012/04/P1040692-300x225.jpg" alt="" title="P1040692" width="300" height="225"  /></a><br />
How did it go?<br />
Today, not only do I now own two sailboats, I&#8217;m building a third one. Of course, they are radio-controlled  boats, about 4 feet tall and 2 feet long. I find that I really like to design and build these little boats.</p>
<p>Did I fulfill my dream? I am sure the younger self would be unimpressed by the small boats I own, but he would be floored by the two young boys who accompany me when I test them. My dream of owning a sailboat became irrelevant. Today, in 2012, if I wanted to go on a sailboat trip, I would probably rent one. And what would matter most to me is the look on the faces of my boys.</p>
<p>I find that this irrelevance of my earlier dreams is a common pattern throughout my life. My younger self was dreaming  about having things and being someone. He thought this would bring happiness. He was wrong.</p>
<p>Today, I focus on <strong>doing</strong> things. I do not own very much and I am not someone important. But I do fun things. I write this blog, I write research papers, I build boats, I publish software, I play video games with my boys&#8230;  These are the things that matter.</p>
<p><strong>Note</strong>: I am grateful to <a href="http://www.facebook.com/profile.php?id=100002382257993">Maxime Larocque</a> for keeping his old high school pictures and posting them on Facebook.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=1qDyUBCjNxE:6gxRxxE2RNc:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=1qDyUBCjNxE:6gxRxxE2RNc:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/1qDyUBCjNxE" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://lemire.me/blog/archives/2012/04/03/what-you-do-not-what-you-own/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		<feedburner:origLink>http://lemire.me/blog/archives/2012/04/03/what-you-do-not-what-you-own/</feedburner:origLink></item>
		<item>
		<title>Publicly available large data sets for database research</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/St0lsfCLF94/</link>
		<comments>http://lemire.me/blog/archives/2012/03/27/publicly-available-large-data-sets-for-database-research/#comments</comments>
		<pubDate>Tue, 27 Mar 2012 15:07:38 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category />

		<guid isPermaLink="false">http://lemire.me/blog/?p=4102</guid>
		<description>Most database research papers use synthetic data sets. That is, they use random-number generators to create their data on the fly. A popular generator is dbgen from the Transaction Processing Performance Council (TPC). Why is that a problem? We end up working with simplistic models. If we consider the main table generated by dbgen, out [...]</description>
			<content:encoded><![CDATA[<p>Most database research papers use synthetic data sets. That is, they use random-number generators to create their data on the fly. A popular generator is <a href="http://www.tpc.org/tpch/">dbgen</a> from the Transaction Processing Performance Council (TPC).</p>
<p>Why is that a problem? </p>
<ul>
<li>
We end up working with simplistic models. If we consider the main table generated by dbgen, out of 17 columns, 7 have uniform distributions. This almost never happens with real data. Similarly, we often end up with attributes that are perfectly statistically independent. I believe that the randomly generated data is not be representative of the real data you find in real businesses.
</li>
<li>
Because so many people use essentially the same data generators, we risk having solutions that are optimized for this particular synthetic data. This allows researchers to essentially cheat, sometimes unknowingly.
</li>
</ul>
<p>However, finding suitably large real data sets is difficult. My own research focus is on data warehousing. Most businesses are unwilling to share the data in their data warehouses. Even if they were willing to do so, sharing very large files is inconvenient.</p>
<p>Here are three moderately large data sets that I have used in my research:</p>
<ul>
<li>I found a table of <a href="http://www.google.com/fusiontables/DataSource?dsrcid=224453">wikileaks-related metadata</a> on Google Fusion. By transforming it into a relational table, I was able to create a table with over a million rows. It is tiny by data warehousing standards, but the data is easily accessible.</li>
<li>I like the Canadian Census from 1880. You have data about 4 million Canadians. The data is publicly available and convenient. See my <a href="http://arxiv.org/abs/0909.1346">paper in Information Sciences from 2011</a> for details on how we retrieved and processed it. </li>
<li>There is weather data set that we have used repeatedly. It is relatively large (9 GB) and freely available as the  <a href="http://cdiac.ornl.gov/epubs/ndp/ndp026b/ndp026b.htm">Edited synoptic cloud reports from ships and land stations over the globe</a>. You can retrieve all of the data files from the <a href="http://cdiac.ornl.gov/ftp/ndp026b/">ftp directory</a> and aggregate them in a single table.</li>
</ul>
<p>Can we do better? I asked on the Internet for large tabular data sets (greater than 20 GB) that can be considered representative of business use. I have not found anything that match my needs yet, but I am investigating further one data set that I consider especially promising:</p>
<ul>
<li><a href="https://plus.google.com/u/0/107121399840634452924/posts">James Long</a> suggested I have a look at the <a href="http://www.census.gov/main/www/cen2000.html">US Census from 2000</a>. I am currently downloading some of their data. I am unsure how big a table I can construct, but there are gigabytes of freely available data.</li>
</ul>
<p>I also received other excellent proposals that I am not pursuing for the time being:</p>
<ul>
<li><a href="https://plus.google.com/u/0/109137147030554669814/posts">Jared Webb</a> suggested I grab one of the <a href="http://www.ehdp.com/vitalnet/datasets.htm">large health data sets</a>. I ended up getting the <a href="http://seer.cancer.gov/">SEER data set</a> from National Cancer Institute, as recommended by Jared. I got a  210 MB zip file filled with relatively small flat files. The data is freely accessible, but you must fax a signed form. It is not nearly large enough for my needs.</li>
<li><a href="https://plus.google.com/u/0/116496743359717565259/posts">Howard C. Shaw III</a> suggested the <a href="http://www.cs.cmu.edu/~enron/">Eron data set</a>. Unfortunately, the Enron email data set is primarily made of unstructured (text) data. It does not suit my purposes.</li>
<li><a href="https://plus.google.com/u/0/116496743359717565259/posts">Howard</a> and <a href="https://twitter.com/#!/neil_conway">Neil Conway</a> also pointed out that I should look at the <a href="http://aws.amazon.com/datasets">Amazon public data sets</a>. These are moderately large data sets that Amazon makes available to its web services customers. In particular, <a href="https://plus.google.com/u/0/115621817091875458160/posts">Tim Goh</a> suggested I look at the <a href="http://aws.amazon.com/datasets/2320">Freebase data dump</a>. Unfortunately, I am not an Amazon customer and I am uneasy about basing my research on data that is only available through an Amazon subscription. Thankfully, many of the Amazon public data sets are available elsewhere. For example, <a href="http://www.linkedin.com/in/nicolastorzec">Nicolas Torzec </a> told me that freebase is making available <a href="http://download.freebase.com/datadumps/">its database dumps from their site</a>. He says that the full dump is 4GB compressed and that it contains about 50M objects. </li>
<li><a href="https://plus.google.com/u/0/115621817091875458160/posts">Tim Goh</a> reminded me of the <a href="http://books.google.com/ngrams/datasets">Google Book n-gram data set</a>. (Update: this is different from the Google n-gram data set which is not freely available.) It is not typical data warehousing data however.</li>
<li><a href="https://plus.google.com/u/0/115315167921149712529/posts">Kristina Chodorow</a> pointed me to a page on quora.com containing a long list of <a href="http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public">publicly available data sets</a>. There is a wealth of information there, but I find that a lot of times, access is difficult, or the data is unstructured or very specialized (e.g., web search).</li>
<li><a href="https://plus.google.com/u/0/100411588371481475841/posts">Jeff Green</a> suggested I look at the <a href="http://www.ic.nhs.uk/statistics-and-data-collections">NHS public data sets</a>. I spent some time on their site and could only find small data sets.</li>
<li><a href="https://plus.google.com/u/0/114011283665948992932/posts">Aaron Newton</a>, <a href="https://twitter.com/#!/elehack">Michael Ekstrand</a> and <a href="http://nikete.com/">Nicolás Della Penna</a> suggested  a snapshot of the <a href="http://en.wikipedia.org/wiki/Wikipedia:Database_download">Wikipedia database</a>. Unfortunately, again, much of the data is unstructured. </li>
<li><a href="https://plus.google.com/u/0/103643913746947599094/posts">Brian McFee</a> pointed me to the <a href="http://labrosa.ee.columbia.edu/millionsong/">Million Song dataset</a>. From what I could tell, you have detailed data about a million songs. It is primarily geared toward music information retrieval.</li>
<li><a href="https://plus.google.com/u/0/110433103726971013987/posts">Matt Kevins</a> pointed to the <a href="http://www.google.com/publicdata/directory">Google public data sets</a>. Alas, I could not find out how to download the data sets and I am not sure how large they are.</li>
<li><a href="https://plus.google.com/u/0/106693866707326539937/posts">Aleks Scholz</a> pointed me to the <a href="http://www.ipac.caltech.edu/2mass/releases/allsky/doc/sec1_4.html#ftpdes">all-sky data set</a>. It is a large, freely available, astronomy data set. Most of the data is made of floating-point numbers so it does not fit my immediate needs, but it looks very interesting.</li>
<li><a href="https://plus.google.com/u/0/107861942561128674596/posts">Israel Herraiz</a> suggested I look at the <a href="http://sourcerer.ics.uci.edu/repository.html">Sourcerer project</a>. It looks like a large collection of Java source code. This could support great research projects, but it is not a good match for what I want to do right now.</li>
<li><a href="https://twitter.com/#!/kopfkind">Axel Knauf</a> suggested <a href="http://commoncrawl.org/data/">web crawling data</a>. Probably not a good fit for my current research though it can be very valuable if you work on web information retrieval.</li>
<li><a href="http://home.manhattan.edu/~peter.boothe/">Peter Boothe</a> asked whether I was interested in <a href="http://www.routeviews.org/">BGP routing data count</a>. It looks like network data. I haven&#8217;t looked at it too much. Could be interesting, but it is probably too specialized.</li>
<li><a href="https://twitter.com/#!/kaleidic">Tracy Harms</a> asked about the Netflix data set. Unfortunately, this data set is no longer publicly available and it was only 2 GB. I used it in my 2010 <a href="http://arxiv.org/abs/0901.3751">Data &amp; Knowledge Engineering paper on bitmap indexes</a>.</li>
<li><a href="https://twitter.com/#!/AaronJElmore">Aaron J Elmore</a> pointed me to <a href="http://oltpbenchmark.com/">oltpbenchmark.com</a> for an <a href="http://en.wikipedia.org/wiki/Online_transaction_processing">online transaction processing benchmark</a> framework. My research is not primarily on transactions (OLTP), but this is a very interesting project. They have collected data sets and corresponding workloads. In particular, they link to <a href="http://www.wikibench.eu/wiki/2007-09/">Wikipedia access statistics</a>. It could be very important if you are designing back-end systems for web applications.</li>
<li><a href="http://ocelma.net/">Òscar Celma</a> points to a <a href="http://an.kaist.ac.kr/~haewoon/release/twitter_social_graph/">Twitter social graph</a> which occupies several gigabytes. </li>
</ul>
<p>(If you answered my queries and I have not included you, I am sorry.)</p>
<div class="related">
<p>Related posts (automatically generated):</p>
<ul>
		<li><a href="http://lemire.me/blog/archives/2006/05/26/stxxl-c-standard-template-library-for-extra-large-data-sets/" rel="bookmark">STXXL: C++ Standard Template Library for Extra Large Data Sets</a><!-- (15.2)--></li>
	</ul>
</div>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=St0lsfCLF94:nDM3DSltJ7U:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=St0lsfCLF94:nDM3DSltJ7U:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/St0lsfCLF94" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://lemire.me/blog/archives/2012/03/27/publicly-available-large-data-sets-for-database-research/feed/</wfw:commentRss>
		<slash:comments>20</slash:comments>
		<feedburner:origLink>http://lemire.me/blog/archives/2012/03/27/publicly-available-large-data-sets-for-database-research/</feedburner:origLink></item>
		<item>
		<title>Do we need copyright?</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/sShDMGjQIjc/</link>
		<comments>http://lemire.me/blog/archives/2012/03/22/do-we-need-copyright/#comments</comments>
		<pubDate>Thu, 22 Mar 2012 14:46:03 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category />

		<guid isPermaLink="false">http://lemire.me/blog/?p=4081</guid>
		<description>The concept of property is a social construction. Animals, such as cats, can own a piece of food, or a territory, but only as long as they are able to personally maintain a credible threat of violence. And animals can only defend concrete, physical properties, such as an area, a dead bird or a tree. [...]</description>
			<content:encoded><![CDATA[<p>The concept of property is a social construction. Animals, such as cats, can own a piece of food, or a territory, but only as long as they are able to personally  maintain a credible threat of violence. And animals can only defend concrete, physical properties, such as an area, a dead bird or a tree.</p>
<p>Yet we are trained to hold copyright as a natural right. People who infringe on copyright are labelled as pirates, thieves. We are told that they literally steal from hard-working creators.</p>
<p>Most people learn about copyright in schools. Often the local librarian will lecture students on the evil of photocopying a book. However, as is often the case, schools fail us. We have good reasons to be critical of copyright and we should question the myths that are often reiterated about copyright.</p>
<p><strong>First myth:  Copyright is meant primarily to protect authors. </strong></p>
<p>This is  a lie. </p>
<ul>
<li>State-enforced copyright came about with the <a href="http://en.wikipedia.org/wiki/Statute_of_Anne">Statute of Anne</a> in 1710. It was the result lobby of a group English publishers who sought to regain their monopoly on publishing. Handing out the initial copyright to the authors was a political gesture: the goal has always been to get authors to hand over the copyright to the publisher, effectively giving the publisher a monopoly.
</li>
<li>In most countries, copyright hold for 70 years after the death of the author. Such a long-term copyright cannot possibly be meant to protect authors.
</li>
</ul>
<p><strong>Second myth: Copyright protects the little guy.</strong></p>
<p>Most of the revenue due to copyright go to wealthy individuals and corporations. Meanwhile, most people who rely on their copyright for a living (writers, musicians, and so on) have low incomes.
</p>
<p><strong>Third myth: Without copyright, there could be no innovation.</strong></p>
<p>Some of the most innovative domains are virtually free from copyright:</p>
<ul>
<li>The fashion industry is effectively copyright-free. Anyone can come up with a new design for a dress. If the design is successful, it will be copied and it is unpractical to try to enforce copyright. Thus, fashion designers must constantly out-innovate the competition.</li>
<li>Researchers freely hand over the copyright to publishers in exchange for nothing. Researchers are driven to invent and innovate because their remuneration and social status depends on their reputation. If anything, copyright on research work slows down progress.</li>
</ul>
<p><strong>Fourth myth: We know that copyright makes us collectively  better off.</strong></p>
<p>The evidence points in the opposite direction. Germany had weak copyright laws up until the Copyright Act of 1901. Yet, maybe because of these weak laws, it became a literary and scientific power:</p>
<blockquote><p>(&#8230;), only 1,000 new works appeared annually in England at that time &#8212; 10 times fewer than in Germany &#8212; and this was not without consequences. Höffner believes it was the chronically weak book market that caused England, the colonial power, to fritter away its head start within the span of a century, while the underdeveloped agrarian state of Germany caught up rapidly, becoming an equally developed industrial nation by 1900. (<a href="http://www.spiegel.de/international/zeitgeist/0,1518,710976,00.html">No Copyright Law The Real Reason for Germany&#8217;s Industrial Expansion?</a> by Frank Thadeusz)</p></blockquote>
<p> Your dentist probably does not have access to the latest research papers in dentistry: subscribing to a single scientific journal can cost thousands of dollars a year. Is it any surprise if the general public is poorly informed when copyright is used to keep them away from the best science, leaving them only generic news content and blogs?</p>
<p>Even if you don&#8217;t care about science, you should be concerned with the cost of copyright.  For example, have you seen the latest Star War movies? They are awful. But that is all we are going to get for at least another 70 years because George Lucas has a monopoly on Star Wars. Without copyright, or with more limited copyright, we would have had several creators competing to build better Star War movies.</p>
<p><strong>Fifth myth: Without copyright, authors would not get paid.</strong></p>
<p>Authors do not have to get paid. Scientists and many authors actually pay to be published. Some authors publish for the indirect benefits of their publications, such as an improved reputation.</p>
<p>However, when authors do get paid, a natural model is patronage. That is the model used by most scientists. </p>
<p>&#8220;But, Daniel, you are delusional! Not every writer can find a patron.&#8221; Am I? I have funded several book projects myself. For example, a lady called Kio Stark <a href="http://www.kickstarter.com/projects/1528125592/dont-go-back-to-school-a-handbook-for-learning-any?ref=users">got $38,928 from us to write a handbook on alternatives to schooling</a>. </p>
<p>Several authors get funded on kickstarter:</p>
<ul>
<li>Rich Burlew received <a href="http://www.kickstarter.com/projects/599092525/the-order-of-the-stick-reprint-drive">$1,254,120 to get back in print an old comic book</a>.</li>
<li>Dennis McKenna received <a href="http://www.kickstarter.com/projects/1862402066/the-brotherhood-of-the-screaming-abyss">$85,750 to write a memoir</a>.</li>
<li>Cory Silverberg received <a href="http://www.kickstarter.com/projects/1809291619/what-makes-a-baby">$65,516 to write a book on where babies come from</a>.</li>
</ul>
<p> In fact, if you think about it for a minute, whenever you buy a book or a movie, you are being a patron to this project. So all work is the result of patronage. <a href="http://craphound.com">Cory Doctorow</a> makes all his novels available for free from his web site. He happens to be one of the most successful writer of his generation. You can be confident that he is doing well financially. It works for him because people are willing to support him: his paying readers are his patrons.</p>
<p> I should add that whenever you follow a link to Amazon.com from my site, and purchase something, I get a percentage of the transaction.  On a good day, I can make $5 with my blog. I could also add ads and  make a few hundred dollars a month. You do not need copyright laws to make some money off a blog: your readers can act as your patrons.</p>
<p><strong>My position</strong>: I see no justification for copyright. I am effectively a writer: I write lecture notes, research articles and blog posts. I get paid without relying on copyright. Instead, I have patrons: funding agencies, students, and blog readers. But if we insist on having copyright, it should at least be limited to a short term (say 5 years or less). </p>
<p><strong>Further reading</strong>: <a href="http://www.paulgraham.com/property.html">Defining Property</a> by Paul Graham and <a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2024588">What We Know, What We Don&#8217;t Know, and What Policy-makers Would Like Us to Know About the Economics of Copyright</a> by Ruth Towse. You may also enjoy my blog post <a href="http://lemire.me/blog/archives/2012/01/06/do-we-need-patents/">Do we need patents?</a> </p>
<p><strong>Appendix</strong></p>
<p>American authors have always enjoyed the protection of copyright laws in the USA. Prior to the adoption of the US constitution, authors in the US were subject to the Statute of Anne. However, the work of <em>foreign authors</em> in the USA were considered to be in the public domain up until the end of the 19<sup>th</sup> century. One might think, since copyright is supposedly good for authors, that foreign authors would be penalized by this lack of copyright. It seems that there were not. Dickens made a fortune in the USA despite the lack of copyright. Foreign authors could sell their &#8220;authorization&#8221; and they would frequently negotiate advances in excess of what they could get in Europe. (References:  Khan, <a href="http://www.nber.org/papers/w10271">Does copyright piracy pay</a> and Plant, <a href="http://www.jstor.org/discover/10.2307/2548748?uid=3739464&#038;uid=2&#038;uid=3737720&#038;uid=4&#038;sid=21100680873821">The economic aspects of copyright in books</a>)</p>
<p>What about contemporary examples? Indian intellectual property enforcement has been historically weak. You can readily find copies of American movies made in India, and American studios do not bother suing: Indian courts place a high bar on infringement. Derivative works (such as making a movie or a book out of an existing work) are often not found to be infringing copyright in India whereas they would elsewhere. So how does India fare culturally? Well, outside of America, Indian movie production is unsurpassed (hence the term <a href="http://en.wikipedia.org/wiki/Bollywood">Bollywood</a>).</p>
<p>Similarly, Japan, Korea and Taiwan have maintained weak intellectual property regimes. It is believed that this was a key factor to explain their economic growth during the second half of the XX<sup>th</sup> century. (Kumar, <a href="http://www.iprcommission.org/papers/pdfs/study_papers/sp1b_kumar_study.pdf">Intellectual Property Rights, Technology and Economic Development: Experiences of Asian Countries</a>)</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=sShDMGjQIjc:0jnQ3LbkdK8:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=sShDMGjQIjc:0jnQ3LbkdK8:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/sShDMGjQIjc" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://lemire.me/blog/archives/2012/03/22/do-we-need-copyright/feed/</wfw:commentRss>
		<slash:comments>49</slash:comments>
		<feedburner:origLink>http://lemire.me/blog/archives/2012/03/22/do-we-need-copyright/</feedburner:origLink></item>
		<item>
		<title>From counting citations to measuring usage (help needed!)</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/j766Qnn339s/</link>
		<comments>http://lemire.me/blog/archives/2012/03/20/from-counting-citations-to-measuring-usage-help-needed/#comments</comments>
		<pubDate>Tue, 20 Mar 2012 17:26:34 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category />

		<guid isPermaLink="false">http://lemire.me/blog/?p=4073</guid>
		<description>We sometimes measure the caliber of a researcher by how many research papers he wrote. This is silly. While there is some correlation between quantity and quality &amp;#8212; people like Einstein tend to publish a lot &amp;#8212; it can be gamed easily. Moreover, several major researchers have published relatively few papers: John Nash has about [...]</description>
			<content:encoded><![CDATA[<p>We sometimes measure the caliber of a researcher by how many research papers he wrote. This is silly. While there is some correlation between quantity and quality &mdash; people like Einstein tend to publish a lot &mdash; it can be gamed easily. Moreover, several major researchers have published relatively few papers: <a href="http://en.wikipedia.org/wiki/John_Forbes_Nash,_Jr.">John Nash</a> has about two dozens papers in Scopus. Even if you don&#8217;t know much about science, I am sure you can think of a few writers who have written only a couple of books but are still world famous.</p>
<p>A better measure is the number of citations a researcher has received. <a href="http://scholar.google.com/citations?user=q1ja-G8AAAAJ">Google Scholar profiles</a> display  the citation record of researchers prominently. It is a slightly more robust measure, but it is still silly because 90% of citations are shallow: most authors haven&#8217;t even read the paper they are citing. We tend to cite famous authors and famous venues in the hope that some of the prestige will get reflected. </p>
<p>But why stop there? We have the technology to measure the usage made of a cited paper. Some citations are more significant: for example it can be an extension of the cited paper. Machine learning techniques can measure the impact of your papers based on how much following papers build on your results. Why isn&#8217;t it done?</p>
<p>People object that defining a metric based on machine learning is troublesome. However, we rely daily on spam filters, search engines and recommender systems that we do not fully understand. Measures that are beyond our ability to compute by hand have repeatedly proven useful. Moreover, identifying important citations can have other applications: </p>
<ul>
<li>Google Scholar says that I am cited about 160 times a year. On average, a paper citing me comes out every two days. What does it mean? I don&#8217;t know. I would be interested in identifying quickly which papers make non-trivial use of my ideas. I am sure many researchers would be interested too!</li>
<li>I sometimes stumble on older highly cited papers. I want to quickly identify the significant follow-up papers. Yet I am often faced with a sea of barely relevant papers that merely cited the reference in passing. It would be tremendously useful for me to know which papers have cited the reference meaningfully.</li>
</ul>
<p>Hence, I surveyed the machine learning literature on classifying citations. I found high quality work, but I feel it is an under-appreciated problem. So I got in touch with <a href="http://nova.apperceptual.com/">Peter Turney</a> and <a href="http://web.ncf.ca/andre/">Andre Vellino</a> and we decided to promote this problem further.</p>
<p>Our first step is to collect a data set of papers together with their most important references. We believe that the best experts to determine what are the crucial references are the authors themselves!</p>
<p>So, if you are a published researcher, we ask you to contribute by <a href="https://docs.google.com/spreadsheet/viewform?formkey=dHlDalFfR1AzTXpaRXA2WEVlRUF5b0E6MA#gid=0">filling out our short online form</a>. On this form, you will be asked for your name and a few papers together with an identification of the crucial references for each paper. The form can take less than 30 seconds to fill out.</p>
<p>In exchange, we will publish the data we collect under the <a href="http://opendatacommons.org/licenses/pddl/1-0/">ODC Public Domain Dedication and Licence</a>. If you leave us your email, we will even tell you when the data is publicly available. Such a public high-quality data set should entice a few researchers to write papers. And, of course, I might contribute to such a paper myself.</p>
<p>My long-term goal is simple: I hope that in a couple of years, Google Scholar will differentiate between citations and &#8220;meaningful&#8221; citations.</p>
<p>Now go fill out the <a href="https://docs.google.com/spreadsheet/viewform?formkey=dHlDalFfR1AzTXpaRXA2WEVlRUF5b0E6MA#gid=0">form</a>!</p>
<p><strong>Note</strong>: I have <a href="https://plus.google.com/105888615414982242080/posts/P5afw9AU5FD">an earlier version of this post</a> on Google+ with several insightful comments.</p>
<p><strong>Further reading</strong>: <a href="http://synthese.wordpress.com/2012/03/20/building-a-better-citation-index/">Building a Better Citation Index</a> by Andre Vellino</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=j766Qnn339s:uml7aXeZYzA:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=j766Qnn339s:uml7aXeZYzA:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/j766Qnn339s" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://lemire.me/blog/archives/2012/03/20/from-counting-citations-to-measuring-usage-help-needed/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		<feedburner:origLink>http://lemire.me/blog/archives/2012/03/20/from-counting-citations-to-measuring-usage-help-needed/</feedburner:origLink></item>
		<item>
		<title>How fast is bit packing?</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/0MVXftJEZqY/</link>
		<comments>http://lemire.me/blog/archives/2012/03/06/how-fast-is-bit-packing/#comments</comments>
		<pubDate>Wed, 07 Mar 2012 03:25:35 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category />

		<guid isPermaLink="false">http://lemire.me/blog/?p=4057</guid>
		<description>Integer values are typically stored using 32 bits. Yet if you are given an array of integers between 0 and 131&amp;#160;072, you could store these numbers using as little as 17 bits each&amp;#8212;a net saving of almost 50%. Programmers nearly never store integers in this manner despite the obvious compression benefits. Indeed, bit packing and [...]</description>
			<content:encoded><![CDATA[<p>Integer values are typically stored using 32 bits. Yet if you are given an array of integers between 0 and 131&nbsp;072, you could store these numbers using as little as 17 bits each&mdash;a net saving of almost 50%. </p>
<p>Programmers nearly never store integers in this manner despite the obvious compression benefits. Indeed, bit packing and unpacking is expensive. How expensive? Intuitively, you might think that recovering 32-bit integers from a stream of packed integers must be at least as expensive as copying the 32-bit integers, and possibly much more expensive. If that is your intuition, then you might be wrong. It can be cheaper to recover 32-bit integers from packed 4-bit integers because you  only need to load one 32-bit word to unpack 8 integers. </p>
<p>Clearly, packing integers in units of 17 bits is not especially convenient. Indeed, 17 and 32 are <a href="http://en.wikipedia.org/wiki/Coprime">coprime</a>. We expect that it would be much faster to pack and unpack integers in units of 4, 8 or 16 bits, than in units of 17 bits. Indeed it is but the difference is maybe not as large as you might think.</p>
<p>I have implemented <a href="http://pastebin.com/ugGnk00p">efficient packing and unpacking routines</a> in C++. To simplify the implementation, we pack and unpack integers in sets of 32 numbers.  I have optimized the code using the GNU GCC 4.6.2 compiler. </p>
<p>On my macbook air (Intel core i7), I get that the <em>unpacking</em> speed is not very sensitive to the specific number of bits: generally, the smaller the bit width, the faster the unpacking. The <em>packing</em> speed is much faster when the bit width is 8 or 16. Even so, the difference is only by a factor of two or so. The results are presented in the next figure. On the y axis, you have the time (smaller is better). On the the x axis, we have the number of bits we packed to. For example, when bit is 1, we pack 32 integers into a single 32-bit word. When the number of bits is set to 32 bits,  we have a regular copy.</p>
<p><img src="http://lemire.me/blog/wp-content/uploads/2012/03/blogbitpacking.png" /></p>
<p>I also provide the raw numbers behind the figure in the next table. </p>
<table style="border-collapse:collapse;text-align:center;margin-left:auto;margin-right:auto">
<tr style="border-top:3px solid #ccc;border-bottom:2px solid #ccc;">
<th>bits</th>
<th>pack (ms)</th>
<th>unpack (ms)</th>
</tr>
<tr>
<td>1</td>
<td>219</td>
<td>211</td>
</tr>
<tr>
</tr>
<tr>
<td>2</td>
<td>215</td>
<td>216</td>
</tr>
<tr>
</tr>
<tr>
<td>3</td>
<td>210</td>
<td>205</td>
</tr>
<tr>
</tr>
<tr>
<td>4</td>
<td>198</td>
<td>194</td>
</tr>
<tr>
</tr>
<tr>
<td>5</td>
<td>222</td>
<td>214</td>
</tr>
<tr>
</tr>
<tr>
<td>6</td>
<td>229</td>
<td>218</td>
</tr>
<tr>
</tr>
<tr>
<td>7</td>
<td>242</td>
<td>222</td>
</tr>
<tr>
</tr>
<tr>
<td>8</td>
<td>167</td>
<td>202</td>
</tr>
<tr>
</tr>
<tr>
<td>9</td>
<td>252</td>
<td>240</td>
</tr>
<tr>
</tr>
<tr>
<td>10</td>
<td>243</td>
<td>225</td>
</tr>
<tr>
</tr>
<tr>
<td>11</td>
<td>255</td>
<td>235</td>
</tr>
<tr>
</tr>
<tr>
<td>12</td>
<td>246</td>
<td>231</td>
</tr>
<tr>
</tr>
<tr>
<td>13</td>
<td>276</td>
<td>244</td>
</tr>
<tr>
</tr>
<tr>
<td>14</td>
<td>279</td>
<td>245</td>
</tr>
<tr>
</tr>
<tr>
<td>15</td>
<td>304</td>
<td>255</td>
</tr>
<tr>
</tr>
<tr>
<td>16</td>
<td>183</td>
<td>223</td>
</tr>
<tr>
</tr>
<tr>
<td>17</td>
<td>292</td>
<td>252</td>
</tr>
<tr>
</tr>
<tr>
<td>18</td>
<td>297</td>
<td>256</td>
</tr>
<tr>
</tr>
<tr>
<td>19</td>
<td>316</td>
<td>266</td>
</tr>
<tr>
</tr>
<tr>
<td>20</td>
<td>300</td>
<td>256</td>
</tr>
<tr>
</tr>
<tr>
<td>21</td>
<td>329</td>
<td>280</td>
</tr>
<tr>
</tr>
<tr>
<td>22</td>
<td>321</td>
<td>274</td>
</tr>
<tr>
</tr>
<tr>
<td>23</td>
<td>332</td>
<td>278</td>
</tr>
<tr>
</tr>
<tr>
<td>24</td>
<td>299</td>
<td>257</td>
</tr>
<tr>
</tr>
<tr>
<td>25</td>
<td>341</td>
<td>289</td>
</tr>
<tr>
</tr>
<tr>
<td>26</td>
<td>340</td>
<td>298</td>
</tr>
<tr>
</tr>
<tr>
<td>27</td>
<td>352</td>
<td>295</td>
</tr>
<tr>
</tr>
<tr>
<td>28</td>
<td>336</td>
<td>284</td>
</tr>
<tr>
</tr>
<tr>
<td>29</td>
<td>367</td>
<td>311</td>
</tr>
<tr>
</tr>
<tr>
<td>30</td>
<td>357</td>
<td>299</td>
</tr>
<tr>
</tr>
<tr>
<td>31</td>
<td>384</td>
<td>319</td>
</tr>
<tr>
</tr>
<tr style="border-bottom:3px solid #ccc;">
<td>32</td>
<td>256</td>
<td>261</td>
</tr>
<tr>
</tr>
</table>
<p><strong>Conclusion</strong>: Bit packing and unpacking can be quite fast. In particular, it can be cheaper to unpack integers from a small number of bits to 32-bit integers than to copy the same 32-bit integers. Exact results will vary depending on your compiler and CPU.</p>
<p><strong>Note</strong>: Strictly speaking my implementation packs the first bits of each integer: it is not assumed that the integers are between 0 and 2<sup>bit</sup>. By adding this assumption, you can improve the packing speed somewhat (at least when the number of bits is not 8 or 16).</p>
<div class="related">
<p>Related posts (automatically generated):</p>
<ul>
		<li><a href="http://lemire.me/blog/archives/2012/04/05/bit-packing-is-fast-but-integer-logarithm-is-slow/" rel="bookmark">Bit packing is fast, but integer logarithm is slow</a><!-- (17.4)--></li>
	</ul>
</div>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=0MVXftJEZqY:lwQGkCa_eK4:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=0MVXftJEZqY:lwQGkCa_eK4:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/0MVXftJEZqY" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://lemire.me/blog/archives/2012/03/06/how-fast-is-bit-packing/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
		<feedburner:origLink>http://lemire.me/blog/archives/2012/03/06/how-fast-is-bit-packing/</feedburner:origLink></item>
		<item>
		<title>I’m an introvert. And that’s ok.</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/45umWjXi5is/</link>
		<comments>http://lemire.me/blog/archives/2012/03/03/im-an-introvert-and-thats-ok/#comments</comments>
		<pubDate>Sat, 03 Mar 2012 23:26:15 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category />

		<guid isPermaLink="false">http://lemire.me/blog/?p=4042</guid>
		<description>I&amp;#8217;m an introvert. That&amp;#8217;s why you don&amp;#8217;t see me at meetings and celebrations. If you do, I&amp;#8217;m in a corner looking awkward. That&amp;#8217;s why I&amp;#8217;m not trying to build a large laboratory of busy graduate students. That&amp;#8217;s why I crave time alone to reflect and think, to write and code&amp;#8230; I am not shy: I [...]</description>
			<content:encoded><![CDATA[<p>I&#8217;m an <a href="http://en.wikipedia.org/wiki/Introvert">introvert</a>. That&#8217;s why you don&#8217;t see me at meetings and celebrations. If you do, I&#8217;m in a corner looking awkward. That&#8217;s why I&#8217;m not trying to build a large laboratory of busy graduate students. That&#8217;s why I crave time alone to reflect and think, to write and code&#8230;</p>
<p>I am not <em>shy</em>: I can talk in front of 200 people without thinking twice about it. I don&#8217;t lack confidence.  I have a large ego&mdash;too large some would say. (I got my wife to read this post and she particularly agrees with this last sentence.)</p>
<p>But my social interactions have high transaction cost: it takes me time and energy just to start chatting with someone. If I have to chat with dozens of people in a day, I end up exhausted. I can&#8217;t pretend to be your friend on the fly. My brain does not work that way.</p>
<p>I love how it is progressively becoming &#8220;ok&#8221; to be an introvert:</p>
<ul>
<li> Carl King wrote a beautiful <a href="http://www.carlkingdom.com/10-myths-about-introverts">essay</a>: 10 Myths About Introverts.  The last myth is the most important: &#8220;Introverts can fix themselves and become Extroverts&#8221;. Gays cannot become straight. Blacks cannot become white. I cannot become an extrovert.  King&#8217;s essay impressed me so much that I bought his book <a href="http://www.amazon.com/Youre-Creative-Genius-Now-What/dp/1932907920/ref=sr_1_1?ie=UTF8&#038;qid=1330815993&#038;sr=8-1&tag=daniellemires-20" rel="nofollow">So, You’re A Creative Genius</a>: a great read if you are both a creative person and an introvert.</li>
<li>Susan Cain gave a great talk based on her book: <a href="http://www.amazon.com/Quiet-Power-Introverts-World-Talking/dp/0307352145/ref=sr_1_1?s=books&#038;ie=UTF8&#038;qid=1330816086&#038;sr=1-1&tag=daniellemires-20" rel="nofollow">Quiet&mdash;The Power of Introverts in a World That Can&#8217;t Stop Talking</a>. A core message of her talk is that we should recognize our bias against introverts. Not everyone works best in groups.  And that&#8217;s ok. Schools and employers need to stop their attempts to fit introverts in the extrovert mold.
</li>
</ul>
<p><iframe width="560" height="315" src="http://www.youtube.com/embed/c0KYU2j0TM4?rel=0" frameborder="0" allowfullscreen></iframe></p>
<p><strong>Further reading</strong>: Venkatesh Rao (another introvert) penned an intriguing <a href="http://www.ribbonfarm.com/2011/04/07/extroverts-introverts-aspies-and-codies/">analysis</a> of introversion.</p>
<p><strong>Note</strong>: This is an expanded version of a <a href="https://plus.google.com/105888615414982242080/posts/HWkiZFBsLsJ">Google+ post</a>.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=45umWjXi5is:lhBawIEdJcc:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=45umWjXi5is:lhBawIEdJcc:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/45umWjXi5is" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://lemire.me/blog/archives/2012/03/03/im-an-introvert-and-thats-ok/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		<feedburner:origLink>http://lemire.me/blog/archives/2012/03/03/im-an-introvert-and-thats-ok/</feedburner:origLink></item>
		<item>
		<title>What happens when you get more Ph.D.s?</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/8pizIuKQG-0/</link>
		<comments>http://lemire.me/blog/archives/2012/02/20/what-happens-when-you-get-more-ph-d-s/#comments</comments>
		<pubDate>Mon, 20 Feb 2012 16:10:04 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category />

		<guid isPermaLink="false">http://lemire.me/blog/?p=4022</guid>
		<description>Following the fall of the USSR, hundreds of world class mathematicians emigrated to the USA. Intuitively, this should have made American mathematics stronger. Did it? Borjas and Doran examined the problem. Their starting point was the realization that the expertise of Soviet mathematicians differed from the expertise of American mathematicians. For historical reasons, these two [...]</description>
			<content:encoded><![CDATA[<p>Following the fall of the USSR, hundreds of world class mathematicians emigrated to the USA. Intuitively, this should have made American mathematics stronger. Did it?</p>
<p><a href="http://www.hks.harvard.edu/fs/gborjas/publications/working%20papers/BorjasDoranMay2011.pdf">Borjas and Doran</a> examined the problem. Their starting point was the realization that the expertise of Soviet mathematicians differed from the expertise of American mathematicians. For historical reasons, these two communities did different mathematics. So we can examine side-by-side areas of mathematics impacted by the arrival of the new mathematicians and areas unaffected.  Their verdict is that the influx of Soviet mathematicians was unhelpful to academic production:</p>
<blockquote><p>
It is (&#8230;) difficult to find convincing quantitative evidence that there was an improvement in the overall “output” of either the pre-­‐existing American workforce or of that community combined with the Soviet émigrés.
</p></blockquote>
<p>What happened? Basically, the number of research jobs, including tenure-track positions, is mostly independent from the supply of Ph.D.s: Harvard will not hire more professors next year even if all of the Ph.D.s in the world move to Boston. The number of new hires depends of factors such as government funding for research and the number of undergraduate students in research universities. So the new researchers just took the jobs that would have gone to other researchers. Young American researchers had to drop out of research for lack of a position. </p>
<p>This study offers an important policy lesson. Training more Ph.D.s in some targeted areas might fail to improve research output in these areas. In this instance, supply-side economics fails. It might be preferable to create new research jobs instead and attract the Ph.D.s with better salaries.</p>
<p>For example, imagine that the government wants to help cancer research. Providing more scholarships to graduate students either directly, or through grants to professors, sounds sensible. However, it might not have the desired effect at all, according to this study. It would be preferable to use the money to create more research jobs pertaining to cancer research.</p>
<p><strong>Credit</strong>: Thanks to <a href="https://plus.google.com/107467306663817144149/about">Larry Larbear</a> for the reference to this study.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=8pizIuKQG-0:4Bc1aisANlo:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=8pizIuKQG-0:4Bc1aisANlo:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/8pizIuKQG-0" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://lemire.me/blog/archives/2012/02/20/what-happens-when-you-get-more-ph-d-s/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		<feedburner:origLink>http://lemire.me/blog/archives/2012/02/20/what-happens-when-you-get-more-ph-d-s/</feedburner:origLink></item>
		<item>
		<title>Bitmaps are surprisingly efficient</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/A-gifwY2fWE/</link>
		<comments>http://lemire.me/blog/archives/2012/02/17/bitmaps-are-surprisingly-efficient/#comments</comments>
		<pubDate>Fri, 17 Feb 2012 20:21:12 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category />

		<guid isPermaLink="false">http://lemire.me/blog/?p=4000</guid>
		<description>Imagine you have to copy an array, and update a few values in the process. What is the most efficient implementation? Let us look at a concrete example. I am given this array: 0,5,1,4,5,1,10,4. I want to create a new array with these values: 0,5,1,40,5,1,100,4. Most programmers would follow the following algorithm: Copy the array [...]</description>
			<content:encoded><![CDATA[<p>Imagine you have to copy an array, and update a few values in the process. What is the most efficient implementation?</p>
<p>Let us look at a concrete example. I am given this array:<br />
<code>0,5,1,4,5,1,10,4.</code><br />
I want to create a new array with these values:<br />
<code>0,5,1,<span style="color:red;font-weight:bold;">40</span>,5,1,<span style="color:red;font-weight:bold;">100</span>,4.</code></p>
<p>Most programmers would follow the following algorithm:</p>
<ol>
<li>Copy the array first;</li>
<li>Iterate over the list of positions and updated values ([3,<span style="color:red;font-weight:bold;">40</span>], [6,<span style="color:red;font-weight:bold;">100</span>]) and correct the values.</li>
</ol>
<p>If there are very few values that need to be updated compared to the size of the array, this approach is probably optimal. The copy of the array itself is very efficient because it is entirely <a href="http://en.wikipedia.org/wiki/Vectorization_%28parallel_computing%29">vectorizable</a>: the processor does not need to copy values one at a time, it can copy two or four at a time.</p>
<p>But what if 10%, 20% or even 30% of the values need to be updated after the copy? Then storing the list of positions can become wasteful. For my toy problem, I have two positions to record (3 and 6): it can use 64 bits when using 32-bit integers. If I want to be more efficient, I can use 8-bit integers, thus using a total of 16 bits. (Most modern computers favor 32-bit integers, and it is generally not computationally efficient to use integers with anything other than 8, 16, 32 or 64 bits.)</p>
<p>A more memory-conscious approach is to use a bitmap. That is, I store the following value using a binary notation:<br />
<code>000<span style="color:red;font-weight:bold;">1</span>00<span style="color:red;font-weight:bold;">1</span>0</code><br />
I put a 1 at the third and sixth position and elsewhere a 0. This makes up the integer 72. In this manner, I never need more than one bit per value. In this case, I use only 8 bits in total, a saving of 50% compared to the alternative where I store each position using an integer.</p>
<p>We need a different implementation however, one where you check the bits of the bitmap before copying.</p>
<ul>
<li>for every position in the array
<ol>
<li>if the bitmap value is 0 then copy the value from the source array;</li>
<li>if the bitmap value is 1 then copy the next available updated value.</li>
</ol>
</li>
</ul>
<p>This new algorithm looks inefficient. There is a lot of branching inside a tight loop. Yet the bitmap approach can be faster when the density of updates is high enough (>2%), as the next table shows. </p>
<table style="border-collapse:collapse;text-align:center;margin-left:auto;margin-right:auto">
<tr style="border-top:3px solid #ccc;border-bottom:2px solid #ccc;">
<th>density (%)</th>
<th>&nbsp;&nbsp;&nbsp;time&nbsp;&nbsp;&nbsp;</th>
<th>&nbsp;&nbsp;time with bitmaps&nbsp;&nbsp;</th>
<th>&nbsp;&nbsp;straight copy&nbsp;&nbsp;</th>
</tr>
<tr>
<td>17</td>
<td>			48		</td>
<td>		<strong>26</strong>	</td>
<td>		<strong>24</strong>	</td>
</tr>
<tr>
<td>9</td>
<td>			47		</td>
<td>		<strong>26</strong>	</td>
<td>		<strong>24</strong>	</td>
</tr>
<tr>
<td>6</td>
<td>			45		</td>
<td>		<strong>26</strong>	</td>
<td>		<strong>24</strong>	</td>
</tr>
<tr>
<td>5</td>
<td>			43		</td>
<td>		<strong>26</strong>	</td>
<td>		<strong>24</strong>	</td>
</tr>
<tr>
<td>4</td>
<td>			41		</td>
<td>		<strong>26</strong>	</td>
<td>		<strong>24</strong>	</td>
</tr>
<tr>
<td>3</td>
<td>			38		</td>
<td>		<strong>26</strong>	</td>
<td>		<strong>24</strong>	</td>
</tr>
<tr style="border-bottom:3px solid #ccc;">
<td>2</td>
<td>			35	</td>
<td>		<strong>26</strong>	</td>
<td>		<strong>24</strong>	</td>
</tr>
</table>
<p><a href="http://pastebin.com/fU18McyU"><br />
My C++ code is online</a>. I used GNU GCC 4.6.2 with only the -Ofast flag. Hardware-wise, I am using a recent MacBook Air with an Intel Core i7. (I stress that using GCC 4.6 is important. Older compilers might give different results. Also, the Core i7 is a processors with aggressive pipelining: cheaper processors might give different results.)</p>
<p>As you can see, the bitmap approach is optimal: a copy with updated values indicated by a bitmap is just as fast as a simple copy (within 10%).</p>
<p>(I updated the numbers on Feb. 23rd 2012. Originally, my code processed the bitmaps 32 bits at a time. I found that it was much faster to process them 8 bits at a time, probably because it allows better loop unrolling. I updated the code again on Feb. 27th 2012 after a bug report by Martin Trenkmann, the numbers were slightly updated as well.)</p>
<p><strong>Conclusion</strong> Indicating exceptions using a bitmap can save memory without any penalty to the running time.</p>
<p><strong>Code</strong>: Source code posted on my blog is available from a <a href="https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog">github repository</a>.</p>
<div class="related">
<p>Related posts (automatically generated):</p>
<ul>
		<li><a href="http://lemire.me/blog/archives/2009/02/03/just-published-java-compressed-bitmap-class/" rel="bookmark">Compressed bitmaps in Java</a><!-- (11.2)--></li>
		<li><a href="http://lemire.me/blog/archives/2008/05/01/i-am-seeking-an-efficient-algorithm-to-group-identical-values-in-an-array/" rel="bookmark">Seeking an efficient algorithm to group identical values</a><!-- (10)--></li>
	</ul>
</div>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=A-gifwY2fWE:t4-uuhwEpy0:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=A-gifwY2fWE:t4-uuhwEpy0:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/A-gifwY2fWE" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://lemire.me/blog/archives/2012/02/17/bitmaps-are-surprisingly-efficient/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		<feedburner:origLink>http://lemire.me/blog/archives/2012/02/17/bitmaps-are-surprisingly-efficient/</feedburner:origLink></item>
		<item>
		<title>Effective compression using frame-of-reference and delta coding</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/ChkuEfTWiv0/</link>
		<comments>http://lemire.me/blog/archives/2012/02/08/effective-compression-using-frame-of-reference-and-delta-coding/#comments</comments>
		<pubDate>Wed, 08 Feb 2012 20:21:49 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category />

		<guid isPermaLink="false">http://lemire.me/blog/?p=3966</guid>
		<description>Most generic compression techniques are based on variations on run-length encoding (RLE) and Lempel-Ziv compression. Compared to these techniques and on the right data set, frame-of-reference and delta coding can be faster for a comparable compression rate. Mathematically, frame-of-reference and delta coding use the same principle: we apply an invertible transformation that maps a set [...]</description>
			<content:encoded><![CDATA[<p>Most generic compression techniques are based on variations on <a href="http://lemire.me/blog/archives/2009/11/24/run-length-encoding-part-i/">run-length encoding</a> (RLE) and  <a href="http://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv">Lempel-Ziv</a> compression. Compared to these techniques and on the right data set, frame-of-reference and delta coding can be faster for a comparable compression rate.  </p>
<p>Mathematically, frame-of-reference and delta coding use the same principle: we apply an invertible transformation that maps a set of (relatively) large integers to mostly smaller integers. (This is a common pattern when compressing data).</p>
<p>Suppose that you wish to compress a sequence of (non-negative) integers. Consider the following sequence:  </p>
<p><code>107,108,110,115,120,125,132,132,131,135.</code></p>
<p>We could store these 10 numbers as 8-bit integers using 80 bits in total. For example, we have that 135 is 10000111 in binary notation.</p>
<p>The frame-of-reference approach begins by computing the range and minimum of the array. We see that the numbers range from 107 and 135. Thus, instead of coding the original sequence, we can subtract 107 from each value and code this difference instead:</p>
<p><code>0, 1, 3, 8, 13, 18, 25, 25, 24, 28.</code></p>
<p>We can code each offset value using no more than 5 bits. For example, 28 is 11100 in binary notation. Of course, we still need to store the minimum value (107) using 8 bits, and we need at least 3 bits to record the fact that only 5 bits per value are used. Nevertheless, the total (8+3+9*5=45) is much less than the original 80 bits. In  actual compression software, you would decompose the data into blocks that are maybe larger than 10 values (say 16,  128 or 2048 values). The overhead of storing the minimal value would be small. Moreover, there are computational side benefits to this format: if we seek the value 1000, we know it cannot be in the block if its minimum is 107 and we use only 5 bits to store the offset from 107.   </p>
<p>Frame-of-reference works when the range of values in each block is relatively small. We can sometimes get better compression if the difference between the values is small. In this case, it is  useful to look at the differences between successive values (e.g., 108-107=1, 110-108=2, 115-110=5):</p>
<p><code>1,2,5,5,5,7,0,-1,4.</code></p>
<p>Given this set of differences and the initial value (107), we can reconstruct the original sequence. <em>Delta coding</em> is the compression strategy where we store these differences instead of the original values. Some people like to think of delta coding as a predictive scheme: you constantly predict that the next value will be like the previous one, and you just code the difference between your prediction and the observed value.</p>
<p>In binary, the values 1,2,5,7 and 4 can be written as 001, 010, 101, 111, 100. If we did not have a negative value (-1), we could store these differences using only 3 bits per value. The negative value comes from the fact that our values are not entirely sorted (just locally so). However, as we shall see, this single negative value will cause us some trouble. How do we code the -1? </p>
<ul>
<li>The original values are 8-bit values. This means that -1 and 256-1 are the same numbers (modulo 256). That is 25+255 modulo 256 is 24. In effect, we compute differences in an integer ring. The differences become <code>1,2,5,5,5,7,0,255,4.</code> Computing the modulo with a power of two is fast because computers use the binary format natively.
</li>
<li>
If you know the value that was predicted (25 in our case). You know that the range of differences goes from -25 to 230. Thus for differences <em>x</em> between -25 and 25, we store them  as 2<em>x</em> if it is positive and as -2<em>x</em>-1 if it is negative. Otherwise, we store it as <em>x</em>+25. One problem with this approach is that it may require much branching: the processor has to constantly check conditions before proceeding further. There may be a substantial penalty to pay when using modern superscalar processors.
</li>
<li>We can replace subtractions by bitwise exclusive or (xor) operations. It bypasses the issue entirely because xoring integers never generates negative values. The successive xor values are <code>7,2,29,11,5,249,0,7,4.</code> A benefit of the xor operation is that it is symmetric: x xor y  is y xor x. This means that inverting the order of the original list, we would simply invert the order of the list of differences. Obviously, computing the xor is quite fast.</li>
</ul>
<p>Once we have the list of differences as non-negative numbers, we can then try to store them by using as few bits as possible. Unfortunately, in our case, we could to the conclusion that we need 8 bits to store the differences. We remarked however that for all but one value, 3 bits per difference would suffice. </p>
<p>So a sensible solution is to code the first 3 bits of each differences: <code>001, 010, 101, 101, 101, 111, 000, 111, 100.</code> And then we add a pointer to the second last difference to indicate that we are missing 5 bits (11111). The cost of coding this exception is about 13 bits.  So the total storage cost would be (8+3+9*3+13=51). In this case, frame-of-reference is preferable to delta coding, but both are preferable to the original 8-bit coding which used 80 bits.</p>
<p>There are many possible variations. For example, you can also use exception technique with the frame-of-reference approach when almost all values fit in a range of values, except for a few.</p>
<p><strong>Further reading</strong>: the document <a href="http://www.hdfgroup.org/doc_resource/SZIP/">SZIP Compression in HDF Products</a> and the corresponding  <a href="http://public.ccsds.org/publications/archive/120x0g2.pdf">CCSDS 120.0-G-2</a> data compression standard describe the application of delta coding for scientific data. <a href="http://michael.dipperstein.com/">Michael Dipperstein&#8217;s page</a> provides a nice overview of generic compression techniques. The specific exception technique I described is from the NewPFD scheme first described in: </p>
<blockquote><p>
H. Yan, S. Ding, T. Suel, <a href="http://www2009.eprints.org/41/1/p401.pdf">Inverted index compression and query processing with optimized document ordering</a>, in: WWW ’09, 2009.
</p></blockquote>
<div class="related">
<p>Related posts (automatically generated):</p>
<ul>
		<li><a href="http://lemire.me/blog/archives/2009/08/28/trading-compression-for-speed-with-vectorization/" rel="bookmark">Trading compression for speed with vectorization</a><!-- (11.5)--></li>
	</ul>
</div>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=ChkuEfTWiv0:YnbPlzgbzqM:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=ChkuEfTWiv0:YnbPlzgbzqM:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/ChkuEfTWiv0" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://lemire.me/blog/archives/2012/02/08/effective-compression-using-frame-of-reference-and-delta-coding/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		<feedburner:origLink>http://lemire.me/blog/archives/2012/02/08/effective-compression-using-frame-of-reference-and-delta-coding/</feedburner:origLink></item>
	</channel>
</rss>

