<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:creativeCommons="http://backend.userland.com/creativeCommonsRssModule" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>Daniel Lemire's blog</title>
	
	<link>http://lemire.me/blog</link>
	<description>Computer Scientist and Open Scholar: Databases, Information Retrieval, Business Intelligence.</description>
	<lastBuildDate>Sat, 04 Feb 2012 01:49:57 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/daniel-lemire/atom" /><feedburner:info uri="daniel-lemire/atom" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><geo:lat>45</geo:lat><geo:long>-73</geo:long><creativeCommons:license>http://creativecommons.org/licenses/by-nc-sa/2.0/</creativeCommons:license><feedburner:emailServiceId>daniel-lemire/atom</feedburner:emailServiceId><feedburner:feedburnerHostname>http://feedburner.google.com</feedburner:feedburnerHostname><feedburner:feedFlare href="http://www.bloglines.com/sub/http://feeds.feedburner.com/daniel-lemire/atom" src="http://www.bloglines.com/images/sub_modern11.gif">Subscribe with Bloglines</feedburner:feedFlare><feedburner:feedFlare href="http://fusion.google.com/add?feedurl=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://buttons.googlesyndication.com/fusion/add.gif">Subscribe with Google</feedburner:feedFlare><feedburner:feedFlare href="http://www.plusmo.com/add?url=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://plusmo.com/res/graphics/fbplusmo.gif">Subscribe with Plusmo</feedburner:feedFlare><feedburner:feedFlare href="http://www.thefreedictionary.com/_/hp/AddRSS.aspx?http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://img.tfd.com/hp/addToTheFreeDictionary.gif">Subscribe with The Free Dictionary</feedburner:feedFlare><feedburner:feedFlare href="http://www.bitty.com/manual/?contenttype=rssfeed&amp;contentvalue=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://www.bitty.com/img/bittychicklet_91x17.gif">Subscribe with Bitty Browser</feedburner:feedFlare><feedburner:feedFlare href="http://www.newsalloy.com/?rss=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://www.newsalloy.com/subrss3.gif">Subscribe with NewsAlloy</feedburner:feedFlare><feedburner:feedFlare href="http://www.live.com/?add=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://tkfiles.storage.msn.com/x1piYkpqHC_35nIp1gLE68-wvzLZO8iXl_JMledmJQXP-XTBOLfmQv4zhj4MhcWEJh_GtoBIiAl1Mjh-ndp9k47If7hTaFno0mxW9_i3p_5qQw">Subscribe with Live.com</feedburner:feedFlare><feedburner:feedFlare href="http://mix.excite.eu/add?feedurl=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://image.excite.co.uk/mix/addtomix.gif">Subscribe with Excite MIX</feedburner:feedFlare><feedburner:feedFlare href="http://download.attensa.com/app/get_attensa.html?feedurl=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://www.attensa.com/blogs/attensa/WindowsLiveWriter/BadgeredintoBadges_10C02/attensa_feed_button5.gif">Subscribe with Attensa for Outlook</feedburner:feedFlare><feedburner:feedFlare href="http://www.webwag.com/wwgthis.php?url=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://www.webwag.com/images/wwgthis.gif">Subscribe with Webwag</feedburner:feedFlare><feedburner:feedFlare href="http://www.podcastready.com/oneclick_bookmark.php?url=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://www.podcastready.com/images/podcastready_button.gif">Subscribe with Podcast Ready</feedburner:feedFlare><feedburner:feedFlare href="http://www.flurry.com/pushRssFeed.do?r=fb&amp;url=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://www.flurry.com/images/flurry_rss_logo2.gif">Subscribe with Flurry</feedburner:feedFlare><feedburner:feedFlare href="http://www.wikio.com/subscribe?url=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://www.wikio.com/shared/img/add2wikio.gif">Subscribe with Wikio</feedburner:feedFlare><feedburner:feedFlare href="http://www.dailyrotation.com/index.php?feed=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://www.dailyrotation.com/rss-dr2.gif">Subscribe with Daily Rotation</feedburner:feedFlare><item>
		<title>Two rules for teaching in the XXIst century</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/TMHdaTRnGF0/</link>
		<comments>http://lemire.me/blog/archives/2012/01/30/two-rules-for-teaching-in-the-xxist-century/#comments</comments>
		<pubDate>Mon, 30 Jan 2012 14:44:16 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category />

		<guid isPermaLink="false">http://lemire.me/blog/?p=3924</guid>
		<description>Education in the XXth century has been primarily industrial: organize the workersstudents in groups under the supervision of a managerteacher. We all have been in such systems for so long that we take it for granted. How else is anyone to learn? Maybe some can learn differently, but most can&amp;#8217;t because they are unmotivated and [...]</description>
			<content:encoded><![CDATA[<p>Education in the XX<sup>th</sup> century has been primarily industrial: organize the <del datetime="2012-01-27T13:58:46+00:00">workers</del>students in groups under the supervision of a <del datetime="2012-01-27T13:58:46+00:00">manager</del>teacher.</p>
<p>We all have been in such systems for so long that we take it for granted. How else is anyone to learn? Maybe some can learn differently, but most can&#8217;t because they are unmotivated and lazy, they lack the critical skills to differentiate right from wrong on their own and they can&#8217;t assess their own level of expertise. At least, that is what I&#8217;m told, but I think it is unfair.</p>
<p>To me, this is like saying that we have to keep long-time prisoners in jail because they do not know how to organize themselves when given their freedom.</p>
<p>Indeed, if students who went through years of schooling cannot learn on their own, if they cannot assess their own progress, and if they generally cannot organize themselves without supervision, we have to wonder whether schools bear part of the blame. And I think they do: we enroll students in supervised and regimented systems where they are constantly told what to do, constantly tested by others and where they have to follow rigid rules as to what they should learn. It is no surprise that many students cannot work on their own when they leave school. </p>
<p>There are a few broken individuals who never really became adults. They have to be kept in check all the time because they could not survive on their own. But if these constituted the essential part of the human race, we would have gone extinct a long time ago. Our ancestors, not long ago, had to survive in small bands hunting small animals and grabbing whatever they could eat. They had to be incredibly resilient because human beings spread throughout the globe like no other animal species.</p>
<p>To put it bluntly, most people lack autonomy, they can&#8217;t be entrepreneurs, precisely because we have carefully beaten it out of them. I have two young kids and they are crazy. One of them is building a castle out of paper in his room. The project is huge and complicated and has worked on it for days, on his own, without anyone telling him what to do. He made mistakes (which he explained to me) and he had to fix them. How often do schools let students embark on self-directed projects? Almost never. </p>
<p>My sons are not exceptional. Like other kids their age, they behave in unconventional ways, trying crazy things on their own, having crazy thoughts on their own. Eventually, with enough schooling, they will settle down and do as they are told in a more reliable manner. They will become very good at following directions.</p>
<p>How good will they be at emulating someone like Steve Jobs, who repeatedly broke all rules? I fear for them that their sense of initiative and wonder will be killed by the time they finish their schooling. (Thankfully, I am a crazy dad with crazy ideas, so maybe I will mitigate the damage.)</p>
<p>Hence, as a teacher, I reject the industrial model as much as I can. I believe that, in an ideal world, we would not need any teaching at all. There is hardly anything you can&#8217;t learn through an apprenticeship. For example, if you just helped out Linus Torvalds for a couple of years, you could become an expert programmer. In fact, I suspect you would fare much better than if you just took programming classes.</p>
<p>The problem with apprenticeship is that it scales poorly. How much patience will Linus Torvalds will have for kids who hardly know anything about computers? How many could he coach? Would he want to have kids over at his house while he is coding?</p>
<p>We still use the apprenticeship model in graduate school. But to accommodate most students, I still haven&#8217;t thought of a better model than setting up classes. But should the classes be organized like factories with the teacher acting as a middle-manager while students act as factory employees, executing tasks one after the other while we assess and time them? I think not. My teaching philosophy is simple: challenge the student, set him in motion, and provide a model. I try to be as far from the industrial model as I can, while remaining within the accepted boundaries of my job. I have two rules when it comes to teaching:</p>
<ul>
<li><strong>Focus on open-ended assignments and exams</strong>. Many professors are frustrated that students come in only for the grades. Probably because they focus on nice lectures and then prepare hastily some assignments. Turn this problem on its head! Focus on the assignments. If your students are not very autonomous &mdash; and they rarely are &mdash; give several long and challenging assignments (at least 4 or 5 a term). Do make sure however that they know where to get the information they need.  Provide solved problems to help the weaker students.
<p>However, keep the assignments open ended. We all like to grade multiple choice questions, but they are a pedagogical atrocity. In life, there is rarely one best answer: assignments should reflect that.  In some of my classes I use &#8220;programming challenges&#8221;: I make up some difficult problem and ask the students to find the best possible solution. Often times, there is no single idea solution, but multiple possibilities, all with different trade-offs. Quite often the students ask me to be more precise: I refuse. I tell my students to justify their answer. Over the years, I have been repeatedly impressed by the ingenuity of my students. Many of them are obviously smarter than I am.</p>
<p>What about lecture and lecture notes? They are secondary. In most fields, the content, the information, is already out there. It has been organized several times over by very smart people. Books have been written on most topics. There is a growing set of great talks available on YouTube, Google Video and elsewhere. Your students do not need you to rehash the same content they can find elsewhere, sometimes in better form. Stop lecturing already! Just link to what is out there and encourage your students to find more using a search engine. Only produce content when you really cannot find the equivalent elsewhere. Please link to material beyond the grasp of most of your students: they need to know the limit of their knowledge.</p>
<p>The famous software engineering guru <a href="http://www.sigcse.org/sigcse2012/attendees/keynotes.php">Fred Brooks</a> agrees with me:</p>
<blockquote><p>
The primary job of the teacher is to make learning happen; that is a design task. Most of us learned most of what we know by what we did, not by what we heard or read. A corollary is that the careful designing of exercises, assignments, projects, even quizzes, makes more difference than the construction of lectures. </p></blockquote>
<p>For my years as a student, I hardly remember the lectures. They were overwhelmingly boring. And I soon learned that even if a teacher was remarkably able and he could give me the impression that I understood everything&#8230; this impression was quickly falsified when I tried to work the material on my own.
</li>
<li><strong>Be an authentic role model</strong>.  Knowing that someone ordinary, like your professor, has become a master of the course material means that you, the very-smart-student, can do the same. That&#8217;s the power of emulation.
<p>When <a href="http://en.wikipedia.org/wiki/Sebastian_Thrun">Sebastian Thrun</a> gave his open AI class at Stanford, tens of thousands of students enrolled. Sure enough, the Stanford badge played a role in the popularity of the course, but ultimately, it is Thrun himself, as a role model, that matters. He has now <a href="http://blogs.reuters.com/felix-salmon/2012/01/23/udacity-and-the-future-of-online-universities/">left Stanford</a> to create his own independent organization (<a href="http://www.udacity.com/">Udacity</a>). Thrun must be confident about his success since he left his tenured position at Stanford, reportedly because he cannot stand the regular (industrial-style) teaching required at Stanford. One upcoming course is &#8220;programming a robotic car&#8221;. I have no idea how good the course will be, but it will be motivating for students to attend the class of the world&#8217;s top expert in the field of robotic car. </p>
<p>The status of the teacher as an expert has always been important. However, the ability of people like Thrun to reach thousands of people every year through his teaching means that there is less of a market for teachers who aren&#8217;t impressive AI researchers.
</li>
</ul>
<p>Unfortunately, as long as I teach within a university, there are a few things I am stuck with:</p>
<ul>
<li>Deadlines: Some students are able to go through the material of a class in 4 weeks. Others would need 16 months. Alas, universities have settled on a fixed number of weeks that everyone must follow. If you complete the course faster, you&#8217;ll still have to wait till the end of the term to get credit. If you need more time, you will have to make special arrangements. Of course, schools follow the factory model: we can&#8217;t have workers come in and finish whenever they want. But outside an industrial setting, I think that deadlines are counterproductive. If I take a class in computing theory and end up proving that P is equal to NP, but I end up my paper a few weeks after the end of the course, I will still fail. Meanwhile, the good student who followed the rules but showed a total lack of initiative and original thinking will go home with a great grade. What do we reward and what do we punish? </li>
<li>Grades: Grades are a very serious matter in schools. <a href="http://activistteacher.blogspot.com/">Denis Rancourt</a>, a top-notch tenured physicist at the University of Ottawa, <a href="http://lemire.me/blog/archives/2009/01/13/must-a-professor-grade-his-students/">was fired after refusing to grade his students</a>. (He would give A+s to everyone.) Grades are effectively the quality control mechanism of schools, where students are the product. Somehow, we have totally integrated the idea that we could sum up an individual by a handful of letters. It sure makes managing people convenient! It all fits nicely in a spreadsheet. Of course, students have adapted by cheating. Schools have reacted by making cheating harder. But I cheated all the way through my undergraduate studies getting almost perfect score in all classes. How? I discovered a little trick: at the University of Toronto, all past year exams were available at the library. If you took time to study them, you soon found out that, at least in the hard sciences, a given professor would always use the same set of 10 to 20 questions, year after year. So all you had to do was to go to the library, study the questions, prepare them, and voilà! An easy A.  But it is all rather pointless. In theory, grades are used by employers to select the best students, but serious employers don&#8217;t do this. We use grades to select the best candidates for graduate school, but I doubt there is a good correlation between grades as an undergraduate and research ability. I know two top-notch researchers who have admitted getting poor grades as undergraduates. For years, I have served on a government committee that awards post-doctoral fellowships: I am amazed at how poor the undergraduate grades are at predicting how well someone might do during his Ph.D. Conversely, I have seen many graduate students who had nearly perfect scores throughout their undergraduate studies who are totally unable to show even just a bit of initiative. They do well as long as you always give them precise directions. </li>
</ul>
<p><strong>Credit</strong>: Thanks to <a href="http://www.cs.ubc.ca/~van/">Michiel van de Panne</a> for the reference to Brooks&#8217; quote.</p>
<p><strong>Further reading</strong>: <a href="http://matt-welsh.blogspot.com/2012/01/making-universities-obsolete.html">Making universities obsolete</a>  by Matt Welsh, an interesting fellow who <a href="http://lemire.me/blog/archives/2010/11/22/why-you-may-not-like-your-job-even-though-everyone-envies-you/">left his tenured position at Harvard</a> to go work in industry.</p>
<p><strong>Disclaimer</strong>: Many people are better and more sophisticated teachers than I am. And the industrial model does work remarkably well in some settings. Yet I think that they the skills it fails to favor are increasingly important. We have to stop training people for factory jobs that are never coming back.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=TMHdaTRnGF0:H2MLmONDmpo:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=TMHdaTRnGF0:H2MLmONDmpo:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/TMHdaTRnGF0" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://lemire.me/blog/archives/2012/01/30/two-rules-for-teaching-in-the-xxist-century/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		<feedburner:origLink>http://lemire.me/blog/archives/2012/01/30/two-rules-for-teaching-in-the-xxist-century/</feedburner:origLink></item>
		<item>
		<title>Citogenesis in science and the importance of real problems</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/QzOQW8Uc7ec/</link>
		<comments>http://lemire.me/blog/archives/2012/01/27/citogenesis-in-science-and-the-importance-of-real-problems/#comments</comments>
		<pubDate>Fri, 27 Jan 2012 21:05:44 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category />

		<guid isPermaLink="false">http://lemire.me/blog/?p=3931</guid>
		<description>Many papers in Computer Science tell the following story: There is a pre-existing problem P. There are few relatively simple but effective solution to problem P. Among them is solution X. We came up with a new solution X+ which is a clever variation on X. It looks good on paper. We ran some experiments [...]</description>
			<content:encoded><![CDATA[<p><a href="http://xkcd.com/978/"><img src="http://imgs.xkcd.com/comics/citogenesis.png" style="float:right;margin:5px;width:200px" /></a><br />
Many papers in Computer Science tell the following story:</p>
<ul>
<li>There is a pre-existing problem <em>P</em>.</li>
<li>There are few relatively simple but effective solution to problem <em>P</em>. Among them is solution <em>X</em>.</li>
<li>We came up with a new solution <em>X</em>+ which is a clever variation on <em>X</em>. It looks good on paper.</li>
<li>We ran some experiments and tweaked our results until <em>X</em>+ looked good. We found a clever way to avoid comparing <em>X</em>+  and <em>X</em> directly and fairly, as it might then become obvious that the gains are small, or even negative!  We would gladly report negative results, but then our paper could not be published.</li>
</ul>
<p>It is a very convenient story for reviewers: the story is simple and easy to assess superficially. The problem is that sometimes, especially if the authors are famous and the idea is compelling, the results will spread. People will adopt <em>X</em>+ and cite it in their work. And the more they cite it, the more enticing it is to use <em>X</em>+ as every citation becomes further validation for <em>X</em>+. And why bother with algorithm <em>X</em> given that it is older and <em>X</em>+ is the state-of-the-art?</p>
<p>Occasionally, someone might try both <em>X</em> and <em>X</em>+, and they may report results showing that the gains due to <em>X</em>+ are small, or negative. But they have no incentive to make a big deal of it because they are trying to propose yet another better algorithm (<em>X</em>++).</p>
<p>This process is called <a href="http://xkcd.com/978/">citogenesis</a>. It is what happens when the truth is determined solely by the literature, not by independent experiments. Everyone assumes, implicitly, that <em>X</em>+ is better than <em>X</em>. They beauty of it is that you do not even need for anyone to have claimed so. You simply need to say that <em>X</em>+ is currently considered the best technique.</p>
<p>Some claim that <a href="http://lemire.me/blog/archives/2010/09/17/can-science-be-wrong-you-bet/">science is self-correcting</a>. People will stop using <em>X</em>+ or someone will try to make a name for himself by proving that <em>X</em>+ is no better and maybe worse than <em>X</em>. But in a business of science driven by publications, it is not clear why it should happen. Publishing that <em>X</em>+ is no better than <em>X</em> is an <a href="http://lemire.me/blog/archives/2008/10/28/when-in-doubts-prefer-unimpressive-negative-results/">unimpressive negative result</a> and those are rarely presented in prestigious venues. </p>
<p>John Regehr made a similar point about <a href="http://blog.regehr.org/archives/667">our inability to address mistakes</a> in the literature:</p>
<blockquote><p> in many cases an honest retrospective would need to be a bit brutal, for example to indicate which papers really just were not good ideas (of course some of these will have won “best paper” awards). In the old days, these retrospectives would have required a venue willing to publish them, (&#8230;), but today they could be uploaded to arXiv. I would totally read and cite these papers if they existed (&#8230;)</p></blockquote>
<p>But there is hope! If problem <em>P</em> is a real problem, for example, a problem that engineers are trying to solve, then you can get actual and reliable validation. Good software engineers do not trust research papers: they run experiments. Is this algorithm faster, really? They verify.</p>
<p>We can actually see this effect. Talk to any Computer Scientist and he will tell you of clever algorithms that have never been adopted by the industry. Most often, there is an implication that industry is backward and that it should pay more attention to academic results. However, I suspect that in a lot of cases, the engineers have voted against <em>X</em>+ and in favor of <em>X</em> after assessing them, fairly and directly. That is what you do when you are working on real problems and really need good results.</p>
<p><strong>Credit</strong>: This blog post was inspired by a <a href="https://plus.google.com/109547101047038671387/posts/Jkntr2ESycZ">comment made by Phil Jones on Google+</a>.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=QzOQW8Uc7ec:hZFXcg_M-K0:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=QzOQW8Uc7ec:hZFXcg_M-K0:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/QzOQW8Uc7ec" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://lemire.me/blog/archives/2012/01/27/citogenesis-in-science-and-the-importance-of-real-problems/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		<feedburner:origLink>http://lemire.me/blog/archives/2012/01/27/citogenesis-in-science-and-the-importance-of-real-problems/</feedburner:origLink></item>
		<item>
		<title>How to revise research papers after receiving harsh reviews</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/qFnPrh3r1sc/</link>
		<comments>http://lemire.me/blog/archives/2012/01/26/how-to-revise-a-research-papers-after-receiving-harsh-reviews/#comments</comments>
		<pubDate>Thu, 26 Jan 2012 15:33:10 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category />

		<guid isPermaLink="false">http://lemire.me/blog/?p=3917</guid>
		<description>Whether you submit your work scientific journal or just post it on a blog, you can expect to receive harsh criticism from time to time. Sometimes you are facing arrogant or ignorant readers. Other times, your work is genuinely flawed. My own work is frequently flawed, as you know if you read this blog. Over [...]</description>
			<content:encoded><![CDATA[<p>Whether you submit your work scientific journal or just post it on a blog, you can expect to receive harsh criticism from time to time. Sometimes you are facing arrogant or ignorant readers. Other times, your work is genuinely flawed. My own work is frequently flawed, as you know if you read this blog.</p>
<p>Over time, I have learned that even if the reviewer is wrong, spending time to careful respond can be tremendously useful. If you are 100% correct, then you get to build up your confidence and can later answer similar criticism hastily. Very often, however, you did not do everything perfectly. Maybe your arguments and data are correct, but you might have presented them better.</p>
<p>There are specific strategies to deal with harsh reviews:</p>
<ul>
<li>Expose yourself regularly to criticism from total strangers. In my experience, if you rarely publish, you are more likely to have difficulty dealing with criticism. I have been called an idiot, I have had to deal with overly aggressive people and I have been  ridiculed on occasion. Of course, I occasionally get depressed after receiving harsh criticism, especially if I thought I had produced great work and feel unappreciated, but I am typically able to recover mentally in minutes or, at least, hours. Part of it is just habit: my brain has learned that harsh criticism does not necessarily signify upcoming pain. </li>
<li>It is critically important to distinguish yourself from your work. If someone repeatedly produces inferior work, his reputation will suffer. However, everyone (even Nobel prize winners) gets it wrong from time to time. It is important to keep in mind that most reviewers do not care that much about you. In fact, they often quickly forget about you while you ruminate over their review.</li>
<li>The best way to address criticism is to take it one comment at a time. If someone finds ten different flaws in your work, don&#8217;t look at it as one message: break it into ten components and address each one separately. This approach scales up linearly: it just take ten times longer to address 10 flaws than one. <a href="http://www.bmartin.cc/pubs/08jspsrr.html">Brian Martin describes it well</a>:<br />
<blockquote><p>
I&#8217;ve found a way to make the revision process easier. I don&#8217;t reread my text, because that just cements my previous approach. Instead, I go through the recommendations of the referees and the editor one by one, making changes. After I finish all those changes, large and small, I print out the whole article and read through it, fixing up expression and making it flow.</p>
<p>Tackling recommendations one by one is important psychologically. Looking at a list of criticisms, sometimes pages of them, can be demoralizing; the task seems too big. Focusing on a single point is easier. Once it&#8217;s done, you can check it off and proceed to the next point, either immediately or tomorrow.</p>
<p>Sometimes responding to a point requires additional work, such as obtaining and reading some new theory or doing some new calculation. It&#8217;s helpful to write down every step that&#8217;s required – for example, (1) order Smith&#8217;s book, (2) read the theory section, (3) write a one-paragraph summary – and tackle them one by one.
</p></blockquote>
</li>
</ul>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=qFnPrh3r1sc:CyT9x4djv8g:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=qFnPrh3r1sc:CyT9x4djv8g:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/qFnPrh3r1sc" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://lemire.me/blog/archives/2012/01/26/how-to-revise-a-research-papers-after-receiving-harsh-reviews/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		<feedburner:origLink>http://lemire.me/blog/archives/2012/01/26/how-to-revise-a-research-papers-after-receiving-harsh-reviews/</feedburner:origLink></item>
		<item>
		<title>Open access journals in Computer Science</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/Yh-Du3dWdOc/</link>
		<comments>http://lemire.me/blog/archives/2012/01/25/open-access-journals-in-computer-science/#comments</comments>
		<pubDate>Wed, 25 Jan 2012 19:15:03 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category />

		<guid isPermaLink="false">http://lemire.me/blog/?p=3884</guid>
		<description>Open access journals make articles freely available. Some of them even allow the authors to keep the copyright of their work. It would seem that they offer a compelling alternative to traditional journals, especially if you hope to reach to people outside academia. However, open access may allow you to get a free copy of [...]</description>
			<content:encoded><![CDATA[<p>Open access journals make articles freely available. Some of them even allow the authors to keep the copyright of their work. It would seem that they offer a compelling alternative to traditional journals, especially if you hope to reach to people outside academia.<br />
However, open access may allow you to get a free copy of an article, but your rights might still be limited. For example, videos on YouTube are freely available, but you are not allowed to copy or reuse them freely.</p>
<p>The directory of open access journals gives a list of over <a href="http://www.doaj.org/doaj?func=subject&#038;cpid=114&#038;uiLanguage=en">300 open access journals in Computer Science</a>. Thus, finding an adequate open access journal where you can submit your work is relatively easy.</p>
<p>However, there are a few sore points.</p>
<p><strong>1. Indexing of open access Computer Science journal is generally weak<br />
</strong></p>
<p>A journal needs to be indexed so that your fellow researchers can find out about your work. Most open access journals will be indexed by Google Scholar, but other indexes are important in Computer Science such as <a href="http://www.informatik.uni-trier.de/~ley/db/">DBLP</a> and the <a href="http://dl.acm.org/">ACM Digital Library</a>. <a href="http://www.scopus.com/home.url">Scopus</a> is also often used by hiring and promotion committees. (Scopus is run by Elsevier.)</p>
<p>As I review the open access journals in Computer Science, I find that indexing is often a sore point. The next table shows that the ACM Digital Library does a poor job at indexing open access journals. In fact, I could find only two open access journals indexed by ACM. It cannot be explained by the prestige of the respective journals: some of these open access journals that ACM fails to index are just as good or better than others it indexes.  And,  of course, no ACM publication is open access. Quite clearly, ACM is doing little to help open access.</p>
<table style="border-collapse:collapse; font-size:0.8em">
<tr style="border-top:3px solid #ccc;border-bottom:2px solid #ccc;">
<th></th>
<th>DBLP</th>
<th>Scopus</th>
<th>ACM </th>
</tr>
<tr>
<td>Chicago Journal of Theoretical Computer Science </td>
<td>  yes </td>
<td>  </td>
<td> </td>
</tr>
<tr>
<td>Discrete Mathematics and Theoretical Computer Science  </td>
<td> yes </td>
<td>yes </td>
<td> </td>
</tr>
<tr>
<td>Electronic Journal of Combinatorics  </td>
<td>  yes  </td>
<td>yes </td>
<td></td>
</tr>
<tr>
<td>IEEE Data Engineering Bulletin </td>
<td>  yes     </td>
<td>     </td>
<td>    </td>
</tr>
<tr>
<td>Journal of Artificial Intelligence Research</td>
<td> yes </td>
<td>yes</td>
<td>yes</td>
</tr>
<tr>
<td>Journal of Computational Geometry</td>
<td> yes </td>
<td></td>
<td></td>
</tr>
<tr>
<td>Journal of Computers              </td>
<td> yes </td>
<td>    </td>
<td> </td>
</tr>
<tr>
<td>Journal of Emerging Technologies in Web Intelligence </td>
<td></td>
<td> </td>
<td></td>
</tr>
<tr>
<td>Journal of Machine Learning Research  </td>
<td> yes   </td>
<td>yes</td>
<td>yes</td>
</tr>
<tr>
<td>Journal of Universal Computer Science  </td>
<td>yes</td>
<td>yes </td>
<td></td>
</tr>
<tr>
<td>Journal of Graph Algorithms and Applications  </td>
<td>yes </td>
<td>yes </td>
<td></td>
</tr>
<tr>
<td>Open Research Computation </td>
<td> </td>
<td>yes </td>
<td> </td>
</tr>
<tr style="border-bottom:3px solid #ccc;">
<td>Theory of Computing                </td>
<td>yes          </td>
<td>          </td>
<td></td>
</tr>
</table>
<p>Elsevier and Springer allow authors of papers in some regular journals to make them available under an open access format in exchange for a one-time fee. Their journals are typically well indexed so they may offer good alternatives. </p>
<p><strong>2. Many open access Computer Science journals require complete copyright transfer<br />
</strong></p>
<p>To publish an article, a journal does not require complete copyright ownership. The only valid justification for requiring that the author gives away his copyright is to restrict access. When reviewing open access journals in Computer Science, I see that several of them inexplicably require complete copyright transfer:</p>
<table style="border-collapse:collapse; font-size:0.8em">
<tr style="border-top:3px solid #ccc;border-bottom:2px solid #ccc;">
<th></th>
<th>author keeps copyright</th>
<th>publication fee</th>
</tr>
<tr>
<td>Chicago Journal of Theoretical Computer Science </td>
<td>  yes </td>
<td> </td>
</tr>
<tr>
<td>Discrete Mathematics and Theoretical Computer Science  </td>
<td> no </td>
<td> </td>
</tr>
<tr>
<td>Electronic Journal of Combinatorics  </td>
<td>  yes  </td>
<td></td>
</tr>
<tr>
<td>IEEE Data Engineering Bulletin </td>
<td> no     </td>
<td>     </td>
</tr>
<tr>
<td>Journal of Artificial Intelligence Research</td>
<td> no </td>
<td>none</td>
</tr>
<tr>
<td>Journal of Computational Geometry</td>
<td> yes </td>
<td></td>
</tr>
<tr>
<td>Journal of Computers              </td>
<td> no </td>
<td>   €360 </td>
</tr>
<tr>
<td>Journal of Emerging Technologies in Web Intelligence </td>
<td>no</td>
<td></td>
</tr>
<tr>
<td>Journal of Machine Learning Research  </td>
<td> yes   </td>
<td></td>
</tr>
<tr>
<td>Journal of Universal Computer Science  </td>
<td>no</td>
<td> </td>
</tr>
<tr>
<td>Journal of Graph Algorithms and Applications  </td>
<td>no </td>
<td> </td>
<td></td>
</tr>
<tr>
<td>Open Research Computation </td>
<td>yes </td>
<td> €1195</td>
</tr>
<tr style="border-bottom:3px solid #ccc;">
<td>Theory of Computing                </td>
<td>yes          </td>
<td>          </td>
<td></td>
</tr>
</table>
<p><strong>Conclusion</strong> There is still much room for progress.</p>
<div class="related">
<p>Related posts (automatically generated):</p>
<ul>
		<li><a href="http://lemire.me/blog/archives/2008/03/05/spam-journals-or-open-journals/" rel="bookmark">Spam journals or open journals?</a><!-- (14.6)--></li>
		<li><a href="http://lemire.me/blog/archives/2009/08/24/open-access-just-for-articles/" rel="bookmark">Open Access: just for articles!</a><!-- (14)--></li>
		<li><a href="http://lemire.me/blog/archives/2012/01/10/open-science-is-hard/" rel="bookmark">Open science: why is it so hard?</a><!-- (12.1)--></li>
	</ul>
</div>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=Yh-Du3dWdOc:YlKQuB8HSTY:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=Yh-Du3dWdOc:YlKQuB8HSTY:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/Yh-Du3dWdOc" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://lemire.me/blog/archives/2012/01/25/open-access-journals-in-computer-science/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://lemire.me/blog/archives/2012/01/25/open-access-journals-in-computer-science/</feedburner:origLink></item>
		<item>
		<title>Should you boycott academic publishers?</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/l8-fdo6-FiA/</link>
		<comments>http://lemire.me/blog/archives/2012/01/23/boycott-elsevier/#comments</comments>
		<pubDate>Mon, 23 Jan 2012 21:22:01 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category />

		<guid isPermaLink="false">http://lemire.me/blog/?p=3853</guid>
		<description>There is a growing list of famous scientists who have pledged to boycott Elsevier as a publisher. If I were in charge of Elsevier, I would be very nervous: academic publishers need famous authors more than the famous authors need the publishers. After all, famous scientists could simply post their work online, and people would [...]</description>
			<content:encoded><![CDATA[<p>There is a growing list of <a href="http://thecostofknowledge.com/">famous scientists who have pledged to boycott Elsevier as a publisher</a>. If I were in charge of Elsevier, I would be very nervous: academic publishers need famous authors more than the famous authors need the publishers. After all, famous scientists could simply post their work online, and people would still read it.</p>
<p>Elsevier has committed too many sins to give an exhaustive list: they <a href="http://classic.the-scientist.com/blog/display/55679/">have created fake academic journals</a> so that pharmaceutical corporations could claim that certain <em>facts</em> appeared in a journal,  they <a href="http://cstheory.blogoverflow.com/2012/01/boycott-elsevier-for-supporting-sopa/">have sponsored evil regulations</a>, and they <a href="http://blogs.ch.cam.ac.uk/pmr/2011/11/27/textmining-my-years-negotiating-with-elsevier/">have restrictive views on what constitutes fair use</a>. Unbelievably, they were also involved in <a href="http://www.idiolect.org.uk/elsevier/">arms trade</a>. They probably have the devil on their board of directors.</p>
<p>The boycott is currently lead by a famous mathematician, <a href="http://gowers.wordpress.com/2012/01/21/elsevier-my-part-in-its-downfall/">Timothy Gowers</a>. Gowers accuses Elsevier of charging exorbitant prices for its journals.</p>
<p>Focusing solely on database-related journals, I decided to look at how much journals charge per article.</p>
<table style="border-top:3px solid #ccc;border-bottom:3px solid #ccc;">
<tr>
<th>journal</th>
<th>publisher</th>
<th><a href="http://www.journalprices.com/">price per article</a></th>
</tr>
<tr>
<td>Distributed and parallel databases</td>
<td>Springer</td>
<td>61.50</td>
</tr>
<tr>
<td>Information systems journal</td>
<td>Wiley</td>
<td>58.16</td>
</tr>
<tr>
<td>Information Systems</td>
<td>Elsevier</td>
<td>53.44</td>
</tr>
<tr>
<td>Knowledge and information systems</td>
<td>Springer</td>
<td>25.39</td>
</tr>
<tr>
<td>Data &amp; knowledge engineering</td>
<td>Elsevier</td>
<td>24.55</td>
</tr>
<tr>
<td>VLDB journal</td>
<td>Springer</td>
<td>22.19</td>
</tr>
<tr>
<td>Information Sciences</td>
<td>Elsevier</td>
<td>21.67</td>
</tr>
<tr>
<td>IEEE Trans. knowledge &amp; data engineering</td>
<td>IEEE</td>
<td>10.80</td>
</tr>
<td>ACM Trans. on database systems</td>
<td>ACM</td>
<td>6.64</td>
<tr>
<td>SIGMOD Record</td>
<td>ACM</td>
<td>0.00</td>
</tr>
</table>
<p><strong>Observations</strong>: </p>
<ul>
<li>The price distribution appears almost random. I can see no relation between prestige or paper length and prices.</li>
<li>Elsevier is hardly alone at charging high prices for papers. Wiley and Springer are just as expensive. Of course, it is possible that Elsevier ends up charging more through  deals and bundling.</li>
<li>ACM is very inexpensive on a per-article basis. However, ACM often asks the authors to pay page charges whereas Elsevier rarely does in my experience.</li>
<li>Though SIGMOD Record is limited to short contributions, its price is unbeatable. And it has no page charge. Moreover, it is generally a well regarded publication venue among database researchers. </li>
</ul>
<p><strong>My take</strong>: The evidence is strong that high-quality inexpensive journals are possible. Current journals are up to an order of magnitude too expensive. However, Elsevier is selling what we want to buy: prestigious journals that people outside the best schools cannot afford. Just like middle-income Americans get into debt to keep up with the top 1%, colleges increase their library budgets to keep up with Stanford and Harvard.</p>
<p>The solution to overpriced journals is to reduce library purchasing power. Most colleges do not have the infinite budgets Harvard and Stanford have, and they should not act like they do. In fact, if we could reduce the purchasing power of most libraries to zero, then researchers and students would be forced to pay $20 or more per article. You can be quite certain that they would mostly read the cheaper (and more competitive) journals. And Stanford researchers want to be cited by the researchers from the lesser institutions so they would also migrate away from overpriced journals. Reduced budgets would still allow publishers like Elsevier to make generous profits, but they would only profit by offering great services at an affordable price. </p>
<p><strong>Disclaimer</strong>: I am currently reviewing a paper for Pattern Recognition (an Elsevier journal), and I <a href="http://arxiv.org/abs/1008.1715">recently published in Discrete Applied Mathematics</a> (another Elsevier journal).  </p>
<p><strong>Update</strong>: Though you can get articles from SIGMOD Record for free if you to the SIGMOD Record home page, ACM sells them through its Digital Library for over $10 a piece.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=l8-fdo6-FiA:-eafRZb5ojM:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=l8-fdo6-FiA:-eafRZb5ojM:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/l8-fdo6-FiA" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://lemire.me/blog/archives/2012/01/23/boycott-elsevier/feed/</wfw:commentRss>
		<slash:comments>19</slash:comments>
		<feedburner:origLink>http://lemire.me/blog/archives/2012/01/23/boycott-elsevier/</feedburner:origLink></item>
		<item>
		<title>Use random hashing if you care about security?</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/dLvntU-ploE/</link>
		<comments>http://lemire.me/blog/archives/2012/01/17/use-random-hashing-if-you-care-about-security/#comments</comments>
		<pubDate>Tue, 17 Jan 2012 19:40:20 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category />

		<guid isPermaLink="false">http://lemire.me/blog/?p=3830</guid>
		<description>Hashing is a programming technique that maps objects (such as strings) to integers. It is a necessary component of hash tables, one of the most frequently used data structure in Computer Science. Typically, Hash tables have the property that looking up or storing a value associated with a key requires constant time. If you use [...]</description>
			<content:encoded><![CDATA[<p>Hashing is a programming technique that maps objects (such as strings) to integers. It is a necessary component of <a href="http://en.wikipedia.org/wiki/Hash_table">hash tables</a>, one of the most frequently used data structure in Computer Science. </p>
<p>Typically, Hash tables have the property that looking up or storing a value associated with a key requires constant time. If you use user identifiers to retrieve names and phone numbers, you can scale up to millions and millions of users without performance penalty. However, the worst case complexity of a hash table is linear: it may need to go through most values each time you want to look up a key. Thankfully, the worst case is typically improbable: it only happens when too many objects hash to the same value. In practice, hash functions are chosen so as to spread hash values uniformly (pseudo-randomly). </p>
<p>Most programming languages like Java or C++ use deterministic hash functions. This means that given a string, it will always hash to the same integer, for all Java software in the whole world. And overall, deterministic hashing works quite well. Unfortunately, deterministic hashing is insecure. If your are building a web application, and hackers know which hash function you are using, they can create a <a href="http://en.wikipedia.org/wiki/Denial-of-service_attack">denial-of-service attack</a> and bring down your application. The gist of it is not complicated: it suffices to ensure that the hash tables fall back on their worst case performance.</p>
<p>This is very serious: it means that if you rely on the default hash functions of your programming language (e.g., String.hashCode in Java), your application could be at risk. On this issue, Alexander Klink and Julian Wälde issued a well written <a href="http://www.nruns.com/_downloads/advisory28122011.pdf">security advisory</a>. </p>
<p>The fix is relatively simple: programming languages need to adopt random hashing. In random hashing, every time the software is initialized, a new hash function is picked, at random. This does not make attacks impossible, but it makes them much more difficult.</p>
<p>The problem is not novel. In 2003, Crosby and Wallach raised the issue and many responsible vendors fixed their products. Alas, the only programming languages to adopt random hashing were Ruby and <a href="http://perldoc.perl.org/perlsec.html#Algorithmic-Complexity-Attacks">Perl</a>. Others are more reluctant.</p>
<p>So, how easy is it to hack the hash functions in, say, Java? Java uses an iterated hash function. At each iteration, iterated hash functions compute a new hash value from the preceding hash value and the next character. <a href="http://docs.oracle.com/javase/1.4.2/docs/api/java/lang/String.html#hashCode%28%29">Strings in Java</a> are hashed using the  function<br />
<code>F(y,c) = 31 y + c.</code><br />
where y is the previous hash value and c is the current character value. Thus, the hash value of a string made of the characters 65, 66 (corresponding to &#8220;AB&#8221; in ASCII)  is 31 times 65 + 66 which is 2081. </p>
<p>Why does Java uses the number 31? The choice is somewhat arbitrary (and 31 might fail to be ideal) but because it is an odd number, the compression function F is <a href="http://arxiv.org/abs/1008.1715"><em>permuting</em> which helps distribute more uniformly the hash values</a>.</p>
<p>It is fairly hard to construct reasonable strings that collide over 32 bits in Java. However, a modest hash table will use only the first few bits of the hash values. Let us consider only the first 16 bits. It is not difficult to check that the strings  &#8220;Ace&#8221;,&#8221;BDe&#8221;,&#8221;AdF&#8221; and &#8220;BEF&#8221; all have the same hash value in Java.</p>
<p>Of course, having 4 strings colliding will not disrupt hash tables. But because the hash function is iterated, we can multiply the number of collisions. Indeed, any two same-length sequences of these four colliding strings will also collide. This means that you can construct 16 strings of length 6 all colliding (&#8220;AceAce&#8221;,&#8221;AceBDe&#8221;,&#8221;AceAdF&#8221;, &#8220;AceBEF&#8221;, &#8220;BDeAce&#8221;,&#8221;BDeBDe&#8221;,&#8221;BDeAdF&#8221;, &#8220;BDeBEF&#8221;, &#8220;AdFAce&#8221;,&#8221;AdFBDe&#8221;,&#8221;AdFAdF&#8221;, &#8220;AdFBEF&#8221;, &#8220;BEFAce&#8221;,&#8221;BEFBDe&#8221;,&#8221;BEFAdF&#8221;, and &#8220;BEFBEF&#8221;). You can keep going to 64 strings of length 9. And so on.</p>
<p>How badly does this impact the performance of a hash table? I tried inserting all the colliding strings in a Java Hashtable container. For comparison, I also inserted randomly chosen strings into either a Hashtable or a TreeMap (tree structure). The net result is that what should be a tiny cost (0.006 s) becomes a massive cost (30s). A server able to process thousands of queries per second might quickly become bogged down trying to process a couple of queries per second.</p>
<table>
<tr style="background:#ccc">
<th>number of strings</th>
<th>hash table: average time (s)</th>
<th>hash table: worst time (s) </th>
<th>tree: average time (s)</th>
</tr>
<tr style="background:#cdd">
<td>16384</td>
<td>0.002</td>
<td>1.1</td>
<td>0.005</td>
</tr>
<tr style="background:#dee">
<td>65536</td>
<td>0.006</td>
<td>30</td>
<td>0.03</td>
</tr>
</table>
<p>For these tests, I am using a MacBook Air with a 1.8 GHz Intel Core i7 running Java 6. My code is <a href="http://pastebin.com/bznPrDTz">available</a>. </p>
<p>Why aren&#8217;t programming languages adopting random hashing? A potential issue is that language designers like determinism. They much prefer reproducible bugs. Nevertheless, any expert programmer should be aware of this problem.</p>
<p><strong>Further reading</strong>: </p>
<ul>
<li>Scott A. Crosby and Dan S. Wallach, <a href="http://www.usenix.org/event/sec03/tech/full_papers/crosby/crosby_html/">Denial of Service via Algorithmic Complexity Attacks</a>, Usenix Security&#8217;03.</li>
<li>Daniel Lemire, <a href="http://arxiv.org/abs/1008.1715">The universality of iterated hashing over variable-length strings</a>, to appear in Discrete Applied Mathematics.
</li>
</ul>
<p><strong>Update</strong>: I initially reported that Ruby was the only language to adopt random hashing. In fact, Perl adopted random hashing with version 5.8.1. In version 5.8.2, Perl adopted <a href="http://perldoc.perl.org/perlsec.html#Algorithmic-Complexity-Attacks">an hybrid that switches between a deterministic and a random hash function when needed</a>. (Thanks to Mike Giroux for the pointer.)</p>
<div class="related">
<p>Related posts (automatically generated):</p>
<ul>
		<li><a href="http://lemire.me/blog/archives/2009/10/02/sensible-hashing-of-variable-length-strings-is-impossible/" rel="bookmark">Sensible hashing of variable-length strings is impossible</a><!-- (27)--></li>
		<li><a href="http://lemire.me/blog/archives/2011/09/26/two-32-bit-hash-functions-from-a-64-bit-hash-function/" rel="bookmark">Two 32-bit hash functions from a 64-bit hash function?</a><!-- (15.5)--></li>
		<li><a href="http://lemire.me/blog/archives/2006/07/04/perfect-hashing/" rel="bookmark">Perfect Hashing</a><!-- (12.2)--></li>
	</ul>
</div>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=dLvntU-ploE:jbHKnxM6kzw:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=dLvntU-ploE:jbHKnxM6kzw:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/dLvntU-ploE" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://lemire.me/blog/archives/2012/01/17/use-random-hashing-if-you-care-about-security/feed/</wfw:commentRss>
		<slash:comments>17</slash:comments>
		<feedburner:origLink>http://lemire.me/blog/archives/2012/01/17/use-random-hashing-if-you-care-about-security/</feedburner:origLink></item>
		<item>
		<title>Open science: why is it so hard?</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/KK7jvYEc2J4/</link>
		<comments>http://lemire.me/blog/archives/2012/01/10/open-science-is-hard/#comments</comments>
		<pubDate>Wed, 11 Jan 2012 00:20:37 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category />

		<guid isPermaLink="false">http://lemire.me/blog/?p=3796</guid>
		<description>Open access is the idea that scholarship should be accessible to all. Many believe that we should require publicly funded researchers to make their work available to the public. That is, if some professor discovers a new algorithm or a new remedy while on a government grant, you should be able to download and read [...]</description>
			<content:encoded><![CDATA[<div style="float:right;margin:5px"><a href="http://www.amazon.com/Reinventing-Discovery-New-Networked-Science/dp/0691148902?tag=daniellemires-20/ref=sr_1_2?ie=UTF8&#038;qid=1326235571&#038;sr=8-2&tag=daniellemires-20" rel="nofollow"><img src="http://press.princeton.edu/images/j9517.gif"/></a></div>
<p><a href="http://en.wikipedia.org/wiki/Open_access">Open access</a> is the idea that scholarship should be accessible to all. Many believe that we should require publicly funded researchers to make their work available to the public. That is, if some professor discovers a new algorithm or a new remedy while on a government grant, you should be able to download and read the paper freely.  </p>
<p>To non-scientists, open access may sound like a socialist utopia. Why would anyone give away carefully curated content for free? The problem is that the content is overwhelmingly produced by scientists who have no share of the profit made by the publishers. These scientists are often funded directly or indirectly by the government. Journal editors are typically not paid. Reviewers are almost never paid. In fact, the opposite is true. Over the years, I have given thousands of dollars in page charges or conference registration to publishers. For example, several ACM journals request <a href="http://tosem.acm.org/Authors.html#PageCharges">$60 per page</a> to the authors (so that publishing 30 pages costs $1800). That is right: as a scientist, you are often asked to pay to get your worked published. Thankfully, most of these fees are paid by research grants, which often come from the government. </p>
<p>Open access is a problem for publishers however. In the current system, the publisher has a monopoly on the journals it has published over the years. This means that as long as researchers need access to these journals, the publisher can charge millions for access. Open access kills this monopoly. Certainly publishers can increase their profits by increase the page charges, but authors can also take their papers to other, more reasonable journals.</p>
<p>Nevertheless, I could never get excited about open access. I find it annoying that I cannot download papers freely, but astrophysicists have already solved this problem without any government intervention or lawsuit. Indeed, nearly all recent astrophysics papers are on <a href="http://arxiv.org/">arXiv</a>. What matters is the culture: physicists care about being read, they love the web. In this sense, open access is a <a href="http://lemire.me/blog/archives/2009/10/21/open-access-is-the-short-sighted-fight/">short-sighted fight</a>. </p>
<p>Thus, a much more significant vision is Nielsen&#8217;s <a href="http://www.amazon.com/Reinventing-Discovery-New-Networked-Science/dp/0691148902?tag=daniellemires-20" rel="nofollow">open science</a>. Michael Nielsen is arguing for a culture shift in science: from a science obsessed with individual performance (and publications) to a science culture resembling more that of open source software or wikipedia. </p>
<p>I fear however that despite all the (well deserved) press that Michael Nielsen&#8217;s <a href="http://www.amazon.com/Reinventing-Discovery-New-Networked-Science/dp/0691148902?tag=daniellemires-20" rel="nofollow">latest book</a> has been getting, too few people understand the importance of this shift. It is not about becoming hippies. It is not a socialist utopia. On the contrary, the system we have right now is akin to an highly regulated industry. All power is in the hands of the government and a few large organizations (universities, publishers) working in tandem. The barrier to entry is maintained artificially high. Open science is really about creating &#8220;open markets&#8221; with freer exchanges. It has the potential to boost our collective productivity by orders of magnitude through the removal of unneeded friction.</p>
<p>Meanwhile, American corporations are concerned with copyright violations on the Internet. Thus, they are pushing a bill, the <a href="http://en.wikipedia.org/wiki/Stop_Online_Piracy_Act">Stop Online Piracy Act (SOPA)</a> which would allow the government to shut down web site that is suspected of violating copyright. Using SOPA, a publisher could have a repository of research papers shut down. While at it, the publishers are also promoting a bill, the <a href="http://www.publishers.org/press/56/">Research Works Act</a> which would make it illegal for government agencies to require open access from publicly funded researchers. </p>
<p>If you are one of the thousands of members of the  <a href="http://www.acm.org/">Association for Computing Machinery (ACM)</a> or the <a href="http://ieee.org/">Institute of Electrical and Electronics Engineers (IEEE)</a>, then you are indirectly supporting this new legislation. Indeed, the ACM and IEEE are members of the <a href="http://www.publishers.org/">Association of American Publishers (AAP)</a>. The AAP is a lobbyist for both proposals.</p>
<p>And we finally get a hint at why it is so hard it is to open up science: the business of science has become intertwined with businesses like the publishing business. ACM has to speak both as an association of computing professionals, and as a publishing house.</p>
<p>What should be a critical support service, the publication of results, ends up driving much of our culture. The journals become the science. <a href="http://en.wikipedia.org/wiki/The_medium_is_the_message">The medium becomes the message</a>.</p>
<p> In effect, we have too much <a href="http://lemire.me/blog/archives/2011/10/10/why-arent-we-getting-richer-the-scarring-tissue-theory/">organizational scarring tissue</a> in science. It could be that we need to reboot the system. As a starting point, we should collectively recognize the problem. Repeat after me: scholarship is not a publishing business.</p>
<p><strong>Further reading</strong>: </p>
<ul>
<li><a href="http://lemire.me/blog/archives/2009/06/17/is-open-access-publishing-the-solution-really/">Is Open Access publishing the solution? Really?<br />
</a>
</li>
<li><a href="http://blog.acm.org/president/?p=67">ACM&#8217;s role in public policy<br />
</a>
</li>
<li><a href="http://cameronneylon.net/blog/update-on-publishers-and-sopa-time-for-scholarly-publishers-to-disavow-the-aap/">Time for scholarly publishers to disavow the AAP</a>
</li>
</ul>
<p><strong>Update</strong>: </p>
<p> The ACM charges the authors of any conference for the publication of proceedings. However, they do not require payment for publishing in their journals: instead they request page charges. </p>
<div class="related">
<p>Related posts (automatically generated):</p>
<ul>
		<li><a href="http://lemire.me/blog/archives/2012/01/25/open-access-journals-in-computer-science/" rel="bookmark">Open access journals in Computer Science</a><!-- (12.4)--></li>
		<li><a href="http://lemire.me/blog/archives/2009/08/24/open-access-just-for-articles/" rel="bookmark">Open Access: just for articles!</a><!-- (10.3)--></li>
	</ul>
</div>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=KK7jvYEc2J4:LBJj5RFK6us:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=KK7jvYEc2J4:LBJj5RFK6us:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/KK7jvYEc2J4" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://lemire.me/blog/archives/2012/01/10/open-science-is-hard/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
		<feedburner:origLink>http://lemire.me/blog/archives/2012/01/10/open-science-is-hard/</feedburner:origLink></item>
		<item>
		<title>Do we need patents?</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/Feq4GGu4akE/</link>
		<comments>http://lemire.me/blog/archives/2012/01/06/do-we-need-patents/#comments</comments>
		<pubDate>Fri, 06 Jan 2012 20:12:03 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category />

		<guid isPermaLink="false">http://lemire.me/blog/?p=3787</guid>
		<description>Whenever I suggest that patents are harmful, people point to the pharmaceutical industry. The pharmaceutical industry is heavily regulated. Marketing a new drug is a lengthy and expensive process. Moreover, drugs are subject to strict patent laws. The rationalization for patents usually goes like so: developing new drugs is extremely expensive, only large well-funded corporations [...]</description>
			<content:encoded><![CDATA[<p>Whenever I suggest that patents are harmful, people point to the pharmaceutical industry. The pharmaceutical industry is heavily regulated. Marketing a new drug is a lengthy and expensive process. Moreover, drugs are subject to strict patent laws. </p>
<p>The rationalization for patents usually goes like so: developing new drugs is extremely expensive, only large well-funded corporations can do it, and they won&#8217;t innovate unless we grant them a generous monopoly on their discovery.</p>
<p>Granting monopolies, even temporary ones, is expensive. We need to be sure that the gains out-weight the costs. In this case, the rationalization offered by the industry does not stand up to scrutiny:</p>
<ul>
<li>The U.S. and the U.K. have always had strong patent laws protecting chemicals and drugs. Meanwhile, continental Europe had much weaker patent protection. Until recently, you could not patent a drug or a chemical in Germany (1967), Switzerland (1977) and Italy (1978). Where did the pharmaceutical industry thrive before the 1960s? In Germany, Switzerland and Italy. Though Italy was the fifth produce of drugs in the 1970s, its industry is now practically disappearing. </li>
<li>
Without drug patents, India became the fourth largest pharmaceutical producer. It was forced to introduce drug patents in 2005. Haley and Haley (2011) analyzed the effect of the introduction of patents in India: </p>
<blockquote><p>
(&#8230;) the rate of growth in innovation, as measured by investments in R&#038;D, has fallen in India&#8217;s product-patent regime. (&#8230;) This study indicates that product-patent regimes do not necessarily generate greater rates of innovation than process-patent regimes, and may reduce innovation.
</p></blockquote>
</li>
<li>Because companies cannot compete over the efficient production of drugs, they have a strong incentive to produce new drugs. People assume that this means more innovation. However, there is a much cheaper and safer path for drug companies: produce a variation on an existing drug. In this manner, you can go after the competition (by producing variations on their drugs) or artificially extend the life of your own patents (by producing endless variations on your own drugs). And indeed, we can verify that most drug patents concern <em>me-too</em> drugs. In effect, it is more profitable for drug companies to consider themselves as patent portfolios rather than R&#038;D firms.</li>
</ul>
<p>You might argue: aren&#8217;t you happy that the pharmaceutical industry has almost cured a disease like AIDS? I would, except that the pharmaceutical companies did no such thing. It took an <a href="http://en.wikipedia.org/wiki/David_Ho_%28scientist%29">academic researcher</a> to figure out that a cocktail of drugs was the best option to treat AIDS. In fact, the majority of the funding in biomedical research comes from the government and private non-pharmaceutical funds. I have to ask: if the private sector with generous patent protections is so good at innovating, why do governments and private foundations bother to fund medical research?</p>
<p>We don&#8217;t have worldwide pharmaceutical patents because they benefit most of us. We have them because they benefit the richest 0.1%  at the expense of all others. </p>
<p>Further reading:</p>
<ul>
<li>
Michele Boldrin and David K. Levine, <a href="http://www.amazon.com/Against-Intellectual-Monopoly-Michele-Boldrin/dp/0521127262/ref=sr_1_1?ie=UTF8&#038;qid=1325875662&#038;sr=8-1&tag=daniellemires-20" rel="nofollow">Against Intellectual Monopoly</a>, Cambridge University Press, 2010.
</li>
<li>
George T. Haley, Usha C.V. Haley, <a href="http://dx.doi.org.proxy.hil.unb.ca/10.1016/j.techfore.2011.05.012">The effects of patent-law changes on innovation: The case of India&#8217;s pharmaceutical industry</a>, Technological Forecasting and Social Change, 2011.
</li>
</ul>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=Feq4GGu4akE:baZwhz9GylE:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=Feq4GGu4akE:baZwhz9GylE:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/Feq4GGu4akE" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://lemire.me/blog/archives/2012/01/06/do-we-need-patents/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		<feedburner:origLink>http://lemire.me/blog/archives/2012/01/06/do-we-need-patents/</feedburner:origLink></item>
		<item>
		<title>Are you a gold prospector, or a construction worker?</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/fSZ6JjuXDAU/</link>
		<comments>http://lemire.me/blog/archives/2012/01/03/are-you-a-gold-prospector-or-a-construction-worker/#comments</comments>
		<pubDate>Tue, 03 Jan 2012 21:54:13 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category />

		<guid isPermaLink="false">http://lemire.me/blog/?p=3774</guid>
		<description>Most work is akin to construction jobs: you work until the house is built. You just have to keep the servers running day after day. You keep writing code day after day. You teach another class. The only risk you take is that they may not be work for you next year. The work is [...]</description>
			<content:encoded><![CDATA[<p>Most work is akin to <em>construction jobs</em>: you work until the house is built. You just have to keep the servers running day after day. You keep writing code day after day. You teach another class. The only risk you take is that they may not be work for you next year. The work is not, in itself, risky. Showing up day after day, and doing your best, is often enough.</p>
<p>Other work is more like gold prospecting. Almost all freely available open source is never used. Almost all research papers are useless. Most books find very few readers. Most blog posts are written in vain. </p>
<p>With gold prospecting, the results are nearly as certain as with construction jobs. Programmers know that their current project won&#8217;t compete against Linux. Mathematicians know that their research paper will be read, at most, by handful of people. Writers know that their next book won&#8217;t sell more than a few hundred copies.</p>
<p>However, gold prospectors rely on probabilities. Even if your chance of success is only 1%, you are guaranteed (within 1%) to succeed if you try 500 times. If your chance of success is a meager 0.1%, then try 5000 times.</p>
<p>That is why so many open-source programmers, scientists and writers thrive despite the odds: they simply keep on trying. It is only a risky business if you cannot try multiple times. You might be tempted to object that writers don&#8217;t try to market their books 500 times. But some do! J. A. Konrath collected nearly 500 rejections from publishers. In <a href="http://jakonrath.blogspot.com/2011/12/list-story-of-rejection.html">a recent blog post</a>, he provides a list of comments by major publishers who all rejected his best-seller The List. Though no publisher would take his book, he is now making $5000 a day from the sales of this self-published book alone. He won because kept trying long enough. I am convinced that most gold prospectors eventually succeed, if they try long enough.</p>
<p>Shortly after completing my Ph.D. I decided to become an entrepreneur. I did not have contracts lined up. I did not have a business idea. I did not have investors. I did not have business experience. Yet I never ran out of money or work despite the apparent risks. Some projects were failures and I did lose time and money. Other projects were rewarding beyond my expectations. It averages out over time.  </p>
<p>Most prospectors just put their best work out and hope for the best. Sometimes they are lucky and the work is appreciated. Yet it is very difficult to predict success ahead of time. Some of the major research papers, including some work that lead to Nobel prizes,  were initially rejected. In fact, the number of Nobel-winning research papers that were initially rejected is astounding (Campanario). </p>
<p>Though my arguments appear reasonable enough, consider how hard it is to apply the gold prospector model in real life. Young researchers are often judged by the work they did during the Ph.D.: you only have a few short years that make or break the rest of your career. We are quick to judge a writer based on the market success of his books: it is difficult for us to imagine that some could go from a nobody to a wealthy writer like J. A. Konrath in a couple of years. </p>
<p>We tend to be overconfident in our assessment of gold prospectors. A much easier, and more meaningful, distinction is between gold prospectors and construction workers. Some people are simply not interested in finding gold. They would rather just build houses. Among college professors, you have those who are eager to start yet another research projects (gold prospectors) and those who do research when it is a requirement to keep their job (construction workers). Among programmers, you have those who start their own projects, and those who will never touch code again if they can get a management job. Both types can excel at their jobs, but the spirit is different. Prospectors accept failures as part of their business, whereas construction workers would rather avoid failures entirely.</p>
<p><strong>Reference</strong>: Juan Miguel Campanario, <a href="http://www2.uah.es/jmc/nobel/nobel.html">Rejecting Nobel class articles and resisting Nobel class discoveries</a>.</p>
<p><strong>Source</strong>: This post was inspired by a Facebook comment made by <a href="http://www.ribbonfarm.com/about/">Venkatesh Rao</a> and by a face-to-face discussion with <a href="http://www.cs.ubc.ca/~beaudoin/">Philippe Beaudoin</a>. </p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=fSZ6JjuXDAU:CmkLr12OsTM:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=fSZ6JjuXDAU:CmkLr12OsTM:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/fSZ6JjuXDAU" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://lemire.me/blog/archives/2012/01/03/are-you-a-gold-prospector-or-a-construction-worker/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		<feedburner:origLink>http://lemire.me/blog/archives/2012/01/03/are-you-a-gold-prospector-or-a-construction-worker/</feedburner:origLink></item>
		<item>
		<title>My favorite posts from 2011</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/S-mKKjFkRIM/</link>
		<comments>http://lemire.me/blog/archives/2011/12/28/my-favorite-posts-from-2011/#comments</comments>
		<pubDate>Wed, 28 Dec 2011 16:42:35 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category />

		<guid isPermaLink="false">http://lemire.me/blog/?p=3765</guid>
		<description>January: Innovating without permission Not even eventually consistent February: Taking scientific publishing to the next level Ten things Computer Science tells us about bureaucrats March: Know the biases of your operating system Breaking news: HTML+CSS is Turing complete April: How information technology is really built May: Automation will make you obsolete, no matter who you [...]</description>
			<content:encoded><![CDATA[<p>January:</p>
<ul>
<li><a href="http://lemire.me/blog/archives/2011/01/17/innovating-without-permission/">Innovating without permission</a></li>
<li><a href="http://lemire.me/blog/archives/2011/01/31/not-even-eventually-consistent/">Not even eventually consistent</a> </li>
</ul>
<p>February:</p>
<ul>
<li><a href="http://lemire.me/blog/archives/2011/02/11/taking-scientific-publishing-to-the-next-level/">Taking scientific publishing to the next level</a></li>
<li><a href="http://lemire.me/blog/archives/2011/04/21/ten-things-computer-science-tells-us-about-bureaucrats/">Ten things Computer Science tells us about bureaucrats</a></li>
</ul>
<p>March:</p>
<ul>
<li><a href="http://lemire.me/blog/archives/2011/03/23/know-the-biases-of-your-operating-system/">Know the biases of your operating system</a></li>
<li><a href="http://lemire.me/blog/archives/2011/03/08/breaking-news-htmlcss-is-turing-complete/">Breaking news: HTML+CSS is Turing complete<br />
</a></li>
</ul>
<p>April:</p>
<ul>
<li><a href="http://lemire.me/blog/archives/2011/04/04/how-information-technology-is-really-built/">How information technology is really built</a></li>
</ul>
<p>May:</p>
<ul>
<li><a href="http://lemire.me/blog/archives/2011/05/27/automation-will-make-your-job-obsolete-no-matter-who-you-are/">Automation will make you obsolete, no matter who you are<br />
</a></li>
<li><a href="http://lemire.me/blog/archives/2011/05/18/the-perils-of-filter-then-publish/">The perils of filter-then-publish<br />
</a></li>
<li><a href="http://lemire.me/blog/archives/2011/06/08/is-wikipedia-anti-intellectual/">Is Wikipedia anti-intellectual?<br />
</a></li>
</ul>
<p>June:</p>
<ul>
<li><a href="http://lemire.me/blog/archives/2011/06/06/why-i-still-program/">Why I still program<br />
</a></li>
<li><a href="http://lemire.me/blog/archives/2011/06/14/the-language-interpreters-are-the-new-machines/">The language interpreters are the new machines<br />
</a></li>
</ul>
<p>July:</p>
<ul>
<li><a href="http://lemire.me/blog/archives/2011/07/04/the-myth-of-the-unavoidable-hyperspecialization/">The myth of the unavoidable specialization<br />
</a></li>
</ul>
<p>August:</p>
<ul>
<li><a href="http://lemire.me/blog/archives/2011/08/15/the-web-is-killing-database-systems/">The Web is killing database systems</a></li>
</ul>
<p>September:</p>
<ul>
<li><a href="http://lemire.me/blog/archives/2011/09/06/science-is-self-regulatory-really/">Science is self-regulatory… really?<br />
</a></li>
<li><a href="http://lemire.me/blog/archives/2011/09/19/emerging-knowledge-is-a-private-business/">Emerging knowledge is a private business<br />
</a></li>
<li><a href="http://lemire.me/blog/archives/2011/09/13/you-think-that-users-are-faceless-objects-you-are-obsolete/">You think that users are faceless objects? You are obsolete!<br />
</a></li>
</ul>
<p>October:</p>
<ul>
<li><a href="http://lemire.me/blog/archives/2011/10/10/why-arent-we-getting-richer-the-scarring-tissue-theory/">Why aren’t we getting richer? The scarring tissue theory<br />
</a></li>
<li><a href="http://lemire.me/blog/archives/2011/10/25/it-is-not-where-you-work-but-who-you-work-with/">It is not where you work, but who you work with<br />
</a></li>
<li><a href="http://lemire.me/blog/archives/2011/10/23/how-database-design-fails-us-and-what-to-do-about-it/">How database design fails us, and what to do about it<br />
</a></li>
</ul>
<p>November:</p>
<ul>
<li><a href="http://lemire.me/blog/archives/2011/11/14/are-relational-databases-good-for-anything-anymore/">Are Relational Databases good for anything anymore?<br />
</a></li>
</ul>
<p>December:</p>
<ul>
<li><a href="http://lemire.me/blog/archives/2011/12/05/dealing-with-harsh-criticism/">Dealing with harsh criticism<br />
</a></li>
</ul>
<div class="related">
<p>Related posts (automatically generated):</p>
<ul>
		<li><a href="http://lemire.me/blog/archives/2009/01/02/favorite-posts-for-2008/" rel="bookmark">Favorite posts for 2008</a><!-- (10.4)--></li>
	</ul>
</div>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=S-mKKjFkRIM:fnKuKN448zY:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=S-mKKjFkRIM:fnKuKN448zY:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/S-mKKjFkRIM" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://lemire.me/blog/archives/2011/12/28/my-favorite-posts-from-2011/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://lemire.me/blog/archives/2011/12/28/my-favorite-posts-from-2011/</feedburner:origLink></item>
		<item>
		<title>Compressing document-oriented databases by rewriting your documents</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/viLbEZr_UL0/</link>
		<comments>http://lemire.me/blog/archives/2011/12/19/compressing-document-oriented-databases-by-rewriting-your-documents/#comments</comments>
		<pubDate>Mon, 19 Dec 2011 17:24:41 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category />

		<guid isPermaLink="false">http://lemire.me/blog/?p=3751</guid>
		<description>The space utilization of relational databases can be estimated quickly. If you create a table made of three columns, each containing an integer, you can expect the database to use roughly 12 bytes per row, plus some overhead. Unless your database is tiny, how you name your columns is irrelevant to the space utilization. Document-oriented [...]</description>
			<content:encoded><![CDATA[<p>The space utilization of relational databases can be estimated quickly. If you create a table made of three columns, each containing an integer, you can expect the database to use roughly 12 bytes per row, plus some overhead. Unless your database is tiny, how you name your columns is irrelevant to the space utilization.</p>
<p>Document-oriented databases such as <a href="http://www.mongodb.org/">MongoDB</a> are not so simple. There is room for optimization. Using short names for attributes is better. </p>
<p>For example, in going from <a href="http://www.json.org/">JSON</a> tuples of the form</p>
<p><code>{date_achat:'1999-06-30',article:'Echasses',<br />
quantite:1,prix:2800}</code></p>
<p>to these tuples where one attribute has a longer name</p>
<p><code>{date_achat:'1999-06-30',article<span style="color:red">fromoutstore</span>:'Echasses',<br />
quantite:1,prix:2800}<br />
</code>
</p>
<p>you increase the space utilization per tuple by 12 bytes (from 105 to 117 bytes per tuple).</p>
<p>The converse is true. Using shorter names is better:</p>
<p><code>{d:'1999-06-30',a:'Echasses',q:1,p:2800}<br />
</code></p>
<p>The space utilization per tuple goes down to 80 bytes (from 105 bytes). This is a saving of over 20%.</p>
<p>It is tempting to do away with the attribute names entirely and save the data as array:</p>
<p><code>['1999-06-30','Echasses',1,2800]</code></p>
<p>Yet the space utilization remains at 80 bytes because the binary format used by MongoDB (<a href="http://bsonspec.org/">BSON</a>) does not store arrays  concisely.</p>
<p>Should we worry about this issue? We live in an era of abundant storage and memory. MongoDB pre-allocates the storage to avoid disk fragmentation. Even the tiniest collection will use 128 MB, and larger collections are stored in 2 GB files: MongoDB is unafraid to waste nearly 2 GB or more. In fact, we might say that it is precisely because we live in such abundance that we can afford to use document-oriented databases. However, <a href="http://stackoverflow.com/questions/2966687/reducing-mongodb-database-file-size">engineers still face problems with space utilization</a>. Hence, it is useful to be aware of the effect that the names you choose will have, especially if you come from a relational database context where name length is irrelevant.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=viLbEZr_UL0:qB-VsfayaPU:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=viLbEZr_UL0:qB-VsfayaPU:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/viLbEZr_UL0" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://lemire.me/blog/archives/2011/12/19/compressing-document-oriented-databases-by-rewriting-your-documents/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		<feedburner:origLink>http://lemire.me/blog/archives/2011/12/19/compressing-document-oriented-databases-by-rewriting-your-documents/</feedburner:origLink></item>
		<item>
		<title>Dealing with harsh criticism</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/bt7IaFMgYKE/</link>
		<comments>http://lemire.me/blog/archives/2011/12/05/dealing-with-harsh-criticism/#comments</comments>
		<pubDate>Mon, 05 Dec 2011 19:23:53 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category />

		<guid isPermaLink="false">http://lemire.me/blog/?p=3740</guid>
		<description>Scott Adams, of Dilbert fame, once told how Dilbert fared poorly initially. His critics objected that Dilbert was hardly ever funny, except when he appeared at the office. Instead of falling prey to discouragement, Adams decided to portray Dilbert almost exclusively at the office from now on. And Scott became a world-famous millionaire. This is [...]</description>
			<content:encoded><![CDATA[<p><a href="http://en.wikipedia.org/wiki/Scott_Adams">Scott Adams</a>, of <a href="http://en.wikipedia.org/wiki/Dilbert">Dilbert</a> fame, once told how Dilbert fared poorly initially. His critics objected that Dilbert was hardly ever funny, except when he appeared at the office. Instead of falling prey to discouragement, Adams decided to portray Dilbert almost exclusively at the office from now on. And Scott became a world-famous millionaire.</p>
<p>This is a perfect example of how to apply mental judo when faced with criticism: turn the negative energy which is opposing you, into positive energy that propels you.</p>
<p>I recently read Graeber&#8217;s <a href="http://www.amazon.com/Debt-First-5-000-Years/dp/1933633867/ref=sr_1_1?ie=UTF8&amp;qid=1323095810&amp;sr=8-1&tag=daniellemires-20" rel="nofollow">Debt: the first 5000 years</a>. I wrote about the book on this blog <a href="http://lemire.me/blog/archives/2011/11/14/where-does-debt-credit-and-currencies-come-from/">a few years ago</a>. This is a brilliant, highly original book by an author who is leaving his mark in history. I wrote an excellent review (5 stars) on Amazon, and I recommend that everyone reads this challenging book!</p>
<p>Yet the book is flawed. Venkatesh Rao summed the problem well, though maybe too harshly:</p>
<blockquote><p><em>Debt</em> is not one big story spanning 5000 years, but more like a collection of 5000 little stories and arguments thrown together, with a bigger narrative almost slapped on as an afterthought.</p></blockquote>
<p>How did Graeber react when Rao criticized him online? He <a href="http://www.ribbonfarm.com/2011/12/01/how-the-world-works/#comment-13306">blamed Rao</a> for missing the boat:</p>
<blockquote><p>Sad really. A book is only as good as its readers.</p></blockquote>
<p>This is not the best response. So, how should you react?</p>
<ol>
<li><strong>Don&#8217;t be blinded by the negative. </strong>To be efficient critics of your work, people have to study it. In the process, they often identify strengths. Pay close attention to your critics. You will often find out that they are not trashing your work entirely. Consider the Scott Adams&#8217; story: Dilbert is funny only while at the office. Ah! This means that Dilbert is funny while at the office? In the Graeber story,  Rao wrote: &#8220;the book provides a lot of astounding value&#8221;. See how Graeber paid no attention to this part?</li>
<li><strong>Be constructive.</strong> Instead of opposing your critics, work with them. They will often identify real weaknesses in your work.  You may not be able to fix these weaknesses (e.g., you cannot re-edit your book), but you should at least be able to take them into account in the future. An answer that I give often is &#8220;I&#8217;m aware of this problem and I plan to write about it in the future&#8221;. Scott Adams&#8217; story is the perfect illustration of this principle: instead of fighting his critics, he improved his work accordingly.</li>
</ol>
<p>&nbsp;</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=bt7IaFMgYKE:ayNp7Z2cYVo:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=bt7IaFMgYKE:ayNp7Z2cYVo:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/bt7IaFMgYKE" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://lemire.me/blog/archives/2011/12/05/dealing-with-harsh-criticism/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		<feedburner:origLink>http://lemire.me/blog/archives/2011/12/05/dealing-with-harsh-criticism/</feedburner:origLink></item>
		<item>
		<title>3 surprising facts about the computation of scalar products</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/3AFNdS4C3LM/</link>
		<comments>http://lemire.me/blog/archives/2011/11/28/3-surprising-facts-about-the-computation-of-the-scalar-product/#comments</comments>
		<pubDate>Mon, 28 Nov 2011 17:27:07 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category />

		<guid isPermaLink="false">http://lemire.me/blog/?p=3722</guid>
		<description>The speed of many algorithms depends on how quickly you can multiply matrices or compute distances. In turn, these computations depend on the scalar product. Given two arrays such as (1,2) and (5,3), the scalar product is the sum of products 1 × 5 + 2 × 3. We have strong incentives to compute the [...]</description>
			<content:encoded><![CDATA[<p>The speed of many algorithms depends on how quickly you can multiply matrices or compute distances. In turn, these computations depend on the scalar product. Given two arrays such as (1,2) and (5,3), the scalar product is the sum of products 1 × 5 + 2 × 3. We have strong incentives to compute the scalar product as quickly as possible. Here are a few facts that I find surprising:</p>
<ul>
<li><strong>Recent processors (e.g., Intel i7) can multiply and add a 32-bit number in a single CPU cycle or less.</strong> For each component in the arrays, we need to execute one multiplication and one addition. The latency of the multiplication is at least 3 cycles, meaning that we require 3 cycles to complete a multiplication. Similarly, additions require at least one cycle. Yet processors make aggressive use of pipe-lining: they execute several multiplications simultaneously so that they can produce one result every cycle.</li>
<li><strong>If you work with 64-bit integers and use some recent GNU GCC compilers (e.g., 4.5), you should disable Streaming SIMD Extensions (SSE) for better speed.</strong> In theory, the <a href="http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions">SSE</a> instructions are ideally suited to the computation of scalar products. Yet the throughput with 64-bit integers goes from 1.3 cycle per multiplication with SSE2 disabled to 3.4 cycles per multiplication with SSE2. There is an optimization bug in the otherwise excellent GNU GCC compiler. Update: According to the numbers provided by John Regehr, this problem also affects some Intel compilers.</li>
<li><strong>When using SSE, 64-bit floating point numbers may be faster than 32-bit floating numbers.</strong> Years ago I was told to avoid 64-bit floating point numbers for performance reasons. It is not automatically good advice on all compilers especially if you require standards compliance.</li>
</ul>
<p>For my tests, I initially used the flags &#8220;-funroll-loops -O3&#8243; on a recent Intel i7-2600 with the GNU GCC compiler version 4.5. In each instance, I have tested with and without manual loop-unrolling and I only report the best score (in cycles per multiplication). The C code is <a href="http://pastebin.com/DY2KFmX4">available</a>.</p>
<table border="1">
<tbody>
<tr>
<th>computation</th>
<th>with SSE</th>
<th>SSE2 disabled (-mno-sse2)</th>
<th>with AVX (-mavx)</th>
</tr>
<tr>
<td>32-bit integers</td>
<td>1.0</td>
<td>1.1</td>
<td><strong>0.5</strong></td>
</tr>
<tr>
<td>64-bit integers</td>
<td>3.4</td>
<td><strong>1.3</strong></td>
<td>3.4</td>
</tr>
<tr>
<td>float</td>
<td><strong>10</strong></td>
<td><strong>10</strong></td>
<td><strong>10</strong></td>
</tr>
<tr>
<td>double</td>
<td><strong>7.0</strong></td>
<td>170</td>
<td><strong>7.0</strong></td>
</tr>
</tbody>
</table>
<p>Upgrading to GCC 4.6.2 and replacing -O3 by the new -Ofast flag changes the results quite a bit at the expense of reliability (-Ofast disregards standards compliance).</p>
<table border="1">
<tbody>
<tr>
<th>computation</th>
<th>with SSE</th>
<th>SSE2 disabled (-mno-sse2)</th>
<th>with AVX (-mavx)</th>
</tr>
<tr>
<td>32-bit integers</td>
<td>1.0</td>
<td>1.1</td>
<td><strong>0.5</strong></td>
</tr>
<tr>
<td>64-bit integers</td>
<td><strong>1.0</strong></td>
<td>1.1</td>
<td><strong>1.0</strong></td>
</tr>
<tr>
<td>float</td>
<td>0.7</td>
<td>0.9</td>
<td><strong>0.3</strong></td>
</tr>
<tr>
<td>double</td>
<td>5.1</td>
<td><strong>2.5</strong></td>
<td>5.0</td>
</tr>
</tbody>
</table>
<p><strong>Further reading</strong>: See my previous post on this topic <a href="http://lemire.me/blog/archives/2011/08/11/fast-computation-of-scalar-products-and-some-lessons-in-optimization/">Fast computation of scalar products, and some lessons in optimization</a></p>
<div class="related">
<p>Related posts (automatically generated):</p>
<ul>
		<li><a href="http://lemire.me/blog/archives/2011/08/11/fast-computation-of-scalar-products-and-some-lessons-in-optimization/" rel="bookmark">Fast computation of scalar products, and some lessons in optimization</a><!-- (22.1)--></li>
	</ul>
</div>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=3AFNdS4C3LM:7HAwQTCEOco:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=3AFNdS4C3LM:7HAwQTCEOco:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/3AFNdS4C3LM" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://lemire.me/blog/archives/2011/11/28/3-surprising-facts-about-the-computation-of-the-scalar-product/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
		<feedburner:origLink>http://lemire.me/blog/archives/2011/11/28/3-surprising-facts-about-the-computation-of-the-scalar-product/</feedburner:origLink></item>
		<item>
		<title>Are Relational Databases good for anything anymore?</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/27rHUzAEl0k/</link>
		<comments>http://lemire.me/blog/archives/2011/11/14/are-relational-databases-good-for-anything-anymore/#comments</comments>
		<pubDate>Mon, 14 Nov 2011 17:11:35 +0000</pubDate>
		<dc:creator>Antonio Badia</dc:creator>
				<category />

		<guid isPermaLink="false">http://lemire.me/blog/?p=3689</guid>
		<description>(Note: thanks to Daniel for letting me use his blog. All comments, questions and criticisms are appreciated, but in this case Daniel is not the guilty party! Antonio) As a database researcher, I was &amp;#8220;brought up&amp;#8221; to think of databases as the ideal tool for data management. There was simply no good reason for anyone [...]</description>
			<content:encoded><![CDATA[<p>(Note: thanks to Daniel for letting me use his blog. All comments, questions and criticisms are appreciated, but in this case Daniel is not the guilty party! <a href="http://louisville.edu/speed/computer/people/faculty/badia">Antonio</a>)</p>
<p>As a database researcher, I was &#8220;brought up&#8221; to think of databases as the ideal tool for data management. There was simply no good reason for anyone with data to manage to use something else. It was with dismay, then, that little by little I came to realize that the importance of databases for data management is diminishing tremendously out there in the real world. The advent of NoSQL was a clear call for anyone paying attention (although in theory NoSQL does not necessarily mean no-relational, in practice it does). The fact that scientists are using anything but (relational) databases for e-science is also another sign that the Apocalypse is among us (just kidding).</p>
<p>It seems clear to me that it must be admitted that there are situations in which a (relational) database is simply not the appropriate tool. This immediately brings forth a question: When is a (relational) database the right tool? My idea to attack this question was to look at the problems people are trying to solve, the processes they use to solve them, and to classify the data those processes deal with along relevant dimensions. Of course, what is relevant is in the eye of the beholder; but there are some characteristics that have traditionally been challenging: dealing with very large data sets, and dealing with irregular data. So I analyzed processes and their data based on those two dimensions.</p>
<ul>
<li><strong>data size:</strong> to classify data by size, I used an entirely pragmatic measure. Roughly, if a data set fits into memory, it is small (up to 2GB or so in today&#8217;s computers); if it fits in a single disk (or, to be exact, a disk array with a single controller), complexity is medium (up to 1TB or so); if it needs several disks/parallel/distributed systems, it&#8217;s large (note that this last category includes perhaps too much, since it goes from a few TB to PTs and beyond). The motivation for this division is clear: algorithms that can assume that all data fits in memory are &#8216;qualitatively&#8217; different from those that need disk access. And adapting algorithms for parallel/distributed environments presents non-trivial challenges. So yes, these categories are very much dependent on current architecture and technology, and may have to be revised once a great solid state device is developed. But until a revolution comes, that will just mean changing the exact &#8216;boundaries&#8217;. And I think the division is very relevant for people using the processes, especially for those that require interactive (or near interactive) response time.</li>
<li><strong>data complexity:</strong> intuitively, this has to do with the data structure or organization, but admittedly, it is a much harder nut to crack then the previous one. After some thinking, I decided to subdivide the problem along 3 &#8216;sub-dimensions&#8217;:</li>
</ul>
<ul>
<li><em>Conceptual complexity:</em> Intuitively, this is the number of elements (entities, relationships) in a conceptual model of the process&#8217; data. This is clearly a bit vague, since one can build several conceptual models of the same data, but it is not important here which entities one &#8220;sees&#8221; in the data, as to &#8216;how many&#8217;. Of course, both issues are related, but still there is an intuitive sense in which this matters. So, if a conceptual model would identify 1 main entity, everything else being attributes or entities &#8220;strongly connected&#8221; to it, then that data would be considered simple -the idea is that you could keep all data in 1 file, with minimal redundancy. If you can identify several entities, then the complexity is medium. How many? There is no clear-cut number here. One possibility is to say that it&#8217;s not too much for one person to comprehend &#8220;at once&#8221;; based on the famous finding of &#8220;the magical number seven, plus minus two&#8221;, but taking into account that one can get helps from &#8216;external memory devices&#8217; (i.e. pencil and paper), we could set an upper limit of 10 to 20 entities. More than that, and we consider the conceptual complexity large.</li>
<li><em>regularity:</em> relational databases assume that you can define a schema for your data before-hand. However, this is not always possible. There has been much research to deal with somewhat irregular (semistructured) data, but here I make another distinction that I consider more relevant to today&#8217;s processes: whether the schema is &#8216;close vocabulary&#8217; (i.e. all possible entities/attributes/relations can be enumerated once and for all) or &#8216;open&#8217;. Schemae (both relational and object-oriented), taxonomies, ontologies, even DTDs and XML Schema-compliant data are closed. Of course, within those there are distinctions. When the schema comes first (relational and object-oriented) I consider that regularity is high (and complexity low); when schema and data may be decoupled (as in semistructured data), I consider that regularity is medium (and so is complexity). On the open vocabulary front, regularity is very low (and complexity is very high). Note that key-value stores are considered &#8216;open&#8217;, since in many of them the value is opaque (i.e. it could be anything). This is not really a schema, it&#8217;s just a convenient way to distribute data (keys are made up and used for hashing or sorting; they are not inherent to the data).</li>
<li><em>schema rate of change:</em> even if one has a schema, or at least a closed vocabulary, one may not be able to take full advantage of this fact if the schema keeps on changing. Relational databases implicitly assume that once a schema is created, there is going to be little to no change. At the other extreme, in some modern processes nothing is assumed about how data evolves, and there is complete freedom: two objects in a collection may have completely different attributes. Also, an object may change to the point that it has nothing in common with the original (except the key, of course). If there is no change, or very infrequent one, I consider complexity low; if there is some change allowed, but within certain limits, I consider it medium; if any type of change is allowed, I consider it high.</li>
</ul>
<p>Thus, we can analyze a process by looking at the data it consumes and/or generates, to decide what kind of system better supports it. The funny thing is, under this analysis, RDBMS seem only adequate when size and complexity are medium -in particular, data size is medium, conceptual simplicity can be handled, but regularity must be high and schema rate of change must be low. For processes with other characteristics, a RDBMS may not be well suited. Some people will argue that most RDBMS nowadays can handle very large data and irregular data -since commercial systems come with &#8216;cluster&#8217;-based extensions, as well as extensions to handle XML. But many of the problems facing large datasets are not solved by throwing more hardware at them. For high availability, replication is needed; and this bring issues for transaction support. For complex data analysis, many times the approaches required are not supported (or not supported well) by SQL -hence, just being able to store the data is not enough. So for many processes, using a distributed RDBMS or a cluster-based RDBMS will still not do. As for extensions to handle object, XML, and even text, the problem is that each one of these extensions was basically a compromise that yielded unwieldy systems, lost much of the simplicity of the original relational model, and in exchange gave clumsy tools.</p>
<p>What happens in other situations? When your data is low in size and low in complexity, some processes make do with files and some domain-specific programs. This seems to be the case with much e-science. The overhead that a database brings is just no worth it. As for very large data, this seems to be the niche of the all these new shiny NoSQL systems.</p>
<p>So, all in all, it seems that RDBMS are being relegated to a very narrow niche. To be sure, it is still a very profitable one, so there is no big market pressure on database companies (yet, although some of them are already responding, like Oracle with its NoSQL database). One can argue that RDBMS have been &#8216;under threat&#8217; before, and have reacted to it -but as a result,  RDBMS seems more bloated than ever to some users, while still failing to satisfy their information needs. Maybe it&#8217;s time to rethink the whole thing from scratch?</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=27rHUzAEl0k:gUTFSNqo698:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=27rHUzAEl0k:gUTFSNqo698:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/27rHUzAEl0k" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://lemire.me/blog/archives/2011/11/14/are-relational-databases-good-for-anything-anymore/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		<feedburner:origLink>http://lemire.me/blog/archives/2011/11/14/are-relational-databases-good-for-anything-anymore/</feedburner:origLink></item>
		<item>
		<title>Where do debt, credit and currencies come from?</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/Ho9d3jI_TFw/</link>
		<comments>http://lemire.me/blog/archives/2011/11/14/where-does-debt-credit-and-currencies-come-from/#comments</comments>
		<pubDate>Mon, 14 Nov 2011 15:53:07 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category />

		<guid isPermaLink="false">http://lemire.me/blog/?p=3674</guid>
		<description>We often believe that primitive cultures lacked currencies and so they engaged in barter. Barter is awfully inconvenient and simply cannot sustain a non-trivial economy. Thus, we conclude that someone invented currencies out of convenience. Later, some astute fellow invented credit and loans. In Debt: the first 5000 years, David Graeber teaches us that we [...]</description>
			<content:encoded><![CDATA[<p>We often believe that primitive cultures lacked currencies and so they engaged in barter. Barter is awfully inconvenient and simply cannot sustain a non-trivial economy. Thus, we conclude that someone invented currencies out of convenience. Later, some astute fellow invented credit and loans.</p>
<p>In <a href="http://www.amazon.com/Debt-First-5-000-Years/dp/1933633867?tag=daniellemires-20" rel="nofollow">Debt: the first 5000 years</a>, <a href="http://en.wikipedia.org/wiki/David_Graeber">David Graeber</a> teaches us that we have the story backward. History followed a pattern that looked more like this:</p>
<ol>
<li>We started out living in <a href="http://en.wikipedia.org/wiki/Smurf">Smurf</a> villages. The Smurfs are little blue fellows who have no use for money. Some contribute more, some contribute less, but nobody is counting. Typically, the more a Smurf contributes, the higher his status. Those who contribute little can be ridiculed or harassed, but everyone gets to eat as long as there is enough food.  When there is a major task, everyone joins up. This is communism and it requires a high level of trust. In short, you must have no incentive to hoard commodities. You may own your own food, but it is understood that anyone from the village who needs it can come and take it from you.</li>
<li>Some form of symbolic currency (such as gold or silver) emerges but it is mostly for ceremonial purposes. For example, you may give some amount of silver to a family so that you can wed their daughter. However, you are not &#8220;buying&#8221; the daughter. In any case, nobody will sell you bread in exchange for silver.</li>
<li>There is always some barter, but mostly between strangers.</li>
<li>As trade expands, it is no longer possible to rely solely on communism and barter. So credit appears. You give a merchant a pile of wood that he will trade in for food in the next village. Because you don&#8217;t entirely trust the merchant, you keep a record of the what you give him. Maybe you want to also share in the profit so you ask for interests. This record you have created is new &#8220;money&#8221;.  You can trade this record for other things while the merchant is away. Of course, the value of these records depend on your reputation. Kings and priests have a better chance of trading such records because they are better known. These records may refer to some units of a precious metal such as gold, but no precious metal is ever exchanged in practice.</li>
<li>Consumer debt has always been a problem. When people face too much debt, they tend to either rebel or exit the system. Some people might join nomads who later regroup and attack the cities. Eventually, governments have to defend the indebted population or face collapse.</li>
<li>When raising a large army, you need to pay them. How? More precisely, how do you get the conquered population to trade with your soldiers? Well, you have typically just acquired a large quantity of gold by confiscating jewels and such. So you create a currency out of it which you hand over to your soldiers. Then you ask the conquered population to give you tribute (or pay taxes) in the form of this new currency. People are then forced to trade with the soldiers.</li>
</ol>
<p>My thoughts:</p>
<ul>
<li>Money came from credit. Currencies came after.</li>
<li>Gold-backed currencies and corresponding taxes were invented to sustain military might.</li>
<li>The American dollar is backed by American military might.</li>
<li>Consumer debt and usury are closely related to slavery and correlated with the fall of empires.</li>
<li>It is likely that we are genetically geared toward the Smurf-village model, that is, communism. Non-ceremonial barter was probably dangerous and uncommon. It is probably a myth that we are jungle animals who try to maximize the profit from every trade. It is maybe not surprising that people have so much difficulty with debt.</li>
</ul>
<p><strong>Credit</strong>: I got this book free from <a href="https://plus.google.com/114495700627365584372/about">William Tozier</a>. As far as I can tell, William does not benefit from the sales of this book. Rather, I believe that he wants to promote the debate on these issues.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=Ho9d3jI_TFw:efnOkaAWorE:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=Ho9d3jI_TFw:efnOkaAWorE:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/Ho9d3jI_TFw" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://lemire.me/blog/archives/2011/11/14/where-does-debt-credit-and-currencies-come-from/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		<feedburner:origLink>http://lemire.me/blog/archives/2011/11/14/where-does-debt-credit-and-currencies-come-from/</feedburner:origLink></item>
	</channel>
</rss>

