<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:creativeCommons="http://backend.userland.com/creativeCommonsRssModule" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>Daniel Lemire's blog</title>
	
	<link>http://www.daniel-lemire.com/blog</link>
	<description>Computer Science researcher and Open Scholar: Web, OLAP, Databases, Time Series, Collaborative Filtering, Information Retrieval, e-Learning.</description>
	<lastBuildDate>Wed, 21 Jul 2010 14:06:02 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/daniel-lemire/atom" /><feedburner:info uri="daniel-lemire/atom" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><creativeCommons:license>http://creativecommons.org/licenses/by-nc-sa/2.0/</creativeCommons:license><feedburner:emailServiceId>daniel-lemire/atom</feedburner:emailServiceId><feedburner:feedburnerHostname>http://feedburner.google.com</feedburner:feedburnerHostname><feedburner:feedFlare href="http://www.bloglines.com/sub/http://feeds.feedburner.com/daniel-lemire/atom" src="http://www.bloglines.com/images/sub_modern11.gif">Subscribe with Bloglines</feedburner:feedFlare><feedburner:feedFlare href="http://fusion.google.com/add?feedurl=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://buttons.googlesyndication.com/fusion/add.gif">Subscribe with Google</feedburner:feedFlare><feedburner:feedFlare href="http://www.plusmo.com/add?url=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://plusmo.com/res/graphics/fbplusmo.gif">Subscribe with Plusmo</feedburner:feedFlare><feedburner:feedFlare href="http://www.thefreedictionary.com/_/hp/AddRSS.aspx?http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://img.tfd.com/hp/addToTheFreeDictionary.gif">Subscribe with The Free Dictionary</feedburner:feedFlare><feedburner:feedFlare href="http://www.bitty.com/manual/?contenttype=rssfeed&amp;contentvalue=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://www.bitty.com/img/bittychicklet_91x17.gif">Subscribe with Bitty Browser</feedburner:feedFlare><feedburner:feedFlare href="http://www.newsalloy.com/?rss=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://www.newsalloy.com/subrss3.gif">Subscribe with NewsAlloy</feedburner:feedFlare><feedburner:feedFlare href="http://www.live.com/?add=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://tkfiles.storage.msn.com/x1piYkpqHC_35nIp1gLE68-wvzLZO8iXl_JMledmJQXP-XTBOLfmQv4zhj4MhcWEJh_GtoBIiAl1Mjh-ndp9k47If7hTaFno0mxW9_i3p_5qQw">Subscribe with Live.com</feedburner:feedFlare><feedburner:feedFlare href="http://mix.excite.eu/add?feedurl=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://image.excite.co.uk/mix/addtomix.gif">Subscribe with Excite MIX</feedburner:feedFlare><feedburner:feedFlare href="http://download.attensa.com/app/get_attensa.html?feedurl=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://www.attensa.com/blogs/attensa/WindowsLiveWriter/BadgeredintoBadges_10C02/attensa_feed_button5.gif">Subscribe with Attensa for Outlook</feedburner:feedFlare><feedburner:feedFlare href="http://www.webwag.com/wwgthis.php?url=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://www.webwag.com/images/wwgthis.gif">Subscribe with Webwag</feedburner:feedFlare><feedburner:feedFlare href="http://www.podcastready.com/oneclick_bookmark.php?url=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://www.podcastready.com/images/podcastready_button.gif">Subscribe with Podcast Ready</feedburner:feedFlare><feedburner:feedFlare href="http://www.flurry.com/pushRssFeed.do?r=fb&amp;url=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://www.flurry.com/images/flurry_rss_logo2.gif">Subscribe with Flurry</feedburner:feedFlare><feedburner:feedFlare href="http://www.wikio.com/subscribe?url=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://www.wikio.com/shared/img/add2wikio.gif">Subscribe with Wikio</feedburner:feedFlare><feedburner:feedFlare href="http://www.dailyrotation.com/index.php?feed=http%3A%2F%2Ffeeds.feedburner.com%2Fdaniel-lemire%2Fatom" src="http://www.dailyrotation.com/rss-dr2.gif">Subscribe with Daily Rotation</feedburner:feedFlare><item>
		<title>Is multiplication slower than addition?</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/fJJSYKS_4sA/</link>
		<comments>http://www.daniel-lemire.com/blog/archives/2010/07/19/is-multiplication-slower-than-addition/#comments</comments>
		<pubDate>Mon, 19 Jul 2010 17:09:34 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category><![CDATA[Science and Technology]]></category>

		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2659</guid>
		<description>Earlier, I asked whether integer addition was faster than bitwise exclusive or. My tests showed no difference, and nobody contradicted me. However, everyone knows that multiplication is slower than addition? Right? In cryptography, there are many papers on how to trade multiplications for additions, to speed up software. So? Can you predict which piece of [...]</description>
			<content:encoded><![CDATA[<p>Earlier, I <a href="http://www.daniel-lemire.com/blog/archives/2010/03/12/which-is-fastest-integer-addition-or-xor/">asked</a> whether integer addition was faster than bitwise exclusive or. My tests showed no difference, and nobody contradicted me.</p>
<p>However, everyone knows that multiplication is slower than addition? Right? In cryptography, there are many papers on how to trade multiplications for additions, to speed up software.</p>
<p>So? Can you predict which piece of code runs faster?</p>
<p><strong>scalar product (N multiplications):</strong><br />
<code><br />
for(int k =0; k &lt; N ; ++k)<br />
answer += vector1[k] * vector2[k];<br />
</code></p>
<p><strong>scalar product two-by-two (N multiplications):</strong><br />
<code> for(int k =0; k &lt; N ; k+=2)<br />
answer += vector1[k] * vector2[k]<br />
+vector1[k+1] * vector2[k+1];</code></p>
<p><strong>non-standard scalar product (N/2 multiplications):</strong><code><br />
for(int k =0; k &lt; N ; k+=2)<br />
answer += ( vector1[k] + vector2[k] )<br />
* ( vector1[k+1] + vector2[k+1] );<br />
</code></p>
<p><strong>just additions (no multiplication):</strong><code><br />
for(int k =0; k &lt; N ; ++k)<br />
answer += vector1[k] + vector2[k];<br />
</code></p>
<p><strong>Answer:</strong> Merely reducing the number of multiplications has no benefit, in these tests. Hence, simple computational cost models (such as counting the number of multiplications) may not hold on modern <a href="http://en.wikipedia.org/wiki/Superscalar">superscalar</a> processors.</p>
<p>My results using GNU GCC 4.2.1 on both a desktop and a laptop:</p>
<table border="1">
<tbody>
<tr>
<th>algorithm</th>
<th>Intel Core i7</th>
<th>Intel Core 2 Duo</th>
</tr>
<tr>
<td>scalar product</td>
<td>0.30</td>
<td>0.39</td>
</tr>
<tr>
<td>scalar product (2&#215;2)</td>
<td>0.25</td>
<td>0.39</td>
</tr>
<tr>
<td>fewer multiplications</td>
<td>0.25</td>
<td>0.39</td>
</tr>
<tr>
<td>just additions</td>
<td>0.16</td>
<td>0.23</td>
</tr>
</tbody>
</table>
<p>Times are in seconds. The source code is available <a href="http://pastebin.com/cdMMLMZm">without pointer arithmetics</a>.  The same test with pointer arithmetics gives faster results, but the same conclusion. I tried a <a href="http://pastebin.com/YxfVcvue">similar experiment</a> in Java. It confirms my result.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=fJJSYKS_4sA:Hl2Kr1258Yw:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=fJJSYKS_4sA:Hl2Kr1258Yw:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/fJJSYKS_4sA" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.daniel-lemire.com/blog/archives/2010/07/19/is-multiplication-slower-than-addition/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		<feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2010/07/19/is-multiplication-slower-than-addition/</feedburner:origLink></item>
		<item>
		<title>General versus domain intelligence</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/nRO8ewN2Clw/</link>
		<comments>http://www.daniel-lemire.com/blog/archives/2010/07/13/general-versus-domain-intelligence/#comments</comments>
		<pubDate>Tue, 13 Jul 2010 13:44:19 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category><![CDATA[Science and Technology]]></category>

		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2637</guid>
		<description>Our brains come with hard-wired algorithms. Cats can catch birds or mice without thinking about it. I can grab and eat a strawberry without thinking. The Savanna-IQ Interaction Hypothesis says that general intelligence may originally have evolved as a domain-specific adaptation to deal with evolutionarily novel, nonrecurrent problems. We can derive from this hypothesis that [...]</description>
			<content:encoded><![CDATA[<p>Our brains come with hard-wired algorithms. Cats can catch birds or mice without thinking about it. I can grab and eat a strawberry without thinking.  The <a href="http://www.psych-it.com.au/Psychlopedia/article.asp?id=331">Savanna-IQ Interaction Hypothesis</a> says that general intelligence may originally have evolved as a domain-specific adaptation to deal with evolutionarily novel, nonrecurrent problems.  We can derive from this hypothesis that people with better general intelligence won&#8217;t be better at routine tasks. In fact, they may fare worse at it! They may only have an edge for novel tasks. Thus, general and domain intelligence may be somewhat separate entities.</p>
<p>How do you recognize people with better general intelligence? They are better at adapting to new settings. They are the first to adopt new strategies. But they may not be very good at baseball or boxing, and they may be socially inept.</p>
<p>Modern Artificial Intelligence (and Machine Learning) is typically domain-specific. My spam filter can detect spam, but it won&#8217;t ever do anything else. Our software has <em>evolved</em> to cope with specific problems. Yet, we still lack software with general intelligence. Trying to build better spam filters may be orthogonal to achieving general intelligence in software. In fact, software with good general intelligence may not do so well at spam filtering.</p>
<p><strong>Reference</strong>: Satoshi Kanazawa, Kaja Perina, <a href="http://personal.lse.ac.uk/Kanazawa/pdfs/PAID2009.pdf">Why night owls are more intelligent</a>, Personality and Individual Differences 47 (2009) 685–690</p>
<p><strong>Further reading</strong>: <a href="http://apperceptual.wordpress.com/2008/10/25/language-cognition-and-evolution-modularity-versus-unity/">Language, Cognition, and Evolution: Modularity versus Unity</a> by Peter Turney</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=nRO8ewN2Clw:onvj3h0ACrA:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=nRO8ewN2Clw:onvj3h0ACrA:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/nRO8ewN2Clw" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.daniel-lemire.com/blog/archives/2010/07/13/general-versus-domain-intelligence/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		<feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2010/07/13/general-versus-domain-intelligence/</feedburner:origLink></item>
		<item>
		<title>Summer reading: my recommendations</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/0DkobO00u6Q/</link>
		<comments>http://www.daniel-lemire.com/blog/archives/2010/07/09/summer-reading-my-recommendations/#comments</comments>
		<pubDate>Fri, 09 Jul 2010 21:22:31 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category><![CDATA[Science and Technology]]></category>

		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2629</guid>
		<description>Containment by Christian Cantrell is an excellent sci-fi novel. And you can grab it nearly for free from the author&amp;#8217;s page. The premise of the book is that humanity built a colony on Venus. Children  are told that Earth cannot be reached. Massive research into economical oxygen production is required for long term survival. Indeed, [...]</description>
			<content:encoded><![CDATA[<p><img style="float: left; margin: 5px; width: 100px;" src="http://www.livingdigitally.net/books/containment/containment_150x225.jpg" alt="containment" /><a href="http://www.livingdigitally.net/containment.html">Containment</a> by Christian Cantrell is an excellent sci-fi novel. And you can <a href="http://www.livingdigitally.net/containment.html">grab it nearly for free</a> from the author&#8217;s page. The premise of the book is that humanity built a colony on Venus. Children  are told that Earth cannot be reached. Massive research into economical oxygen production is required for long term survival. Indeed,  plants cannot survive on the surface of Venus. Or can they? Couldn&#8217;t we design special plants that could survive? One of the young researchers sets out to answer the question. Unfortunately, he won&#8217;t like the answer. The plot may not be extraordinary, but there are many things to like for computer nerds. For example, the book is set in a future where we appear to have cheap quantum computing. Or, at least, some very fast computers. One of the consequence is that any sufficiently smart kid can break any encryption. Moreover, it is cheaper to simulate most physical experiments than to actual execute them.</p>
<p><img style="float: left; margin: 5px; width: 100px;" src="http://photo.goodreads.com/books/1171481840m/101869.jpg" alt="atrocity archive" />The <a href="http://en.wikipedia.org/wiki/The_Atrocity_Archives">Atrocity Archives</a> by <a href="http://en.wikipedia.org/wiki/Charles_Stross">Charles Stross</a> is the first in an ongoing series of books. Stross was a software engineer, and it shows. His book reveals many secrets all Computer Scientists should know. For example, do you know why Knuth will never finish the <a href="http://en.wikipedia.org/wiki/The_Art_of_Computer_Programming">Art of Computer programming</a>, no matter what he tells us? Here&#8217;s a quote:</p>
<blockquote><p>The [Turing] theorem is a hack on discrete number theory that simultaneously disproves the Church-Turing hypothesis (wave if you understood that) and worse, permits NP-complete problems to be converted into P-complete ones. This has several consequences, starting with screwing over most cryptography algorithms—translation: all your bank account are belong to us—and ending with the ability to computationally generate a Dho-Nha geometry curve in real time.</p>
<p>This latter item is just slightly less dangerous than allowing nerds with laptops to wave a magic wand and turn them into hydrogen bombs at will. Because, you see, everything you know about the way this universe works is correct—except for the little problem that this isn&#8217;t the only universe we have to worry about. Information can leak between one universe and another. And in a vanishingly small number of other universes there are things that listen, and talk back—see Al-Hazred, Nietzsche, Lovecraft, Poe, et cetera. The many-angled ones, as they say, live at the bottom of the Mandelbrot set, except when a suitable incantation in the platonic realm of mathematics—computerised or otherwise—draws them forth. (And you thought running that fractal screensaver was good for your computer?)</p></blockquote>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=0DkobO00u6Q:9bcIYrVOmkI:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=0DkobO00u6Q:9bcIYrVOmkI:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/0DkobO00u6Q" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.daniel-lemire.com/blog/archives/2010/07/09/summer-reading-my-recommendations/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		<feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2010/07/09/summer-reading-my-recommendations/</feedburner:origLink></item>
		<item>
		<title>The five most important algorithms?</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/lIgnUXwWaTE/</link>
		<comments>http://www.daniel-lemire.com/blog/archives/2010/07/05/the-five-most-important-algorithms/#comments</comments>
		<pubDate>Tue, 06 Jul 2010 01:42:14 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category><![CDATA[Science and Technology]]></category>

		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2621</guid>
		<description>Bernhard Koutschan posted a compilation of the most important algorithms. The goal is to determine the 5 most important algorithms. Out of his list, I would select the following five algorithms: Binary search is the first non-trivial algorithm I remember learning. The Fast Fourier transform (FFT) is an amazing algorithm. Combined with the Convolution theorem, [...]</description>
			<content:encoded><![CDATA[<p>Bernhard Koutschan posted a compilation of the <a href="http://www.risc.jku.at/people/ckoutsch/stuff/e_algorithms.html">most important algorithms</a>. The goal is to determine the 5 most important algorithms. Out of his list, I would select the following five algorithms:</p>
<ul>
<li><a href="http://en.wikipedia.org/wiki/Binary_search_algorithm">Binary search</a> is the first non-trivial algorithm I remember learning.</li>
<li>The <a href="http://en.wikipedia.org/wiki/Fast_Fourier_transform">Fast Fourier transform (FFT)</a> is an amazing algorithm. Combined with the <a href="http://en.wikipedia.org/wiki/Convolution_theorem">Convolution theorem</a>, it lets you do magic.</li>
<li>While <a href="http://en.wikipedia.org/wiki/Hash_function">hashing</a> is not an algorithm, it is one of the most powerful and useful idea in Computer Science. It takes minutes to explain it, but years to master.</li>
<li><a href="http://en.wikipedia.org/wiki/Merge_sort">Merge sort</a> is the most elegant sorting algorithm. You can explain it in three sentences to anyone.</li>
<li>While not an algorithm per se, the <a href="http://en.wikipedia.org/wiki/Singular_Value_Decomposition">Singular Value Decomposition</a> (SVD) is the most important Linear Algebra concept <em>I don&#8217;t remember learning as an undergraduate</em>. (And yes, I went to a <a href="http://www.math.toronto.edu/">good school</a>. And yes, I was an A student.) It can help you <a href="http://en.wikipedia.org/wiki/Pseudoinverse">invert singular matrices</a> and do other similar magic.</li>
</ul>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=lIgnUXwWaTE:FL7qQp4WwQ0:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=lIgnUXwWaTE:FL7qQp4WwQ0:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/lIgnUXwWaTE" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.daniel-lemire.com/blog/archives/2010/07/05/the-five-most-important-algorithms/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
		<feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2010/07/05/the-five-most-important-algorithms/</feedburner:origLink></item>
		<item>
		<title>NoSQL or NoJoin?</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/AP69y9-lxik/</link>
		<comments>http://www.daniel-lemire.com/blog/archives/2010/06/28/nosql-or-nojoin/#comments</comments>
		<pubDate>Mon, 28 Jun 2010 13:28:59 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category><![CDATA[Data Warehousing and OLAP]]></category>
		<category><![CDATA[Science and Technology]]></category>

		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2614</guid>
		<description>Several major players built alternatives to conventional database systems: Google created BigTable, Amazon built Dynamo and Facebook initiated Cassandra. There are many other comparable open source initiatives such as CouchDB and MongoDB. These systems are part of a trend called NoSQL because it is not centered around the SQL language. While there has always been [...]</description>
			<content:encoded><![CDATA[<p>Several major players built alternatives to conventional database systems:  Google created <a href="http://en.wikipedia.org/wiki/BigTable">BigTable</a>, Amazon built <a href="http://en.wikipedia.org/wiki/Dynamo_(storage_system)">Dynamo</a> and Facebook initiated <a href="http://en.wikipedia.org/wiki/Apache_Cassandra">Cassandra</a>. There are many other comparable open source initiatives such as <a href="http://en.wikipedia.org/wiki/CouchDB">CouchDB</a> and  <a href="http://en.wikipedia.org/wiki/MongoDB">MongoDB</a>. These systems are part of a trend called <a href="http://en.wikipedia.org/wiki/Nosql">NoSQL</a> because it is not centered around the <a href="http://en.wikipedia.org/wiki/Sql">SQL</a> language. While there has always been non SQL-based database systems, the rising popularity of these alternatives in industry is drawing attention.</p>
<p>In <a href="http://cacm.acm.org/blogs/blog-cacm/50678-the-nosql-discussion-has-nothing-to-do-with-sql/fulltext">The &#8220;NoSQL&#8221; Discussion has Nothing to Do With SQL</a>, Stonebraker opposes the <a href="http://en.wikipedia.org/wiki/Nosql">NoSQL trend</a> in those terms:</p>
<blockquote><p>(&#8230;) blinding performance depends on removing overhead. Such overhead has nothing to do with SQL, but instead revolves around traditional implementations of ACID transactions, multi-threading, and disk management.</p></blockquote>
<p>In effect, Stonebraker says that all of the benefits of the NoSQL systems have nothing to do with ditching the SQL language.  Of course, because the current breed of SQL is Turing complete, it is difficult to argue against SQL at the formal level. In theory, all Turing complete languages are interchangeable. You can do everything (bad and good) in SQL.</p>
<p>However, in practice, SQL is based on joins and related low-level issues like foreign keys. SQL entices people to <a href="http://en.wikipedia.org/wiki/Database_normalization">normalize their data</a>. Normalization fragments databases into smaller tables which is great for data integrity and beneficial for some <a href="http://en.wikipedia.org/wiki/Database_transaction#Transactional_databases">transactional systems</a>. However, joins are expensive. Moreover, joins require strong consistency and fixed schemas.</p>
<p>In turn, avoiding join operations makes it possible to maintain flexible or informal schemas, and to <a href="http://en.wikipedia.org/wiki/Scalability#Scale_horizontally_.28scale_out.29">scale horizontally</a>. Thus, the NoSQL solutions should really be called NoJoin because they are mostly defined by avoidance of the <a href="http://en.wikipedia.org/wiki/Join_(SQL)">join operation</a>.</p>
<p>How do we compute joins? There are two main techniques :</p>
<ul>
<li>When dealing with large tables, you may prefer the <a href="http://en.wikipedia.org/wiki/Sort-merge_join">sort merge</a> algorithm. Because it requires sorting tables, it runs in <em>O</em>(<em>n</em> log <em>n</em>). (If your tables are already sorted in the correct order, sort merge is automatically the best choice.)</li>
<li>For in-memory tables, <a href="http://en.wikipedia.org/wiki/Hash_join">hash joins</a> are preferable because they run in linear time <em>O</em>(<em>n</em>). However, the characteristics of modern hardware are increasing detrimental to the hash join alternative (see C. Kim, et al. <a href="http://www.vldb.org/pvldb/2/vldb09-257.pdf">Sort vs. Hash revisited</a>. 2009).</li>
</ul>
<p>(It is also possible to use <a href="http://en.wikipedia.org/wiki/Bitmap_index">bitmap indexes</a> to precompute joins.) In any case, short of precomputing the joins, joining large tables is expensive and requires source tables to be consistent.</p>
<p><strong>Conclusion:</strong> SQL is a fine language, but it has some biases that may trap developers. What works well in a business transaction system, may fail you in other instances.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=AP69y9-lxik:VCbufZ0RLW8:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=AP69y9-lxik:VCbufZ0RLW8:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/AP69y9-lxik" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.daniel-lemire.com/blog/archives/2010/06/28/nosql-or-nojoin/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		<feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2010/06/28/nosql-or-nojoin/</feedburner:origLink></item>
		<item>
		<title>The fallacy of absolute numbers</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/nDVbo2teE7o/</link>
		<comments>http://www.daniel-lemire.com/blog/archives/2010/06/18/the-fallacy-of-absolute-numbers/#comments</comments>
		<pubDate>Fri, 18 Jun 2010 18:11:29 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category><![CDATA[Academia/Research]]></category>

		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2608</guid>
		<description>I often come across the following type of arguments in research papers: You could save 3 bits of storage for every value in your database. Surely that&amp;#8217;s irrelevant. Nobody cares about saving 3 bits! You can sort arrays in 10 ms. Surely, that cannot be improved upon? You are already down to 10 ms and [...]</description>
			<content:encoded><![CDATA[<p>I often come across the following type of arguments in research papers:</p>
<ul>
<li>You could save 3 bits of storage for every value in your database. Surely that&#8217;s irrelevant. Nobody cares about saving 3 bits!</li>
<li>You can sort arrays in 10 ms. Surely, that cannot be improved upon? You are already down to 10 ms and nobody cares about such small delays.</li>
</ul>
<p>I hope you can see what is wrong with these statements?</p>
<p>I call it the <strong>fallacy of absolute numbers:</strong> you express a measure or a gain in absolute value, and then conclude to optimality or near optimality because the number appears small (or large).</p>
<p><strong>Remember:</strong> Saving 3 bits of storage out of 6 bits is a 2:1 compression ratio. Sorting in 5 ms instead of 10 ms doubles the speed.</p>
<p><strong>Disclaimer:</strong> I am sure that someone else has documented this fallacy, but I could not find any reference to it.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=nDVbo2teE7o:VESx0Z5x9HM:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=nDVbo2teE7o:VESx0Z5x9HM:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/nDVbo2teE7o" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.daniel-lemire.com/blog/archives/2010/06/18/the-fallacy-of-absolute-numbers/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		<feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2010/06/18/the-fallacy-of-absolute-numbers/</feedburner:origLink></item>
		<item>
		<title>Indexing XML</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/ltONA9z5S2g/</link>
		<comments>http://www.daniel-lemire.com/blog/archives/2010/06/16/indexing-xml/#comments</comments>
		<pubDate>Wed, 16 Jun 2010 14:02:21 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category><![CDATA[Data Warehousing and OLAP]]></category>
		<category><![CDATA[Science and Technology]]></category>

		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2604</guid>
		<description>I&amp;#8217;d like to know a lot more about XML indexing—if only because I really ought to be teaching this topic. So I decided to write a blog post to expose what I know, hoping that some knowledgeable readers will fill me in on what I am missing. Mostly, I expect we are interested in indexing [...]</description>
			<content:encoded><![CDATA[<p>I&#8217;d like to know a lot more about XML indexing—if only because I really ought to be teaching this topic. So I decided to write a blog post to expose what I know, hoping that some knowledgeable readers will fill me in on what I am missing.</p>
<p>Mostly, I expect we are interested in indexing <a href="http://en.wikipedia.org/wiki/XPath">XPath</a> queries. Not only is XPath useful on its own, but it is also the basis for the <a href="http://en.wikipedia.org/wiki/FLWOR">FLWOR</a> expressions in <a href="http://en.wikipedia.org/wiki/XQuery">XQuery</a>.</p>
<p>A typical XPath expression will select only a small fraction of any XML document (such as the value of a particular attribute). Thus, a sensible strategy is to represent the XML documents as tables. There are several possible maps from XML documents to tables. One of the most common  is ORDPATH.</p>
<p>In the ORDPATH model, the root node receives the identifier 1, the first node contained in the root node receives the identifier 1.1, the second one receives the identifier 1.2, and so on. Given the ORDPATH identifiers, we can easily determine whether two nodes are neighbors, or whether they have a child-parent relationship.</p>
<p>As an example, here&#8217;s an XML document and its (simplified) ORDPATH representation:</p>
<p><code><br />
&lt;liste temps="janvier" &gt;<br />
&lt;bateau /&gt;<br />
&lt;bateau &gt;<br />
&lt;canard /&gt;<br />
&lt;/bateau&gt;<br />
&lt;/liste&gt;<br />
</code></p>
<table border="1">
<tbody>
<tr>
<th>ORDPATH</th>
<th>name</th>
<th>type</th>
<th>value</th>
</tr>
<tr>
<td>1</td>
<td>liste</td>
<td>element</td>
<td>-</td>
</tr>
<tr>
<td>1.1</td>
<td>temps</td>
<td>attribute</td>
<td>janvier</td>
</tr>
<tr>
<td>1.2</td>
<td>bateau</td>
<td>element</td>
<td>-</td>
</tr>
<tr>
<td>1.3</td>
<td>bateau</td>
<td>element</td>
<td>-</td>
</tr>
<tr>
<td>1.3.1</td>
<td>canard</td>
<td>element</td>
<td>-</td>
</tr>
</tbody>
</table>
<p>Given a table, we can easily index it using standard indexes such as B trees or hash tables. For example, if we index the value column, we can quickly process the XPath expression  @temps=&#8221;janvier&#8221;.</p>
<p>Effectively, we can map XPath and XQuery queries into SQL. This leaves relatively little room for XML-specific indexes. I am certain that XML database designers have even smarter strategies, but do they work significantly better?</p>
<p><strong>Reference</strong>: P. O’Neil, et al.. <a href="http://www.cs.umb.edu/~poneil/ordpath.pdf">ORDPATHs: insert-friendly XML node labels</a>. 2004.</p>
<p><strong>Further reading</strong>: <a href="http://www.daniel-lemire.com/blog/archives/2008/12/04/native-xml-databases-have-they-taken-the-world-over-yet/">Native XML databases: have they taken the world over yet?</a></p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=ltONA9z5S2g:shdE_vkrYcE:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=ltONA9z5S2g:shdE_vkrYcE:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/ltONA9z5S2g" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.daniel-lemire.com/blog/archives/2010/06/16/indexing-xml/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		<feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2010/06/16/indexing-xml/</feedburner:origLink></item>
		<item>
		<title>Lack of steady trajectories and failure</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/_l6qi9ppnt4/</link>
		<comments>http://www.daniel-lemire.com/blog/archives/2010/06/14/lack-of-steady-trajectories-and-failure/#comments</comments>
		<pubDate>Mon, 14 Jun 2010 16:48:03 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category><![CDATA[Academia/Research]]></category>

		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2598</guid>
		<description>A common advice given out to young researchers is to find a niche. (See Michael&amp;#8217;s Branding Your Research). That is certainly good advice. Instead of being another young researcher, you can be the new guy working on topic X. But it always seems to happen no matter what: most Ph.D. thesis address a narrow topic. I [...]</description>
			<content:encoded><![CDATA[<p>A common advice given out to young researchers is to find a niche. (See Michael&#8217;s <a href="http://mybiasedcoin.blogspot.com/2010/06/branding-your-research-and-yourself.html">Branding Your Research</a>). That is certainly good advice. Instead of being another young researcher, you can be the new guy working on topic X. But it always seems to happen no matter what: most Ph.D. thesis address a narrow topic. I believe that the real advice people would like to give is: find yourself a nice topic, and make sure this topic becomes <strong>fashionable</strong>. Of course, this implies that you can somehow predict the future, or have a thesis supervisor with enough clout that he can either initiate new trends, or have inside knowledge regarding the upcoming trends.</p>
<p>A more interesting question is what you should do with the rest of your career, assuming you landed a research job, somehow. Should you find yourself one or two niche topics and stay there for the rest of your life? That is a common strategy. You save precious time: instead of having to skim 100 research articles a year, you may get by with 20 or 30 research articles, or even less. Moreover, because you are the leading authority on one or two topics, you can never be caught unaware. You never have to worry about finding new topics: you just keep on iteratively improving whatever you are doing right now. With some luck, you can reuse your funding proposals year after year. Finally, you can quickly get to know everyone that matters regarding these narrow topics. And that is a perfectly good strategy.</p>
<p>The problems begin when we associate <strong>the lack of a steady trajectory with failure</strong>. <strong>Encouraging static research topics leads to conservatism.</strong> Meanwhile, some of the most innovative researchers have cultivated varied interests. Von Neumann was a set theorist, but <a href="http://stepanov.lk.net/mnemo/legende.html">he wrote 20 papers in Physics</a>, and even in Mathematics, he covered a wide range of topics (set theory, logic, topological groups, measure theory, ergodic theory, operator theory, and continuous geometry). Would we have been better off had von Neumann remained a pure set theorist?</p>
<p>And I tend to have more trust in researchers who have their eggs in different baskets. They can afford to be a bit more critical.</p>
<p><strong>Warning:</strong> I am not urging Ph.D. students to change topic repeatedly while writing up their thesis. Finish whatever you start. And be aware that approaching a new research topic can be costly.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=_l6qi9ppnt4:rX1ux8Xw0ac:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=_l6qi9ppnt4:rX1ux8Xw0ac:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/_l6qi9ppnt4" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.daniel-lemire.com/blog/archives/2010/06/14/lack-of-steady-trajectories-and-failure/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		<feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2010/06/14/lack-of-steady-trajectories-and-failure/</feedburner:origLink></item>
		<item>
		<title>Academic publishing is archaic</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/5hLG2IFp0mI/</link>
		<comments>http://www.daniel-lemire.com/blog/archives/2010/06/10/academic-publishing-is-archaic/#comments</comments>
		<pubDate>Thu, 10 Jun 2010 13:57:35 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category><![CDATA[Academia/Research]]></category>

		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2593</guid>
		<description>Technological progress tends to increase the available information. Thus, our capacity to manage this information becomes overloaded (hence the term information overload). As Clay Shirky explained: it is not so much an information overload, as a filter failure. The abundance of information is never a problem. The real problem is the lack of efficient strategies [...]</description>
			<content:encoded><![CDATA[<p>Technological progress tends to increase the available information. Thus, our capacity to manage this information becomes overloaded (hence the term <a href="http://en.wikipedia.org/wiki/Information_overload">information overload</a>). As Clay Shirky <a href="http://web2expo.blip.tv/file/1277460/">explained</a>: it is not so much an information overload, as a filter failure. The abundance of information is never a problem. The real problem is the lack of efficient strategies to index, summarize, filter, cross-reference and archive information.</p>
<p>But information overload is nothing new. In <a href="http://muse.jhu.edu/journals/journal_of_the_history_of_ideas/v064/64.1blair.html">Reading Strategies for Coping With Information Overload ca. 1550-1700</a>, Blair surveys the techniques our ancestors invented to cope with the abundance of books :</p>
<ul>
<li>the alphabetical index;</li>
<li>the reference book,</li>
<li>copy and paste (with actual scissors) to save time in note-taking.</li>
</ul>
<p>What I find fascinating is the historical perspective: while still useful, the alphabetical index is hardly exciting anymore. It has been supplanted by full text search (in e-books). There are still reference books (such as dictionaries), but they are being replaced with online tools. Information overload continues to generate many inventions: the search engine (such as Google), the recommender system (as on Amazon.com), and the social networks (such as Twitter). Literally, these tools expand our minds. We become smarter.</p>
<p>Yet, every time I finish writing a research article, I am amazed at how old fashioned the format is.</p>
<ul>
<li>Research journals still ask for silly metadata such as keywords, even though most researchers rely on full text search.</li>
<li>The format is clearly meant for paper, even though most of my collaborators browse research articles on their computers.</li>
<li>We have silly things like page limitations.</li>
<li>It is excessively difficult to correct or improve a &#8220;published&#8221; article.</li>
</ul>
<p>There is hope. The <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0010663">PLoS One journal</a> presents research articles in an innovative format. The article is interactive: anyone can rate and comment it. Many journals allow the authors to upload supplementary material. Yet, I predict that in 20 years, we will look back and think that academic publishing in 2010 was archaic. (I admit that it is not a daring prediction.) There is much room for innovation.</p>
<p><strong>Source:</strong> <a href="http://erikduval.wordpress.com/about/">Erik Duval</a>.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=5hLG2IFp0mI:0EZeNXUr70k:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=5hLG2IFp0mI:0EZeNXUr70k:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/5hLG2IFp0mI" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.daniel-lemire.com/blog/archives/2010/06/10/academic-publishing-is-archaic/feed/</wfw:commentRss>
		<slash:comments>21</slash:comments>
		<feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2010/06/10/academic-publishing-is-archaic/</feedburner:origLink></item>
		<item>
		<title>Maximizing your impact as a researcher (guest post)</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/hAR2_9eDqes/</link>
		<comments>http://www.daniel-lemire.com/blog/archives/2010/06/04/maximizing-your-impact-as-a-researcher-guest-post/#comments</comments>
		<pubDate>Fri, 04 Jun 2010 16:46:30 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category><![CDATA[Academia/Research]]></category>

		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2584</guid>
		<description>The greatest challenge for a researcher is to choose projects that have a good chance of delivering impact. Alain Désilets from NRC—co-author of VoiceGrip, Webitext and the Cross Lingual Wiki Engine—shared his strategies with me: Look at how many workdays per week you can dedicate to research and make that be the number of projects you [...]</description>
			<content:encoded><![CDATA[<p><img style="float: right; margin: 3px; width: 100px;" src="http://wiki-translation.com/img/wiki_up/Alain_desilets_standing.jpg" alt="Alain Désilets" />The greatest challenge for a researcher is to choose projects that have a good chance of delivering impact. <a href="http://wiki-translation.com/tiki-index.php?page=Alain+D%C3%A9silets#Blog">Alain Désilets</a> from <a href="http://www.nrc-cnrc.gc.ca/index.html">NRC</a>—co-author of <a href="http://voicecode.iit.nrc.ca/VoiceCode/uploads/aboutVG.html">VoiceGrip</a>, <a href="http://www.webitext.com/bin/webitext.cgi">Webitext</a> and the<a href="http://wiki-translation.com/tiki-index.php?page=Cross+Lingual+Wiki+Engine+Project&amp;bl"> Cross Lingual Wiki Engine</a>—shared his strategies with me:</p>
<ul>
<li>Look at how many workdays per week you can dedicate to research and make that be the number of projects you can work on in parallel. In other words, if you are one of the lucky few who can dedicate 5 days per week doing research, then you have room for 5 projects.</li>
<li>Invest your energy proportionally to the amount of positive feedback you receive for each  project. This includes collaboration offers, grants, potential users, and so on.</li>
<li>Never work alone on a project for too long. It&#8217;s OK to start exploring a compelling idea on your own for a couple of months, but if you can&#8217;t convince someone else to work with you on it, maybe it&#8217;s not such a great idea after all. Maybe it&#8217;s technically infeasible, maybe there is no need or market for it, or maybe it&#8217;s just too much ahead of its time. Don&#8217;t completely give up on the idea yet. Put it on the ice for now and keep sharing that idea with people until you meet the right people to make it happen with you.</li>
<li>Instead of looking for partnership money which will require you to spend months drafting and revising agreements (who wants to deal with lawyers anyway), look for talented people who have control over their own time, and are willing to invest some of that precious resource working with you on an idea. Don&#8217;t worry about who will own the baby before it&#8217;s actually born (that usually ensures that the work relationship will never get off the ground). Just make sure everyone keeps a lab book documenting who did what so that you will have a basis to argue in a friendly and civilised manner about who owns what share of the baby, if you ever have that nice problem.</li>
<li>Talk to lots of different people from different walks of life about your idea. You never know who will give you the insight or contact you need to advance to the next level on a given project. Of course if you do this, you pretty much give up on the idea of patenting your idea.</li>
<li>Make sure you collocate in time and space as much as you can with your collaborators. There was a time when I had 5 projects (those were the happy days of 5 days of research per week), and I had scheduled things so that on Mondays I would work with Joe on project X, Tuesdays were dedicated to working on project Y with Jane, and so on.</li>
<li>Find and organisation or a type of end users with an interesting problem that you think you could solve using some bleeding edge technology. Become very intimate with the problem, maybe even pretending to do these people&#8217;s job for a day. Once you understand their problem well, don&#8217;t jump right away to the hi-tech solution. Instead, start with the Simplest Thing That Could Possibly Work, and only add complex technology where and when it is needed. This may not get you a publication in a first tier journal, but it greatly increases your odds of developing a system that will actually be used. Plus, when you DO find that you need sophisticated technology, you know exactly why, and what the actual value added is.</li>
<li>Use Agile Development practices which allow you to advance your projects in short, highly focused bursts of a few days (1-day burst are even possible). Write lots of short &#8220;stories&#8221; that describe things you can accomplish in a day or less, and keep re-prioritizing them so that the ones that currently add the most value to your target users are always at the top. Use Test Driven Development to ensure that your system is always stable and that you can put it aside for a few days or months, yet pick up right from where you left. These kinds of techniques are essential if you want to be able to quickly reallocate your effort depending on how hot your different projects are.</li>
</ul>
<p><strong>Disclaimer</strong>: it does not necessarily  reflect the views of his employer.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=hAR2_9eDqes:W5bta9ryCOU:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=hAR2_9eDqes:W5bta9ryCOU:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/hAR2_9eDqes" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.daniel-lemire.com/blog/archives/2010/06/04/maximizing-your-impact-as-a-researcher-guest-post/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2010/06/04/maximizing-your-impact-as-a-researcher-guest-post/</feedburner:origLink></item>
		<item>
		<title>How do we choose research journals?</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/TkqCS4L9Dtk/</link>
		<comments>http://www.daniel-lemire.com/blog/archives/2010/06/03/how-do-we-choose-research-journals/#comments</comments>
		<pubDate>Thu, 03 Jun 2010 19:10:53 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category><![CDATA[Academia/Research]]></category>

		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2575</guid>
		<description>The publishing house Elsevier invited me to fill out a survey regarding their journals. As a reward, they gave me a glimpse at their statistics. The three most important considerations when choosing a research journals are (in order) : Speed of review process Standard of reviews Overall reputation of the journal And the activity researchers [...]</description>
			<content:encoded><![CDATA[<p>The publishing house Elsevier invited me to fill out a survey regarding their journals. As a reward, they gave me a glimpse at their statistics.</p>
<p>The three most important considerations when choosing a research journals are (in order) :</p>
<ol>
<li>Speed of review process</li>
<li>Standard of reviews</li>
<li>Overall reputation of the journal</li>
</ol>
<p>And the activity researchers complained to most about? Peer reviewing manuscripts.</p>
<p>In any case, if you want to build a good journal and attract great papers, make sure you have fast and competent peer review. (Duh!) Meanwhile, having a good printer or a good editorial board are much less important.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=TkqCS4L9Dtk:y3o8JTuwp4w:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=TkqCS4L9Dtk:y3o8JTuwp4w:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/TkqCS4L9Dtk" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.daniel-lemire.com/blog/archives/2010/06/03/how-do-we-choose-research-journals/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		<feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2010/06/03/how-do-we-choose-research-journals/</feedburner:origLink></item>
		<item>
		<title>Computer Science is shallow</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/XTXHBxcF-R4/</link>
		<comments>http://www.daniel-lemire.com/blog/archives/2010/05/31/computer-science-is-shallow/#comments</comments>
		<pubDate>Tue, 01 Jun 2010 02:32:21 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category><![CDATA[Academia/Research]]></category>

		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2567</guid>
		<description>Zed A. Shaw—author of several books on Ruby and Python—came up with an interesting criticism of Computer Science. He makes some good points: Computer Science is a pointless discipline with no culture. (&amp;#8230;) They rarely teach deep philosophy and instead would rather either teach you what some business down the street wants, or teach you [...]</description>
			<content:encoded><![CDATA[<p><a href="http://en.wikipedia.org/wiki/Zed_Shaw">Zed A. Shaw</a>—author of  several books on <a href="http://www.amazon.com/s?_encoding=UTF8&amp;search-alias=books&amp;field-author=Zed%20Shaw">Ruby</a> and <a href="http://learnpythonthehardway.org/index">Python</a>—came up with an interesting <a href="http://sheddingbikes.com/posts/1275258018.html">criticism</a> of Computer Science. He makes some good points:</p>
<blockquote><p>Computer Science is a pointless discipline with no culture. (&#8230;) They rarely teach deep philosophy and instead would rather either teach you what some business down the street wants, or teach you their favorite pet language like LISP. (&#8230;) Another way to explain the shallowness of Computer Science is that it&#8217;s the only discipline that eschews paradox. Even mathematics has reams of unanswered questions and potential paradox in its core philosophy. (&#8230;) There&#8217;s an envelope of knowledge so vast in most other disciplines that just when you think you&#8217;ve learned it all you find something else you never knew. This is what makes them interesting.</p></blockquote>
<p>Oh! I think there are many deep and exciting questions in Computer Science. (And not just whether <a href="http://en.wikipedia.org/wiki/P_versus_NP_problem">P is equal to NP</a>.) And do Sociology, Economics and History have more depth? But I agree that Computer Science is too often <a href="http://en.wikipedia.org/wiki/Utilitarianism">utilitarian</a>. Some like to pretend that by catering to the perceived needs of industry, graduates will get better jobs. Unfortunately, too often, the students have to unlearn their so-called &#8220;practical knowledge&#8221; once they leave the campus. The honest truth: <strong>you don&#8217;t need three or four years of college to do great in the software industry</strong>.</p>
<p>Maybe more time should be spent on the deep questions. Here are a few discussion points that come to mind :</p>
<ul>
<li>What is &#8220;meaning&#8221; and how can computation capture or codify it? What does it say about our brain? Is our brain a Turing machine?</li>
<li>Why are some programmers ten times more productive than others?</li>
<li>Can computers extend our intelligence? How intelligent can we become?</li>
</ul>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=XTXHBxcF-R4:CXZRX37qYP4:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=XTXHBxcF-R4:CXZRX37qYP4:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/XTXHBxcF-R4" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.daniel-lemire.com/blog/archives/2010/05/31/computer-science-is-shallow/feed/</wfw:commentRss>
		<slash:comments>27</slash:comments>
		<feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2010/05/31/computer-science-is-shallow/</feedburner:origLink></item>
		<item>
		<title>Sorting is fast and useful</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/qePHnRaObAI/</link>
		<comments>http://www.daniel-lemire.com/blog/archives/2010/05/20/sorting-is-fast-and-useful/#comments</comments>
		<pubDate>Fri, 21 May 2010 02:15:41 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category><![CDATA[Data Warehousing and OLAP]]></category>
		<category><![CDATA[Science and Technology]]></category>

		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2557</guid>
		<description>I like to sort things. If you should learn one thing about Computer Science is that sorting is fast and useful. Here&amp;#8217;s a little example. You want to check quickly whether an integer belongs to a set. Maybe you want to determine whether a userID is valid. The solutions: Use a hash table. Java programmers use [...]</description>
			<content:encoded><![CDATA[<p>I like to <a href="http://arxiv.org/abs/0901.3751">sort things</a>. If you should learn one thing about Computer Science is that <strong>sorting is fast and useful</strong>.</p>
<p>Here&#8217;s a little example. You want to check <strong>quickly</strong> whether an integer belongs to a set. Maybe you want to determine whether a userID is valid. The solutions:</p>
<ul>
<li>Use a <a href="http://en.wikipedia.org/wiki/Hash_table">hash table</a>. Java programmers use the <a href="http://java.sun.com/j2se/1.5.0/docs/api/java/util/HashSet.html">HashSet</a> class.</li>
<li>Use a tree structure such as a <a href="http://en.wikipedia.org/wiki/Red-black_tree">red-black tree</a> or a <a href="http://en.wikipedia.org/wiki/B-tree">B-tree</a>. Java programmers use the <a href="http://java.sun.com/j2se/1.5.0/docs/api/java/util/TreeSet.html">TreeSet</a> class.</li>
<li>If your set of integers changes rarely, you can sort it, and then try to locate integers using <a href="http://en.wikipedia.org/wiki/Binary_search">binary search</a>.</li>
</ul>
<p>I <a href="http://pastebin.com/Lmcu9KBw">wrote a Java benchmark</a> to compare the three solutions:</p>
<p style="text-align: center;"><img src="http://lh4.ggpht.com/__I-3q9m-Gqo/S_XjF3C_loI/AAAAAAAABso/d1Mqv1jxjZw/s800/Screen%20shot%202010-05-20%20at%209.33.57%20PM.png" alt="" /></p>
<p>Binary search over a sorted array is a <strong>only 10% slower</strong> than the HashSet. Yet, <strong>the sorted array uses half the memory</strong>. Hence, using a sorted array is the clear winner for this problem.</p>
<p>If you think that&#8217;s a little bit silly, consider that <a href="http://en.wikipedia.org/wiki/Column-oriented_DBMS">column-oriented</a> DBMSes like <a href="http://en.wikipedia.org/wiki/Vertica">Vertica</a> use binary search over sorted columns as an indexing technique.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=qePHnRaObAI:5h-c8HIhwEA:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=qePHnRaObAI:5h-c8HIhwEA:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/qePHnRaObAI" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.daniel-lemire.com/blog/archives/2010/05/20/sorting-is-fast-and-useful/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		<feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2010/05/20/sorting-is-fast-and-useful/</feedburner:origLink></item>
		<item>
		<title>Chinese researchers publish more research papers</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/8Lg-qYKGzMA/</link>
		<comments>http://www.daniel-lemire.com/blog/archives/2010/05/11/chinese-publish-more-research-papers-than-americans/#comments</comments>
		<pubDate>Tue, 11 May 2010 15:16:32 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category><![CDATA[Academia/Research]]></category>

		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2549</guid>
		<description>Funding agencies in Canada seek to emulate American funding agencies by promoting excellence. What this means in concrete terms is that few professors get most of the resources whereas the bulk of University professors are left with a pitance or nothing. The intuition behind this more competitive approach is that we must catch up with [...]</description>
			<content:encoded><![CDATA[<p>Funding agencies in Canada seek to emulate American funding agencies by promoting <em>excellence</em>. What this means in concrete terms is that few professors get most of the resources whereas the bulk of University professors are left with a pitance or nothing. The intuition behind this more competitive approach is that we must catch up with the American efficiency. We must reward the most productive researchers and stop wasting money with the unproductive ones. (Disclaimer: I am happy with the research grants I got so far. Luckily, I have been judged to be productive&#8230;)</p>
<p>But how is the American system holding out against the competition? I looked at the countries publishing most research papers in Computer sciences, in 1998 and then in 2008.</p>
<p>1998:</p>
<ol>
<li>USA (14,294 papers)</li>
<li>Japan (2,941 papers)</li>
<li>United Kingdom (2,706 papers)</li>
</ol>
<p>2008:</p>
<ol>
<li>USA (15,744 papers)</li>
<li>China (14,680 papers)</li>
<li>United Kingdom (5,703 papers)</li>
</ol>
<p>It appears that whereas most countries have doubled or more their production of research papers, <strong>the USA has stood still</strong>. Because these numbers are for 2008, I conjecture that right now, in 2010, <strong>Chinese researchers already publish more than their American counterparts</strong>. Of course, American authors are more cited, but the gap between China and the USA is closing in this respect as well. Interestingly, <strong>Americans also appear to be losing their edge compared to the  United Kingdom, France, Germany and Canada</strong>.</p>
<p>While I do not have enough evidence to conclude, I conjecture that an all-or-nothing approach, so common in the USA, may not be so efficient after all. By leaving most University professors behind, you are wasting precious resources. And I fear that by emulating this model, Canada might be losing out too.</p>
<p><strong>Source:</strong> <a href="http://www.scimagojr.com/countryrank.php?area=1700&amp;category=0&amp;region=all&amp;year=1998&amp;order=it&amp;min=1000&amp;min_type=it">SJR</a>.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=8Lg-qYKGzMA:bdhJt0fRJnI:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=8Lg-qYKGzMA:bdhJt0fRJnI:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/8Lg-qYKGzMA" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.daniel-lemire.com/blog/archives/2010/05/11/chinese-publish-more-research-papers-than-americans/feed/</wfw:commentRss>
		<slash:comments>25</slash:comments>
		<feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2010/05/11/chinese-publish-more-research-papers-than-americans/</feedburner:origLink></item>
		<item>
		<title>Acceptance rate versus impact</title>
		<link>http://feedproxy.google.com/~r/daniel-lemire/atom/~3/TxSkG4YDB_8/</link>
		<comments>http://www.daniel-lemire.com/blog/archives/2010/05/10/acceptance-rate-versus-impact/#comments</comments>
		<pubDate>Mon, 10 May 2010 18:18:14 +0000</pubDate>
		<dc:creator>Daniel Lemire</dc:creator>
				<category><![CDATA[Academia/Research]]></category>

		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2546</guid>
		<description>Should you attend the most selective school? Maybe not: Students who attended more selective colleges do not earn more than other students who were accepted and rejected by comparable schools but attended less selective colleges. (Dale and Krueger, Estimating the payoff to attending a more selective college, 1999). Should you present papers in the conference [...]</description>
			<content:encoded><![CDATA[<p>Should you attend the most selective school? Maybe not:</p>
<blockquote><p>Students who attended more selective colleges do not earn more than other students who were accepted and rejected by comparable schools but attended less selective colleges. (Dale and Krueger, <a href="http://ideas.repec.org/p/nbr/nberwo/7322.html">Estimating the payoff to attending a more selective college</a>, 1999).</p></blockquote>
<p>Should you present papers in the conference with the lowest acceptance rate? Looking at this plot, there seems to be little correlation between acceptance rate and <a href="http://en.wikipedia.org/wiki/Impact_factor">impact factor</a>:</p>
<p><img src="http://www.leduotang.com/sylvain/sites/default/files/IFvsAR.jpg" alt="acceptance rate versus impact factor" /></p>
<p>(Source: <a href="http://www.leduotang.com/sylvain/node/56">Sylvain Hallé&#8217;s blog</a>.)</p>
<p><strong>Conclusion:</strong> The best schools or the best conferences may not be those with low acceptance rates.</p>
<div class="feedflare">
<a href="http://feeds.feedburner.com/~ff/daniel-lemire/atom?a=TxSkG4YDB_8:18wuQge3a2k:D7DqB2pKExk"><img src="http://feeds.feedburner.com/~ff/daniel-lemire/atom?i=TxSkG4YDB_8:18wuQge3a2k:D7DqB2pKExk" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/daniel-lemire/atom/~4/TxSkG4YDB_8" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.daniel-lemire.com/blog/archives/2010/05/10/acceptance-rate-versus-impact/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		<feedburner:origLink>http://www.daniel-lemire.com/blog/archives/2010/05/10/acceptance-rate-versus-impact/</feedburner:origLink></item>
	</channel>
</rss>
