<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DDI</title>
	<atom:link href="https://michaelnielsen.org/ddi/feed/" rel="self" type="application/rss+xml" />
	<link>https://michaelnielsen.org/ddi</link>
	<description>Data-driven intelligence</description>
	<lastBuildDate>Wed, 29 Oct 2014 00:28:17 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>
	<item>
		<title>How the backpropagation algorithm works</title>
		<link>https://michaelnielsen.org/ddi/how-the-backpropagation-algorithm-works/</link>
		
		<dc:creator><![CDATA[Michael Nielsen]]></dc:creator>
		<pubDate>Mon, 14 Apr 2014 19:22:03 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<guid isPermaLink="false">https://michaelnielsen.org/ddi/?p=96</guid>

					<description><![CDATA[Chapter 2 of my free online book about &#8220;Neural Networks and Deep Learning&#8221; is now available. The chapter is an in-depth explanation of the backpropagation algorithm. Backpropagation is the workhorse of learning in neural networks, and a key component in modern deep learning systems.﻿ Enjoy!]]></description>
										<content:encoded><![CDATA[<p><a href="http://neuralnetworksanddeeplearning.com/chap2.html">Chapter 2</a> of my free online book about <a href="http://neuralnetworksanddeeplearning.com">&#8220;Neural Networks and Deep Learning&#8221;</a> is now available.  The chapter is an in-depth explanation of the backpropagation algorithm.  Backpropagation is the workhorse of learning in neural networks, and a key component in modern deep learning systems.﻿  Enjoy!</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Reinventing Explanation</title>
		<link>https://michaelnielsen.org/ddi/reinventing-explanation/</link>
					<comments>https://michaelnielsen.org/ddi/reinventing-explanation/#comments</comments>
		
		<dc:creator><![CDATA[Michael Nielsen]]></dc:creator>
		<pubDate>Fri, 31 Jan 2014 18:55:08 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<guid isPermaLink="false">https://michaelnielsen.org/ddi/?p=94</guid>

					<description><![CDATA[My new essay on the use of digital media to explain scientific ideas is here.]]></description>
										<content:encoded><![CDATA[<p>My new essay on the use of digital media to explain scientific ideas is <a href="https://michaelnielsen.org/reinventing_explanation/index.html">here</a>.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://michaelnielsen.org/ddi/reinventing-explanation/feed/</wfw:commentRss>
			<slash:comments>3</slash:comments>
		
		
			</item>
		<item>
		<title>How the Bitcoin protocol actually works</title>
		<link>https://michaelnielsen.org/ddi/how-the-bitcoin-protocol-actually-works/</link>
					<comments>https://michaelnielsen.org/ddi/how-the-bitcoin-protocol-actually-works/#comments</comments>
		
		<dc:creator><![CDATA[Michael Nielsen]]></dc:creator>
		<pubDate>Fri, 06 Dec 2013 18:37:10 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<guid isPermaLink="false">https://michaelnielsen.org/ddi/?p=84</guid>

					<description><![CDATA[Many thousands of articles have been written purporting to explain Bitcoin, the online, peer-to-peer currency. Most of those articles give a hand-wavy account of the underlying cryptographic protocol, omitting many details. Even those articles which delve deeper often gloss over crucial points. My aim in this post is to explain the major ideas behind the&#8230; <a class="more-link" href="https://michaelnielsen.org/ddi/how-the-bitcoin-protocol-actually-works/">Continue reading <span class="screen-reader-text">How the Bitcoin protocol actually works</span></a>]]></description>
										<content:encoded><![CDATA[<p>Many thousands of articles have been written purporting to explain Bitcoin, the online, peer-to-peer currency.  Most of those articles give a hand-wavy account of the underlying cryptographic protocol, omitting many details.  Even those articles which delve deeper often gloss over crucial points.  My aim in this post is to explain the major ideas behind the Bitcoin protocol in a clear, easily comprehensible way.  We&#8217;ll start from first principles, build up to a broad theoretical understanding of how the protocol works, and then dig down into the nitty-gritty, examining the raw data in a Bitcoin transaction.</p>
<p>Understanding the protocol in this detailed way is hard work.  It is tempting instead to take Bitcoin as given, and to engage in speculation about how to get rich with Bitcoin, whether Bitcoin is a bubble, whether Bitcoin might one day mean the end of taxation, and so on.  That&#8217;s fun, but severely limits your understanding. Understanding the details of the Bitcoin protocol opens up otherwise inaccessible vistas.  In particular, it&#8217;s the basis for understanding Bitcoin&#8217;s built-in scripting language, which makes it possible to use Bitcoin to create new types of financial instruments, such as <a href="http://szabo.best.vwh.net/formalize.html">smart contracts</a>. New financial instruments can, in turn, be used to create new markets and to enable new forms of collective human behaviour.  Talk about fun!</p>
<p>I&#8217;ll describe Bitcoin scripting and concepts such as smart contracts in future posts. This post concentrates on explaining the nuts-and-bolts of the Bitcoin protocol.  To understand the post, you need to be comfortable with <a href="http://en.wikipedia.org/wiki/Public-key_cryptography">public key   cryptography</a>, and with the closely related idea of <a href="https://en.wikipedia.org/wiki/Digital_signature">digital   signatures</a>.  I&#8217;ll also assume you&#8217;re familiar with <a href="https://en.wikipedia.org/wiki/Cryptographic_hash_function">cryptographic   hashing</a>.  None of this is especially difficult.  The basic ideas can be taught in freshman university mathematics or computer science classes.  The ideas are beautiful, so if you&#8217;re not familiar with them, I recommend taking a few hours to get familiar.</p>
<p>It may seem surprising that Bitcoin&#8217;s basis is cryptography.  Isn&#8217;t Bitcoin a currency, not a way of sending secret messages?  In fact, the problems Bitcoin needs to solve are largely about securing transactions &#8212; making sure people can&#8217;t steal from one another, or impersonate one another, and so on.  In the world of atoms we achieve security with devices such as locks, safes, signatures, and bank vaults.  In the world of bits we achieve this kind of security with cryptography.  And that&#8217;s why Bitcoin is at heart a cryptographic protocol.</p>
<p>My strategy in the post is to build Bitcoin up in stages.  I&#8217;ll begin by explaining a very simple digital currency, based on ideas that are almost obvious.  We&#8217;ll call that currency <em>Infocoin</em>, to distinguish it from Bitcoin.  Of course, our first version of Infocoin will have many deficiencies, and so we&#8217;ll go through several iterations of Infocoin, with each iteration introducing just one or two simple new ideas. After several such iterations, we&#8217;ll arrive at the full Bitcoin protocol.  We will have reinvented Bitcoin!</p>
<p>This strategy is slower than if I explained the entire Bitcoin protocol in one shot.  But while you can understand the mechanics of Bitcoin through such a one-shot explanation, it would be difficult to understand <em>why</em> Bitcoin is designed the way it is.  The advantage of the slower iterative explanation is that it gives us a much sharper understanding of each element of Bitcoin.</p>
<p>Finally, I should mention that I&#8217;m a relative newcomer to Bitcoin. I&#8217;ve been following it loosely since 2011 (and cryptocurrencies since the late 1990s), but only got seriously into the details of the Bitcoin protocol earlier this year.  So I&#8217;d certainly appreciate corrections of any misapprehensions on my part.  Also in the post I&#8217;ve included a number of &#8220;problems for the author&#8221; &#8211; notes to myself about questions that came up during the writing.  You may find these interesting, but you can also skip them entirely without losing track of the main text.</p>
<h3>First steps: a signed letter of intent</h3>
<p>So how can we design a digital currency?  </p>
<p>On the face of it, a digital currency sounds impossible.  Suppose some person &#8211; let&#8217;s call her Alice &#8211; has some digital money which she wants to spend.  If Alice can use a string of bits as money, how can we prevent her from using the same bit string over and over, thus minting an infinite supply of money?  Or, if we can somehow solve that problem, how can we prevent someone else forging such a string of bits, and using that to steal from Alice?</p>
<p>These are just two of the many problems that must be overcome in order to use information as money.</p>
<p>As a first version of Infocoin, let&#8217;s find a way that Alice can use a string of bits as a (very primitive and incomplete) form of money, in a way that gives her at least some protection against forgery. Suppose Alice wants to give another person, Bob, an infocoin.  To do this, Alice writes down the message &#8220;I, Alice, am giving Bob one infocoin&#8221;.  She then digitally signs the message using a private cryptographic key, and announces the signed string of bits to the entire world.</p>
<p>(By the way, I&#8217;m using capitalized &#8220;Infocoin&#8221; to refer to the protocol and general concept, and lowercase &#8220;infocoin&#8221; to refer to specific denominations of the currency.  A similar useage is common, though not universal, in the Bitcoin world.)</p>
<p>This isn&#8217;t terribly impressive as a prototype digital currency!  But it does have some virtues.  Anyone in the world (including Bob) can use Alice&#8217;s public key to verify that Alice really was the person who signed the message &#8220;I, Alice, am giving Bob one infocoin&#8221;.  No-one else could have created that bit string, and so Alice can&#8217;t turn around and say &#8220;No, I didn&#8217;t mean to give Bob an infocoin&#8221;.  So the protocol establishes that Alice truly intends to give Bob one infocoin.  The same fact &#8211; no-one else could compose such a signed message &#8211; also gives Alice some limited protection from forgery.  Of course, <em>after</em> Alice has published her message it&#8217;s possible for other people to duplicate the message, so in that sense forgery is possible.  But it&#8217;s not possible from scratch.  These two properties &#8211; establishment of intent on Alice&#8217;s part, and the limited protection from forgery &#8211; are genuinely notable features of this protocol.</p>
<p>I haven&#8217;t (quite) said exactly what digital money <em>is</em> in this protocol.  To make this explicit: it&#8217;s just the message itself, i.e., the string of bits representing the digitally signed message &#8220;I, Alice, am giving Bob one infocoin&#8221;.  Later protocols will be similar, in that all our forms of digital money will be just more and more elaborate messages [1].</p>
<h3>Using serial numbers to make coins uniquely identifiable</h3>
<p>A problem with the first version of Infocoin is that Alice could keep sending Bob the same signed message over and over.  Suppose Bob receives ten copies of the signed message &#8220;I, Alice, am giving Bob one infocoin&#8221;.  Does that mean Alice sent Bob ten <em>different</em> infocoins?  Was her message accidentally duplicated?  Perhaps she was trying to trick Bob into believing that she had given him ten different infocoins, when the message only proves to the world that she intends to transfer one infocoin.</p>
<p>What we&#8217;d like is a way of making infocoins unique.  They need a label or serial number.  Alice would sign the message &#8220;I, Alice, am giving Bob one infocoin, with serial number 8740348&#8221;.  Then, later, Alice could sign the message &#8220;I, Alice, am giving Bob one infocoin, with serial number 8770431&#8221;, and Bob (and everyone else) would know that a different infocoin was being transferred.</p>
<p>To make this scheme work we need a trusted source of serial numbers for the infocoins.  One way to create such a source is to introduce a <em>bank</em>.  This bank would provide serial numbers for infocoins, keep track of who has which infocoins, and verify that transactions really are legitimate,</p>
<p>In more detail, let&#8217;s suppose Alice goes into the bank, and says &#8220;I want to withdraw one infocoin from my account&#8221;.  The bank reduces her account balance by one infocoin, and assigns her a new, never-before used serial number, let&#8217;s say 1234567.  Then, when Alice wants to transfer her infocoin to Bob, she signs the message &#8220;I, Alice, am giving Bob one infocoin, with serial number 1234567&#8221;.  But Bob doesn&#8217;t just accept the infocoin.  Instead, he contacts the bank, and verifies that: (a) the infocoin with that serial number belongs to Alice; and (b) Alice hasn&#8217;t already spent the infocoin.  If both those things are true, then Bob tells the bank he wants to accept the infocoin, and the bank updates their records to show that the infocoin with that serial number is now in Bob&#8217;s possession, and no longer belongs to Alice.</p>
<h3>Making everyone collectively the bank</h3>
<p>This last solution looks pretty promising.  However, it turns out that we can do something much more ambitious.  We can eliminate the bank entirely from the protocol.  This changes the nature of the currency considerably.  It means that there is no longer any single organization in charge of the currency.  And when you think about the enormous power a central bank has &#8211; control over the money supply &#8211; that&#8217;s a pretty huge change.</p>
<p>The idea is to make it so <em>everyone</em> (collectively) is the bank. In particular, we&#8217;ll assume that everyone using Infocoin keeps a complete record of which infocoins belong to which person.  You can think of this as a shared public ledger showing all Infocoin transactions.  We&#8217;ll call this ledger the <em>block chain</em>, since that&#8217;s what the complete record will be called in Bitcoin, once we get to it.</p>
<p>Now, suppose Alice wants to transfer an infocoin to Bob.  She signs the message &#8220;I, Alice, am giving Bob one infocoin, with serial number 1234567&#8221;, and gives the signed message to Bob.  Bob can use his copy of the block chain to check that, indeed, the infocoin is Alice&#8217;s to give.  If that checks out then he broadcasts both Alice&#8217;s message and his acceptance of the transaction to the entire network, and everyone updates their copy of the block chain.</p>
<p>We still have the &#8220;where do serial number come from&#8221; problem, but that turns out to be pretty easy to solve, and so I will defer it to later, in the discussion of Bitcoin.  A more challenging problem is that this protocol allows Alice to cheat by double spending her infocoin.  She sends the signed message &#8220;I, Alice, am giving Bob one infocoin, with serial number 1234567&#8243; to Bob, and the message&#8221;I, Alice, am giving Charlie one infocoin, with [the same] serial number 1234567&#8221; to Charlie.  Both Bob and Charlie use their copy of the block chain to verify that the infocoin is Alice&#8217;s to spend.  Provided they do this verification at nearly the same time (before they&#8217;ve had a chance to hear from one another), both will find that, yes, the block chain shows the coin belongs to Alice.  And so they will both accept the transaction, and also broadcast their acceptance of the transaction.  Now there&#8217;s a problem.  How should other people update their block chains?  There may be no easy way to achieve a consistent shared ledger of transactions.  And even if everyone can agree on a consistent way to update their block chains, there is still the problem that either Bob or Charlie will be cheated.</p>
<p>At first glance double spending seems difficult for Alice to pull off. After all, if Alice sends the message first to Bob, then Bob can verify the message, and tell everyone else in the network (including Charlie) to update their block chain.  Once that has happened, Charlie would no longer be fooled by Alice. So there is most likely only a brief period of time in which Alice can double spend.  However, it&#8217;s obviously undesirable to have any such a period of time.  Worse, there are techniques Alice could use to make that period longer.  She could, for example, use network traffic analysis to find times when Bob and Charlie are likely to have a lot of latency in communication.  Or perhaps she could do something to deliberately disrupt their communications.  If she can slow communication even a little that makes her task of double spending much easier.</p>
<p>How can we address the problem of double spending?  The obvious solution is that when Alice sends Bob an infocoin, Bob shouldn&#8217;t try to verify the transaction alone.  Rather, he should broadcast the possible transaction to the entire network of Infocoin users, and ask them to help determine whether the transaction is legitimate.  If they collectively decide that the transaction is okay, then Bob can accept the infocoin, and everyone will update their block chain.  This type of protocol can help prevent double spending, since if Alice tries to spend her infocoin with both Bob and Charlie, other people on the network will notice, and network users will tell both Bob and Charlie that there is a problem with the transaction, and the transaction shouldn&#8217;t go through.</p>
<p>In more detail, let&#8217;s suppose Alice wants to give Bob an infocoin.  As before, she signs the message &#8220;I, Alice, am giving Bob one infocoin, with serial number 1234567&#8221;, and gives the signed message to Bob. Also as before, Bob does a sanity check, using his copy of the block chain to check that, indeed, the coin currently belongs to Alice.  But at that point the protocol is modified.  Bob doesn&#8217;t just go ahead and accept the transaction.  Instead, he broadcasts Alice&#8217;s message to the entire network.  Other members of the network check to see whether Alice owns that infocoin.  If so, they broadcast the message &#8220;Yes, Alice owns infocoin 1234567, it can now be transferred to Bob.&#8221;  Once enough people have broadcast that message, everyone updates their block chain to show that infocoin 1234567 now belongs to Bob, and the transaction is complete.</p>
<p>This protocol has many imprecise elements at present.  For instance, what does it mean to say &#8220;once enough people have broadcast that message&#8221;?  What exactly does &#8220;enough&#8221; mean here?  It can&#8217;t mean everyone in the network, since we don&#8217;t <em>a priori</em> know who is on the Infocoin network.  For the same reason, it can&#8217;t mean some fixed fraction of users in the network.  We won&#8217;t try to make these ideas precise right now.  Instead, in the next section I&#8217;ll point out a serious problem with the approach as described.  Fixing that problem will at the same time have the pleasant side effect of making the ideas above much more precise.</p>
<h3>Proof-of-work</h3>
<p>Suppose Alice wants to double spend in the network-based protocol I just described.  She could do this by taking over the Infocoin network.  Let&#8217;s suppose she uses an automated system to set up a large number of separate identities, let&#8217;s say a billion, on the Infocoin network.  As before, she tries to double spend the same infocoin with both Bob and Charlie.  But when Bob and Charlie ask the network to validate their respective transactions, Alice&#8217;s sock puppet identities swamp the network, announcing to Bob that they&#8217;ve validated his transaction, and to Charlie that they&#8217;ve validated his transaction, possibly fooling one or both into accepting the transaction.</p>
<p>There&#8217;s a clever way of avoiding this problem, using an idea known as <em>proof-of-work</em>.  The idea is counterintuitive and involves a combination of two ideas: (1) to (artificially) make it <em>computationally costly</em> for network users to validate transactions; and (2) to <em>reward</em> them for trying to help validate transactions.  The reward is used so that people on the network will try to help validate transactions, even though that&#8217;s now been made a computationally costly process.  The benefit of making it costly to validate transactions is that validation can no longer be influenced by the number of network identities someone controls, but only by the total computational power they can bring to bear on validation.  As we&#8217;ll see, with some clever design we can make it so a cheater would need enormous computational resources to cheat, making it impractical.</p>
<p>That&#8217;s the gist of proof-of-work.  But to really understand proof-of-work, we need to go through the details.</p>
<p>Suppose Alice broadcasts to the network the news that &#8220;I, Alice, am giving Bob one infocoin, with serial number 1234567&#8221;.  </p>
<p>As other people on the network hear that message, each adds it to a queue of pending transactions that they&#8217;ve been told about, but which haven&#8217;t yet been approved by the network.  For instance, another network user named David might have the following queue of pending transactions:</p>
<p>I, Tom, am giving Sue one infocoin, with serial number 1201174.</p>
<p>I, Sydney, am giving Cynthia one infocoin, with serial number 1295618.</p>
<p>I, Alice, am giving Bob one infocoin, with serial number 1234567.</p>
<p>David checks his copy of the block chain, and can see that each transaction is valid.  He would like to help out by broadcasting news of that validity to the entire network.</p>
<p>However, before doing that, as part of the validation protocol David is required to solve a hard computational puzzle &#8211; the proof-of-work.  Without the solution to that puzzle, the rest of the network won&#8217;t accept his validation of the transaction.</p>
<p>What puzzle does David need to solve?  To explain that, let <img src='https://s0.wp.com/latex.php?latex=h&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h' title='h' class='latex' /> be a fixed hash function known by everyone in the network &#8211; it&#8217;s built into the protocol.  Bitcoin uses the well-known <a href="https://en.wikipedia.org/wiki/SHA-2">SHA-256</a> hash function, but any cryptographically secure hash function will do.  Let&#8217;s give David&#8217;s queue of pending transactions a label, <img src='https://s0.wp.com/latex.php?latex=l&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l' title='l' class='latex' />, just so it&#8217;s got a name we can refer to.  Suppose David appends a number <img src='https://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> (called the <em>nonce</em>) to <img src='https://s0.wp.com/latex.php?latex=l&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l' title='l' class='latex' /> and hashes the combination.  For example, if we use <img src='https://s0.wp.com/latex.php?latex=l+%3D+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l = ' title='l = ' class='latex' /> &#8220;Hello, world!&#8221; (obviously this is not a list of transactions, just a string used for illustrative purposes) and the nonce <img src='https://s0.wp.com/latex.php?latex=x+%3D+0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x = 0' title='x = 0' class='latex' /> <a href="https://en.bitcoin.it/wiki/Proof_of_work">then</a> (output is in hexadecimal) </p>
<pre>
h("Hello, world!0") = 
  1312af178c253f84028d480a6adc1e25e81caa44c749ec81976192e2ec934c64
</pre>
<p> The puzzle David has to solve &#8211; the proof-of-work &#8211; is to find a nonce <img src='https://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> such that when we append <img src='https://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> to <img src='https://s0.wp.com/latex.php?latex=l&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l' title='l' class='latex' /> and hash the combination the output hash begins with a long run of zeroes.  The puzzle can be made more or less difficult by varying the number of zeroes required to solve the puzzle.  A relatively simple proof-of-work puzzle might require just three or four zeroes at the start of the hash, while a more difficult proof-of-work puzzle might require a much longer run of zeros, say 15 consecutive zeroes. In either case, the above attempt to find a suitable nonce, with <img src='https://s0.wp.com/latex.php?latex=x+%3D+0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x = 0' title='x = 0' class='latex' />, is a failure, since the output doesn&#8217;t begin with any zeroes at all.  Trying <img src='https://s0.wp.com/latex.php?latex=x+%3D+1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x = 1' title='x = 1' class='latex' /> doesn&#8217;t work either: </p>
<pre>
h("Hello, world!1") = 
  e9afc424b79e4f6ab42d99c81156d3a17228d6e1eef4139be78e948a9332a7d8
</pre>
<p> We can keep trying different values for the nonce, <img src='https://s0.wp.com/latex.php?latex=x+%3D+2%2C+3%2C%5Cldots&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x = 2, 3,\ldots' title='x = 2, 3,\ldots' class='latex' />. Finally, at <img src='https://s0.wp.com/latex.php?latex=x+%3D+4250&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x = 4250' title='x = 4250' class='latex' /> we obtain: </p>
<pre>
h("Hello, world!4250") = 
  0000c3af42fc31103f1fdc0151fa747ff87349a4714df7cc52ea464e12dcd4e9
</pre>
<p> This nonce gives us a string of four zeroes at the beginning of the output of the hash.  This will be enough to solve a simple proof-of-work puzzle, but not enough to solve a more difficult proof-of-work puzzle.</p>
<p>What makes this puzzle hard to solve is the fact that the output from a cryptographic hash function behaves like a random number: change the input even a tiny bit and the output from the hash function changes completely, in a way that&#8217;s hard to predict.  So if we want the output hash value to begin with 10 zeroes, say, then David will need, on average, to try <img src='https://s0.wp.com/latex.php?latex=16%5E%7B10%7D+%5Capprox+10%5E%7B12%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='16^{10} \approx 10^{12}' title='16^{10} \approx 10^{12}' class='latex' /> different values for <img src='https://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> before he finds a suitable nonce.  That&#8217;s a pretty challenging task, requiring lots of computational power.</p>
<p>Obviously, it&#8217;s possible to make this puzzle more or less difficult to solve by requiring more or fewer zeroes in the output from the hash function.  In fact, the Bitcoin protocol gets quite a fine level of control over the difficulty of the puzzle, by using a slight variation on the proof-of-work puzzle described above.  Instead of requiring leading zeroes, the Bitcoin proof-of-work puzzle requires the hash of a block&#8217;s header to be lower than or equal to a number known as the <a href="https://en.bitcoin.it/wiki/Target">target</a>.  This target is automatically adjusted to ensure that a Bitcoin block takes, on average, about ten minutes to validate.  </p>
<p>(In practice there is a sizeable randomness in how long it takes to validate a block &#8211; sometimes a new block is validated in just a minute or two, other times it may take 20 minutes or even longer. It&#8217;s straightforward to modify the Bitcoin protocol so that the time to validation is much more sharply peaked around ten minutes.  Instead of solving a single puzzle, we can require that multiple puzzles be solved; with some careful design it is possible to considerably reduce the variance in the time to validate a block of transactions.)</p>
<p>Alright, let&#8217;s suppose David is lucky and finds a suitable nonce, <img src='https://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' />. Celebration!  (He&#8217;ll be rewarded for finding the nonce, as described below).  He broadcasts the block of transactions he&#8217;s approving to the network, together with the value for <img src='https://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' />.  Other participants in the Infocoin network can verify that <img src='https://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> is a valid solution to the proof-of-work puzzle.  And they then update their block chains to include the new block of transactions.</p>
<p>For the proof-of-work idea to have any chance of succeeding, network users need an incentive to help validate transactions.  Without such an incentive, they have no reason to expend valuable computational power, merely to help validate other people&#8217;s transactions.  And if network users are not willing to expend that power, then the whole system won&#8217;t work.  The solution to this problem is to reward people who help validate transactions.  In particular, suppose we reward whoever successfully validates a block of transactions by crediting them with some infocoins.  Provided the infocoin reward is large enough that will give them an incentive to participate in validation.</p>
<p>In the Bitcoin protocol, this validation process is called <em>mining</em>.  For each block of transactions validated, the successful miner receives a bitcoin reward.  Initially, this was set to be a 50 bitcoin reward.  But for every 210,000 validated blocks (roughly, once every four years) the reward halves.  This has happened just once, to date, and so the current reward for mining a block is 25 bitcoins.  This halving in the rate will continue every four years until the year 2140 CE.  At that point, the reward for mining will drop below <img src='https://s0.wp.com/latex.php?latex=10%5E%7B-8%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='10^{-8}' title='10^{-8}' class='latex' /> bitcoins per block.  <img src='https://s0.wp.com/latex.php?latex=10%5E%7B-8%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='10^{-8}' title='10^{-8}' class='latex' /> bitcoins is actually the minimal unit of Bitcoin, and is known as a <em>satoshi</em>.  So in 2140 CE the total supply of bitcoins will cease to increase.  However, that won&#8217;t eliminate the incentive to help validate transactions.  Bitcoin also makes it possible to set aside some currency in a transaction as a <em>transaction fee</em>, which goes to the miner who helps validate it.  In the early days of Bitcoin transaction fees were mostly set to zero, but as Bitcoin has gained in popularity, transaction fees have gradually risen, and are now a substantial additional incentive on top of the 25 bitcoin reward for mining a block.</p>
<p>You can think of proof-of-work as a competition to approve transactions.  Each entry in the competition costs a little bit of computing power.  A miner&#8217;s chance of winning the competition is (roughly, and with some caveats) equal to the proportion of the total computing power that they control.  So, for instance, if a miner controls one percent of the computing power being used to validate Bitcoin transactions, then they have roughly a one percent chance of winning the competition.  So provided a lot of computing power is being brought to bear on the competition, a dishonest miner is likely to have only a relatively small chance to corrupt the validation process, unless they expend a huge amount of computing resources.</p>
<p>Of course, while it&#8217;s encouraging that a dishonest party has only a relatively small chance to corrupt the block chain, that&#8217;s not enough to give us confidence in the currency.  In particular, we haven&#8217;t yet conclusively addressed the issue of double spending.</p>
<p>I&#8217;ll analyse double spending shortly.  Before doing that, I want to fill in an important detail in the description of Infocoin.  We&#8217;d ideally like the Infocoin network to agree upon the <em>order</em> in which transactions have occurred.  If we don&#8217;t have such an ordering then at any given moment it may not be clear who owns which infocoins. To help do this we&#8217;ll require that new blocks always include a pointer to the last block validated in the chain, in addition to the list of transactions in the block.  (The pointer is actually just a hash of the previous block).  So typically the block chain is just a linear chain of blocks of transactions, one after the other, with later blocks each containing a pointer to the immediately prior block:</p>
<p><img decoding="async" src="https://michaelnielsen.org/ddi/wp-content/uploads/2013/12/block_chain.png" width="310px"></p>
<p>Occasionally, a fork will appear in the block chain.  This can happen, for instance, if by chance two miners happen to validate a block of transactions near-simultaneously &#8211; both broadcast their newly-validated block out to the network, and some people update their block chain one way, and others update their block chain the other way:</p>
<p><img decoding="async" src="https://michaelnielsen.org/ddi/wp-content/uploads/2013/12/block_chain_fork.png" width="310px"></p>
<p>This causes exactly the problem we&#8217;re trying to avoid &#8211; it&#8217;s no longer clear in what order transactions have occurred, and it may not be clear who owns which infocoins.  Fortunately, there&#8217;s a simple idea that can be used to remove any forks.  The rule is this: if a fork occurs, people on the network keep track of both forks.  But at any given time, miners only work to extend whichever fork is longest in their copy of the block chain.  </p>
<p>Suppose, for example, that we have a fork in which some miners receive block A first, and some miners receive block B first.  Those miners who receive block A first will continue mining along that fork, while the others will mine along fork B.  Let&#8217;s suppose that the miners working on fork B are the next to successfully mine a block:</p>
<p><img decoding="async" src="https://michaelnielsen.org/ddi/wp-content/uploads/2013/12/block_chain_extended.png" width="390px"></p>
<p>After they receive news that this has happened, the miners working on fork A will notice that fork B is now longer, and will switch to working on that fork.  Presto, in short order work on fork A will cease, and everyone will be working on the same linear chain, and block A can be ignored.  Of course, any still-pending transactions in A will still be pending in the queues of the miners working on fork B, and so all transactions will eventually be validated.</p>
<p>Likewise, it may be that the miners working on fork A are the first to extend their fork.  In that case work on fork B will quickly cease, and again we have a single linear chain.</p>
<p>No matter what the outcome, this process ensures that the block chain has an agreed-upon time ordering of the blocks.  In Bitcoin proper, a transaction is not considered confirmed until: (1) it is part of a block in the longest fork, and (2) at least 5 blocks follow it in the longest fork.  In this case we say that the transaction has &#8220;6 confirmations&#8221;.  This gives the network time to come to an agreed-upon the ordering of the blocks.  We&#8217;ll also use this strategy for Infocoin.</p>
<p>With the time-ordering now understood, let&#8217;s return to think about what happens if a dishonest party tries to double spend.  Suppose Alice tries to double spend with Bob and Charlie.  One possible approach is for her to try to validate a block that includes both transactions.  Assuming she has one percent of the computing power, she will occasionally get lucky and validate the block by solving the proof-of-work.  Unfortunately for Alice, the double spending will be immediately spotted by other people in the Infocoin network and rejected, despite solving the proof-of-work problem.  So that&#8217;s not something we need to worry about.</p>
<p>A more serious problem occurs if she broadcasts two separate transactions in which she spends the same infocoin with Bob and Charlie, respectively.  She might, for example, broadcast one transaction to a subset of the miners, and the other transaction to another set of miners, hoping to get both transactions validated in this way.  Fortunately, in this case, as we&#8217;ve seen, the network will eventually confirm one of these transactions, but not both.  So, for instance, Bob&#8217;s transaction might ultimately be confirmed, in which case Bob can go ahead confidently.  Meanwhile, Charlie will see that his transaction has not been confirmed, and so will decline Alice&#8217;s offer.  So this isn&#8217;t a problem either.  In fact, knowing that this will be the case, there is little reason for Alice to try this in the first place.</p>
<p>An important variant on double spending is if Alice = Bob, i.e., Alice tries to spend a coin with Charlie which she is also &#8220;spending&#8221; with herself (i.e., giving back to herself).  This sounds like it ought to be easy to detect and deal with, but, of course, it&#8217;s easy on a network to set up multiple identities associated with the same person or organization, so this possibility needs to be considered.  In this case, Alice&#8217;s strategy is to wait until Charlie accepts the infocoin, which happens after the transaction has been confirmed 6 times in the longest chain.  She will then attempt to fork the chain before the transaction with Charlie, adding a block which includes a transaction in which she pays herself:</p>
<p><img decoding="async" src="https://michaelnielsen.org/ddi/wp-content/uploads/2013/12/block_chain_cheating.png" width="473px"></p>
<p>Unfortunately for Alice, it&#8217;s now very difficult for her to catch up with the longer fork.  Other miners won&#8217;t want to help her out, since they&#8217;ll be working on the longer fork.  And unless Alice is able to solve the proof-of-work at least as fast as everyone else in the network combined &#8211; roughly, that means controlling more than fifty percent of the computing power &#8211; then she will just keep falling further and further behind.  Of course, she might get lucky.  We can, for example, imagine a scenario in which Alice controls one percent of the computing power, but happens to get lucky and finds six extra blocks in a row, before the rest of the network has found any extra blocks.  In this case, she might be able to get ahead, and get control of the block chain.  But this particular event will occur with probability <img src='https://s0.wp.com/latex.php?latex=1%2F100%5E6+%3D+10%5E%7B-12%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1/100^6 = 10^{-12}' title='1/100^6 = 10^{-12}' class='latex' />.  A more general analysis along these lines shows that Alice&#8217;s probability of ever catching up is infinitesimal, unless she is able to solve proof-of-work puzzles at a rate approaching all other miners combined.</p>
<p>Of course, this is not a rigorous security analysis showing that Alice cannot double spend. It&#8217;s merely an informal plausibility argument. The <a href="http://bitcoin.org/bitcoin.pdf">original paper</a> introducing Bitcoin did not, in fact, contain a rigorous security analysis, only informal arguments along the lines I&#8217;ve presented here.  The security community is still analysing Bitcoin, and trying to understand possible vulnerabilities.  You can see some of this research <a href="https://en.bitcoin.it/wiki/Research">listed here</a>, and I mention a few related problems in the &#8220;Problems for the author&#8221; below.  At this point I think it&#8217;s fair to say that the jury is still out on how secure Bitcoin is.</p>
<p>The proof-of-work and mining ideas give rise to many questions.  How much reward is enough to persuade people to mine?  How does the change in supply of infocoins affect the Infocoin economy?  Will Infocoin mining end up concentrated in the hands of a few, or many?  If it&#8217;s just a few, doesn&#8217;t that endanger the security of the system? Presumably transaction fees will eventually equilibriate &#8211; won&#8217;t this introduce an unwanted source of friction, and make small transactions less desirable?  These are all great questions, but beyond the scope of this post.  I may come back to the questions (in the context of Bitcoin) in a future post.  For now, we&#8217;ll stick to our focus on understanding how the Bitcoin protocol works.</p>
<h3>Problems for the author</h3>
<ul>
<li> I don&#8217;t understand why double spending can&#8217;t be prevented in a   simpler manner using   <a href="http://en.wikipedia.org/wiki/Two-phase_commit_protocol">two-phase     commit</a>.  Suppose Alice tries to double spend an infocoin with   both Bob and Charlie.  The idea is that Bob and Charlie would each   broadcast their respective messages to the Infocoin network, along   with a request: &#8220;Should I accept this?&#8221;  They&#8217;d then wait some   period &#8211; perhaps ten minutes &#8211; to hear any naysayers who could   prove that Alice was trying to double spend.  If no such nays are   heard (and provided there are no signs of attempts to disrupt the   network), they&#8217;d then accept the transaction.  This protocol needs   to be hardened against network attacks, but it seems to me to be the   core of a good alternate idea.  How well does this work?  What   drawbacks and advantages does it have compared to the full Bitcoin   protocol?
<li> Early in the section I mentioned that there is a natural way of   reducing the variance in time required to validate a block of   transactions.  If that variance is reduced too much, then it creates   an interesting attack possibility.  Suppose Alice tries to fork the   chain in such a way that: (a) one fork starts with a block in which   Alice pays herself, while the other fork starts with a block in   which Alice pays Bob; (b) both blocks are announced nearly   simultaneously, so roughly half the miners will attempt to mine each   fork; (c) Alice uses her mining power to try to keep the forks of   roughly equal length, mining whichever fork is shorter &#8211; this is   ordinarily hard to pull off, but becomes significantly easier if the   standard deviation of the time-to-validation is much shorter than   the network latency; (d) after 5 blocks have been mined on both   forks, Alice throws her mining power into making it more likely that   Charles&#8217;s transaction is confirmed; and (e) after confirmation of   Charles&#8217;s transaction, she then throws her computational power into   the other fork, and attempts to regain the lead.  This balancing   strategy will have only a small chance of success.  But while the   probability is small, it will certainly be much larger than in the   standard protocol, with high variance in the time to validate a   block.  Is there a way of avoiding this problem?
<li> Suppose Bitcoin mining software always explored nonces starting   with <img src='https://s0.wp.com/latex.php?latex=x+%3D+0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x = 0' title='x = 0' class='latex' />, then <img src='https://s0.wp.com/latex.php?latex=x+%3D+1%2C+x+%3D+2%2C%5Cldots&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x = 1, x = 2,\ldots' title='x = 1, x = 2,\ldots' class='latex' />.  If this is done by all   (or even just a substantial fraction) of Bitcoin miners then it   creates a vulnerability.  Namely, it&#8217;s possible for someone to   improve their odds of solving the proof-of-work merely by starting   with some other (much larger) nonce.  More generally, it may be   possible for attackers to exploit any systematic patterns in the way   miners explore the space of nonces.  More generally still, in the   analysis of this section I have implicitly assumed a kind of   symmetry between different miners.  In practice, there will be   asymmetries and a thorough security analysis will need to account for those asymmetries.
</ul>
<h3>Bitcoin</h3>
<p>Let&#8217;s move away from Infocoin, and describe the actual Bitcoin protocol.  There are a few new ideas here, but with one exception (discussed below) they&#8217;re mostly obvious modifications to Infocoin.</p>
<p>To use Bitcoin in practice, you first install a <a href="http://bitcoin.org/en/choose-your-wallet">wallet</a> program on your computer.  To give you a sense of what that means, here&#8217;s a screenshot of a wallet called <a href="https://multibit.org/">Multbit</a>. You can see the Bitcoin balance on the left &#8212; 0.06555555 Bitcoins, or about 70 dollars at the exchange rate on the day I took this screenshot &#8212; and on the right two recent transactions, which deposited those 0.06555555 Bitcoins:</p>
<p><img decoding="async" src="https://michaelnielsen.org/ddi/wp-content/uploads/2013/12/wallet_transaction.jpg" width="500px"></p>
<p>Suppose you&#8217;re a merchant who has set up an online store, and you&#8217;ve decided to allow people to pay using Bitcoin.  What you do is tell your wallet program to generate a <em>Bitcoin address</em>.  In response, it will generate a public / private key pair, and then hash the public key to form your Bitcoin address:</p>
<p><img decoding="async" src="https://michaelnielsen.org/ddi/wp-content/uploads/2013/12/bitcoin_address.jpg" width="500px"></p>
<p>You then send your Bitcoin address to the person who wants to buy from you.  You could do this in email, or even put the address up publicly on a webpage.  This is safe, since the address is merely a hash of your public key, which can safely be known by the world anyway.  (I&#8217;ll return later to the question of why the Bitcoin address is a hash, and not just the public key.)</p>
<p>The person who is going to pay you then generates a <em>transaction</em>.  Let&#8217;s take a look at the data from an <a href="http://blockexplorer.com/tx/7c402505be883276b833d57168a048cfdf306a926484c0b58930f53d89d036f9">actual   transaction</a> transferring <img src='https://s0.wp.com/latex.php?latex=0.31900000&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0.31900000' title='0.31900000' class='latex' /> bitcoins.  What&#8217;s shown below is very nearly the raw data.  It&#8217;s changed in three ways: (1) the data has been deserialized; (2) line numbers have been added, for ease of reference; and (3) I&#8217;ve abbreviated various hashes and public keys, just putting in the first six hexadecimal digits of each, when in reality they are much longer.  Here&#8217;s the data: </p>
<pre>
1.  {"hash":"7c4025...",
2.  "ver":1,
3.  "vin_sz":1,
4.  "vout_sz":1,
5.  "lock_time":0,
6.  "size":224,
7.  "in":[
8.    {"prev_out":
9.      {"hash":"2007ae...",
10.      "n":0},
11.    "scriptSig":"304502... 042b2d..."}],
12. "out":[
13.   {"value":"0.31900000",
14.    "scriptPubKey":"OP_DUP OP_HASH160 a7db6f OP_EQUALVERIFY OP_CHECKSIG"}]}
</pre>
<p>Let&#8217;s go through this, line by line.</p>
<p>Line 1 contains the hash of the remainder of the transaction, <tt>7c4025...</tt>, expressed in hexadecimal.  This is used as an identifier for the transaction.</p>
<p>Line 2 tells us that this is a transaction in version 1 of the Bitcoin protocol.</p>
<p>Lines 3 and 4 tell us that the transaction has one input and one output, respectively.  I&#8217;ll talk below about transactions with more inputs and outputs, and why that&#8217;s useful. </p>
<p>Line 5 contains the value for <tt>lock_time</tt>, which can be used to control when a transaction is finalized.  For most Bitcoin transactions being carried out today the <tt>lock_time</tt> is set to 0, which means the transaction is finalized immediately.</p>
<p>Line 6 tells us the size (in bytes) of the transaction.  Note that it&#8217;s not the monetary amount being transferred!  That comes later.</p>
<p>Lines 7 through 11 define the input to the transaction.  In particular, lines 8 through 10 tell us that the input is to be taken from the output from an earlier transaction, with the given <tt>hash</tt>, which is expressed in hexadecimal as <tt>2007ae...</tt>. The <tt>n=0</tt> tells us it&#8217;s to be the first output from that transaction; we&#8217;ll see soon how multiple outputs (and inputs) from a transaction work, so don&#8217;t worry too much about this for now.  Line 11 contains the signature of the person sending the money, <tt>304502...</tt>, followed by a space, and then the corresponding public key, <tt>04b2d...</tt>.  Again, these are both in hexadecimal.</p>
<p>One thing to note about the input is that there&#8217;s nothing explicitly specifying how many bitcoins from the previous transaction should be spent in this transaction.  In fact, <em>all</em> the bitcoins from the <tt>n=0</tt>th output of the previous transaction are spent.  So, for example, if the <tt>n=0</tt>th output of the earlier transaction was 2 bitcoins, then 2 bitcoins will be spent in this transaction.  This seems like an inconvenient restriction &#8211; like trying to buy bread with a 20 dollar note, and not being able to break the note down.  The solution, of course, is to have a mechanism for providing change. This can be done using transactions with multiple inputs and outputs, which we&#8217;ll discuss in the next section.</p>
<p>Lines 12 through 14 define the output from the transaction.  In particular, line 13 tells us the value of the output, 0.319 bitcoins. Line 14 is somewhat complicated.  The main thing to note is that the string <tt>a7db6f...</tt> is the Bitcoin address of the intended recipient of the funds (written in hexadecimal).  In fact, Line 14 is actually an expression in Bitcoin&#8217;s scripting language.  I&#8217;m not going to describe that language in detail in this post, the important thing to take away now is just that <tt>a7db6f...</tt> is the Bitcoin address.</p>
<p>You can now see, by the way, how Bitcoin addresses the question I swept under the rug in the last section: where do Bitcoin serial numbers come from?  In fact, the role of the serial number is played by transaction hashes.  In the transaction above, for example, the recipient is receiving 0.319 Bitcoins, which come out of the first output of an earlier transaction with hash <tt>2007ae...</tt> (line 9). If you go and look in the block chain for that transaction, you&#8217;d see that its output comes from a still earlier transaction.  And so on.</p>
<p>There are two clever things about using transaction hashes instead of serial numbers.  First, in Bitcoin there&#8217;s not really any separate, persistent &#8220;coins&#8221; at all, just a long series of transactions in the block chain.  It&#8217;s a clever idea to realize that you don&#8217;t need persistent coins, and can just get by with a ledger of transactions. Second, by operating in this way we remove the need for any central authority issuing serial numbers.  Instead, the serial numbers can be self-generated, merely by hashing the transaction.</p>
<p>In fact, it&#8217;s possible to keep following the chain of transactions further back in history.  Ultimately, this process must terminate. This can happen in one of two ways.  The first possibilitty is that you&#8217;ll arrive at the very first Bitcoin transaction, contained in the so-called <a href="https://en.bitcoin.it/wiki/Genesis_block">Genesis   block</a>.  This is a special transaction, having no inputs, but a 50 Bitcoin output.  In other words, this transaction establishes an initial money supply.  The Genesis block is treated separately by Bitcoin clients, and I won&#8217;t get into the details here, although it&#8217;s along similar lines to the transaction above.  You can see the deserialized raw data <a href="http://blockexplorer.com/rawblock/000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f">here</a>, and read about the Genesis block <a href="https://en.bitcoin.it/wiki/Genesis_block">here</a>.</p>
<p>The second possibility when you follow a chain of transactions back in time is that eventually you&#8217;ll arrive at a so-called <em>coinbase   transaction</em>.  With the exception of the Genesis block, every block of transactions in the block chain starts with a special coinbase transaction.  This is the transaction rewarding the miner who validated that block of transactions.  It uses a similar but not identical format to the transaction above.  I won&#8217;t go through the format in detail, but if you want to see an example, see <a href="http://blockexplorer.com/rawtx/c3facb1e90fdbaf0ee59e342a00e1c82588af138784fabad7398eb9dab3a0e5a">here</a>. You can read a little more about coinbase transactions <a href="https://en.bitcoin.it/wiki/Protocol_specification#Transaction_Verification">here</a>.</p>
<p>Something I haven&#8217;t been precise about above is what exactly is being signed by the digital signature in line 11.  The obvious thing to do is for the payer to sign the whole transaction (apart from the transaction hash, which, of course, must be generated later). Currently, this is <em>not</em> what is done &#8211; some pieces of the transaction are omitted.  This makes some pieces of the transaction <a href="https://en.bitcoin.it/wiki/Transaction_Malleability">malleable</a>, i.e., they can be changed later.  However, this malleability does not include the amounts being paid out, senders and recipients, which can&#8217;t be changed later.  I must admit I haven&#8217;t dug down into the details here. I gather that this malleability is under discussion in the Bitcoin developer community, and there are efforts afoot to reduce or eliminate this malleability.</p>
<h3>Transactions with multiple inputs and outputs</h3>
<p>In the last section I described how a transaction with a single input and a single output works. In practice, it&#8217;s often extremely convenient to create Bitcoin transactions with multiple inputs or multiple outputs.  I&#8217;ll talk below about why this can be useful.  But first let&#8217;s take a look at the data from an <a href="http://blockexplorer.com/tx/99383066a5140b35b93e8f84ef1d40fd720cc201d2aa51915b6c33616587b94f">actual   transaction</a>: </p>
<pre>
1. {"hash":"993830...",
2. "ver":1,
3. "vin_sz":3,
4.  "vout_sz":2,
5.  "lock_time":0,
6.  "size":552,
7.  "in":[
8.    {"prev_out":{
9.      "hash":"3beabc...",
10.        "n":0},
11.     "scriptSig":"304402... 04c7d2..."},
12.    {"prev_out":{
13.        "hash":"fdae9b...",
14.        "n":0},
15.      "scriptSig":"304502... 026e15..."},
16.    {"prev_out":{
17.        "hash":"20c86b...",
18.        "n":1},
19.      "scriptSig":"304402... 038a52..."}],
20.  "out":[
21.    {"value":"0.01068000",
22.      "scriptPubKey":"OP_DUP OP_HASH160 e8c306... OP_EQUALVERIFY OP_CHECKSIG"},
23.    {"value":"4.00000000",
24.      "scriptPubKey":"OP_DUP OP_HASH160 d644e3... OP_EQUALVERIFY OP_CHECKSIG"}]}
</pre>
<p> Let&#8217;s go through the data, line by line.  It&#8217;s very similar to the single-input-single-output transaction, so I&#8217;ll do this pretty quickly.</p>
<p>Line 1 contains the hash of the remainder of the transaction.  This is used as an identifier for the transaction.</p>
<p>Line 2 tells us that this is a transaction in version 1 of the Bitcoin protocol.</p>
<p>Lines 3 and 4 tell us that the transaction has three inputs and two outputs, respectively.</p>
<p>Line 5 contains the <tt>lock_time</tt>.  As in the single-input-single-output case this is set to 0, which means the transaction is finalized immediately.</p>
<p>Line 6 tells us the size of the transaction in bytes.</p>
<p>Lines 7 through 19 define a list of the inputs to the transaction. Each corresponds to an output from a previous Bitcoin transaction.</p>
<p>The first input is defined in lines 8 through 11.  </p>
<p>In particular, lines 8 through 10 tell us that the input is to be taken from the <tt>n=0</tt>th output from the transaction with <tt>hash</tt> <tt>3beabc...</tt>.  Line 11 contains the signature, followed by a space, and then the public key of the person sending the bitcoins.</p>
<p>Lines 12 through 15 define the second input, with a similar format to lines 8 through 11.  And lines 16 through 19 define the third input.</p>
<p>Lines 20 through 24 define a list containing the two outputs from the transaction.</p>
<p>The first output is defined in lines 21 and 22.  Line 21 tells us the value of the output, 0.01068000 bitcoins.  As before, line 22 is an expression in Bitcoin&#8217;s scripting language.  The main thing to take away here is that the string <tt>e8c30622...</tt> is the Bitcoin address of the intended recipient of the funds.</p>
<p>The second output is defined lines 23 and 24, with a similar format to the first output.</p>
<p>One apparent oddity in this description is that although each output has a Bitcoin value associated to it, the inputs do not.  Of course, the values of the respective inputs can be found by consulting the corresponding outputs in earlier transactions.  In a standard Bitcoin transaction, the sum of all the inputs in the transaction must be at least as much as the sum of all the outputs.  (The only exception to this principle is the Genesis block, and in coinbase transactions, both of which add to the overall Bitcoin supply.)  If the inputs sum up to more than the outputs, then the excess is used as a <em>transaction fee</em>.  This is paid to whichever miner successfully validates the block which the current transaction is a part of.</p>
<p>That&#8217;s all there is to multiple-input-multiple-output transactions! They&#8217;re a pretty simple variation on single-input-single-output-transactions.</p>
<p>One nice application of multiple-input-multiple-output transactions is the idea of <em>change</em>.  Suppose, for example, that I want to send you 0.15 bitcoins.  I can do so by spending money from a previous transaction in which I received 0.2 bitcoins.  Of course, I don&#8217;t want to send you the entire 0.2 bitcoins.  The solution is to send you 0.15 bitcoins, and to send 0.05 bitcoins to a Bitcoin address which I own.  Those 0.05 bitcoins are the change.  Of course, it differs a little from the change you might receive in a store, since change in this case is what you pay yourself.  But the broad idea is similar.</p>
<h3>Conclusion</h3>
<p>That completes a basic description of the main ideas behind Bitcoin. Of course, I&#8217;ve omitted many details &#8211; this isn&#8217;t a formal specification.  But I have described the main ideas behind the most common use cases for Bitcoin.</p>
<p>While the rules of Bitcoin are simple and easy to understand, that doesn&#8217;t mean that it&#8217;s easy to understand all the consequences of the rules.  There is vastly more that could be said about Bitcoin, and I&#8217;ll investigate some of these issues in future posts.</p>
<p>For now, though, I&#8217;ll wrap up by addressing a few loose ends.</p>
<p><strong>How anonymous is Bitcoin?</strong> Many people claim that Bitcoin can be used anonymously.  This claim has led to the formation of marketplaces such as <a href="http://en.wikipedia.org/wiki/Silk_Road_(marketplace)">Silk   Road</a> (and various successors), which specialize in illegal goods. However, the claim that Bitcoin is anonymous is a myth.  The block chain is public, meaning that it&#8217;s possible for anyone to see every Bitcoin transaction ever.  Although Bitcoin addresses aren&#8217;t immediately associated to real-world identities, computer scientists have done a <a href="http://scholar.google.com/scholar?q=de-anonymization">great deal   of work</a> figuring out how to de-anonymize &#8220;anonymous&#8221; social networks.  The block chain is a marvellous target for these techniques.  I will be extremely surprised if the great majority of Bitcoin users are not identified with relatively high confidence and ease in the near future.  The confidence won&#8217;t be high enough to achieve convictions, but will be high enough to identify likely targets.  Furthermore, identification will be retrospective, meaning that someone who bought drugs on Silk Road in 2011 will still be identifiable on the basis of the block chain in, say, 2020.  These de-anonymization techniques are well known to computer scientists, and, one presumes, therefore to the NSA.  I would not be at all surprised if the NSA and other agencies have already de-anonymized many users.  It is, in fact, ironic that Bitcoin is often touted as anonymous.  It&#8217;s not.  Bitcoin is, instead, perhaps the most open and transparent financial instrument the world has ever seen.</p>
<p><strong>Can you get rich with Bitcoin?</strong> Well, maybe.  Tim O&#8217;Reilly <a href="http://radar.oreilly.com/2006/05/my-commencement-speech-at-sims.html">once   said</a>: &#8220;Money is like gas in the car &#8211; you need to pay attention or you’ll end up on the side of the road &#8211; but a well-lived life is not a tour of gas stations!&#8221;  Much of the interest in Bitcoin comes from people whose life mission seems to be to find a <em>really big</em> gas station.  I must admit I find this perplexing.  What is, I believe, much more interesting and enjoyable is to think of Bitcoin and other cryptocurrencies as a way of enabling new forms of collective behaviour.  That&#8217;s intellectually fascinating, offers marvellous creative possibilities, is socially valuable, and may just also put some money in the bank.  But if money in the bank is your primary concern, then I believe that other strategies are much more likely to succeed.</p>
<p><strong>Details I&#8217;ve omitted:</strong> Although this post has described the main ideas behind Bitcoin, there are many details I haven&#8217;t mentioned. One is a nice space-saving trick used by the protocol, based on a data structure known as a <a href="http://en.wikipedia.org/wiki/Merkle_tree">Merkle tree</a>.  It&#8217;s a detail, but a splendid detail, and worth checking out if fun data structures are your thing.  You can get an overview in the <a href="http://bitcoin.org/bitcoin.pdf">original Bitcoin paper</a>. Second, I&#8217;ve said little about the <a href="https://en.bitcoin.it/wiki/Network">Bitcoin network</a> &#8211; questions like how the network deals with denial of service attacks, how nodes <a href="https://en.bitcoin.it/wiki/Satoshi_Client_Node_Discovery">join   and leave the network</a>, and so on.  This is a fascinating topic, but it&#8217;s also something of a mess of details, and so I&#8217;ve omitted it.  You can read more about it at some of the links above.</p>
<p><strong>Bitcoin scripting:</strong> In this post I&#8217;ve explained Bitcoin as a form of digital, online money.  But this is only a small part of a much bigger and more interesting story.  As we&#8217;ve seen, every Bitcoin transaction is associated to a script in the Bitcoin programming language.  The scripts we&#8217;ve seen in this post describe simple transactions like &#8220;Alice gave Bob 10 bitcoins&#8221;.  But the scripting language can also be used to express far more complicated transactions.  To put it another way, Bitcoin is <em>programmable   money</em>. In later posts I will explain the scripting system, and how it is possible to use Bitcoin scripting as a platform to experiment with all sorts of amazing financial instruments.</p>
<p><em>Thanks for reading.  Enjoy the essay?  You can tip me with Bitcoin (!) at address: 17ukkKt1bNLAqdJ1QQv8v9Askr6vy3MzTZ.  You may also   enjoy the   <a href="http://neuralnetworksanddeeplearning.com/chap1.html">first     chapter</a> of my forthcoming book on neural networks and deep   learning, and may wish to   <a href="https://twitter.com/michael_nielsen">follow me on Twitter</a>.</em></p>
<h3>Footnote</h3>
<p>[1] In the United States the question &#8220;Is money a form of speech?&#8221; is an important legal question, because of the protection afforded speech under the US Constitution.  In my (legally uninformed) opinion digital money may make this issue more complicated.  As we&#8217;ll see, the Bitcoin protocol is really a way of standing up before the rest of the world (or at least the rest of the Bitcoin network) and avowing &#8220;I&#8217;m going to give such-and-such a number of bitcoins to so-and-so a person&#8221; in a way that&#8217;s extremely difficult to repudiate.  At least naively, it looks more like speech than exchanging copper coins, say.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://michaelnielsen.org/ddi/how-the-bitcoin-protocol-actually-works/feed/</wfw:commentRss>
			<slash:comments>259</slash:comments>
		
		
			</item>
		<item>
		<title>Neural Networks and Deep Learning: first chapter now live</title>
		<link>https://michaelnielsen.org/ddi/neural-networks-and-deep-learning-first-chapter-now-live/</link>
					<comments>https://michaelnielsen.org/ddi/neural-networks-and-deep-learning-first-chapter-now-live/#comments</comments>
		
		<dc:creator><![CDATA[Michael Nielsen]]></dc:creator>
		<pubDate>Mon, 25 Nov 2013 15:03:37 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<guid isPermaLink="false">https://michaelnielsen.org/ddi/?p=82</guid>

					<description><![CDATA[I am delighted to announce that the first chapter of my book &#8220;Neural Networks and Deep Learning&#8221; is now freely available online here. The chapter explains the basic ideas behind neural networks, including how they learn. I show how powerful these ideas are by writing a short program which uses neural networks to solve a&#8230; <a class="more-link" href="https://michaelnielsen.org/ddi/neural-networks-and-deep-learning-first-chapter-now-live/">Continue reading <span class="screen-reader-text">Neural Networks and Deep Learning: first chapter now live</span></a>]]></description>
										<content:encoded><![CDATA[<p>I am delighted to announce that the first chapter of my book &#8220;Neural Networks and Deep Learning&#8221; is now freely available online <a href="http://neuralnetworksanddeeplearning.com/chap1.html">here</a>.</p>
<p>The chapter explains the basic ideas behind neural networks, including how they learn.  I show how powerful these ideas are by writing a short program which uses neural networks to solve a hard problem &#8212; recognizing handwritten digits.  The chapter also takes a brief look at how deep learning works.</p>
<p>The book&#8217;s <a href="http://neuralnetworksanddeeplearning.com/">landing page</a> gives a broader view on the book.  And I&#8217;ve written a more <a href="http://neuralnetworksanddeeplearning.com/about.html">in-depth discussion</a> of the philosophy behind the book.</p>
<p>Finally, if you&#8217;ve read this far I hope you&#8217;ll consider supporting my <a href="http://www.indiegogo.com/projects/neural-networks-and-deep-learning-book-project/">Indiegogo campaign</a> for the book, which will give you access to perks like early drafts of later chapters.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://michaelnielsen.org/ddi/neural-networks-and-deep-learning-first-chapter-now-live/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			</item>
		<item>
		<title>Why Bloom filters work the way they do</title>
		<link>https://michaelnielsen.org/ddi/why-bloom-filters-work-the-way-they-do/</link>
					<comments>https://michaelnielsen.org/ddi/why-bloom-filters-work-the-way-they-do/#comments</comments>
		
		<dc:creator><![CDATA[Michael Nielsen]]></dc:creator>
		<pubDate>Wed, 26 Sep 2012 22:21:24 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<guid isPermaLink="false">https://michaelnielsen.org/ddi/?p=75</guid>

					<description><![CDATA[Imagine you&#8217;re a programmer who is developing a new web browser. There are many malicious sites on the web, and you want your browser to warn users when they attempt to access dangerous sites. For example, suppose the user attempts to access http://domain/etc. You&#8217;d like a way of checking whether domain is known to be&#8230; <a class="more-link" href="https://michaelnielsen.org/ddi/why-bloom-filters-work-the-way-they-do/">Continue reading <span class="screen-reader-text">Why Bloom filters work the way they do</span></a>]]></description>
										<content:encoded><![CDATA[<p>Imagine you&#8217;re a programmer who is developing a new web browser. There are many malicious sites on the web, and you want your browser to warn users when they attempt to access dangerous sites. For example, suppose the user attempts to access <tt>http://domain/etc</tt>. You&#8217;d like a way of checking whether <tt>domain</tt> is known to be a malicious site.  What&#8217;s a good way of doing this?</p>
<p>An obvious naive way is for your browser to maintain a list or set data structure containing all known malicious domains. A problem with this approach is that it may consume a considerable amount of memory. If you know of a million malicious domains, and domains need (say) an average of 20 bytes to store, then you need 20 megabytes of storage. That&#8217;s quite an overhead for a single feature in your web browser.  Is there a better way?</p>
<p>In this post I&#8217;ll describe a data structure which provides an excellent way of solving this kind of problem. The data structure is known as a <a href="http://en.wikipedia.org/wiki/Bloom_filter">Bloom   filter</a>.  Bloom filter are much more memory efficient than the naive &#8220;store-everything&#8221; approach, while remaining extremely fast.  I&#8217;ll describe both how Bloom filters work, and also some extensions of Bloom filters to solve more general problems.</p>
<p>Most explanations of Bloom filters cut to the chase, quickly explaining the detailed mechanics of how Bloom filters work.  Such explanations are informative, but I must admit that they made me uncomfortable when I was first learning about Bloom filters.  In particular, I didn&#8217;t feel that they helped me understand <em>why</em> Bloom filters are put together the way they are.  I couldn&#8217;t fathom the mindset that would lead someone to <em>invent</em> such a data structure.  And that left me feeling that all I had was a superficial, surface-level understanding of Bloom filters.</p>
<p>In this post I take an unusual approach to explaining Bloom filters. We <em>won&#8217;t</em> begin with a full-blown explanation.  Instead, I&#8217;ll gradually build up to the full data structure in stages.  My goal is to tell a plausible story explaining how one could invent Bloom filters from scratch, with each step along the way more or less &#8220;obvious&#8221;.  Of course, hindsight is 20-20, and such a story shouldn&#8217;t be taken too literally.  Rather, the benefit of developing Bloom filters in this way is that it will deepen our understanding of why Bloom filters work in just the way they do.  We&#8217;ll explore some alternative directions that plausibly <em>could</em> have been taken &#8211; and see why they don&#8217;t work as well as Bloom filters ultimately turn out to work.  At the end we&#8217;ll understand much better why Bloom filters are constructed the way they are.</p>
<p>Of course, this means that if your goal is just to understand the mechanics of Bloom filters, then this post isn&#8217;t for you.  Instead, I&#8217;d suggest looking at a more conventional introduction &#8211; the <a href="http://en.wikipedia.org/wiki/Bloom_filter">Wikipedia article</a>, for example, perhaps in conjunction with an interactive demo, like the nice one <a href="http://www.jasondavies.com/bloomfilter/">here</a>.  But if your goal is to understand why Bloom filters work the way they do, then you may enjoy the post.</p>
<p><strong>A stylistic note:</strong> Most of my posts are code-oriented.  This post is much more focused on mathematical analysis and algebraic manipulation: the point isn&#8217;t code, but rather how one could come to invent a particular data structure.  That is, it&#8217;s the story <em>behind</em> the code that implements Bloom filters, and as such it requires rather more attention to mathematical detail.</p>
<p><strong>General description of the problem:</strong> Let&#8217;s begin by abstracting away from the &#8220;safe web browsing&#8221; problem that began this post.  We want a data structure which represents a set <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' /> of objects.  That data structure should enable two operations: (1) the ability to <tt>add</tt> an extra object <img src='https://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> to the set; and (2) a <tt>test</tt> to determine whether a given object <img src='https://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> is a member of <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' />.  Of course, there are many other operations we might imagine wanting &#8211; for example, maybe we&#8217;d also like to be able to <tt>delete</tt> objects from the set.  But we&#8217;re going to start with just these two operations of <tt>add</tt>ing and <tt>test</tt>ing.  Later we&#8217;ll come back and ask whether operations such as <tt>delete</tt>ing objects are also possible.</p>
<p><strong>Idea: store a set of hashed objects:</strong> Okay, so how can we solve the problem of representing <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' /> in a way that&#8217;s more memory efficient than just storing all the objects in <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' />?  One idea is to store hashed versions of the objects in <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' />, instead of the full objects.  If the hash function is well chosen, then the hashed objects will take up much less memory, but there will be little danger of making errors when <tt>test</tt>ing whether an object is an element of the set or not.</p>
<p>Let&#8217;s be a little more explicit about how this would work.  We have a set <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' /> of objects <img src='https://s0.wp.com/latex.php?latex=x_0%2C+x_1%2C+%5Cldots%2C+x_%7B%7CS%7C-1%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x_0, x_1, \ldots, x_{|S|-1}' title='x_0, x_1, \ldots, x_{|S|-1}' class='latex' />, where <img src='https://s0.wp.com/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' /> denotes the number of objects in <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' />.  For each object we compute an <img src='https://s0.wp.com/latex.php?latex=m&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m' title='m' class='latex' />-bit hash function <img src='https://s0.wp.com/latex.php?latex=h%28x_j%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(x_j)' title='h(x_j)' class='latex' /> &#8211; i.e., a hash function which takes an arbitrary object as input, and outputs <img src='https://s0.wp.com/latex.php?latex=m&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m' title='m' class='latex' /> bits &#8211; and the set <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' /> is represented by the set <img src='https://s0.wp.com/latex.php?latex=%5C%7B+h%28x_0%29%2C+h%28x_1%29%2C+%5Cldots%2C+h%28x_%7B%7CS%7C-1%7D%29+%5C%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\{ h(x_0), h(x_1), \ldots, h(x_{|S|-1}) \}' title='\{ h(x_0), h(x_1), \ldots, h(x_{|S|-1}) \}' class='latex' />. We can <tt>test</tt> whether <img src='https://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> is an element of <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' /> by checking whether <img src='https://s0.wp.com/latex.php?latex=h%28x%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(x)' title='h(x)' class='latex' /> is in the set of hashes.  This basic hashing approach requires roughly <img src='https://s0.wp.com/latex.php?latex=m+%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m |S|' title='m |S|' class='latex' /> bits of memory.</p>
<p>(As an aside, in principle it&#8217;s possible to store the set of hashed objects more efficiently, using just <img src='https://s0.wp.com/latex.php?latex=m%7CS%7C-%5Clog%28%7CS%7C%21%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m|S|-\log(|S|!)' title='m|S|-\log(|S|!)' class='latex' /> bits, where <img src='https://s0.wp.com/latex.php?latex=%5Clog&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\log' title='\log' class='latex' /> is to base two.  The <img src='https://s0.wp.com/latex.php?latex=-%5Clog%28%7CS%7C%21%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='-\log(|S|!)' title='-\log(|S|!)' class='latex' /> saving is possible because the ordering of the objects in a set is redundant information, and so in principle can be eliminated using a suitable encoding.  However, I haven&#8217;t thought through what encodings could be used to do this in practice.  In any case, the saving is likely to be minimal, since <img src='https://s0.wp.com/latex.php?latex=%5Clog%28%7CS%7C%21%29++%5Capprox+%7CS%7C+%5Clog+%7CS%7C%2C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\log(|S|!)  \approx |S| \log |S|,' title='\log(|S|!)  \approx |S| \log |S|,' class='latex' /> and <img src='https://s0.wp.com/latex.php?latex=m&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m' title='m' class='latex' /> will usually be quite a bit bigger than <img src='https://s0.wp.com/latex.php?latex=%5Clog+%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\log |S|' title='\log |S|' class='latex' /> &#8211; if that weren&#8217;t the case, then hash collisions would occur all the time.  So I&#8217;ll ignore the terms <img src='https://s0.wp.com/latex.php?latex=-%5Clog%28%7CS%7C%21%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='-\log(|S|!)' title='-\log(|S|!)' class='latex' /> for the rest of this post.  In fact, in general I&#8217;ll be pretty cavalier in later analyses as well, omitting lower order terms without comment.)</p>
<p>A danger with this hash-based approach is that an object <img src='https://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> outside the set <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' /> might have the same hash value as an object inside the set, i.e., <img src='https://s0.wp.com/latex.php?latex=h%28x%29+%3D+h%28x_j%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(x) = h(x_j)' title='h(x) = h(x_j)' class='latex' /> for some <img src='https://s0.wp.com/latex.php?latex=j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j' title='j' class='latex' />.  In this case, <tt>test</tt> will erroneously report that <img src='https://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> is in <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' />.  That is, this data structure will give us a <em>false positive</em>.  Fortunately, by choosing a suitable value for <img src='https://s0.wp.com/latex.php?latex=m&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m' title='m' class='latex' />, the number of bits output by the hash function, we can reduce the probability of a false positive as much as we want. To understand how this works, notice first that the probability of <tt>test</tt> giving a false positive is 1 minus the probability of <tt>test</tt> correctly reporting that <img src='https://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> is not in <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' />. This occurs when <img src='https://s0.wp.com/latex.php?latex=h%28x%29+%5Cneq+h%28x_j%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(x) \neq h(x_j)' title='h(x) \neq h(x_j)' class='latex' /> for all <img src='https://s0.wp.com/latex.php?latex=j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j' title='j' class='latex' />.  If the hash function is well chosen, then the probability that <img src='https://s0.wp.com/latex.php?latex=h%28x%29+%5Cneq+h%28x_j%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(x) \neq h(x_j)' title='h(x) \neq h(x_j)' class='latex' /> is <img src='https://s0.wp.com/latex.php?latex=%281-1%2F2%5Em%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='(1-1/2^m)' title='(1-1/2^m)' class='latex' /> for each <img src='https://s0.wp.com/latex.php?latex=j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j' title='j' class='latex' />, and these are independent events. Thus the probability of <tt>test</tt> failing is:</p>
<img src='https://s0.wp.com/latex.php?latex=+++p+%3D+1-%281-1%2F2%5Em%29%5E%7B%7CS%7C%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   p = 1-(1-1/2^m)^{|S|}. ' title='   p = 1-(1-1/2^m)^{|S|}. ' class='latex' />
<p>This expression involves three quantities: the probability <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> of <tt>test</tt> giving a false positive, the number <img src='https://s0.wp.com/latex.php?latex=m&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m' title='m' class='latex' /> of bits output by the hash function, and the number of elements in the set, <img src='https://s0.wp.com/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' />.  It&#8217;s a nice expression, but it&#8217;s more enlightening when rewritten in a slightly different form.  What we&#8217;d really like to understand is how many bits of memory are needed to represent a set of size <img src='https://s0.wp.com/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' />, with probability <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> of a <tt>test</tt> failing.  To understand that we let <img src='https://s0.wp.com/latex.php?latex=%5C%23&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\#' title='\#' class='latex' /> be the number of bits of memory used, and aim to express <img src='https://s0.wp.com/latex.php?latex=%5C%23&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\#' title='\#' class='latex' /> as a function of <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> and <img src='https://s0.wp.com/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' />.  Observe that <img src='https://s0.wp.com/latex.php?latex=%5C%23+%3D+%7CS%7Cm&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\# = |S|m' title='\# = |S|m' class='latex' />, and so we can substitute for <img src='https://s0.wp.com/latex.php?latex=m+%3D+%5C%23%2F%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m = \#/|S|' title='m = \#/|S|' class='latex' /> to obtain</p>
<img src='https://s0.wp.com/latex.php?latex=+++p+%3D+1-%5Cleft%281-1%2F2%5E%7B%5C%23%2F%7CS%7C%7D%5Cright%29%5E%7B%7CS%7C%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   p = 1-\left(1-1/2^{\#/|S|}\right)^{|S|}. ' title='   p = 1-\left(1-1/2^{\#/|S|}\right)^{|S|}. ' class='latex' />
<p>This can be rearranged to express <img src='https://s0.wp.com/latex.php?latex=%5C%23&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\#' title='\#' class='latex' /> in term of <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> and <img src='https://s0.wp.com/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' />:</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%23+%3D+%7CS%7C%5Clog+%5Cfrac%7B1%7D%7B1-%281-p%29%5E%7B1%2F%7CS%7C%7D%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \# = |S|\log \frac{1}{1-(1-p)^{1/|S|}}. ' title='   \# = |S|\log \frac{1}{1-(1-p)^{1/|S|}}. ' class='latex' />
<p>This expression answers the question we really want answered, telling us how many bits are required to store a set of size <img src='https://s0.wp.com/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' /> with a probability <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> of a <tt>test</tt> failing.  Of course, in practice we&#8217;d like <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> to be small &#8211; say <img src='https://s0.wp.com/latex.php?latex=p+%3D+0.01&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p = 0.01' title='p = 0.01' class='latex' /> &#8211; and when this is the case the expression may be approximated by a more transparent expression:</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%23+%5Capprox+%7CS%7C%5Clog+%5Cfrac%7B%7CS%7C%7D%7Bp%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \# \approx |S|\log \frac{|S|}{p}. ' title='   \# \approx |S|\log \frac{|S|}{p}. ' class='latex' />
<p>This makes intuitive sense: <tt>test</tt> failure occurs when <img src='https://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> is not in <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' />, but <img src='https://s0.wp.com/latex.php?latex=h%28x%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(x)' title='h(x)' class='latex' /> is in the hashed version of <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' />.  Because this happens with probability <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' />, it must be that <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' /> occupies a fraction <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> of the total space of possible hash outputs.  And so the size of the space of all possible hash outputs must be about <img src='https://s0.wp.com/latex.php?latex=%7CS%7C%2Fp&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|/p' title='|S|/p' class='latex' />.  As a consequence we need <img src='https://s0.wp.com/latex.php?latex=%5Clog%28%7CS%7C%2Fp%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\log(|S|/p)' title='\log(|S|/p)' class='latex' /> bits to represent each hashed object, in agreement with the expression above.  </p>
<p>How memory efficient is this hash-based approach to representing <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' />? It&#8217;s obviously likely to be quite a bit better than storing full representations of the objects in <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' />.  But we&#8217;ll see later that Bloom filters can be far more memory efficient still.</p>
<p>The big drawback of this hash-based approach is the false positives. Still, for many applications it&#8217;s fine to have a small probability of a false positive.  For example, false positives turn out to be okay for the safe web browsing problem.  You might worry that false positives would cause some safe sites to erroneously be reported as unsafe, but the browser can avoid this by maintaining a (small!)  list of safe sites which are false positives for <tt>test</tt>.</p>
<p><strong>Idea: use a bit array:</strong> Suppose we want to represent some subset <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' /> of the integers <img src='https://s0.wp.com/latex.php?latex=0%2C+1%2C+2%2C+%5Cldots%2C+999&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0, 1, 2, \ldots, 999' title='0, 1, 2, \ldots, 999' class='latex' />.  As an alternative to hashing or to storing <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' /> directly, we could represent <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' /> using an array of <img src='https://s0.wp.com/latex.php?latex=1000&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1000' title='1000' class='latex' /> bits, numbered <img src='https://s0.wp.com/latex.php?latex=0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0' title='0' class='latex' /> through <img src='https://s0.wp.com/latex.php?latex=999&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='999' title='999' class='latex' />.  We would set bits in the array to <img src='https://s0.wp.com/latex.php?latex=1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1' title='1' class='latex' /> if the corresponding number is in <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' />, and otherwise set them to <img src='https://s0.wp.com/latex.php?latex=0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0' title='0' class='latex' />.  It&#8217;s obviously trivial to <tt>add</tt> objects to <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' />, and to <tt>test</tt> whether a particular object is in <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' /> or not.</p>
<p>The memory cost to store <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' /> in this bit-array approach is <img src='https://s0.wp.com/latex.php?latex=1000&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1000' title='1000' class='latex' /> bits, regardless of how big or small <img src='https://s0.wp.com/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' /> is.  Suppose, for comparison, that we stored <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' /> directly as a list of 32-bit integers. Then the cost would be <img src='https://s0.wp.com/latex.php?latex=32+%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='32 |S|' title='32 |S|' class='latex' /> bits.  When <img src='https://s0.wp.com/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' /> is very small, this approach would be more memory efficient than using a bit array.  But as <img src='https://s0.wp.com/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' /> gets larger, storing <img src='https://s0.wp.com/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' /> directly becomes much less memory efficient.  We could ameliorate this somewhat by storing elements of <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' /> using only 10 bits, instead of 32 bits.  But even if we did this, it would still be more expensive to store the list once <img src='https://s0.wp.com/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' /> got beyond one hundred elements.  So a bit array really would be better for modestly large subsets.</p>
<p><strong>Idea: use a bit array where the indices are given by hashes:</strong> A problem with the bit array example described above is that we needed a way of numbering the possible elements of <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' />, <img src='https://s0.wp.com/latex.php?latex=0%2C1%2C%5Cldots%2C999&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0,1,\ldots,999' title='0,1,\ldots,999' class='latex' />.  In general the elements of <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' /> may be complicated objects, not numbers in a small, well-defined range.</p>
<p>Fortunately, we can use hashing to number the elements of <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' />. Suppose <img src='https://s0.wp.com/latex.php?latex=h%28%5Ccdot%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(\cdot)' title='h(\cdot)' class='latex' /> is an <img src='https://s0.wp.com/latex.php?latex=m&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m' title='m' class='latex' />-bit hash function.  We&#8217;re going to represent a set <img src='https://s0.wp.com/latex.php?latex=S+%3D+%5C%7Bx_0%2C%5Cldots%2Cx_%7B%7CS%7C-1%7D%5C%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S = \{x_0,\ldots,x_{|S|-1}\}' title='S = \{x_0,\ldots,x_{|S|-1}\}' class='latex' /> using a bit array containing <img src='https://s0.wp.com/latex.php?latex=2%5Em&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='2^m' title='2^m' class='latex' /> elements.  In particular, for each <img src='https://s0.wp.com/latex.php?latex=x_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x_j' title='x_j' class='latex' /> we set the <img src='https://s0.wp.com/latex.php?latex=h%28x_j%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(x_j)' title='h(x_j)' class='latex' />th element in the bit array, where we regard <img src='https://s0.wp.com/latex.php?latex=h%28x_j%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(x_j)' title='h(x_j)' class='latex' /> as a number in the range <img src='https://s0.wp.com/latex.php?latex=0%2C1%2C%5Cldots%2C2%5Em-1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0,1,\ldots,2^m-1' title='0,1,\ldots,2^m-1' class='latex' />.  More explicitly, we can <tt>add</tt> an element <img src='https://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> to the set by setting bit number <img src='https://s0.wp.com/latex.php?latex=h%28x%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(x)' title='h(x)' class='latex' /> in the bit array.  And we can <tt>test</tt> whether <img src='https://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> is an element of <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' /> by checking whether bit number <img src='https://s0.wp.com/latex.php?latex=h%28x%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(x)' title='h(x)' class='latex' /> in the bit array is set.</p>
<p>This is a good scheme, but the <tt>test</tt> can fail to give the correct result, which occurs whenever <img src='https://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> is not an element of <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' />, yet <img src='https://s0.wp.com/latex.php?latex=h%28x%29+%3D+h%28x_j%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(x) = h(x_j)' title='h(x) = h(x_j)' class='latex' /> for some <img src='https://s0.wp.com/latex.php?latex=j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j' title='j' class='latex' />.  This is exactly the same failure condition as for the basic hashing scheme we described earlier.  By exactly the same reasoning as used then, the failure probability is</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5B%2A%5D+%5C%2C%5C%2C%5C%2C%5C%2C+p+%3D+1-%281-1%2F2%5Em%29%5E%7B%7CS%7C%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   [*] \,\,\,\, p = 1-(1-1/2^m)^{|S|}. ' title='   [*] \,\,\,\, p = 1-(1-1/2^m)^{|S|}. ' class='latex' />
<p>As we did earlier, we&#8217;d like to re-express this in terms of the number of bits of memory used, <img src='https://s0.wp.com/latex.php?latex=%5C%23&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\#' title='\#' class='latex' />.  This works differently than for the basic hashing scheme, since the number of bits of memory consumed by the current approach is <img src='https://s0.wp.com/latex.php?latex=%5C%23+%3D+2%5Em&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\# = 2^m' title='\# = 2^m' class='latex' />, as compared to <img src='https://s0.wp.com/latex.php?latex=%5C%23+%3D+%7CS%7Cm&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\# = |S|m' title='\# = |S|m' class='latex' /> for the earlier scheme.  Using <img src='https://s0.wp.com/latex.php?latex=%5C%23+%3D+2%5Em&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\# = 2^m' title='\# = 2^m' class='latex' /> and substituting for <img src='https://s0.wp.com/latex.php?latex=m&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m' title='m' class='latex' /> in Equation [*], we have:</p>
<img src='https://s0.wp.com/latex.php?latex=+++p+%3D+1-%281-1%2F%5C%23%29%5E%7B%7CS%7C%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   p = 1-(1-1/\#)^{|S|}. ' title='   p = 1-(1-1/\#)^{|S|}. ' class='latex' />
<p>Rearranging this to express <img src='https://s0.wp.com/latex.php?latex=%5C%23&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\#' title='\#' class='latex' /> in term of <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> and <img src='https://s0.wp.com/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' /> we obtain:</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%23+%3D+%5Cfrac%7B1%7D%7B1-%281-p%29%5E%7B1%2F%7CS%7C%7D%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \# = \frac{1}{1-(1-p)^{1/|S|}}. ' title='   \# = \frac{1}{1-(1-p)^{1/|S|}}. ' class='latex' />
<p>When <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> is small this can be approximated by</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%23+%5Capprox+%5Cfrac%7B%7CS%7C%7D%7Bp%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \# \approx \frac{|S|}{p}. ' title='   \# \approx \frac{|S|}{p}. ' class='latex' />
<p>This isn&#8217;t very memory efficient!  We&#8217;d like the probability of failure <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> to be small, and that makes the <img src='https://s0.wp.com/latex.php?latex=1%2Fp&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1/p' title='1/p' class='latex' /> dependence bad news when compared to the <img src='https://s0.wp.com/latex.php?latex=%5Clog%28%7CS%7C%2Fp%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\log(|S|/p)' title='\log(|S|/p)' class='latex' /> dependence of the basic hashing scheme described earlier.  The only time the current approach is better is when <img src='https://s0.wp.com/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' /> is very, very large.  To get some idea for just how large, if we want <img src='https://s0.wp.com/latex.php?latex=p+%3D+0.01&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p = 0.01' title='p = 0.01' class='latex' />, then <img src='https://s0.wp.com/latex.php?latex=1%2Fp&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1/p' title='1/p' class='latex' /> is only better than <img src='https://s0.wp.com/latex.php?latex=%5Clog%28%7CS%7C%2Fp%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\log(|S|/p)' title='\log(|S|/p)' class='latex' /> when <img src='https://s0.wp.com/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' /> gets to be more than about <img src='https://s0.wp.com/latex.php?latex=1.27+%2A+10%5E%7B28%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1.27 * 10^{28}' title='1.27 * 10^{28}' class='latex' />. That&#8217;s quite a set!  In practice, the basic hashing scheme will be much more memory efficient.</p>
<p>Intuitively, it&#8217;s not hard to see why this approach is so memory inefficient compared to the basic hashing scheme.  The problem is that with an <img src='https://s0.wp.com/latex.php?latex=m&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m' title='m' class='latex' />-bit hash function, the basic hashing scheme used <img src='https://s0.wp.com/latex.php?latex=m%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m|S|' title='m|S|' class='latex' /> bits of memory, while hashing into a bit array uses <img src='https://s0.wp.com/latex.php?latex=2%5Em&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='2^m' title='2^m' class='latex' /> bits, but doesn&#8217;t change the probability of failure.  That&#8217;s exponentially more memory!</p>
<p>At this point, hashing into bit arrays looks like a bad idea.  But it turns out that by tweaking the idea just a little we can improve it a lot.  To carry out this tweaking, it helps to name the data structure we&#8217;ve just described (where we hash into a bit array).  We&#8217;ll call it a <em>filter</em>, anticipating the fact that it&#8217;s a precursor to the Bloom filter.  I don&#8217;t know whether &#8220;filter&#8221; is a standard name, but in any case it&#8217;ll be a useful working name.</p>
<p><strong>Idea: use multiple filters:</strong> How can we make the basic filter just described more memory efficient?  One possibility is to try using multiple filters, based on independent hash functions. More precisely, the idea is to use <img src='https://s0.wp.com/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' /> filters, each based on an (independent) <img src='https://s0.wp.com/latex.php?latex=m&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m' title='m' class='latex' />-bit hash function, <img src='https://s0.wp.com/latex.php?latex=h_0%2C+h_1%2C+%5Cldots%2C+h_%7Bk-1%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_0, h_1, \ldots, h_{k-1}' title='h_0, h_1, \ldots, h_{k-1}' class='latex' />.  So our data structure will consist of <img src='https://s0.wp.com/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' /> separate bit arrays, each containing <img src='https://s0.wp.com/latex.php?latex=2%5Em&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='2^m' title='2^m' class='latex' /> bits, for a grand total of <img src='https://s0.wp.com/latex.php?latex=%5C%23+%3D+k+2%5Em&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\# = k 2^m' title='\# = k 2^m' class='latex' /> bits.  We can <tt>add</tt> an element <img src='https://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> by setting the <img src='https://s0.wp.com/latex.php?latex=h_0%28x%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_0(x)' title='h_0(x)' class='latex' />th bit in the first bit array (i.e., the first filter), the <img src='https://s0.wp.com/latex.php?latex=h_1%28x%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_1(x)' title='h_1(x)' class='latex' />th bit in the second filter, and so on.  We can <tt>test</tt> whether a candidate element <img src='https://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> is in the set by simply checking whether all the appropriate bits are set in each filter.  For this to fail, each individual filter must fail.  Because the hash functions are independent of one another, the probability of this is the <img src='https://s0.wp.com/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' />th power of any single filter failing:</p>
<img src='https://s0.wp.com/latex.php?latex=+++p+%3D+%5Cleft%281-%281-1%2F2%5Em%29%5E%7B%7CS%7C%7D%5Cright%29%5Ek.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   p = \left(1-(1-1/2^m)^{|S|}\right)^k. ' title='   p = \left(1-(1-1/2^m)^{|S|}\right)^k. ' class='latex' />
<p>The number of bits of memory used by this data structure is <img src='https://s0.wp.com/latex.php?latex=%5C%23+%3D+k+2%5Em&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\# = k 2^m' title='\# = k 2^m' class='latex' /> and so we can substitute <img src='https://s0.wp.com/latex.php?latex=2%5Em+%3D+%5C%23%2Fk&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='2^m = \#/k' title='2^m = \#/k' class='latex' /> and rearrange to get</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5B%2A%2A%5D+%5C%2C%5C%2C%5C%2C%5C%2C+%5C%23+%3D+%5Cfrac%7Bk%7D%7B1-%281-p%5E%7B1%2Fk%7D%29%5E%7B1%2F%7CS%7C%7D%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   [**] \,\,\,\, \# = \frac{k}{1-(1-p^{1/k})^{1/|S|}}. ' title='   [**] \,\,\,\, \# = \frac{k}{1-(1-p^{1/k})^{1/|S|}}. ' class='latex' />
<p>Provided <img src='https://s0.wp.com/latex.php?latex=p%5E%7B1%2Fk%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p^{1/k}' title='p^{1/k}' class='latex' /> is much smaller than <img src='https://s0.wp.com/latex.php?latex=1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1' title='1' class='latex' />, this expression can be simplified to give</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%23+%5Capprox+%5Cfrac%7Bk%7CS%7C%7D%7Bp%5E%7B1%2Fk%7D%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \# \approx \frac{k|S|}{p^{1/k}}. ' title='   \# \approx \frac{k|S|}{p^{1/k}}. ' class='latex' />
<p>Good news!  This repetition strategy is much more memory efficient than a single filter, at least for small values of <img src='https://s0.wp.com/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' />.  For instance, moving from <img src='https://s0.wp.com/latex.php?latex=k+%3D+1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k = 1' title='k = 1' class='latex' /> repetitions to <img src='https://s0.wp.com/latex.php?latex=k+%3D+2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k = 2' title='k = 2' class='latex' /> repititions changes the denominator from <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> to <img src='https://s0.wp.com/latex.php?latex=%5Csqrt%7Bp%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\sqrt{p}' title='\sqrt{p}' class='latex' /> &#8211; typically, a huge improvement, since <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> is very small.  And the only price paid is doubling the numerator.  So this is a big win.</p>
<p>Intuitively, and in retrospect, this result is not so surprising. Putting multiple filters in a row, the probability of error drops exponentially with the number of filters.  By contrast, in the single filter scheme, the drop in the probability of error is roughly linear with the number of bits.  (This follows from considering Equation [*] in the limit where <img src='https://s0.wp.com/latex.php?latex=1%2F2%5Em&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1/2^m' title='1/2^m' class='latex' /> is small.)  So using multiple filters is a good strategy.</p>
<p>Of course, a caveat to the last paragraph is that this analysis requires that <img src='https://s0.wp.com/latex.php?latex=p%5E%7B1%2Fk%7D+%5Cll+1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p^{1/k} \ll 1' title='p^{1/k} \ll 1' class='latex' />, which means that <img src='https://s0.wp.com/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' /> can&#8217;t be too large before the analysis breaks down.  For larger values of <img src='https://s0.wp.com/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' /> the analysis is somewhat more complicated.  In order to find the optimal value of <img src='https://s0.wp.com/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' /> we&#8217;d need to figure out what value of <img src='https://s0.wp.com/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' /> minimizes the exact expression [**] for <img src='https://s0.wp.com/latex.php?latex=%5C%23&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\#' title='\#' class='latex' />.  We won&#8217;t bother &#8211; at best it&#8217;d be tedious, and, as we&#8217;ll see shortly, there is in any case a better approach.</p>
<p><strong>Overlapping filters:</strong> This is a variation on the idea of repeating filters.  Instead of having <img src='https://s0.wp.com/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' /> separate bit arrays, we use just a single array of <img src='https://s0.wp.com/latex.php?latex=2%5Em&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='2^m' title='2^m' class='latex' /> bits.  When <tt>add</tt>ing an object <img src='https://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' />, we simply set all the bits <img src='https://s0.wp.com/latex.php?latex=h_0%28x%29%2C+h_1%28x%29%2C%5Cldots%2C+h_%7Bk-1%7D%28x%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_0(x), h_1(x),\ldots, h_{k-1}(x)' title='h_0(x), h_1(x),\ldots, h_{k-1}(x)' class='latex' /> in the same bit array.  To <tt>test</tt> whether an element <img src='https://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> is in the set, we simply check whether all the bits <img src='https://s0.wp.com/latex.php?latex=h_0%28x%29%2C+h_1%28x%29%2C%5Cldots%2C+h_%7Bk-1%7D%28x%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_0(x), h_1(x),\ldots, h_{k-1}(x)' title='h_0(x), h_1(x),\ldots, h_{k-1}(x)' class='latex' /> are set or not.</p>
<p>What&#8217;s the probability of the <tt>test</tt> failing?  Suppose <img src='https://s0.wp.com/latex.php?latex=x+%5Cnot+%5Cin+S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x \not \in S' title='x \not \in S' class='latex' />.  Failure occurs when <img src='https://s0.wp.com/latex.php?latex=h_0%28x%29+%3D+h_%7Bi_0%7D%28x_%7Bj_0%7D%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_0(x) = h_{i_0}(x_{j_0})' title='h_0(x) = h_{i_0}(x_{j_0})' class='latex' /> for some <img src='https://s0.wp.com/latex.php?latex=i_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='i_0' title='i_0' class='latex' /> and <img src='https://s0.wp.com/latex.php?latex=j_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j_0' title='j_0' class='latex' />, and also <img src='https://s0.wp.com/latex.php?latex=h_1%28x%29+%3D+h_%7Bi_1%7D%28x_%7Bj_1%7D%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_1(x) = h_{i_1}(x_{j_1})' title='h_1(x) = h_{i_1}(x_{j_1})' class='latex' /> for some <img src='https://s0.wp.com/latex.php?latex=i_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='i_1' title='i_1' class='latex' /> and <img src='https://s0.wp.com/latex.php?latex=j_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j_1' title='j_1' class='latex' />, and so on for all the remaining hash functions, <img src='https://s0.wp.com/latex.php?latex=h_2%2C+h_3%2C%5Cldots%2C+h_%7Bk-1%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_2, h_3,\ldots, h_{k-1}' title='h_2, h_3,\ldots, h_{k-1}' class='latex' />.  These are independent events, and so the probability they all occur is just the product of the probabilities of the individual events.  A little thought should convince you that each individual event will have the same probability, and so we can just focus on computing the probability that <img src='https://s0.wp.com/latex.php?latex=h_0%28x%29+%3D+h_%7Bi_0%7D%28x_%7Bj_0%7D%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_0(x) = h_{i_0}(x_{j_0})' title='h_0(x) = h_{i_0}(x_{j_0})' class='latex' /> for some <img src='https://s0.wp.com/latex.php?latex=i_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='i_0' title='i_0' class='latex' /> and <img src='https://s0.wp.com/latex.php?latex=j_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j_0' title='j_0' class='latex' />.  The overall probability <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> of failure will then be the <img src='https://s0.wp.com/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' />th power of that probability, i.e.,</p>
<img src='https://s0.wp.com/latex.php?latex=+++p+%3D+p%28h_0%28x%29+%3D+h_%7Bi_0%7D%28x_%7Bj_0%7D%29+%5Cmbox%7B+for+some+%7D+i_0%2Cj_0%29%5Ek+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   p = p(h_0(x) = h_{i_0}(x_{j_0}) \mbox{ for some } i_0,j_0)^k ' title='   p = p(h_0(x) = h_{i_0}(x_{j_0}) \mbox{ for some } i_0,j_0)^k ' class='latex' />
<p>The probability that <img src='https://s0.wp.com/latex.php?latex=h_0%28x%29+%3D+h_%7Bi_0%7D%28x_%7Bj_0%7D%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_0(x) = h_{i_0}(x_{j_0})' title='h_0(x) = h_{i_0}(x_{j_0})' class='latex' /> for some <img src='https://s0.wp.com/latex.php?latex=i_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='i_0' title='i_0' class='latex' /> and <img src='https://s0.wp.com/latex.php?latex=j_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j_0' title='j_0' class='latex' /> is one minus the probability that <img src='https://s0.wp.com/latex.php?latex=h_0%28x%29+%5Cneq+h_%7Bi_0%7D%28x_%7Bj_0%7D%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_0(x) \neq h_{i_0}(x_{j_0})' title='h_0(x) \neq h_{i_0}(x_{j_0})' class='latex' /> for all <img src='https://s0.wp.com/latex.php?latex=i_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='i_0' title='i_0' class='latex' /> and <img src='https://s0.wp.com/latex.php?latex=j_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j_0' title='j_0' class='latex' />.  These are independent events for the different possible values of <img src='https://s0.wp.com/latex.php?latex=i_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='i_0' title='i_0' class='latex' /> and <img src='https://s0.wp.com/latex.php?latex=j_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j_0' title='j_0' class='latex' />, each with probability <img src='https://s0.wp.com/latex.php?latex=1-1%2F2%5Em&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1-1/2^m' title='1-1/2^m' class='latex' />, and so</p>
<img src='https://s0.wp.com/latex.php?latex=+++p%28h_0%28x%29+%3D+h_%7Bi_0%7D%28x_%7Bj_0%7D%29+%5Cmbox%7B+for+some+%7D+i_0%2Cj_0%29+%3D+1-%281-1%2F2%5Em+%29%5E%7Bk%7CS%7C%7D%2C+++&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   p(h_0(x) = h_{i_0}(x_{j_0}) \mbox{ for some } i_0,j_0) = 1-(1-1/2^m )^{k|S|},   ' title='   p(h_0(x) = h_{i_0}(x_{j_0}) \mbox{ for some } i_0,j_0) = 1-(1-1/2^m )^{k|S|},   ' class='latex' />
<p>since there are <img src='https://s0.wp.com/latex.php?latex=k%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k|S|' title='k|S|' class='latex' /> different pairs of possible values <img src='https://s0.wp.com/latex.php?latex=%28i_0%2C+j_0%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='(i_0, j_0)' title='(i_0, j_0)' class='latex' />.  It follows that</p>
<img src='https://s0.wp.com/latex.php?latex=+++p+%3D+%5Cleft%281-%281-1%2F2%5Em+%29%5E%7Bk%7CS%7C%7D%5Cright%29%5Ek.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   p = \left(1-(1-1/2^m )^{k|S|}\right)^k. ' title='   p = \left(1-(1-1/2^m )^{k|S|}\right)^k. ' class='latex' />
<p>Substituting <img src='https://s0.wp.com/latex.php?latex=2%5Em+%3D+%5C%23&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='2^m = \#' title='2^m = \#' class='latex' /> we obtain</p>
<img src='https://s0.wp.com/latex.php?latex=+++p+%3D+%5Cleft%281-%281-1%2F%5C%23+%29%5E%7Bk%7CS%7C%7D%5Cright%29%5Ek+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   p = \left(1-(1-1/\# )^{k|S|}\right)^k ' title='   p = \left(1-(1-1/\# )^{k|S|}\right)^k ' class='latex' />
<p>which can be rearranged to obtain</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%23+%3D+%5Cfrac%7B1%7D%7B1-%281-p%5E%7B1%2Fk%7D%29%5E%7B1%2Fk%7CS%7C%7D%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \# = \frac{1}{1-(1-p^{1/k})^{1/k|S|}}. ' title='   \# = \frac{1}{1-(1-p^{1/k})^{1/k|S|}}. ' class='latex' />
<p>This is remarkably similar to the expression [**] derived above for repeating filters.  In fact, provided <img src='https://s0.wp.com/latex.php?latex=p%5E%7B1%2Fk%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p^{1/k}' title='p^{1/k}' class='latex' /> is much smaller than <img src='https://s0.wp.com/latex.php?latex=1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1' title='1' class='latex' />, we get</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%23+%5Capprox+%5Cfrac%7Bk%7CS%7C%7D%7Bp%5E%7B1%2Fk%7D%7D%2C+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \# \approx \frac{k|S|}{p^{1/k}}, ' title='   \# \approx \frac{k|S|}{p^{1/k}}, ' class='latex' />
<p>which is exactly the same as [**] when <img src='https://s0.wp.com/latex.php?latex=p%5E%7B1%2Fk%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p^{1/k}' title='p^{1/k}' class='latex' /> is small.  So this approach gives quite similar outcomes to the repeating filter strategy.</p>
<p>Which approach is better, repeating or overlapping filters?  In fact, it can be shown that</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5Cfrac%7B1%7D%7B1-%281-p%5E%7B1%2Fk%7D%29%5E%7B1%2Fk%7CS%7C%7D%7D+%5Cleq+%5Cfrac%7Bk%7D%7B1-%281-p%5E%7B1%2Fk%7D%29%5E%7B1%2F%7CS%7C%7D%7D%2C+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \frac{1}{1-(1-p^{1/k})^{1/k|S|}} \leq \frac{k}{1-(1-p^{1/k})^{1/|S|}}, ' title='   \frac{1}{1-(1-p^{1/k})^{1/k|S|}} \leq \frac{k}{1-(1-p^{1/k})^{1/|S|}}, ' class='latex' />
<p>and so the overlapping filter strategy is more memory efficient than the repeating filter strategy.  I won&#8217;t prove the inequality here &#8211; it&#8217;s a straightforward (albeit tedious) exercise in calculus.  The important takeaway is that overlapping filters are the more memory-efficient approach.</p>
<p>How do overlapping filters compare to our first approach, the basic hashing strategy?  I&#8217;ll defer a full answer until later, but we can get some insight by choosing <img src='https://s0.wp.com/latex.php?latex=p+%3D+0.0001&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p = 0.0001' title='p = 0.0001' class='latex' /> and <img src='https://s0.wp.com/latex.php?latex=k%3D4&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k=4' title='k=4' class='latex' />.  Then for the overlapping filter we get <img src='https://s0.wp.com/latex.php?latex=%5C%23+%5Capprox+40%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\# \approx 40|S|' title='\# \approx 40|S|' class='latex' />, while the basic hashing strategy gives <img src='https://s0.wp.com/latex.php?latex=%5C%23+%3D+%7CS%7C+%5Clog%28+10000+%7CS%7C%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\# = |S| \log( 10000 |S|)' title='\# = |S| \log( 10000 |S|)' class='latex' />. Basic hashing is worse whenever <img src='https://s0.wp.com/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' /> is more than about 100 million &#8211; a big number, but also a big improvement over the <img src='https://s0.wp.com/latex.php?latex=1.27+%2A+10%5E%7B28%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1.27 * 10^{28}' title='1.27 * 10^{28}' class='latex' /> required by a single filter.  Given that we haven&#8217;t yet made any attempt to optimize <img src='https://s0.wp.com/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' />, this ought to encourage us that we&#8217;re onto something.</p>
<h3>Problems for the author</h3>
<ul>
<li> I suspect that there&#8217;s a simple intuitive argument that would   let us see upfront that overlapping filters will be more memory   efficient than repeating filters.  Can I find such an argument? </ul>
<p><strong>Bloom filters:</strong> We&#8217;re finally ready for Bloom filters.  In fact, Bloom filters involve only a few small changes to overlapping filters.  In describing overlapping filters we hashed into a bit array containing <img src='https://s0.wp.com/latex.php?latex=2%5Em&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='2^m' title='2^m' class='latex' /> bits.  We could, instead, have used hash functions with a range <img src='https://s0.wp.com/latex.php?latex=0%2C%5Cldots%2CM-1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0,\ldots,M-1' title='0,\ldots,M-1' class='latex' /> and hashed into a bit array of <img src='https://s0.wp.com/latex.php?latex=M&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M' title='M' class='latex' /> (instead of <img src='https://s0.wp.com/latex.php?latex=2%5Em&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='2^m' title='2^m' class='latex' />) bits. The analysis goes through unchanged if we do this, and we end up with</p>
<img src='https://s0.wp.com/latex.php?latex=+++p+%3D+%5Cleft%281-%281-1%2F%5C%23+%29%5E%7Bk%7CS%7C%7D%5Cright%29%5Ek+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   p = \left(1-(1-1/\# )^{k|S|}\right)^k ' title='   p = \left(1-(1-1/\# )^{k|S|}\right)^k ' class='latex' />
<p>and</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%23+%3D+%5Cfrac%7B1%7D%7B1-%281-p%5E%7B1%2Fk%7D%29%5E%7B1%2Fk%7CS%7C%7D%7D%2C+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \# = \frac{1}{1-(1-p^{1/k})^{1/k|S|}}, ' title='   \# = \frac{1}{1-(1-p^{1/k})^{1/k|S|}}, ' class='latex' />
<p>exactly as before. The only reason I didn&#8217;t do this earlier is because in deriving Equation [*] above it was convenient to re-use the reasoning from the basic hashing scheme, where <img src='https://s0.wp.com/latex.php?latex=m&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m' title='m' class='latex' /> (not <img src='https://s0.wp.com/latex.php?latex=M&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M' title='M' class='latex' />) was the convenient parameter to use.  But the exact same reasoning works.</p>
<p>What&#8217;s the best value of <img src='https://s0.wp.com/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' /> to choose?  Put another way, what value of <img src='https://s0.wp.com/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' /> should we choose in order to minimize the number of bits, <img src='https://s0.wp.com/latex.php?latex=%5C%23&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\#' title='\#' class='latex' />, given a particular value for the probability of error, <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' />, and a particular sizek <img src='https://s0.wp.com/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' />?  Equivalently, what value of <img src='https://s0.wp.com/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' /> will minimize <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' />, given <img src='https://s0.wp.com/latex.php?latex=%5C%23&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\#' title='\#' class='latex' /> and <img src='https://s0.wp.com/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' />?  I won&#8217;t go through the full analysis here, but with calculus and some algebra you can show that choosing</p>
<img src='https://s0.wp.com/latex.php?latex=+++k+%5Capprox+%5Cfrac%7B%5C%23%7D%7B%7CS%7C%7D+%5Cln+2+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   k \approx \frac{\#}{|S|} \ln 2 ' title='   k \approx \frac{\#}{|S|} \ln 2 ' class='latex' />
<p>minimizes the probability <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' />.  (Note that <img src='https://s0.wp.com/latex.php?latex=%5Cln&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\ln' title='\ln' class='latex' /> denotes the natural logarithm, not logarithms to base 2.)  By choosing <img src='https://s0.wp.com/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' /> in this way we get:</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5B%2A%2A%2A%5D+%5C%2C%5C%2C%5C%2C%5C%2C+%5C%23+%3D+%5Cfrac%7B%7CS%7C%7D%7B%5Cln+2%7D+%5Clog+%5Cfrac%7B1%7D%7Bp%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   [***] \,\,\,\, \# = \frac{|S|}{\ln 2} \log \frac{1}{p}. ' title='   [***] \,\,\,\, \# = \frac{|S|}{\ln 2} \log \frac{1}{p}. ' class='latex' />
<p>This really is good news!  Not only is it better than a bit array, it&#8217;s actually (usually) much better than the basic hashing scheme we began with.  In particular, it will be better whenever</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5Cfrac%7B1%7D%7B%5Cln+2%7D+%5Clog+%5Cfrac%7B1%7D%7Bp%7D+%5Cleq+%5Clog+%5Cfrac%7B%7CS%7C%7D%7Bp%7D%2C+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \frac{1}{\ln 2} \log \frac{1}{p} \leq \log \frac{|S|}{p}, ' title='   \frac{1}{\ln 2} \log \frac{1}{p} \leq \log \frac{|S|}{p}, ' class='latex' />
<p>which is equivalent to requiring</p>
<img src='https://s0.wp.com/latex.php?latex=+++%7CS%7C+%5Cgeq+p%5E%7B1-1%2F%5Cln+2%7D+%5Capprox+%5Cfrac%7B1%7D%7Bp%5E%7B0.44%7D%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   |S| \geq p^{1-1/\ln 2} \approx \frac{1}{p^{0.44}}. ' title='   |S| \geq p^{1-1/\ln 2} \approx \frac{1}{p^{0.44}}. ' class='latex' />
<p>If we want (say) <img src='https://s0.wp.com/latex.php?latex=p%3D0.01&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p=0.01' title='p=0.01' class='latex' /> this means that Bloom filter will be better whenever <img src='https://s0.wp.com/latex.php?latex=%7CS%7C+%5Cgeq+8&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S| \geq 8' title='|S| \geq 8' class='latex' />, which is obviously an extremely modest set size.</p>
<p>Another way of interpreting [***] is that a Bloom filter requires <img src='https://s0.wp.com/latex.php?latex=%5Cfrac%7B1%7D%7B%5Cln+2%7D+%5Clog+%5Cfrac%7B1%7D%7Bp%7D+%5Capprox+1.44+%5Clog+%5Cfrac%7B1%7D%7Bp%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\frac{1}{\ln 2} \log \frac{1}{p} \approx 1.44 \log \frac{1}{p}' title='\frac{1}{\ln 2} \log \frac{1}{p} \approx 1.44 \log \frac{1}{p}' class='latex' /> bits per element of the set being represented.  In fact, it&#8217;s possible to prove that any data structure supporting the <tt>add</tt> and <tt>test</tt> operations will require at least <img src='https://s0.wp.com/latex.php?latex=%5Clog+%5Cfrac%7B1%7D%7Bp%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\log \frac{1}{p}' title='\log \frac{1}{p}' class='latex' /> bits per element in the set.  This means that Bloom filters are near-optimal.  Futher work has been done finding even more memory-efficient data structures that actually meet the <img src='https://s0.wp.com/latex.php?latex=%5Clog+%5Cfrac%7B1%7D%7Bp%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\log \frac{1}{p}' title='\log \frac{1}{p}' class='latex' /> bound.  See, for example, the paper by <a href="http://scholar.google.ca/scholar?cluster=13031359803369786500&#038;hl=en&#038;as_sdt=0,5">Anna   Pagh, Rasmus Pagh, and S. Srinivasa Rao</a>.</p>
<h3>Problems for the author</h3>
<ul>
<li> Are the more memory-efficient algorithms practical?  Should we   be using them? </ul>
<p>In actual applications of Bloom filters, we won&#8217;t know <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' /> in advance, nor <img src='https://s0.wp.com/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' />. So the way we usually specify a Bloom filter is to specify the <em>maximum</em> size <img src='https://s0.wp.com/latex.php?latex=n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='n' title='n' class='latex' /> of set that we&#8217;d like to be able to represent, and the maximal probability of error, <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' />, that we&#8217;re willing to tolerate.  Then we choose</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%23+%3D+%5Cfrac%7Bn%7D%7B%5Cln+2%7D+%5Clog+%5Cfrac%7B1%7D%7Bp%7D+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \# = \frac{n}{\ln 2} \log \frac{1}{p} ' title='   \# = \frac{n}{\ln 2} \log \frac{1}{p} ' class='latex' />
<p>and</p>
<img src='https://s0.wp.com/latex.php?latex=+++k+%3D+%5Cln+%5Cfrac%7B1%7D%7Bp%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   k = \ln \frac{1}{p}. ' title='   k = \ln \frac{1}{p}. ' class='latex' />
<p>This gives us a Bloom filter capable of representing any set up to size <img src='https://s0.wp.com/latex.php?latex=n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='n' title='n' class='latex' />, with probability of error guaranteed to be at most <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' />.  The size <img src='https://s0.wp.com/latex.php?latex=n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='n' title='n' class='latex' /> is called the <em>capacity</em> of the Bloom filter.  Actually, these expressions are slight simplifications, since the terms on the right may not be integers &#8211; to be a little more pedantic, we choose</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%23+%3D+%5Clceil+%5Cfrac%7Bn%7D%7B%5Cln+2%7D+%5Clog+%5Cfrac%7B1%7D%7Bp%7D+%5Crceil+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \# = \lceil \frac{n}{\ln 2} \log \frac{1}{p} \rceil ' title='   \# = \lceil \frac{n}{\ln 2} \log \frac{1}{p} \rceil ' class='latex' />
<p>and</p>
<img src='https://s0.wp.com/latex.php?latex=+++k+%3D+%5Clceil+%5Cln+%5Cfrac%7B1%7D%7Bp%7D+%5Crceil.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   k = \lceil \ln \frac{1}{p} \rceil. ' title='   k = \lceil \ln \frac{1}{p} \rceil. ' class='latex' />
<p>One thing that still bugs me about Bloom filters is the expression for the optimal value for <img src='https://s0.wp.com/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' />.  I don&#8217;t have a good intuition for it &#8211; why is it logarithmic in <img src='https://s0.wp.com/latex.php?latex=1%2Fp&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1/p' title='1/p' class='latex' />, and why does it not depend on <img src='https://s0.wp.com/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' />? There&#8217;s a tradeoff going on here that&#8217;s quite strange when you think about it: bit arrays on their own aren&#8217;t very good, but if you repeat or overlap them just the right number of times, then performance improves a lot.  And so you can think of Bloom filters as a kind of compromise between an overlap strategy and a bit array strategy.  But it&#8217;s really not at all obvious (a) why choosing a compromise strategy is the best; or (b) why the right point at which to compromise is where it is, i.e., why <img src='https://s0.wp.com/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' /> has the form it does.  I can&#8217;t quite answer these questions at this point &#8211; I can&#8217;t see that far through Bloom filters.  I suspect that understanding the <img src='https://s0.wp.com/latex.php?latex=k+%3D+2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k = 2' title='k = 2' class='latex' /> case really well would help, but haven&#8217;t put in the work.  Anyone with more insight is welcome to speak up!</p>
<p><strong>Summing up Bloom filters:</strong> Let&#8217;s collect everything together. Suppose we want a Bloom filter with capacity <img src='https://s0.wp.com/latex.php?latex=n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='n' title='n' class='latex' />, i.e., capable of representing any set <img src='https://s0.wp.com/latex.php?latex=S&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S' title='S' class='latex' /> containing up to <img src='https://s0.wp.com/latex.php?latex=n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='n' title='n' class='latex' /> elements, and such that <tt>test</tt> produces a false positive with probability at most <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' />. Then we choose</p>
<img src='https://s0.wp.com/latex.php?latex=+++k+%3D+%5Clceil+%5Cln+%5Cfrac%7B1%7D%7Bp%7D+%5Crceil+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   k = \lceil \ln \frac{1}{p} \rceil ' title='   k = \lceil \ln \frac{1}{p} \rceil ' class='latex' />
<p>independent hash functions, <img src='https://s0.wp.com/latex.php?latex=h_0%2C+h_1%2C+%5Cldots%2C+h_%7Bk-1%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_0, h_1, \ldots, h_{k-1}' title='h_0, h_1, \ldots, h_{k-1}' class='latex' />.  Each hash function has a range <img src='https://s0.wp.com/latex.php?latex=0%2C%5Cldots%2C%5C%23-1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0,\ldots,\#-1' title='0,\ldots,\#-1' class='latex' />, where <img src='https://s0.wp.com/latex.php?latex=%5C%23&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\#' title='\#' class='latex' /> is the number of bits of memory our Bloom filter requires,</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%23+%3D+%5Clceil+%5Cfrac%7Bn%7D%7B%5Cln+2%7D+%5Clog+%5Cfrac%7B1%7D%7Bp%7D+%5Crceil.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \# = \lceil \frac{n}{\ln 2} \log \frac{1}{p} \rceil. ' title='   \# = \lceil \frac{n}{\ln 2} \log \frac{1}{p} \rceil. ' class='latex' />
<p>We number the bits in our Bloom filter from <img src='https://s0.wp.com/latex.php?latex=0%2C%5Cldots%2C%5C%23-1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0,\ldots,\#-1' title='0,\ldots,\#-1' class='latex' />.  To <tt>add</tt> an element <img src='https://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> to our set we set the bits <img src='https://s0.wp.com/latex.php?latex=h_0%28x%29%2C+h_1%28x%29%2C+%5Cldots%2C+h_%7B%5C%23-1%7D%28x%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_0(x), h_1(x), \ldots, h_{\#-1}(x)' title='h_0(x), h_1(x), \ldots, h_{\#-1}(x)' class='latex' /> in the filter.  And to <tt>test</tt> whether a given element <img src='https://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> is in the set we simply check whether bits <img src='https://s0.wp.com/latex.php?latex=h_0%28x%29%2C+h_1%28x%29%2C+%5Cldots%2C+h_%7B%5C%23-1%7D%28x%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_0(x), h_1(x), \ldots, h_{\#-1}(x)' title='h_0(x), h_1(x), \ldots, h_{\#-1}(x)' class='latex' /> in the bit array are all set.</p>
<p>That&#8217;s all there is to the mechanics of how Bloom filters work!  I won&#8217;t give any sample code &#8211; I usually provide code samples in Python, but the Python standard library lacks bit arrays, so nearly all of the code would be concerned with defining a bit array class. That didn&#8217;t seem like it&#8217;d be terribly illuminating.  Of course, it&#8217;s not difficult to find libraries implementing Bloom filters.  For example, <a href="http://www.jasondavies.com/">Jason Davies</a> has written a javascript Bloom filter which has a fun and informative <a href="http://www.jasondavies.com/bloomfilter/">online interactive   visualisation</a>.  I recommend checking it out.  And I&#8217;ve personally used <a href="http://mike.axiak.net/">Mike Axiak</a>&#8216;s fast C-based Python library <a href="https://github.com/axiak/pybloomfiltermmap">pybloomfiltermmap</a> &#8211; the documentation is clear, it took just a few minutes to get up and running, and I&#8217;ve generally had no problems.</p>
<h3>Problems</h3>
<ul>
<li> Suppose we have two Bloom filters, corresponding to sets <img src='https://s0.wp.com/latex.php?latex=S_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S_1' title='S_1' class='latex' />   and <img src='https://s0.wp.com/latex.php?latex=S_2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S_2' title='S_2' class='latex' />.  How can we construct the Bloom filters corresponding to   the sets <img src='https://s0.wp.com/latex.php?latex=S_1+%5Ccup+S_2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S_1 \cup S_2' title='S_1 \cup S_2' class='latex' /> and <img src='https://s0.wp.com/latex.php?latex=S_1+%5Ccap+S_2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='S_1 \cap S_2' title='S_1 \cap S_2' class='latex' />?
</ul>
<p><strong>Applications of Bloom filters:</strong> Bloom filters have been used to solve many different problems.  Here&#8217;s just a few examples to give the flavour of how they can be used.  An early idea was Manber and Wu&#8217;s 1994 <a href="http://scholar.google.ca/scholar?cluster=2662095141699171069">proposal</a> to use Bloom filters to store lists of weak passwords.  Google&#8217;s <a href="http://research.google.com/archive/bigtable.html">BigTable</a> storage system uses Bloom filters to speed up queries, by avoiding disk accesses for rows or columns that don&#8217;t exist.  Google Chrome uses Bloom filters to do <a href="http://src.chromium.org/viewvc/chrome/trunk/src/chrome/browser/safe_browsing/bloom_filter.h?view=markup">safe   web browsing</a> &#8211; the opening example in this post was quite real! More generally, it&#8217;s useful to consider using Bloom filters whenever a large collection of objects needs to be stored.  They&#8217;re not appropriate for all purposes, but at the least it&#8217;s worth thinking about whether or not a Bloom filter can be applied.</p>
<p><strong>Extensions of Bloom filters:</strong> There&#8217;s many clever ways of extending Bloom filters.  I&#8217;ll briefly describe one, just to give you the flavour, and provide links to several more.</p>
<p><strong>A delete operation:</strong> It&#8217;s possible to modify Bloom filters so they support a <tt>delete</tt> operation that lets you remove an element from the set.  You can&#8217;t do this with a standard Bloom filter: it would require unsetting one or more of the bits <img src='https://s0.wp.com/latex.php?latex=h_0%28x%29%2C+h_1%28x%29%2C+%5Cldots&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_0(x), h_1(x), \ldots' title='h_0(x), h_1(x), \ldots' class='latex' /> in the bit array.  This could easily lead us to accidentally <tt>delete</tt> <em>other</em> elements in the set as well.</p>
<p>Instead, the <tt>delete</tt> operation is implemented using an idea known as a <em>counting Bloom filter</em>.  The basic idea is to take a standard Bloom filter, and replace each bit in the bit array by a bucket containing several bits (usually 3 or 4 bits).  We&#8217;re going to treat those buckets as counters, initially set to <img src='https://s0.wp.com/latex.php?latex=0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0' title='0' class='latex' />.  We <tt>add</tt> an element <img src='https://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> to the counting Bloom filter by <em>incrementing</em> each of the buckets numbered <img src='https://s0.wp.com/latex.php?latex=h_0%28x%29%2C+h_1%28x%29%2C+%5Cldots&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_0(x), h_1(x), \ldots' title='h_0(x), h_1(x), \ldots' class='latex' />.  We <tt>test</tt> whether <img src='https://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> is in the counting Bloom filter by looking to see whether each of the corresponding buckets are non-zero.  And we <tt>delete</tt> <img src='https://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> by decrementing each bucket.</p>
<p>This strategy avoids the accidental deletion problem, because when two elements of the set <img src='https://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> and <img src='https://s0.wp.com/latex.php?latex=y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y' title='y' class='latex' /> hash into the same bucket, the count in that bucket will be at least <img src='https://s0.wp.com/latex.php?latex=2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='2' title='2' class='latex' />.  <tt>delete</tt>ing one of the elements, say <img src='https://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' />, will still leave the count for the bucket at least <img src='https://s0.wp.com/latex.php?latex=1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1' title='1' class='latex' />, so <img src='https://s0.wp.com/latex.php?latex=y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y' title='y' class='latex' /> won&#8217;t be accidentally deleted.  Of course, you could worry that this will lead us to erroneously conclude that <img src='https://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> is still in the set after it&#8217;s been deleted.  But that can only happen if other elements in the set hash into every single bucket that <img src='https://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> hashes into.  That will only happen if <img src='https://s0.wp.com/latex.php?latex=%7CS%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|S|' title='|S|' class='latex' /> is very large.</p>
<p>Of course, that&#8217;s just the basic idea behind counting Bloom filters. A full analysis requires us to understand issues such as bucket overflow (when a counter gets incremented too many times), the optimal size for buckets, the probability of errors, and so on. I won&#8217;t get into that, but you there&#8217;s details in the further reading, below.</p>
<p><strong>Other variations and further reading:</strong> There are many more variations on Bloom filters.  Just to give you the flavour of a few applications: (1) they can be modified to be used as lookup dictionaries, associating a value with each element <tt>add</tt>ed to the filter; (2) they can be modified so that the capacity scales up dynamically; and (3) they can be used to quickly approximate the number of elements in a set.  There are many more variations as well: Bloom filters have turned out to be a very generative idea!  This is part of why it&#8217;s useful to understand them deeply, since even if a standard Bloom filter can&#8217;t solve the particular problem you&#8217;re considering, it may be possible to come up with a variation which does.  You can get some idea of the scope of the known variations by looking at the <a href="http://en.wikipedia.org/wiki/Bloom_filter">Wikipedia article</a>. I also like the <a href="http://scholar.google.ca/scholar?cluster=7837240630058449829&#038;hl=en&#038;as_sdt=0,5">survey   article</a> by <a href="http://en.wikipedia.org/wiki/Andrei_Broder">Andrei Broder</a> and <a href="http://mybiasedcoin.blogspot.ca/">Michael Mitzenmacher</a>.  It&#8217;s a little more dated (2004) than the Wikipedia article, but nicely written and a good introduction.  For a shorter introduction to some variations, there&#8217;s also a recent <a href="http://matthias.vallentin.net/blog/2011/06/a-garden-variety-of-bloom-filters/">blog   post</a> by <a href="http://matthias.vallentin.net/">Matthias Vallentin</a>. You can get the flavour of current research by looking at some of the papers citing Bloom filters <a href="http://academic.research.microsoft.com/Publication/772630/space-time-trade-offs-in-hash-coding-with-allowable-errors">here</a>. Finally, you may enjoy reading the <a href="http://scholar.google.ca/scholar?cluster=11454588508174765009&#038;hl=en&#038;as_sdt=0,5">original   paper on Bloom filters</a>, as well as the <a href="http://scholar.google.ca/scholar?cluster=18066790496670563714&#038;hl=en&#038;as_sdt=0,5">original   paper on counting Bloom filters</a>.</p>
<p><strong>Understanding data structures:</strong> I wrote this post because I recently realized that I didn&#8217;t understand any complex data structure in any sort of depth.  There are, of course, a huge number of striking data structures in computer science &#8211; just look at <a href="http://en.wikipedia.org/wiki/List_of_data_structures">Wikipedia&#8217;s   amazing list</a>!  And while I&#8217;m familiar with many of the simpler data structures, I&#8217;m ignorant of most complex data structures.  There&#8217;s nothing wrong with that &#8211; unless one is a specialist in data structures there&#8217;s no need to master a long laundry list.  But what bothered me is that I hadn&#8217;t <em>thoroughly</em> mastered even a single complex data structure.  In some sense, I didn&#8217;t know what it means to understand a complex data structure, at least beyond surface mechanics.  By trying to reinvent Bloom filters, I&#8217;ve found that I&#8217;ve deepened my own understanding and, I hope, written something of interest to others.</p>
<p>  <em>Interested in more?  Please <a href="https://michaelnielsen.org/ddi/feed/">subscribe to this blog</a>, or <a href="http://twitter.com/\#!/michael_nielsen">follow me on Twitter</a>.  You may also enjoy reading my new book about  open science, <a href="http://www.amazon.com/Reinventing-Discovery-New-Networked-Science/dp/product-description/0691148902">Reinventing Discovery</a>.  </em> </p>
]]></content:encoded>
					
					<wfw:commentRss>https://michaelnielsen.org/ddi/why-bloom-filters-work-the-way-they-do/feed/</wfw:commentRss>
			<slash:comments>20</slash:comments>
		
		
			</item>
		<item>
		<title>How to crawl a quarter billion webpages in 40 hours</title>
		<link>https://michaelnielsen.org/ddi/how-to-crawl-a-quarter-billion-webpages-in-40-hours/</link>
					<comments>https://michaelnielsen.org/ddi/how-to-crawl-a-quarter-billion-webpages-in-40-hours/#comments</comments>
		
		<dc:creator><![CDATA[Michael Nielsen]]></dc:creator>
		<pubDate>Fri, 10 Aug 2012 17:33:15 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<guid isPermaLink="false">https://michaelnielsen.org/ddi/?p=70</guid>

					<description><![CDATA[More precisely, I crawled 250,113,669 pages for just under 580 dollars in 39 hours and 25 minutes, using 20 Amazon EC2 machine instances. I carried out this project because (among several other reasons) I wanted to understand what resources are required to crawl a small but non-trivial fraction of the web. In this post I&#8230; <a class="more-link" href="https://michaelnielsen.org/ddi/how-to-crawl-a-quarter-billion-webpages-in-40-hours/">Continue reading <span class="screen-reader-text">How to crawl a quarter billion webpages in 40 hours</span></a>]]></description>
										<content:encoded><![CDATA[<p>More precisely, I crawled 250,113,669 pages for just under 580 dollars in 39 hours and 25 minutes, using 20 Amazon EC2 machine instances.  </p>
<p>I carried out this project because (among several other reasons) I wanted to understand what resources are required to crawl a small but non-trivial fraction of the web.  In this post I describe some details of what I did.  Of course, there&#8217;s nothing especially new: I wrote a vanilla (distributed) crawler, mostly to teach myself something about crawling and distributed computing. Still, I learned some lessons that may be of interest to a few others, and so in this post I describe what I did.  The post also mixes in some personal working notes, for my own future reference.</p>
<p>What does it mean to crawl a non-trivial fraction of the web?  In fact, the notion of a &#8220;non-trivial fraction of the web&#8221; isn&#8217;t well defined.  Many websites generate pages dynamically, in response to user input &#8211; for example, Google&#8217;s search results pages are dynamically generated in response to the user&#8217;s search query.  Because of this it doesn&#8217;t make much sense to say there are so-and-so many billion or trillion pages on the web.  This, in turn, makes it difficult to say precisely what is meant by &#8220;a non-trivial fraction of the web&#8221;.  However, as a reasonable proxy for the size of the web we can use the number of webpages indexed by large search engines. According to this <a href="http://www.youtube.com/watch?v=modXC5IWTJI&#038;feature=player_detailpage#t=175s">presentation</a> by Googler <a href="http://research.google.com/people/jeff/">Jeff Dean</a>, as of November 2010 Google was indexing &#8220;tens of billions of pages&#8221;. (Note that the <a href="http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html">number   of urls</a> is in the trillions, apparently because of duplicated page content, and multiple urls pointing to the same content.)  The now-defunct search engine <a href="http://en.wikipedia.org/wiki/Cuil">Cuil</a> claimed to index <a href="http://searchengineland.com/cuil-launches-can-this-search-start-up-really-best-google-14459">120   billion pages</a>.  By comparison, a quarter billion is, obviously, very small.  Still, it seemed to me like an encouraging start.</p>
<p><strong>Code:</strong> Originally I intended to make the crawler code available under an open source license at GitHub.  However, as I better understood the cost that crawlers impose on websites, I began to have reservations.  My crawler is designed to be polite and impose relatively little burden on any single website, but could (like many crawlers) easily be modified by thoughtless or malicious people to impose a heavy burden on sites.  Because of this I&#8217;ve decided to postpone (possibly indefinitely) releasing the code.  </p>
<p>There&#8217;s a more general issue here, which is this: who gets to crawl the web?  Relatively few sites exclude crawlers from companies such as Google and Microsoft.  But there are a <em>lot</em> of crawlers out there, many of them without much respect for the needs of individual siteowners.  Quite reasonably, many siteowners take an aggressive approach to shutting down activity from less well-known crawlers.  A possible side effect is that if this becomes too common at some point in the future, then it may impede the development of useful new services, which need to crawl the web.  A possible long-term solution may be services like <a href="http://commoncrawl.org/">Common Crawl</a>, which provide access to a common corpus of crawl data.  </p>
<p>I&#8217;d be interested to hear other people&#8217;s thoughts on this issue.</p>
<p>(<em>Later update:</em> I get regular email asking me to send people my code.  Let me pre-emptively say: I decline these requests.)</p>
<p><strong>Architecture:</strong> Here&#8217;s the basic architecture:</p>
<p><img decoding="async" src="https://michaelnielsen.org/ddi/wp-content/uploads/2012/08/quarter_billion_page_crawl_big_picture.png" width="440px"></p>
<p>The master machine (my laptop) begins by downloading <a href="http://www.alexa.com">Alexa&#8217;s</a> list of the <a href="http://s3.amazonaws.com/alexa-static/top-1m.csv.zip">top million   domains</a>.  These were used both as a domain whitelist for the crawler, and to generate a starting list of seed urls.</p>
<p>The domain whitelist was partitioned across the 20 EC2 machine instances in the crawler.  This was done by numbering the instances <img src='https://s0.wp.com/latex.php?latex=0%2C+1%2C+2%2C+%5Cldots%2C+19&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0, 1, 2, \ldots, 19' title='0, 1, 2, \ldots, 19' class='latex' /> and then allocating the domain <tt>domain</tt> to instance number <tt>hash(domain) % 20</tt>, where <tt>hash</tt> is the standard Python hash function.</p>
<p>Deployment and management of the cluster was handled using <a href="http://www.fabfile.org/">Fabric</a>, a well-documented and nicely designed Python library which streamlines the use of ssh over clusters of machines.  I managed the connection to Amazon EC2 using a <a href="https://github.com/mnielsen/ec2_tools">set of Python scripts</a> I wrote, which wrap the <a href="https://github.com/boto/boto">boto</a> library.</p>
<p>I used 20 Amazon EC2 <a href="http://aws.amazon.com/ec2/instance-types/">extra large</a> instances, running Ubuntu 11.04 (Natty Narwhal) under the <a href="http://thecloudmarket.com/image/ami-68ad5201--ubuntu-images-ubuntu-natty-11-04-amd64-server-20110426">ami-68ad5201   Amazon machine image</a> provided by Canonical.  I used the extra large instance after testing on several instance types; the extra large instances provided (marginally) more pages downloaded per dollar spent.  I used the US East (North Virginia) region, because it&#8217;s the least expensive of Amazon&#8217;s regions (along with the US West, Oregon region).</p>
<p><strong>Single instance architecture:</strong> Each instance further partitioned its domain whitelist into 141 separate blocks of domains, and launched 141 Python threads, with each thread responsible for crawling the domains in one block.  Here&#8217;s how it worked (details below):</p>
<p><img decoding="async" src="https://michaelnielsen.org/ddi/wp-content/uploads/2012/08/quarter_billion_page_crawl_single_instance.png" width="440px"></p>
<p>The reason for using threads is that the Python standard library uses blocking I/O to handle http network connections.  This means that a single-threaded crawler would spend most of its time idling, usually waiting on the network connection of the remote machine being crawled. It&#8217;s much better to use a multi-threaded crawler, which can make fuller use of the resources available on an EC2 instance.  I chose the number of crawler threads (141) empirically: I kept increasing the number of threads until the speed of the crawler started to saturate. With this number of threads the crawler was using a considerable fraction of the CPU capacity available on the EC2 instance.  My informal testing suggested that it was CPU which was the limiting factor, but that I was not so far away from the network and disk speed becoming bottlenecks; in this sense, the EC2 extra large instance was a good compromise.  Memory useage was never an issue.  It&#8217;s possible that for this reason EC2&#8217;s high-CPU extra large instance type would have been a better choice; I only experimented with this instance type with early versions of the crawler, which were more memory-limited.</p>
<p><strong>How domains were allocated across threads:</strong> The threads were numbered <img src='https://s0.wp.com/latex.php?latex=0%2C+1%2C+%5Cldots%2C+140&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0, 1, \ldots, 140' title='0, 1, \ldots, 140' class='latex' />, and domains were allocated on the basis of the Python hash function, to thread number <tt>hash(domain) % 141</tt> (similar to the allocation across machines in the cluster).  Once the whitelisted domains / seed urls were allocated to threads, the crawl was done in a simple breadth-first fashion, i.e., for each seed url we download the corresponding web page, extract the linked urls, and check each url to see: (a) whether the extracted url is a fresh url which has not already been seen and added to the url frontier; and (b) whether the extracted url is in the same seed domain as the page which has just been crawled.  If both these conditions are met, the url is added to the url frontier for the current thread, otherwise the url is discarded.  With this architecture we are essentially carrying out a very large number of independent crawls of the whitelisted domains obtained from Alexa.</p>
<p>Note that this architecture also ensures that if, for example, we are crawling a page from TechCrunch, and extract from that page a link to the Huffington Post, then the latter link will be discarded, even though the Huffington Post is in our domain whitelist.  The only links added to the url frontier will be those that point back to TechCrunch itself. The reason we avoid adding dealing with (whitelisted) external links is because: (a) it may require communication between different EC2 instances, which would substantially complicate the crawler; and, more importantly, (b) in practice, most sites have lots of internal links, and so it&#8217;s unlikely that this policy means the crawler is missing much.</p>
<p>One advantage of allocating all urls from the same domain to the same crawler thread is that it makes it much easier to crawl politely, since no more than one connection to a site will be open at any given time.  In particular, this ensures that we won&#8217;t be hammering any given domain with many simultaneous connections from different threads (or different machines).</p>
<h3>Problems for the author</h3>
<ul>
<li> For some very large and rapidly changing websites it may be   necessary to open multiple simultaneous connections in order for the   crawl to keep up with the changes on the site.  How can we decide   when that is appropriate? </ul>
<p><strong>How the url frontiers work:</strong> A <em>separate</em> url frontier file was maintained for each domain.  This was simply a text file, with each line containing a single url to be crawled; initially, the file contains just a single line, with the seed url for the domain.  I spoke above of the url frontier for a thread; that frontier can be thought of as the combination of all the url frontier files for domains being crawled by that thread.  </p>
<p>Each thread maintained a connection to a <a href="http://redis.io/">redis</a> server.  For each domain being crawled by the thread a redis key-value pair was used to keep track of the current position in the url frontier file for that domain.  I used redis (and the <a href="https://github.com/andymccurdy/redis-py/">Python   bindings</a>) to store this information in a fashion that was both persistent and fast to look up.  The persistence was important because it meant that the crawler could be stopped and started at will, without losing track of where it was in the url frontier.</p>
<p>Each thread also maintained a dictionary whose keys were the (hashed) domains for that thread.  The corresponding values were the next time it would be polite to crawl that domain.  This value was set to be 70 seconds after the last time the domain was crawled, to ensure that domains weren&#8217;t getting hit too often.  The crawler thread simply iterated over the keys in this dictionary, looking for the next domain it was polite to crawl.  Once it found such a domain it then extracted the next url from the url frontier for that domain, and went about downloading that page.  If the url frontier was exhausted (some domains run out of pages to crawl) then the domain key was removed from the dictionary.  One limitation of this design was that when restarting the crawler each thread had to identify again which domains had already been exhausted and should be deleted from the dictionary. This slowed down the restart a little, and is something I&#8217;d modify if I were to do further work with the crawler.</p>
<p><strong>Use of a Bloom filter:</strong> I used a <a href="http://en.wikipedia.org/wiki/Bloom_filter">Bloom filter</a> to keep track of which urls had already been seen and added to the url frontier.  This enabled a very fast check of whether or not a new candidate url should be added to the url frontier, with only a low probability of erroneously adding a url that had already been added. This was done using <a href="http://mike.axiak.net/">Mike Axiak</a>&#8216;s very nice C-based <a href="https://github.com/axiak/pybloomfiltermmap">pybloomfiltermmap</a>.</p>
<p><em>Update:</em> Jeremy McLain <a href="https://michaelnielsen.org/ddi/how-to-crawl-a-quarter-billion-webpages-in-40-hours/#comment-6379">points out in comments</a> that I&#8217;ve got this backward, and that with a Bloom filter there is a low probability &#8220;that you will never crawl certain URLs because your bloom filter is telling you they have already been crawled when in fact they have not.&#8221;  A better (albeit slightly slower) solution would be to simply store all the URLs, and check directly.</p>
<p><strong>Anticipated versus unanticipated errors:</strong> Because the crawler ingests input from external sources, it needs to deal with many potential errors.  By design, there are two broad classes of error: <em>anticipated errors</em> and <em>unanticipated errors</em>.</p>
<p>Anticipated errors are things like a page failing to download, or timing out, or containing unparseable input, or a <tt>robots.txt</tt> file disallowing crawling of a page.  When anticipated errors arise, the crawler writes the error to a (per-thread) informational log (the &#8220;info log&#8221; in the diagram above), and continues in whatever way is appropriate.  For example, if the <tt>robots.txt</tt> file disallows crawling then we simply continue to the next url in the url frontier.</p>
<p>Unanticipated errors are errors which haven&#8217;t been anticipated and designed for.  Rather than the crawler falling over, the crawler simply logs the error (to the &#8220;critical log&#8221; in the diagram above), and moves on to the next url in the url frontier.  At the same time, the crawler tracks how many unanticipated errors have occurred in close succession.  If many unanticipated errors occur in close succession it usually indicates that some key piece of infrastructure has failed.  Because of this, if there are too many unanticipated errors in close succession, the crawler shuts down entirely.</p>
<p>As I was developing and testing the crawler, I closely followed the unanticipated errors logged in the critical log.  This enabled me to understand many of the problems faced by the crawler.  For example, early on in development I found that sometimes the html for a page would be so badly formed that the html parser would have little choice but to raise an exception.  As I came to understand such errors I would rewrite the crawler code so such errors become anticipated errors that were handled as gracefully as possible.  Thus, the natural tendency during development was for unanticipated errors to become anticipated errors.</p>
<p><strong>Domain and subdomain handling:</strong> As mentioned above, the crawler works by doing lots of parallel intra-domain crawls.  This works well, but a problem arises because of the widespread use of subdomains.  For example, if we start at the seed url <tt>http://barclays.com</tt> and crawl only urls within the <tt>barclays.com</tt> domain, then we quickly run out of urls to crawl. The reason is that most of the internal links on the <tt>barclays.com</tt> site are actually to <tt>group.barclays.com</tt>, not <tt>barclays.com</tt>.  Our crawler should also add urls from the latter domain to the url frontier for <tt>barclays.com</tt>.</p>
<p>We resolve this by stripping out all subdomains, and working with the stripped domains when deciding whether to add a url to the url frontier.  Removing subdomains turns out to be a surprisingly hard problem, because of variations in the way domain names are formed. Fortunately, the problem seems to be well solved using <a href="https://twitter.com/Bluu">John Kurkowski&#8217;s</a> <a href="https://github.com/john-kurkowski/tldextract">tldextract   library</a>.</p>
<p><strong>On the representation of the url frontier:</strong> I noted above that a separate url frontier file was maintained for each domain.  In an early version of the code, each crawler thread had a url frontier maintained as a <em>single</em> flat text file.  As a crawler thread read out lines in the file, it would crawl those urls, and append any new urls found to the end of the file.</p>
<p>This approach seemed natural to me, but organizing the url frontier files on a per-thread (rather than per-domain) basis caused a surprising number of problems.  As the crawler thread moved through the file to find the next url to crawl, the crawler thread would encounter urls belonging to domains that were not yet polite to crawl because they&#8217;d been crawled too recently.  My initial strategy was simply to append such urls to the end of the file, so they would be found again later.  Unfortunately, there were often a <em>lot</em> of such urls in a row &#8211; consecutive urls often came from the same domain (since they&#8217;d been extracted from the same page).  And so this strategy caused the file for the url frontier to grow very rapidly, eventually consuming most disk space.</p>
<p>Exacerbating this problem, this approach to the url frontier caused an unforseen &#8220;domain clumping problem&#8221;.  To understand this problem, imagine that the crawler thread encountered (say) 20 consecutive urls from a single domain.  It might crawl the first of these, extracting (say) 20 extra urls to append to the end of the url frontier.  But the next 19 urls would all be skipped over, since it wouldn&#8217;t yet be polite to crawl them, and they&#8217;d also be appended to the end of the url frontier.  Now we have 39 urls from the same domain at the end of the url frontier.  But when the crawler thread gets to those, we may well have the same process repeat &#8211; leading to a clump of 58 urls from the same domain at the end of the file.  And so on, leading to very long runs of urls from the same domain.  This consumes lots of disk space, and also slows down the crawler, since the crawler thread may need to examine a large number of urls before it finds a new url it&#8217;s okay to crawl.</p>
<p>These problems could have been solved in various ways; moving to the per-domain url frontier file was how I chose to address the problems, and it seemed to work well.</p>
<p><strong>Choice of number of threads:</strong> I mentioned above that the number of crawler threads (141) was chosen empirically.  However, there is an important constraint on that number, and in particular its relationship to the number (20) of EC2 instances being used.  Suppose that instead of 141 threads I&#8217;d used (say) 60 threads.  This would create a problem.  To see why, note that any domain allocated to instance number 7 (say) would necessarily satisfy <tt>hash(domain) % 20 = 7</tt>.  This would imply that <tt>hash(domain) % 60 = 7</tt> or 27 or 47, and as a consequence all the domains would be allocated to just one of three crawler threads (thread numbers 7, 27 and 47), while the other 57 crawler threads would lie idle, defeating the purpose of using multiple threads.</p>
<p>One way to solve this problem would be to use two <a href="http://en.wikipedia.org/wiki/K-independent_hashing">independent</a> hash functions to allocate domains to EC2 instances and crawler threads.  However, an even simpler way of solving the problem is to choose the number of crawler threads to be coprime to the number of EC2 instances.  This coprimality ensures that domains will be allocated reasonably evenly across both instance and threads.  (I won&#8217;t prove this here, but it can be proved with a little effort).  It is easily checked that 141 and 20 are coprime.</p>
<p>Note, incidentally, that Python&#8217;s <tt>hash</tt> is not a true hash function, in the sense that it doesn&#8217;t guarantee that the domains will be spread evenly across EC2 instances.  It turns out that Python&#8217;s <tt>hash</tt> takes similar key strings to similar hash values.  I talk more about this point (with examples) in the fifth paragraph of <a href="https://michaelnielsen.org/blog/consistent-hashing/">this post</a>. However, I found empirically that <tt>hash</tt> seems to spread domains evenly enough across instances, and so I didn&#8217;t worry about using a better (but slower) hash function, like those available through Python&#8217;s <tt>hashlib</tt> library.</p>
<p><strong>Use of Python:</strong> All my code was written in Python.  Initially, I wondered if Python might be too slow, and create bottlenecks in the crawling.  However, profiling the crawler showed that most time was spent either (a) managing network connections and downloading data; or (b) parsing the resulting webpages.  The parsing of the webpages was being done using <a href="http://lxml.de/">lxml</a>, a Python binding to fast underlying C libraries.  It didn&#8217;t seem likely to be easy to speed that up, and so I concluded that Python was likely not a particular bottleneck in the crawling.  </p>
<p><strong>Politeness:</strong> The crawler used Python&#8217;s <a href="http://docs.python.org/library/robotparser.html">robotparser   library</a> in order to observe the <a href="http://www.robotstxt.org/">robots exclusion protocol</a>.  As noted above, I also imposed an absolute 70-second minimum time interval between accesses to any given domain.  In practice, the mean time between accesses was more like 3-4 minutes.</p>
<p>In initial test runs of the crawler I got occasional emails from webmasters asking for an explanation of why I was crawling their site. Because of this, in the crawler&#8217;s <a href="http://en.wikipedia.org/wiki/User_agent">User-agent</a> I included a link to a <a href="https://michaelnielsen.org/blog/oss-bot/">webpage</a> explaining the purpose of my crawler, how to exclude it from a site, and what steps I was taking to crawl politely.  This was (I presume) both helpful to webmasters and also helpful to me, for it reduced the number of inquiries.  A handful of people asked me to exclude their sites from the crawl, and I complied quickly.</p>
<h3>Problems for the author</h3>
<ul>
<li> Because my crawl didn&#8217;t take too long, the <tt>robots.txt</tt>   file was downloaded just once for each domain, at the beginning of   the crawl.  In a longer crawl, how should we decide how long to wait   between downloads of <tt>robots.txt</tt>? </ul>
<p><strong>Truncation:</strong> The crawler truncates large webpages rather than downloading the full page. It does this in part because it&#8217;s necessary &#8211; it really wouldn&#8217;t surprise me if someone has a terabyte html file sitting on a server somewhere &#8211; and in part because for many applications it will be of more interest to focus on earlier parts of the page.</p>
<p>What&#8217;s a reasonable threshold for truncation?  According to <a href="http://code.google.com/speed/articles/web-metrics.html">this   report</a> from Google, as of May 2010 the average network size of a webpage from a top site is 312.04 kb.  However, that includes images, scripts and stylesheets, which the crawler ignores.  If you ignore the images and so on, then the average network size drops to just 33.66 kb.</p>
<p>However, that number of 33.66 kb is for content which may be served compressed over the network. Our truncation will be based on the uncompressed size.  Unfortunately, the Google report doesn&#8217;t tell us what the average size of the uncompressed content is.  However, we can get an estimate of this, since Google reports that the average uncompressed size of the <em>total</em> page (including images and so on) is 477.26 kb, while the average network size is 312.04 kb.</p>
<p>Assuming that this compression ratio is typical, we estimate that the average uncompressed size of the content the crawler downloads is 51 kb.  In the event, I experimented with several truncation settings, and found that a truncation threshold of 200 kilobytes enabled me to download the great majority of webpages in their entirety, while addressing the problem of very large html files mentioned above. (Unfortunately, I didn&#8217;t think to check what the <em>actual</em> average uncompressed size was, my mistake.)</p>
<p><strong>Storage:</strong> I stored all the data using EC2&#8217;s built-in <a href="http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/InstanceStorage.html">instance   storage</a> &#8211; 1.69 Terabytes for the extra-large instances I was using.  This storage is non-persistent, and so any data stored on an instance will vanish when that instance is terminated.  Now, for many kinds of streaming or short-term analysis of data this would be adequate &#8211; indeed, it might not even be necessary to store the data at all.  But, of course, for many applications of a crawl this approach is not appropriate, and the instance storage should be supplemented with something more permanent, such as S3.  For my purposes using the instance storage seemed fine.</p>
<p><strong>Price:</strong> The price broke down into two components: (1) 512 dollars for the use of the 20 extra-large EC2 instances for 40 hours; and (2) about 65 dollars for a little over 500 gigabytes of outgoing bandwidth, used to make http requests.  Note that Amazon does not charge for incoming bandwidth (a good thing, too!)  It would be interesting to compare these costs to the (appropriately amortized) costs of using other cloud providers, or self-hosting.  </p>
<p>Something I didn&#8217;t experiment with is the use of Amazon&#8217;s <a href="http://aws.amazon.com/ec2/spot-instances/">spot instances</a>, where you can bid to use Amazon&#8217;s unused EC2 capacity.  I didn&#8217;t think of doing this until just as I was about to launch the crawl.  When I went to look at the spot instance pricing history, I discovered to my surprise that the spot instance prices are often a factor of 10 or so lower than the prices for on-demand instances!  Factoring in the charges for outgoing bandwidth, this means it may be possible to use spot instances to do a similar crawl for 120 dollars or so, a factor of five savings.  I considered switching, but ultimately decided against it, thinking that it might take 2 or 3 days work to properly understand the implications of switching, and to get things working exactly as I wanted.  Admittedly, it&#8217;s possible that it would have taken much less time, in which case I missed an opportunity to trade some money for just a little extra time.</p>
<p><strong>Improvements to the crawler architecture:</strong> Let me finish by noting a few ways it&#8217;d be interesting to improve the current crawler: </p>
<ul>
<li> For many long-running applications the crawler would need a   smart crawl policy so that it knows when and how to re-crawl a page.   According to a   <a href="http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//people/jeff/WSDM09-keynote.pdf">presentation</a>   from <a href="http://research.google.com/people/jeff/">Jeff Dean</a>,   Google&#8217;s mean time to index a new page is now just minutes.  I don&#8217;t   know how that works, but imagine that notification protocols such as   <a href="http://code.google.com/p/pubsubhubbub/">pubsubhubbub</a> play an   important role.  It&#8217;d be good to change the crawler so that it&#8217;s   pubsubhubbub aware.
<li> The crawler currently uses a threaded architecture.  Another   quite different approach is to use an   <a href="http://www.google.com/#sclient=psy-ab&#038;hl=en&#038;site=&#038;source=hp&#038;q=event-driven+crawler&#038;fp=1">evented     architecture</a>.  What are the pros and cons of a multi-threaded   versus an evented architecture?
<li> The instances in the cluster are configured using fabric and   shell scripts to install programs such as redis, pybloomfilter, and   so on.  This is slow and not completely reliable.  Is there a better   way of doing this?  Creating my own EC2 AMI?  Configuration   management software such as   <a href="http://www.opscode.com/chef/">Chef</a> and   <a href="http://puppetlabs.com/">Puppet</a>?  I considered using one of   the latter, but deferred it because of the upfront cost of learning   the systems.
<li> Logging is currently done using Python&#8217;s <tt>logging</tt> module.   Unfortunately, I&#8217;m finding this is not well-adapted to Python&#8217;s   threading.  Is there a better solution?
<li> The crawler was initially designed for crawling in a batch   environment, where it is run and then terminates.  I&#8217;ve since   modified it so that it can be stopped, modifications made, and   restarted.  It&#8217;d be good to add instrumentation so it can be   modified more dynamically, in real time.
<li> Many interesting research papers have been published about   crawling.  I read or skimmed quite a few while writing my crawler,   but ultimately used only a few of the ideas; just getting the basics   right proved challenging enough.  In future iterations it&#8217;d be   useful to look at this work again and to incorporate the best ideas.   Good starting points include a   <a href="http://nlp.stanford.edu/IR-book/html/htmledition/web-crawling-and-indexes-1.html">chapter</a>   in the book by Manning, Raghavan and Sch\&#8221;utze, and a   <a href="http://infolab.stanford.edu/~olston/publications/crawling_survey.pdf">survey     paper</a> by Olston and Najork.  Existing open source crawlers such   as   <a href="https://webarchive.jira.com/wiki/display/Heritrix/Heritrix">Heritrix</a>   and <a href="http://nutch.apache.org/">Nutch</a> would also be interesting   to look at in more depth. </ul>
]]></content:encoded>
					
					<wfw:commentRss>https://michaelnielsen.org/ddi/how-to-crawl-a-quarter-billion-webpages-in-40-hours/feed/</wfw:commentRss>
			<slash:comments>67</slash:comments>
		
		
			</item>
		<item>
		<title>Using evaluation to improve our question-answering system</title>
		<link>https://michaelnielsen.org/ddi/using-evaluation-to-improve-our-question-answering-system/</link>
					<comments>https://michaelnielsen.org/ddi/using-evaluation-to-improve-our-question-answering-system/#comments</comments>
		
		<dc:creator><![CDATA[Michael Nielsen]]></dc:creator>
		<pubDate>Sat, 30 Jun 2012 20:20:15 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<guid isPermaLink="false">https://michaelnielsen.org/ddi/?p=68</guid>

					<description><![CDATA[It&#8217;s tempting to think IBM&#8217;s Jeopardy-playing machine Watson must have relied on some huge algorithmic advance or silver bullet idea in order to beat top human Jeopardy players. But the researchers behind Watson have written a very interesting paper about how Watson works, and a different picture emerges. It&#8217;s not that they found any super-algorithm&#8230; <a class="more-link" href="https://michaelnielsen.org/ddi/using-evaluation-to-improve-our-question-answering-system/">Continue reading <span class="screen-reader-text">Using evaluation to improve our question-answering system</span></a>]]></description>
										<content:encoded><![CDATA[<p>It&#8217;s tempting to think IBM&#8217;s <em>Jeopardy</em>-playing machine Watson must have relied on some huge algorithmic advance or silver bullet idea in order to beat top human <em>Jeopardy</em> players.  But the researchers behind Watson have written a very <a href="http://www.aaai.org/ojs/index.php/aimagazine/article/view/2303/2165">interesting   paper</a> about how Watson works, and a different picture emerges. It&#8217;s not that they found any super-algorithm for answering questions. Instead, Watson combined a large number of different algorithms, most of them variations on standard algorithms from natural language processing and machine learning.  Individually, none of these algorithms was particularly good at solving Jeopardy puzzles, or the sub-problems that needed to be solved along the way.  But by integrating a suite of not-very-good algorithms in just the right way, the Watson team got superior performance.  As they write in the paper: </p>
<blockquote><p> rapid integration and evaluation of new ideas and new   components against end-to-end metrics [were] essential to our   progress&#8230; [Question answering benefits from] a single extensible   architecture that [allows] component results to be consistently   evaluated in a common technical context against a growing variety   of&#8230; &#8220;Challenge Problems.&#8221;&#8230; Our commitment to regularly   evaluate the effects of specific techniques on   end-to-end-performance, and to let that shape our research   investment, was necessary for our rapid progress. </p></blockquote>
<p> In other words, they built an extensive evaluation environment that gave detailed and clear metrics that let them see how well their system was doing.  In turn, they used these metrics to determine where to invest time and money in improving their system and, equally important, to determine what ideas to abandon.  We will call this style of development <em>evaluation-driven development</em>.  It&#8217;s not a new idea, of course &#8211; anyone who&#8217;s ever run an A/B test is doing something similar &#8211; but the paper implies that Watson took the approach to a remarkable extreme, and that it was this approach which was responsible for much of Watson&#8217;s success.</p>
<p>In the <a href="https://michaelnielsen.org/ddi/how-to-answer-a-question-a-simple-system/">last   post</a> I developed a very simple question-answering system based on the AskMSR system developed at Microsoft Research in 2001.  I&#8217;ve since been experimenting &#8211; in a very small-scale way! &#8211; with the use of evaluation-driven development to improve the performance of that system.  In this post I describe some of my experiments.  Of course, my experiments don&#8217;t compare in sophistication to what the Watson team did, nor with what is done in many other machine learning systems. Still, I hope that the details are of interest to at least a few readers.  The code for the experiments may be found on GitHub <a href="https://github.com/mnielsen/mini_qa/tree/blog-evaluation">here</a>. Note that you don&#8217;t need to have read the last post to understand this one, although of course it wouldn&#8217;t hurt!</p>
<p><strong>First evaluation:</strong> The system described in my last post took questions like &#8220;Who is the world&#8217;s richest person?&#8221; and rewrote those questions as search queries for Google.  E.g., for the last question it might use Google to find webpages matching &#8220;the world&#8217;s richest person is *&#8221;, and then look for name-like strings in the wildcard position.  It would extract possible strings, and score them based on a number of factors, such as whether the strings were capitalized, how likely the search query was to give the correct results, and so on.  Finally, it returned the highest-scoring strings as candidate answers.</p>
<p>To evaluate this system, I constructed a list of 100 questions, with each question having a list of one or more acceptable answers. Example questions and acceptable answers included:</p>
<p>(1) Who married the Prince of Wales in 2005?: Camilla Parker Bowles</p>
<p>(2) Who was the first woman to climb the mountain K2?:  Wanda Rutkiewicz</p>
<p>(3) Who was the bass guitarist for the band Nirvana?: Krist Novoselic</p>
<p>(4) Who wrote &#8216;The Waste Land&#8217;?: T. S. Eliot or Thomas Stearns Eliot</p>
<p>(5) Who won the 1994 Formula One Grand Prix championship?: Michael Schumacher</p>
<p>The system described in my last post got 41 of the 100 questions exactly right, i.e., it returned an acceptable answer as its top-scoring answer.  Furthemore, the system returned a correct answer as one of its top 20 scored responses for 75 of the 100 questions. Among these, the average rank was 3.48: i.e., when Google could find the answer at all, it tended to rank it very highly.</p>
<p>(Actually, to be precise, I used a slightly modified version of the system in the last post: I fixed a bug in how the program parsed strings.  The bug is conceptually unimportant, but fixing it increased performance slightly.)</p>
<p><strong>Comparison to Wolfram Alpha:</strong> As a comparison, I tried submitting the evaluation questions also to Wolfram Alpha, and determining if the answers returned by Wolfram were in the list of acceptable answers.  Now, Wolfram doesn&#8217;t always return an answer. But for 27 of those questions it did return an answer.  And 20 of those answers were correct.</p>
<p><strong>Hybrid system:</strong> Two things are notable about Wolfram&#8217;s performance versus the Google-based system: (1) Wolfram gets answers correct much less often than the Google-based system; and (2) When Wolfram returns an answer, it is much more likely to be correct (20 / 27, or 74 percent) than the Google-based system.  This suggests using a hybrid system which applies the following procedure: </p>
<pre>
if Wolfram Alpha returns an answer:
    return Wolfram Alpha's answer
else:
    return the Google-based answer
</pre>
<p> Let&#8217;s see if we can guess how well this system will work.  One possible assumption is that the correctness of the Google-based system and the Wolfram-based system are <em>uncorrelated</em>.  If that&#8217;s the case, then we&#8217;d expect that the hybrid system would get 20 / 27 questions correct, for those questions answered by Wolfram, and the Google-based system would get 41 percent of the remaining 73 questions correct.  That gives a total of 50 questions correct.</p>
<p>Now, in practice, when I ran the hybrid procedure it only gave 45 correct answers.  That&#8217;s a considerable improvement (4 questions) over the Google-based system alone.  On the other hand, it&#8217;s less than <em>half</em> the improvement we&#8217;d expect if the Google-based system and the Wolfram-based system were uncorrelated.  What&#8217;s going on is that the evaluation questions which Wolfram Alpha is good at answering are, for the most part, questions which Google is also good at answering. In other words, what we learn from the evaluation is that there is considerable overlap (or correlation) in the type of questions these two systems are good at answering, and so we get less improvement than we might have thought.</p>
<h3>Problems for the author</h3>
<ul>
<li> In developing the hybrid algorithm we relied on the fact that   when Wolfram Alpha returned an answer it was highly likely to be   correct &#8211; much more likely than the Google-based system.  But   suppose the Google-based system says that there is a very big gap in   score between the top-ranked and second-ranked answer.  Might that   signal that we can have high confidence in the Google-based answer?   Determine where it is possible to gain better performance by   returning the Google-based answer preferentially when the ratio of   the top-ranked and second-ranked score is sufficiently large.
<li> Are there other problem features that would enable us to   determine whether a question is more or less likely to be answered   correctly by the Google-based system or by the Wolfram-based system?   Maybe, for example, long questions are more likely to be answered   correctly by Google.  Modify the code to determine whether this is   the case.
</ul>
<p><strong>Improving the scoring of proper nouns in the Google-based   system:</strong> In the Google-based system, the scoring mechanism increase the scores of candidate answers by a factor for each word in the answer which is capitalized.  The idea is to score more highly candidate answers which are likely to be proper nouns. </p>
<p>Originally, I chose the capitalization factor to be 3.0, based on nothing more than an intuitive guess.  Let&#8217;s use the evaluation system to see if we can improve this guess.  The results are below; you may wish to skip straight to the list of lessons learned, however, as an entree into the results.</p>
<pre>
Factor: 4.0
Top-ranking answer is correct: 33
A correct answer in the top 20: 74
Average rank for correct answers in the top 20: 4.46

Factor: 3.0 (original parameter choice)
Top-ranking answer is correct: 41
A correct answer in the top 20: 75
Average rank for correct answers in the top 20: 3.48

Factor: 2.5
Top-ranking answer is correct: 42
A correct answer in the top 20: 76
Average rank for correct answers in the top 20: 3.41

Factor: 2.4
Top-ranking answer is correct:  47
A correct answer in the top 20: 76
Average rank for correct answers in the top 20: 3.38

Factor: 2.3
Top-ranking answer is correct: 47
A correct answer in the top 20: 76
Average rank for correct answers in the top 20: 3.25

Factor: 2.2
Top-ranking answer is correct: 47
A correct answer in the top 20: 76
Average rank for correct answers in the top 20: 3.21

Factor: 2.1
Top-ranking answer is correct: 47
A correct answer in the top 20: 76
Average rank for correct answers in the top 20: 3.24

Factor: 2.0
Top-ranking answer is correct: 42
A correct answer in the top 20: 76
Average rank for correct answers in the top 20: 3.13

Factor: 1.9
Top-ranking answer is correct: 44
A correct answer in the top 20: 76
Average rank for correct answers in the top 20: 3.13

Factor: 1.8
Top-ranking answer is correct: 43
A correct answer in the top 20: 76
Average rank for correct answers in the top 20: 3.12

Factor: 1.7
Top-ranking answer is correct: 38
A correct answer in the top 20: 76
Average rank for correct answers in the top 20: 3.39

Factor: 1.6
Top-ranking answer is correct: 34
A correct answer in the top 20: 74
Average rank for correct answers in the top 20: 3.22

Factor: 1.5
Top-ranking answer is correct: 29
A correct answer in the top 20: 73
Average rank for correct answers in the top 20: 3.38

Factor: 1.0
Top-ranking answer is correct: 1
A correct answer in the top 20: 65
Average rank for correct answers in the top 20: 7.74

Factor: 0.8
Top-ranking answer is correct: 1
A correct answer in the top 20: 52
Average rank for correct answers in the top 20: 10.25
</pre>
<p>Looking at the above, we learn a great deal:</p>
<p>(1) Increasing the capitalization factor to much above my original guess at a value (3.0) starts to significantly degrade the system.</p>
<p>(2) Removing the capitalization factor (i.e., setting it to 1.0) makes the system much, much worse.  It still does okay at getting things approximately correct &#8212; for 65 of the questions a correct answer is in the top 20 returned responses.  But for only 1 of the 100 questions is the top-ranked answer actually correct.  In other words, we learn that <em>paying attention to capitalization makes a huge difference   to how likely our algorithm is to return a correct answer as the top   result</em>.  It&#8217;s also interesting to note that <em>even ignoring   capitalization the algorithm does pretty well at returning a   high-ranked answer that is correct</em>. </p>
<p>(3) Actually penalizing capitalized answers (by choosing a capitalization factor of 0.8) makes things even worse.  This is not surprising, but it&#8217;s useful to see explicitly, and further confirmation that paying attention to capitalization matters.</p>
<p>(4) The optimal parameter range for the capitalization factor depends on whether you want to maximize the number of perfect answers, in which case the capitalization factor should be 2.1-2.4, or to get the best possible rank, in which case the capitalization factor should be 1.8-2.0.</p>
<p>(5) Performance only varies a small amount over the range of capitalization factors 1.8-2.3.  Even taking the capitalization factor out to its original value (3.0) decreases performance only a small amount.</p>
<p><strong>Improving the scoring of proper nouns in the hybrid system:</strong> We&#8217;ve just used our evaluation suite to improve the value of the capitalization factor for the Google-based question-answering system. What happens when go through the same procedure for the hybrid question-answering system?  <em>A priori</em> the optimal value for the capitalization factor may be somewhat different.  I ran the evaluation algorithm for many different values of the capitaliation factor.  Here are the results:</p>
<pre>
Factor: 2.6
Top-ranking answer is correct: 47

Factor: 2.5
Top-ranking answer is correct: 46

Factor: 2.4
Top-ranking answer is correct: 51

Factor: 2.3
Top-ranking answer is correct: 51

Factor: 2.2
Top-ranking answer is correct: 51

Factor: 2.1
Top-ranking answer is correct: 51

Factor: 2.0
Top-ranking answer is correct: 47

Factor: 1.9
Top-ranking answer is correct: 50

Factor: 1.8
Top-ranking answer is correct: 49

Factor: 1.7
Top-ranking answer is correct: 47
</pre>
<p>The pattern is quite similar to the Google-based system, with the highest scores being obtained for capitalization factors in the range 2.1-2.4.  As a result, I&#8217;ve change the value for the capitalization factor from its original value of 3.0 to a somewhat lower value, 2.2.</p>
<p><strong>Varying the weights of the rewrite rules:</strong> Recall from the <a href="https://michaelnielsen.org/ddi/how-to-answer-a-question-a-simple-system/">last   post</a> that questions of the form &#8220;Who VERB w1 w2 w3&#8230;&#8221; got rewritten as Google queries: </p>
<pre>
"VERB w1   w2   w3 ... "
"w1   VERB w2   w3 ... "
"w1   w2   VERB w3 ... "
... 
</pre>
<p> The candidate answers extracted from the search result for these rewritten queries were given a score of 5, following the scoring used in the original AskMSR research paper.  Another rewrite rule formulated a single Google query which omitted both the quotes and VERB, i.e., it was just: </p>
<pre>
w1   w2   w3 ... 
</pre>
<p> The candidate answers extracted from the search results for these rewritten queries were given a score of 2.</p>
<p>The scores 5 and 2 seem pretty arbitrary.  Can we improve our system by changing the scores?  Let&#8217;s experiment, and find out.  To do the experiment, let me first mention that all that matters to the final ranking is the <em>ratio</em> of the two scores (initially, 5/2 = 2.5). So I&#8217;ll give the results of running our evaluation suite for different choices of that ratio:</p>
<pre>
Ratio: infinite
Top-ranking answer is correct: 50

Ratio: 10.0
Top-ranking answer is correct: 50

Ratio: 6.0
Top-ranking answer is correct: 50

Ratio: 5.0
Top-ranking answer is correct: 50

Ratio: 4.0
Top-ranking answer is correct: 50

Ratio: 3.5
Top-ranking answer is correct: 51

Ratio: 3.0
Top-ranking answer is correct: 51

Ratio: 2.5
Top-ranking answer is correct: 51

Ratio: 2.0
Top-ranking answer is correct: 51

Ratio: 1.5
Top-ranking answer is correct: 51

Ratio: 1.0
Top-ranking answer is correct: 50

Ratio: 0.75
Top-ranking answer is correct: 49

Ratio: 0.0
Top-ranking answer is correct: 37
</pre>
<p>Again, we learn some interesting things:</p>
<p>(1) The performance is surprisingly insensitive to changes in this ratio.</p>
<p>(2) Even removing the quoted rewritten queries entirely (ratio = 0.0) only drops the performance from 51 to 37 questions answered correctly.</p>
<p>(3) Dropping the unquoted rewritten queries has almost no effect at all, dropping the number of correct answers from 51 to 50.</p>
<p>(4) Choosing the ratio to be 2.5, as was done in the original AskMSR paper, seems to provide good performance.</p>
<p><strong>Two problems in the way we&#8217;ve used evaluation questions to   improve the system</strong></p>
<p>(1) It&#8217;s not clear that the evaluation questions are truly a random sample of well-posed &#8220;Who&#8230;?&#8221; questions.  They&#8217;re simply a set of questions that happened to occur to me while I was working on the program, and they naturally reflect my biases and interests.  In fact, the problem is worse than that: it&#8217;s not even clear what it would <em>mean</em> for the evaluation questions to be a random sample.  A random sample from what natural population?</p>
<p>(2) There&#8217;s something fishy about using the same data repeatedly to improve the parameters in our system.  It can lead to the problem of <a href="http://en.wikipedia.org/wiki/Overfitting">overfitting</a>, where the parameters we end up choosing reflect accidental features of the evaluation set.  A standard solution to this problem is <a href="http://en.wikipedia.org/wiki/Cross-validation_(statistics)">cross-validation</a>, dividing evaluation questions up into <em>training data</em> (which are used to estimate parameters) and <em>validation data</em> (which are used to check that overfitting hasn&#8217;t occurred).</p>
<p>Problem (2) could be solved with a much larger set of evaluation questions, and that&#8217;s a good goal for the future.  If anyone knows of a good source of evaluation questions, I&#8217;d like to hear about it! It&#8217;s not at all clear to me how to solve problem (1), and I&#8217;d be interested to hear people&#8217;s ideas.</p>
<p><strong>Automating:</strong> What we&#8217;ve done here is very similar to what people routinely do with statistical models or machine learning, where you use training data to find the best possible parameters for your model.  What&#8217;s interesting, though, is that we&#8217;re not just varying parameters, we&#8217;re changing the algorithm.  It&#8217;s interesting to learn that Wolfram Alpha and Google overlap more than you might <em>a   priori</em> expect in the types of questions they&#8217;re good (and bad) at answering.  And it tells you that effort invested in improving one or the other approach might have less impact than you expect, unless you&#8217;re careful to aim your improvements at questions not already well-answered by the other system.  </p>
<h3>Problems</h3>
<ul>
<li> Find three ways of modifying the hybrid algorithm, and use the   evaluation algorithm to see how well the modifications work.  Can   you improve the algorithm so it gets more than 51 of the evaluation   questions exactly right?   </ul>
<p><strong>Acknowledgements:</strong> Thanks to <a href="https://twitter.com/np_hoffman">Nathan Hoffman</a> for gently suggesting that it might be a good idea to cache results from my screen-scraping!</p>
<p><em>Interested in more?  Please <a href="https://michaelnielsen.org/ddi/feed/">subscribe to this blog</a>, or <a href="http://twitter.com/\#!/michael_nielsen">follow me on Twitter</a>.  You may also enjoy reading my new book about  open science, <a href="http://www.amazon.com/Reinventing-Discovery-New-Networked-Science/dp/product-description/0691148902">Reinventing Discovery</a>. </em></p>
]]></content:encoded>
					
					<wfw:commentRss>https://michaelnielsen.org/ddi/using-evaluation-to-improve-our-question-answering-system/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			</item>
		<item>
		<title>How to answer a question: a simple system</title>
		<link>https://michaelnielsen.org/ddi/how-to-answer-a-question-a-simple-system/</link>
					<comments>https://michaelnielsen.org/ddi/how-to-answer-a-question-a-simple-system/#comments</comments>
		
		<dc:creator><![CDATA[Michael Nielsen]]></dc:creator>
		<pubDate>Wed, 13 Jun 2012 17:30:35 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<guid isPermaLink="false">https://michaelnielsen.org/ddi/?p=65</guid>

					<description><![CDATA[In 2011, IBM got a lot of publicity when it demonstrated a computer named Watson, which was designed to answer questions on the game show Jeopardy. Watson was good enough to be competitive with (and ultimately better than) some of the best ever Jeopardy players. Playing Jeopardy perhaps seems like a frivolous task, and it&#8217;s&#8230; <a class="more-link" href="https://michaelnielsen.org/ddi/how-to-answer-a-question-a-simple-system/">Continue reading <span class="screen-reader-text">How to answer a question: a simple system</span></a>]]></description>
										<content:encoded><![CDATA[<p>In 2011, IBM got a lot of publicity when it demonstrated a computer named <a href="http://en.wikipedia.org/wiki/Watson_(computer)">Watson</a>, which was designed to answer questions on the game show <em>Jeopardy</em>.  Watson was good enough to be competitive with (and ultimately better than) some of the best ever Jeopardy players.</p>
<p>Playing Jeopardy perhaps seems like a frivolous task, and it&#8217;s tempting to think that Watson was primarily a public relations coup for IBM.  In fact, Watson is a remarkable technological achievement. Jeopardy questions use all sorts of subtle hints, humour and obscure references.  For Watson to compete with the best humans it needed to integrate an enormous range of knowledge, as well as many of the best existing ideas in natural language processing and artificial intelligence.</p>
<p>In this post I describe one of Watson&#8217;s ancestors, a question-answering system called AskMSR built in 2001 by researchers at Microsoft Research.  AskMSR is much simpler than Watson, and doesn&#8217;t work nearly as well, but does teach some striking lessons about question-answering systems.  My account of AskMSR is based on a paper by <a href="http://trec.nist.gov/pubs/trec10/papers/Trec2001Notebook.AskMSRFinal.pdf">Brill,   Lin, Banko, Dumais and Ng</a>, and a followup paper by <a href="http://acl.ldc.upenn.edu/W/W02/W02-1033.pdf">Brill, Dumais and   Banko</a>.</p>
<p>The AskMSR system was developed for a US-Government-run workshop called <a href="http://en.wikipedia.org/wiki/Text_Retrieval_Conference">TREC</a>. TREC is an annual multi-track workshop, with each track concentrating on a different information retrieval task.  For each TREC track the organizers pose a <em>challenge</em> to participants in the run-up to the workshop.  Participants build systems to solve those challenges, the systems are evaluated by the workshop organizers, and the results of the evaluation are discussed at the workshop.</p>
<p>For many years one of the TREC tracks was question answering (QA), and the AskMSR system was developed for the QA track at the 2002 TREC workshop.  At the time, many of the systems submitted to TREC&#8217;s QA track relied on complex linguistic analysis to understand as much as possible about the meaning of the questions being asked. The researchers behind AskMSR had a different idea.  Instead of doing sophisticated linguistic analysis of questions they decided to do a much simpler analysis, but to draw on a rich database of knowledge to find answers &#8211; the web.  As they put it:</p>
<blockquote><p>   In contrast to many question answering systems that begin with rich   linguistic resources (e.g., parsers, dictionaries, WordNet), we   begin with data and use that to drive the design of our system.  To   do this, we first use simple techniques to look for answers to   questions on the Web. </p></blockquote>
<p>As I describe in detail below, their approach was to take the question asked, to rewrite it in the form of a search engine query, or perhaps several queries, and then extract the answer by analysing the Google results for those queries.</p>
<p>The insight they had was that a large and diverse document collection (such as the web) would contain clear, easily extractable answers to a very large number of questions.  Suppose, for example, that your QA system is trying to answer the question &#8220;Who killed Abraham Lincoln?&#8221;  Suppose also that the QA system only received limited training data, including a text which read &#8220;Abraham Lincoln&#8217;s life was ended by John Wilkes Booth&#8221;.  It requires a sophisticated analysis to understand that this means Booth killed Lincoln.  If that were the only training text related to Lincoln and Booth, the system might not make the correct inference.  On the other hand, with a much large document collection it would be likely that somewhere in the documents it would plainly say &#8220;John Wilkes Booth killed Abraham Lincoln&#8221;.  Indeed, if the document collection were large enough this phrase (and close variants) might well be repeated many times.  At that point it takes much less sophisticated analysis to figure out that &#8220;John Wilkes Booth&#8221; is a good candidate answer.  As Brill <em>et al</em> put it:</p>
<blockquote><p> [T]he greater the answer redundancy in the source, the   more likely it is that we can find an answer that occurs in a simple   relation to the question, and therefore, the less likely it is that   we will need to resort to solving the aforementioned difficulties   facing natural language processing systems. </p></blockquote>
<p>To sum it up even more succinctly, the idea behind AskMSR was that: <strong>unsophisticated linguistic algorithms + large amounts of data   <img src='https://s0.wp.com/latex.php?latex=%5Cgeq&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\geq' title='\geq' class='latex' /> sophisticated linguistic algorithms + only a small amount of   data.</strong></p>
<p>I&#8217;ll now describe how AskMSR worked.  I limit my discussion to just a single type of question, questions of the form &#8220;Who&#8230;?&#8221;  The AskMSR system could deal with several different question types (&#8220;When&#8230;?&#8221;, <em>etc</em>), but the ideas used for the other question types were similar.  More generally, my description is an abridgement and simplification of the original AskMSR system, and you should refer to the original paper for more details.  The benefit of my simplified approach is that we&#8217;ll be able to create a short Python implementation of the ideas.</p>
<h3>How AskMSR worked</h3>
<p><strong>Rewriting:</strong> Suppose we&#8217;re asked a question like &#8220;Who is the richest person in the world?&#8221;  If we&#8217;re trying to use Google ourselves to answer the question, then we might simply type the question straight in.  But Google isn&#8217;t (so far as I know) designed with question-answering in mind.  So if we&#8217;re a bit more sophisticated we&#8217;ll rewrite the text of the question to turn it into a query more likely to turn up answers, queries like <tt>richest person world</tt> or <tt>world's richest person</tt>.  What makes these queries better is that they omit terms like &#8220;Who&#8221; which are unlikely to appear in <em>answers</em> to the question we&#8217;re interested in.</p>
<p>AskMSR used a similar idea of rewriting, based on very simple rules. Consider &#8220;Who is the richest person in the world?&#8221;  Chances are pretty good that we&#8217;re looking for a web page with text of the form &#8220;the richest person in the world is *&#8221;, where * denotes the (unknown) name of that person.  So a good strategy is to search for pages matching the text &#8220;the richest person in the world is&#8221; exactly, and then to extract whatever name comes to the right.</p>
<p>There are a few things to note about how we rewrote the question to get the search query.</p>
<p>Most obviously, we eliminated &#8220;Who&#8221; and the question mark.</p>
<p>A little less obviously, we moved the verb &#8220;is&#8221; to the end of the phrase.  Moving the verb in this way is common when answering questions.  Unfortunately, we don&#8217;t always move the verb in the same way.  Consider a question such as &#8220;Who is the world&#8217;s richest person married to?&#8221;  The best way to move the verb is not to the end of the sentence, but rather to move it between &#8220;the world&#8217;s richest person&#8221; and &#8220;married to&#8221;, so we are searching for pages matching the text &#8220;the world&#8217;s richest person is married to *&#8221;.</p>
<p>More generally, we&#8217;ll rewrite questions of the form <tt>Who w0 w1 w2 w3 ...</tt> (where <tt>w0 w1 w2 ...</tt> are words) as the following search queries (note that the quotes matter, and are part of the query): </p>
<pre>
"w0 w1 w2 w3 ... "
"w1 w0 w2 w3 ... "
"w1 w2 w0 w3 ... "
... 
</pre>
<p> It&#8217;s a carpet-bombing strategy, trying all possible options for the correct place for the verb in the text.  Of course, we end up with many ridiculous options, like <tt>"the world's is richest person married to"</tt>.  However, according to Brill <em>et al</em>, &#8220;While such an approach results in many nonsensical rewrites&#8230;  these very rarely result in the retrieval of bad pages, and the proper movement position is guaranteed to be found via exhaustive search.&#8221;</p>
<p>Why does this very rarely result in the retrieval of bad pages? Ultimately, the justification must be empirical.  However, a plausible story is that rewrites such as <tt>"the world's is richest person married to"</tt> don&#8217;t make much sense, and so are likely to match only a very small collection of webpages compared to (say) <tt>"world's richest person is married to"</tt>.  Because of this, we will see few repetitions in any erroneous answers extracted from the search results for the nonsensical query, and so the nonsensical query is unlikely to result in incorrect answers.</p>
<h3>Problems for the author</h3>
<ul>
<li> Contrary to the last paragraph, I think it&#8217;s likely that   sometimes the rewritten questions <em>do</em> make sense as search   queries, but not as ways of searching for answers to the original   question.  What are some examples of such questions? </ul>
<p>Another feature of the queries above is that they are all quoted. Google uses this syntax to specify that an exact phrase match is required.  Below we&#8217;ll introduce an unquoted rewrite rule, meaning that an exact phrase match isn&#8217;t required.</p>
<p>With the rules I&#8217;ve described above we rewrite &#8220;Who is the world&#8217;s richest person?&#8221; as both the query &#8220;is the world&#8217;s richest person&#8221; and the query &#8220;the world&#8217;s richest person is&#8221;.  There are several other variations, but for now we&#8217;ll focus on just these two.  For the first query, &#8220;is the world&#8217;s richest person&#8221;, notice that if we find this text in a page then it&#8217;s likely to read something like &#8220;Bill Gates is the world&#8217;s richest person&#8221; (or perhaps &#8220;<a href="http://en.wikipedia.org/wiki/Carlos_Slim">Carlos Slim</a> is&#8230;&#8221;, since Slim has recently overtaken Gates in wealth).  If we search on the second query, then returned pages are more likely to read &#8220;the world&#8217;s richest person is Bill Gates&#8221;.</p>
<p>What this suggests is that accompanying the rewritten query, we should also specify whether we expect the answer to appear on the left (L) or the right (R).  Brill <em>et al</em> adopted the rule to expect the answer on the left when the verb starts the query, and on the right when the verb appears anywhere else.  With this change the specification of the rewrite rules becomes: </p>
<pre>
["w0 w1 w2 w3 ... ", L]
["w1 w0 w2 w3 ... ", R]
["w1 w2 w0 w3 ... ", R]
... 
</pre>
<p> Brill <em>et al</em> don&#8217;t justify this choice of locations.  I haven&#8217;t thought hard about it, but the few examples I&#8217;ve tried suggest that it&#8217;s not a bad convention, although I&#8217;ll bet you could come up with counterexamples.</p>
<p>For some rewritten queries Brill allow the answer to appear on either (E) side.  In particular, they append an extra rewrite rule to the above list which is: </p>
<pre>
[w1 w2 w3..., E]
</pre>
<p> This differs in two ways to the earlier rewrite rules.  First, the search query is unquoted, so it doesn&#8217;t require an exact match, or for the order of the words to be exactly as specified.  Second, the verb is entirely omitted.  For these reasons it makes sense to look for answers on either side of the search text.</p>
<p>Of course, if we extract a candidate answer from this less precise search then we won&#8217;t be as confident in the answer.  For that reason we also assign a score to each rewrite rule.  Here&#8217;s the complete set of rewrite rules we use, including scores.  (I&#8217;ll explain how we use the scores a little later: for now all that matters is that higher scores are better.) </p>
<pre>
["w0 w1 w2 w3 ... ", L, 5]
["w1 w0 w2 w3 ... ", R, 5]
["w1 w2 w0 w3 ... ", R, 5]
... 
[w1 w2 w3..., E, 2]
</pre>
<h3>Problems for the author</h3>
<ul>
<li> At the moment we&#8217;ve implicitly assumed that the verb is a single   word after &#8220;Who&#8221;.  However, sometimes the verb will be more   complex.  For example, in the sentence &#8220;Who shot and killed   President Kennedy?&#8221; the verb is &#8220;shot and killed&#8221;.  How can we   identify such complex compound verbs? </ul>
<p><strong>Extracting candidate answers:</strong> We submit each rewritten query to Google, and extract Google&#8217;s document summaries for the top 10 search results.  We then split the summaries into sentences, and each sentence is assigned a score, which is just the score of the rewrite rule the sentence originated from.</p>
<p>How should we extract candidate answers from these sentences?  First, we remove text which overlaps the query itself &#8211; we&#8217;re looking for terms <em>near</em> the query, not the same as the query!  This ensures that we don&#8217;t answer questions such as &#8220;Who killed John Fitzgerald Kennedy?&#8221;  with &#8220;John Fitzgerald Kennedy&#8221;.  We&#8217;ll call the text that remains after deletion a truncated sentence.</p>
<p>With the sentence truncated, the idea then is to look for 1-, 2-, and 3-word strings (n-grams) which recur often in the truncated sentences. To do this we start by listing every n-gram that appears in at least one truncated sentence.  This is our list of candidate answers.  For each such n-gram we&#8217;ll assign a total score.  Roughly speaking, the total score is the sum of the scores of all the truncated sentences that the n-gram appears in.</p>
<p>In fact, we can improve the performance of the system by modifying this scoring procedure.  We do this by boosting the score of a sentence if one or more words in the n-gram are capitalized.  We do this because capitalization likely indicates a proper noun, which means the word is an especially good candidate to be part of the correct answer.  The system outputs the n-gram which has the highest total score as its preferred answer.</p>
<p><strong>Toy program:</strong> Below is a short Python 2.7 program which implements the algorithm described above.  (It omits the left-versus-right-versus-either distinction, though.)  The code is also available on <a href="https://github.com/mnielsen/mini_qa/tree/blog-v1">GitHub</a>, together with a small third-party library (via <a href="http://breakingcode.wordpress.com/2010/06/29/google-search-python/">Mario   Vilas</a>) that&#8217;s used to access Google search results.  Here&#8217;s the code:</p>
<pre>
#### mini_qa.py
#
# A toy question-answering system, which uses Google to attempt to
# answer questions of the form "Who... ?"  An example is: "Who
# discovered relativity?"
#
# The design is a simplified version of the AskMSR system developed by
# researchers at Microsoft Research.  The original paper is:
#
# Brill, Lin, Banko, Dumais and Ng, "Data-Intensive Question
# Answering" (2001).
#
# I've described background to this program here:
#
# https://michaelnielsen.org/ddi/how-to-answer-a-question-v1/

#### Copyright and licensing
#
# MIT License - see GitHub repo for details:
#
# https://github.com/mnielsen/mini_qa/blob/blog-v1/mini_qa.py
#
# Copyright (c) 2012 Michael Nielsen

#### Library imports

# standard library
from collections import defaultdict
import re

# third-party libraries
from google import search

def pretty_qa(question, num=10):
    """
    Wrapper for the `qa` function.  `pretty_qa` prints the `num`
    highest scoring answers to `question`, with the scores in
    parentheses.
    """
    print "\nQ: "+question
    for (j, (answer, score)) in enumerate(qa(question)[:num]):
        print "

def qa(question):
    """
    Return a list of tuples whose first entry is a candidate answer to
    `question`, and whose second entry is the score for that answer.
    The tuples are ordered in decreasing order of score.  Note that
    the answers themselves are tuples, with each entry being a word.
    """
    answer_scores = defaultdict(int)
    for query in rewritten_queries(question):
        for summary in get_google_summaries(query.query):
            for sentence in sentences(summary):
                for ngram in candidate_answers(sentence, query.query):
                    answer_scores[ngram] += ngram_score(ngram, 
                                                        query.score)
    return sorted(answer_scores.iteritems(), 
                  key=lambda x: x[1], 
                  reverse=True)

def rewritten_queries(question):
    """
    Return a list of RewrittenQuery objects, containing the search
    queries (and corresponding weighting score) generated from
    `question`.  
    """
    rewrites = [] 
    tq = tokenize(question)
    verb = tq[1] # the simplest assumption, something to improve
    rewrites.append(
        RewrittenQuery("\"
    for j in range(2, len(tq)):
        rewrites.append(
            RewrittenQuery(
                "\"
                    " ".join(tq[2:j+1]), verb, " ".join(tq[j+1:])),
                5))
    rewrites.append(RewrittenQuery(" ".join(tq[2:]), 2))
    return rewrites

def tokenize(question):
    """
    Return a list containing a tokenized form of `question`.  Works by
    lowercasing, splitting around whitespace, and stripping all
    non-alphanumeric characters.  
    """
    return [re.sub(r"\W", "", x) for x in question.lower().split()]

class RewrittenQuery():
    """
    Given a question we rewrite it as a query to send to Google.
    Instances of the RewrittenQuery class are used to store these
    rewritten queries.  Instances have two attributes: the text of the
    rewritten query, which is sent to Google; and a score, indicating
    how much weight to give to the answers.  The score is used because
    some queries are much more likely to give highly relevant answers
    than others.
    """

    def __init__(self, query, score):
        self.query = query
        self.score = score

def get_google_summaries(query):
    """
    Return a list of the top 10 summaries associated to the Google
    results for `query`.  Returns all available summaries if there are
    fewer than 10 summaries available.  Note that these summaries are
    returned as BeautifulSoup.BeautifulSoup objects, and may need to
    be manipulated further to extract text, links, etc.
    """
    return search(query)

def sentences(summary):
    """
    Return a list whose entries are the sentences in the
    BeautifulSoup.BeautifulSoup object `summary` returned from Google.
    Note that the sentences contain alphabetical and space characters
    only, and all punctuation, numbers and other special characters
    have been removed.
    """
    text = remove_spurious_words(text_of(summary))
    sentences = [sentence for sentence in text.split(".") if sentence]
    return [re.sub(r"[^a-zA-Z ]", "", sentence) for sentence in sentences]

def text_of(soup):
    """
    Return the text associated to the BeautifulSoup.BeautifulSoup
    object `soup`.
    """
    return ".join(str(soup.findAll(text=True)))

def remove_spurious_words(text):
    """
    Return `text` with spurious words stripped.  For example, Google
    includes the word "Cached" in many search summaries, and this word
    should therefore mostly be ignored.
    """
    spurious_words = ["Cached", "Similar"]
    for word in spurious_words:
        text = text.replace(word, "")
    return text

def candidate_answers(sentence, query):
    """
    Return all the 1-, 2-, and 3-grams in `sentence`.  Terms appearing
    in `query` are filtered out.  Note that the n-grams are returned
    as a list of tuples.  So a 1-gram is a tuple with 1 element, a
    2-gram is a tuple with 2 elements, and so on.
    """
    filtered_sentence = [word for word in sentence.split() 
                         if word.lower() not in query]
    return sum([ngrams(filtered_sentence, j) for j in range(1,4)], [])

def ngrams(words, n=1):
    """
    Return all the `n`-grams in the list `words`.  The n-grams are
    returned as a list of tuples, each tuple containing an n-gram, as
    per the description in `candidate_answers`.
    """
    return [tuple(words[j:j+n]) for j in xrange(len(words)-n+1)]

def ngram_score(ngram, score):
    """
    Return the score associated to `ngram`.  The base score is
    `score`, but it's modified by a factor which is 3 to the power of
    the number of capitalized words.  This biases answers toward
    proper nouns.
    """
    num_capitalized_words = sum(
        1 for word in ngram if is_capitalized(word)) 
    return score * (3**num_capitalized_words)

def is_capitalized(word):
    """
    Return True or False according to whether `word` is capitalized.
    """
    return word == word.capitalize()

if __name__ == "__main__":
    pretty_qa("Who ran the first four-minute mile?")
    pretty_qa("Who makes the best pizza in New York?")
    pretty_qa("Who invented the C programming language?")
    pretty_qa("Who wrote the Iliad?")
    pretty_qa("Who caused the financial crash of 2008?")
    pretty_qa("Who caused the Great Depression?")
    pretty_qa("Who is the most evil person in the world?")
    pretty_qa("Who wrote the plays of Wiliam Shakespeare?")
    pretty_qa("Who is the world's best tennis player?")
    pretty_qa("Who is the richest person in the world?")
</pre>
<p>If you run the program you&#8217;ll see that the results are a mixed bag. When I tested it, it knew that Roger Bannister ran the first four-minute mile, that Dennis Ritchie invented the C programming language, and that Homer wrote the Iliad.  On the other hand, the program sometimes gives answers that are either wrong or downright nonsensical.  For example, it thinks that Occupy Wall Street caused the financial crash of 2008 (Ben Bernanke also scores highly).  And it replies &#8220;Laureus Sports Awards&#8221; when asked who is the world&#8217;s best tennis player.  So it&#8217;s quite a mix of good and bad results.</p>
<p>While developing the program I used some questions repeatedly to figure out how to improve perfomance.  For example, I often asked it the questions &#8220;Who invented relativity?&#8221; and &#8220;Who killed Abraham Lincoln?&#8221;  Unsuprisingly, the program now answers both questions correctly!  So to make things fairer the questions used as examples in the code aren&#8217;t ones I tested while developing the program.  They&#8217;re still far from a random sample, but at least the most obvious form of bias has been removed.</p>
<p>Much can be done to improve this program.  Here are a few ideas:</p>
<h3>Problems</h3>
<ul>
<li> To solve many of the problems I describe below it would help to   have a systematic procedure to evaluate the performance of the   system, and, in particular, to compare the performance of different   versions of the system.  How can we build such a systematic   evaluation procedure?
<li> The program does no sanity-checking of questions.  For example,   it simply drops the first word of the question.  As a result, the   question &#8220;Foo killed Abraham Lincoln?&#8221; is treated identically to   &#8220;Who killed Abraham Lincoln?&#8221;  Add some basic sanity checks to   ensure the question satisfies standard constraints on formatting,   etc.
<li> More generally, the program makes no distinction between   sensible and nonsensical questions.  You can ask it &#8220;Who coloured   green ideas sleeping furiously?&#8221; and it will happily answer.   (Perhaps   <a href="http://en.wikipedia.org/wiki/Colorless_green_ideas_sleep_furiously">appropriately</a>,   &#8220;Chomsky&#8221; is one of the top answers.)  How might the program   figure out that the question doesn&#8217;t make sense?  (This is a problem   shared by many other systems.  Here is   <a href="http://www.wolframalpha.com/input/?i=who+coloured+green+ideas+sleeping+furiously\     Alpha's response">to the question, and also   <a href="https://www.google.com/search?q=who+coloured+green+ideas+while+sleeping+furiously?">Google&#8217;s     response</a>.)
<li> The program boosts the score of n-grams containing capitalized   words.  This makes sense most of the time, since a capitalized word   is likely to be a proper noun, and is thus a good candidate to be   the answer to the question.  But it makes less sense when the word   is the first word of the sentence.  What should be done then?
<li> In the sentences extracted from Google we removed terms which   appeared in the original query.  This means that, for example, we   won&#8217;t answer &#8220;Who killed John Fitzgerald Kennedy?&#8221; with &#8220;John   Fitzgerald Kennedy&#8221;.  It&#8217;d be good to take this further by also   eliminating synonyms, so that we don&#8217;t answer &#8220;JFK&#8221;.  How can this   be done?
<li> There are occasional exceptions to the rule that answers don&#8217;t   repeat terms in the question.  For example, the correct answer to   the question &#8220;Who wrote the plays of William Shakespeare?&#8221; is,   most likely, &#8220;William Shakespeare&#8221;.  How can we identify questions   where this is likely to be the case?
<li> How does it change the results to use more than 10 search   summaries?  Do the results get better or worse?
<li> An alternative approach to using Google would be to use Bing,   Yahoo! BOSS, Wolfram Alpha, or Cyc as sources.  Or we could build   our own source using tools such as   <a href="http://nutch.apache.org/">Nutch</a> and   <a href="http://lucene.apache.org/solr/">Solr</a>. How do the resulting   systems compare to one another?  Is it possible to combine the   sources to do better than any single source alone?
<li> Some n-grams are much more common than others: the phrase &#8220;he   was going&#8221; occurs   <a href="http://books.google.com/ngrams/graph?content=he+was+going%2CLee+Harvey+Oswald&#038;year_start=1800&#038;year_end=2000&#038;corpus=0&#038;smoothing=3">far more often in English text</a> than the phrase &#8220;Lee Harvey Oswald&#8221;,   for example.  Because of this, chances are that repetitions of the   phrase &#8220;Lee Harvey Oswald&#8221; in search results are more meaningful   than repetitions of more common phrases, such as &#8220;he was going&#8221;.   It&#8217;d be natural to modify the program to give less common phrases a   higher score.  What&#8217;s the right way of doing this?  Does using   <a href="https://michaelnielsen.org/ddi/documents-as-geometric-objects-how-to-rank-documents-for-full-text-search/">inverse     document frequency</a> give better performance, for example?
<li> Question-answering is really three problems: (1) understanding   the question; (2) figuring out the answer; and (3) explaining the   answer.  In this post I&#8217;ve concentrated on (2) and (to some extent)   (1), but not (3).  How might we go about explaining the answers   generated by our system?
</ul>
<h3>Discussion</h3>
<p>Let&#8217;s come back to the idea I described in the opening: <strong>unsophisticated algorithms + large amounts of data <img src='https://s0.wp.com/latex.php?latex=%5Cgeq&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\geq' title='\geq' class='latex' />   sophisticated linguistic algorithms + only a small amount of data</strong>.</p>
<p>Variations on this idea have become common recently.  In an influential 2009 paper, <a href="http://www.csee.wvu.edu/~gidoretto/courses/2011-fall-cp/reading/TheUnreasonable%20EffectivenessofData_IEEE_IS2009.pdf">Halevy, Norvig, and Pereira</a>wrote that:</p>
<blockquote><p> [I]nvariably, simple models and a lot of data trump more   elaborate models based on less data. </p></blockquote>
<p>They were writing in the context of machine translation and speech recognition, but this point of view has become commonplace well beyond machine translation and speech recognition.  For example, at the 2012 O&#8217;Reilly Strata California conference on big data, there was a debate on the idea that &#8220;[i]n data science, domain expertise is more important than machine learning skill.&#8221;  The people favouring the machine learning side <a href="http://radar.oreilly.com/2012/03/subject-matter-experts-data-stories-analysis.html">won   the debate</a>, at least in the eyes of the audience.  Admittedly, <em>a priori</em> you&#8217;d expect this audience to strongly favour machine learning.  Still, I expect that just a few years earlier even a highly sympathetic audience would have tilted the other way.</p>
<p>An interesting way of thinking about this idea that data trumps the quality of your models and algorithms is in terms of a tradeoff curve:</p>
<p><img decoding="async" src="https://michaelnielsen.org/ddi/wp-content/uploads/2012/06/algorithms_vs_data.png" width="280px"></p>
<p>The curve shown is one of constant performance: as you increase the amount of training data, you decrease the quality of the algorithm required to get the same performance.  The tradeoff between the two is determined by the slope of the curve.  </p>
<p>A difficulty with the &#8220;more data is better&#8221; point of view is that it&#8217;s not clear how to determine what the tradeoffs are in practice: is the slope of the curve very shallow (more data helps more than better algorithms), or very steep (better algorithms help more than more data).  To put it another way, it&#8217;s not obvious whether to focus on acquiring more data, or on improving your algorithms.  Perhaps the correct moral to draw is that this is a <em>key</em> tradeoff to think about when deciding how to allocate effort.  At least in the case of the AskMSR system, taking the more data idea seriously enabled the team to very quickly build a system that was competitive with other systems which had taken much longer to develop.</p>
<p>A second difficulty with the more data is better point of view is, of course, that sometimes more data is exremely expensive to generate. IBM has said that it intends to use Watson to <a href="http://www.wired.com/wiredenterprise/2012/03/ibm-watson/">help   doctors answer medical questions</a>.  Yet for such highly specialized questions it&#8217;s not so clear that it will be easy to find data sources which meet the criteria outlined by Brill <em>et al</em>:</p>
<blockquote><p> [T]he greater the answer redundancy in the source, the   more likely it is that we can find an answer that occurs in a simple   relation to the question, and therefore, the less likely it is that   we will need to resort to solving the aforementioned difficulties   facing natural language processing systems. </p></blockquote>
<p><strong>Data versus understanding:</strong> I&#8217;ll finish by briefly discussing one possible criticism of the data-driven approach to question-answering.  That criticism is that it rejects detailed understanding in favour of a simplistic data-driven associative model. It&#8217;s tempting to think that this must therefore be the wrong track, a blind alley that may pay off in the short term, but which will ultimately be unfruitful. I think that point of view is shortsighted. We don&#8217;t yet understand how human cognition works, but we do know that we often use after-the-fact rationalization and <a href="http://en.wikipedia.org/wiki/Confabulation">confabulation</a>, suggesting that the basis for many of our decisions isn&#8217;t a detailed understanding, but something more primitive.  This isn&#8217;t to say that rationalization and confabulation are desirable.  But it does mean that it&#8217;s worth pushing beyond narrow conceptions of understanding, and trying to determine how much of our intelligence can be mimicked using simple data-driven associative techniques, and by combining such techniques with other ideas.</p>
<p><strong>Acknowledgements:</strong> Thanks to <a href="https://github.com/thomasballinger">Thomas Ballinger</a>, <a href="https://github.com/doda">Dominik Dabrowski</a>, <a href="https://twitter.com/#!/@np_hoffman">Nathan Hoffman</a>, and <a href="https://github.com/happy4crazy">Allan O&#8217;Donnell</a> for comments that improved my code. </a></p>
<p><em>Interested in more?  Please <a href="https://michaelnielsen.org/ddi/feed/">subscribe to this blog</a>, or <a href="http://twitter.com/#!/michael_nielsen">follow me on Twitter</a>.  You may also enjoy reading my new book about  open science, <a href="http://www.amazon.com/Reinventing-Discovery-New-Networked-Science/dp/product-description/0691148902">Reinventing Discovery</a>. </em> </p>
]]></content:encoded>
					
					<wfw:commentRss>https://michaelnielsen.org/ddi/how-to-answer-a-question-a-simple-system/feed/</wfw:commentRss>
			<slash:comments>7</slash:comments>
		
		
			</item>
		<item>
		<title>Lisp as the Maxwell&#8217;s equations of software</title>
		<link>https://michaelnielsen.org/ddi/lisp-as-the-maxwells-equations-of-software/</link>
					<comments>https://michaelnielsen.org/ddi/lisp-as-the-maxwells-equations-of-software/#comments</comments>
		
		<dc:creator><![CDATA[Michael Nielsen]]></dc:creator>
		<pubDate>Wed, 11 Apr 2012 23:52:19 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<guid isPermaLink="false">https://michaelnielsen.org/ddi/?p=62</guid>

					<description><![CDATA[On my first day of physics graduate school, the professor in my class on electromagnetism began by stepping to the board, and wordlessly writing four equations: He stepped back, turned around, and said something like [1]: &#8220;These are Maxwell&#8217;s equations. Just four compact equations. With a little work it&#8217;s easy to understand the basic elements&#8230; <a class="more-link" href="https://michaelnielsen.org/ddi/lisp-as-the-maxwells-equations-of-software/">Continue reading <span class="screen-reader-text">Lisp as the Maxwell&#8217;s equations of software</span></a>]]></description>
										<content:encoded><![CDATA[<p>On my first day of physics graduate school, the professor in my class on electromagnetism began by stepping to the board, and wordlessly writing four equations:</p>
<img src='https://s0.wp.com/latex.php?latex=+%5Cnabla+%5Ccdot+E+%3D+%5Cfrac%7B%5Crho%7D%7B%5Cepsilon_0%7D+%5Chspace%7B2cm%7D+%5Cnabla+%5Ctimes+E%2B%5Cfrac%7B%5Cpartial+B%7D%7B%5Cpartial+t%7D+%3D+0++&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt=' \nabla \cdot E = \frac{\rho}{\epsilon_0} \hspace{2cm} \nabla \times E+\frac{\partial B}{\partial t} = 0  ' title=' \nabla \cdot E = \frac{\rho}{\epsilon_0} \hspace{2cm} \nabla \times E+\frac{\partial B}{\partial t} = 0  ' class='latex' />
<img src='https://s0.wp.com/latex.php?latex=+%5Cnabla+%5Ccdot+B+%3D+0+%5Chspace%7B1.1cm%7D+%5Cnabla+%5Ctimes+B-%5Cmu_0+%5Cepsilon_0+%5Cfrac%7B%5Cpartial+E%7D%7B%5Cpartial+t%7D+%3D+%5Cmu_0+j+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt=' \nabla \cdot B = 0 \hspace{1.1cm} \nabla \times B-\mu_0 \epsilon_0 \frac{\partial E}{\partial t} = \mu_0 j ' title=' \nabla \cdot B = 0 \hspace{1.1cm} \nabla \times B-\mu_0 \epsilon_0 \frac{\partial E}{\partial t} = \mu_0 j ' class='latex' />
<p>He stepped back, turned around, and said something like [1]: &#8220;These are Maxwell&#8217;s equations.  Just four compact equations.  With a little work it&#8217;s easy to understand the basic elements of the equations &#8211; what all the symbols mean, how we can compute all the relevant quantities, and so on.  But while it&#8217;s easy to understand the elements of the equations, understanding all their consequences is another matter.  Inside these equations is all of electromagnetism &#8211; everything from antennas to motors to circuits.  If you think you understand the consequences of these four equations, then you may leave the room now, and you can come back and ace the exam at the end of semester.&#8221;</p>
<p><a href="http://en.wikipedia.org/wiki/Alan_Kay">Alan Kay</a> has <a href="http://queue.acm.org/detail.cfm?id=1039523">famously described</a> Lisp as the &#8220;Maxwell&#8217;s equations of software&#8221;.  He describes the revelation he experienced when, as a graduate student, he was studying the <a href="http://www.softwarepreservation.org/projects/LISP/book/LISP%201.5%20Programmers%20Manual.pdf">LISP 1.5 Programmer&#8217;s Manual</a> and realized that &#8220;the half page of code on the bottom of page 13&#8230; was Lisp in itself. These were “Maxwell’s Equations of Software!” This is the whole world of programming in a few lines that I can put my hand over.&#8221;</p>
<p>Here&#8217;s the half page of code that Kay saw in that manual:</p>
<p><a href="https://michaelnielsen.org/ddi/wp-content/uploads/2012/04/Lisp_Maxwells_Equations.png"><img decoding="async" src="https://michaelnielsen.org/ddi/wp-content/uploads/2012/04/Lisp_Maxwells_Equations.png" alt="" title="Lisp_Maxwells_Equations" width="480" class="alignnone size-full wp-image-63" srcset="https://michaelnielsen.org/ddi/wp-content/uploads/2012/04/Lisp_Maxwells_Equations.png 555w, https://michaelnielsen.org/ddi/wp-content/uploads/2012/04/Lisp_Maxwells_Equations-291x300.png 291w" sizes="(max-width: 555px) 100vw, 555px" /></a></p>
<p>What we&#8217;re going to do in this essay is understand what that half page of code means, and what it means that Lisp is the Maxwell&#8217;s equations of software.  However, we won&#8217;t literally work through the half page of code above.  Instead, we&#8217;ll do something much more informative: we&#8217;ll create a modern, <em>fully executable</em> equivalent of the code above.  Furthermore, to make this essay accessible, I won&#8217;t assume that you know Lisp.  Instead, I&#8217;ll teach you the basic elements of Lisp.</p>
<p>That perhaps sounds over-ambitious, but the good news is that it&#8217;s easy to learn the basic elements of Lisp.  Provided you have a little facility with computer programming and comfort with mathematics, you can learn how Lisp works in just a few minutes.  Frankly, it&#8217;s much easier than understanding the elements of Maxwell&#8217;s equations!  And so I&#8217;ll start by explaining a subset of the Lisp programming language, and getting you to write some Lisp code.</p>
<p>But I won&#8217;t stop with just showing you how to write some Lisp.  Once we&#8217;ve done that we&#8217;re going to write an <em>interpreter</em> for Lisp code.  In particular, we&#8217;ll create a interpreter based on a <a href="http://norvig.com/lispy.html">beautiful Lisp interpreter</a> written by <a href="http://norvig.com/">Peter Norvig</a>, which contains just 90 lines of Python code.  Our interpreter will be a little more complex, due mostly to the addition of a few conveniences absent from Norvig&#8217;s interpreter.  The code is still simple and easy to understand, provided you&#8217;re comfortable reading Python code.  As we&#8217;ll see, the benefit of writing the interpreter is not just that it gives us a running interpreter (although that&#8217;s no small thing).  It&#8217;s that writing an interpreter also deepens our understanding of Lisp.  It does that by taking what would otherwise be some rather abstract concepts in our description of Lisp, and giving them concrete, tangible representations in terms of Python code and data structures. By making concrete what was formerly abstract, the code for our Lisp interpreter gives us a new way of understanding how Lisp works.</p>
<p>With our Python Lisp interpreter up and running, we&#8217;ll then write a modern equivalent to the code on the bottom of page 13 of the LISP 1.5 Programmer&#8217;s Manual.  But while our code will be essentially the same as the code from page 13, it will have the considerable advantage that it&#8217;s also executable.  We can, if we wish, play with the code, modify it, and improve it.  In other words, it&#8217;s a living version of the Maxwell&#8217;s equations of software!  Furthermore, with our new understanding it becomes an easy and fun exercise to understand all the details on page 13 of the LISP Manual.  </p>
<p>This second part of the essay is based primarily on two sources: the first chapter of the LISP 1.5 Manual, of course, but also an <a href="http://lib.store.yahoo.net/lib/paulgraham/jmc.ps">essay by Paul   Graham</a> (postscript) in which he explains some of the early ideas behind Lisp.  Incidentally, &#8220;LISP&#8221; is the capitalization used in the LISP Manual, but otherwise I&#8217;ll use the modern capitalization convention, and write &#8220;Lisp&#8221;.</p>
<p>The great Norwegian mathematician <a href="http://en.wikipedia.org/wiki/Niels_Henrik_Abel">Niels Henrik   Abel</a> was once asked how he had become so good at mathematics.  He replied that it was <a href="http://scienceworld.wolfram.com/biography/Abel.html">&#8220;by   studying the masters, not their pupils&#8221;</a>.  The current essay is motivated by Abel&#8217;s admonishment.  As a programmer, I&#8217;m a beginner (and almost completely new to Lisp), and so this essay is a way for me to work in detail through ideas from masters such as Alan Kay, Peter Norvig, and Paul Graham.  Of course, if one takes Abel at his word, then you should stop reading this essay, and instead go study the works of Kay, Norvig, Graham, and the like!  I certainly recommend taking the time to study their work, and at the end of the essay I make some recommendations for further reading.  However, I hope that this essay has a distinct enough point of view to be of interest in its own right.  Above all, I hope the essay makes thinking about Lisp (and programming) fun, and that it raises some interesting fundamental questions.  Of course, as a beginner, the essay may contain some misunderstandings or errors (perhaps significant ones), and I&#8217;d welcome corrections, pointers, and discussion.</p>
<h3>Some elements of Lisp</h3>
<p>In this and the next two sections we&#8217;ll learn the basic elements of Lisp &#8211; enough to develop our own executable version of what Alan Kay saw on page 13 of the LISP Manual.  We&#8217;ll call the dialect of Lisp we develop &#8220;tiddly Lisp&#8221;, or just tiddlylisp for short.  Tiddlylisp is based on a subset of the programming language Scheme, which is one of the most popular modern dialects of Lisp. </p>
<p>While I can show you a bunch of examples of Lisp, you&#8217;ll learn much more if you type in the examples yourself, and then play around with them, modifying them and trying your own ideas.  So I want you to download the file <a href="https://raw.github.com/mnielsen/tiddlylisp/master/tiddlylisp.py">tiddlylisp.py</a> to your local machine.  Alternately, if you&#8217;re using <tt>git</tt>, you can just clone the entire <a href="https://github.com/mnielsen/tiddlylisp">code repository</a> associated to this essay.  The file tiddlylisp.py is the Lisp interpreter whose design and code we&#8217;ll work through later in the essay.  On Linux and Mac you can start the tiddlylisp interpreter by typing <tt>python tiddlylisp.py</tt> from the command line.  You should see a prompt: </p>
<pre>
tiddlylisp>
</pre>
<p> That&#8217;s where you&#8217;re going to type in the code from the examples &#8211; it&#8217;s an interactive Lisp interpreter.  You can exit the interpreter at any time by hitting <tt>Ctrl-C</tt>.  Note that the interpreter isn&#8217;t terribly complete &#8211; as we&#8217;ll see, it&#8217;s only 153 lines! &#8211; and one way it&#8217;s incomplete is that the error messages aren&#8217;t very informative.  Don&#8217;t spend a lot of time worrying about the errors, just try again.</p>
<p>If you&#8217;re on Windows, and don&#8217;t already have Python installed, you&#8217;ll need to <a href="http://www.python.org/download/releases/2.7.2/">download   it</a> (I suggest Python 2.7 for this code), before starting the interpreter by running <tt>python tiddlylisp.py</tt>.  Note that I&#8217;ve only tested the interpreter on Ubuntu Linux, not on Windows or the Mac.</p>
<p>As our first example, type the following code into the tiddlylisp interpreter: </p>
<pre>
tiddlylisp> (+ 2 3)
5
</pre>
<p> In the <em>expression</em> you typed in, <tt>+</tt> is a built-in <em>procedure</em>, which represents the addition operation.  It&#8217;s being applied to two <em>arguments</em>, in this case the numbers <tt>2</tt> and <tt>3</tt>.  The interpreter <em>evaluates</em> the result of applying <tt>+</tt> to <tt>2</tt> and <tt>3</tt>, i.e., of adding <tt>2</tt> and <tt>3</tt> together, <em>returning the value</em> <tt>5</tt>, which it then <em>prints</em>.</p>
<p>That first example was extremely simple, but it contains many of the concepts we need to understand Lisp: expressions, procedures, arguments, evaluation, returning a value, and printing.  We&#8217;ll see many more illustrations of these ideas below.</p>
<p>Here&#8217;s a second example, illustrating another built-in procedure, this time the multiplication procedure, <tt>*</tt>: </p>
<pre>
tiddlylisp> (* 3 4)
12
</pre>
<p> It&#8217;s the same basic story: <tt>*</tt> is a built-in procedure, this time representing multiplication, and is here being applied to the numbers <tt>3</tt> and <tt>4</tt>.  The interpreter evaluates the expression, and prints the value returned, which is <tt>12</tt>.</p>
<p>One potentially confusing thing about the first two examples is that I&#8217;ve called <tt>+</tt> and <tt>*</tt> &#8220;procedures&#8221;, yet in many programming languages procedures don&#8217;t return a value, only functions do.  I&#8217;m using this terminology because it&#8217;s the standard terminology in the programming language Scheme, which is the dialect of Lisp that tiddlylisp is based on.  In fact, in some modern dialects of Lisp &#8211; such as Common Lisp &#8211; operations such as <tt>+</tt> and <tt>*</tt> would be called &#8220;functions&#8221;.  But we&#8217;ll stick with Scheme&#8217;s useage, and talk only of procedures.</p>
<p>Another example, illustrating a slightly different type of procedure: </p>
<pre>
tiddlylisp> (< 10 20)
True
</pre>
<p> Here, the built-in procedure <tt><</tt> represents the comparison operator.  And so tiddlylisp prints the result of evaluating an expression which compares the constants <tt>10</tt> and <tt>20</tt>.  The result is <tt>True</tt>, since <tt>10</tt> is less than <tt>20</tt>.  By contrast, we have </p>
<pre>
tiddlylisp> (< 17 12)
False
</pre>
<p> since <tt>17</tt> is not less than <tt>12</tt>.</p>
<p>Many Lisp beginners are initially confused by this way of writing basic numerical operations.  We're so familiar with expressions such as <tt>2 + 3</tt> that the changed order in <tt>(+ 2 3)</tt> appears strange and unfamiliar. And yet our preference for the infix notation <tt>2 + 3</tt> over the prefix notation <tt>(+ 2 3)</tt> is more a historical accident than a consequence of anything fundamental about arithmetic.  Unfortunately, this has the consequence that some people back away from Lisp, simply because they dislike thinking in this unfamiliar way.</p>
<p>In this essay we won't get deep enough into Lisp to see in a lot of concrete detail why the prefix notation is a good idea.  However, we will get one hint: our tiddlylisp interpreter will be much simpler for using the same (prefix) style for all procedures.  If you really strongly dislike the prefix notation, then I challenge you to rewrite tiddlylisp so that it uses infix notation for some operations, and prefix notation for others.  What you'll find is that the interpreter becomes quite a bit more complex.  And so there is a sense in which using the same prefix style everywhere makes Lisp simpler.</p>
<p>Another useful thing Lisp allows us to do is to nest expressions: </p>
<pre>
tiddlylisp> (< (* 11 19) (* 9 23))
False
</pre>
<p> To compute the value of this expression, tiddlylisp evaluates the nested expressions, returning <tt>209</tt> from <tt>(* 11 19)</tt>, and <tt>207</tt> from <tt>(* 9 23)</tt>.  The output from the outer expression is then just the result of evaluating <tt>(< 209 207)</tt>, which of course is <tt>False</tt>.</p>
<p>We can use tiddlylisp to define variables: </p>
<pre>
tiddlylisp> (define x 3)
tiddlylisp> (* x 2)
6
</pre>
<p> <tt>define</tt> is used to define <em>new</em> variables; we can also assign a value to a variable which has previously been defined: </p>
<pre>
tiddlylisp> (set! x 4)
tiddlylisp> x
4
</pre>
<p> One question you may have here is about the slightly unusual syntax: <tt>set!</tt>.  You might wonder whether the exclamation mark means something special - maybe <tt>set!</tt> is some type of hybrid procedure or something.  Actually, there's no such complexity: <tt>set!</tt> is just a single keyword, just like <tt>define</tt>, nothing special about it at all.  It's just that tiddlylisp allows exclamation marks in keyword names.  So there's nothing special going on indicated by the exclamation mark.</p>
<p>A deeper question is why we don't simply eleminate <tt>define</tt>, and make it so <tt>set!</tt> checks whether a variable has been <tt>define</tt>d already, and if not, does so.  I'll explain a little bit later why we don't do this; for now, you should just note the distinction.</p>
<p>We can use <tt>define</tt> to define new procedures in a way similar to how we use it to define variables.  Here's an example showing how to define a procedure named <tt>square</tt>.  I'll unpack what's going on below, but first, here's the code, </p>
<pre>
tiddlylisp> (define square (lambda (x) (* x x)))
tiddlylisp> (square 7)
49
</pre>
<p> Ignore the first of the lines above for now.  You can see from the second line what the procedure <tt>square</tt> does: it takes a single number as input, and returns the square of the number.</p>
<p>What about the first line of code above?  We already know enough to guess quite a bit about how this line works: a procedure named <tt>square</tt> is being defined, and is assigned the value of the expression <tt>(lambda (x) (* x x))</tt>, whatever that value might be. What's new, and what we need to understand, is what the value of the expression <tt>(lambda (x) (* x x))</tt> is.  To understand this, let's break the expression into three pieces: <tt>lambda</tt>, <tt>(x)</tt>, and <tt>(* x x)</tt>.  We'll understand the three pieces separately, and then put them back together.</p>
<p>The first piece of the expression - the <tt>lambda</tt> - simply tells the tiddlylisp interpreter that this expression is defining a procedure.  I must admit that when I first encountered the <tt>lambda</tt> notation I found this pretty confusing [2] - I thought that <tt>lambda</tt> must be a variable, or an argument, or something like that.  But no, it's just a big fat red flag to the tiddlylisp interpreter saying "Hey, this is a procedure definition".  That's all <tt>lambda</tt> is.</p>
<p>The second part of the expression - the <tt>(x)</tt> - tells tiddlylisp that this is a procedure with a single argument, and that we're going to use the temporary name <tt>x</tt> for that argument, for the purposes of defining the procedure.  If the procedure definition had started instead with <tt>(lambda (x y) ...)</tt> that would have meant that the procedure had two arguments, temporarily labelled <tt>x</tt> and <tt>y</tt> for the purposes of defining the procedure.</p>
<p>The third part of the expression - the <tt>(* x x)</tt> - is the meat of the procedure definition.  It's what we evaluate and return when the procedure is called, with the actual values for the arguments of the procedure substituted in place of <tt>x</tt>.</p>
<p>Taking it all together, then, the value of the expresion <tt>(lambda (x) (* x x))</tt> is just a procedure with a single input, and returning the square of that input.  This procedure is <em>anonymous</em>, i.e., it doesn't have a name.  But we can give it a name by using <tt>define</tt>, and so the line </p>
<pre>
tiddlylisp> (define square (lambda (x) (* x x)))
</pre>
<p> tells tiddlylisp to define something called <tt>square</tt>, whose value is a procedure (because of the <tt>lambda</tt>) with a single argument (because of the <tt>(x)</tt>), and what that procedure returns is the square of its argument (because of the <tt>(* x x)</tt>).</p>
<p>An important point about the variables used in defining procedures is that they're dummy variables.  Suppose we wanted to define a procedure <tt>area</tt> which would return the area of a triangle, with arguments the base of the triangle and the height.  We could do this using the following procedure definition: </p>
<pre>
tiddlylisp> (define area (lambda (b h) (* 0.5 (* b h))))
tiddlylisp> (area 3 5)
7.5
</pre>
<p> But what would have happened if we'd earlier defined a variable <tt>h</tt>, e.g.: </p>
<pre>
tiddlylisp> (define h 11)
tiddlylisp> (define area (lambda (b h) (* 0.5 (* b h))))
</pre>
<p> It probably won't surprise you to learn that inside the procedure definition, i.e., immediately after the <tt>lambda (b h)</tt>, the <tt>h</tt> is treated as a dummy variable, and is entirely different from the <tt>h</tt> outside the procedure definition.  It has what's called a different <em>scope</em>.  And so we can continue the above with the following </p>
<pre>
tiddlylisp> (area 3 h)
16.5
</pre>
<p> that is, the area is just 0.5 times 3 times the value of <tt>h</tt> set earlier, outside the procedure definition.  At that point the value of <tt>h</tt> was 11, and so <tt>(area 3 h)</tt> returns 16.5.</p>
<p>There's a variation on the above that you might wonder about, which is what happens when you use variables defined <em>outside</em> a procedure <em>inside</em> that procedure definition?  For instance, suppose we have: </p>
<pre>
tiddlylisp> (define x 3)
tiddlylisp> (define foo (lambda (y) (* x y)))
</pre>
<p> What happens now if we evaluate <tt>foo</tt>?  Well, tiddlylisp does the sensible thing, and interprets <tt>x</tt> as it was defined outside the procedure definition, so we have: </p>
<pre>
tiddlylisp> (foo 4)
12
</pre>
<p> What happens to our procedure if we next change the value of <tt>x</tt>? In fact, this changes <tt>foo</tt>: </p>
<pre>
tiddlylisp> (set! x 5)
tiddlylisp> (foo 4)
20
</pre>
<p> In other words, in the procedure definition <tt>lambda (y) (* x y)</tt> the <tt>x</tt> really does refer to the variable <tt>x</tt>, and not to the particular value <tt>x</tt> might have at any given point in time.</p>
<p>Let's dig down a bit more into how tiddlylisp handles scope and dummy variables.  Let's look at what happens when we define a variable: </p>
<pre>
tiddlylisp> (define x 5)
tiddlylisp> x
5
</pre>
<p> The way the interpreter handles this internally is that it maintains what's called an <em>environment</em>: a dictionary whose keys are the variable names, and whose values are the corresponding variable values.  So what the interpeter does when it sees the first line above is add a new key to the environment, "<tt>x</tt>", with value <tt>5</tt>.  When the interpreter sees the second line, it consults the environment, looks up the key <tt>x</tt>, and returns the corresponding value.  You can, if you like, think of the environment as the interpreter's memory or data store, in which it stores all the details of the variables defined to date.</p>
<p>All this is pretty simple.  We can go along defining and changing variables, and the interpreter just keeps consulting and modifying the environment as necessary.  But when you define a new procedure using <tt>lambda</tt> the interpreter treats the variables used in the definition slightly differently.  Let's look at an example: </p>
<pre>
tiddlylisp> (define h 5)
tiddlylisp> (define area (lambda (b) (* b h)))
tiddlylisp> (area 2)
10
</pre>
<p> What I want to concentrate on here is the procedure definition <tt>(lambda (b) (* b h))</tt>.  Up to this point the interpreter had been chugging along, modifying its environment as appropriate.  What happens when it sees the <tt>(lambda ...)</tt>, though, is that the <em>interpreter creates a new environment</em>, an environment called an <em>inner</em> environment, as distinguished from the <em>outer</em> environment, which is what the interpreter has been operating in until it reached the <tt>(lambda ...)</tt> statement.  The inner environment is a new dictionary, whose keys initially are just the arguments to the procedure - in this case a single key, <tt>b</tt> - and whose values will be supplied when the procedure is called.</p>
<p>To recap, what the interpreter does when it sees <tt>(lambda (b) (* b h))</tt> is create a new inner environment, with a key <tt>b</tt> whose value will be set when the procedure is called. When evaluating the expression <tt>(* b h)</tt> (which defines the result returned from the procedure) what the interpreter does is first consult the inner environment, where it finds the key <tt>b</tt>, but not the key <tt>h</tt>.  When it fails to find <tt>h</tt>, it looks instead for <tt>h</tt> in the outer environment, where the key <tt>h</tt> has indeed been defined, and retrieves the appropriate value.</p>
<p>I've described a simple example showing how environments work, but tiddlylisp also allows us to have procedure definitions nested inside procedure definitions, nested inside procedure definitions (and so on).  To deal with this, the top-level tiddlylisp interpreter operates inside a <em>global environment</em>, and each procedure definition creates a new inner environment, perhaps nested inside a previously created inner environment, if that's appropriate.</p>
<p>If all this talk about inner and outer environments has left you confused, fear not.  At this stage it really <em>is</em> important to have gotten the gist of how environments work, but you shouldn't worry if the details still seem elusive.  In my opinion, the best way to understand those details is not through abstract discussion, but instead by looking at the working code for tiddlylisp.  We'll get to that shortly, but for now can move onwards armed with a general impression of how environments work.</p>
<p>The procedure definitions I've described so far evaluate and return the value from just a single expression.  We can use such expressions to achieve suprisingly complex things, because of the ability to nest expressions.  Still, it would be convenient to have a way of chaining together expressions that doesn't involve nesting.  A way of doing this is to use the <tt>begin</tt> keyword.  <tt>begin</tt> is especially helpful in defining complex procedures, and so I'll give an example in that context: </p>
<pre>
tiddlylisp> (define area (lambda (r) (begin (define pi 3.14) (* pi (* r r)))))
</pre>
<p> This line is defining a procedure called <tt>area</tt>, with a single argument <tt>r</tt>, and where the value of <tt>(area r)</tt> is just the value of the expression </p>
<pre>
(begin (define pi 3.14) (* pi (* r r)))
</pre>
<p> with the appropriate value for <tt>r</tt> substituted.  The way tiddlylisp evaluates the <tt>begin</tt> expression above is that it evaluates all the separate sub-expressions, consecutively in the order they appear, and then returns the value of the <em>final</em> sub-expression, in this case the sub-expression <tt>(* pi (* r r))</tt>. So, for example, we get </p>
<pre>
tiddlylisp> (area 3)
28.26
</pre>
<p> which is just the area of a circle with radius <tt>3</tt>.  </p>
<p>Now, in a simple example such as this you might argue that it makes more sense to avoid defining <tt>pi</tt>, and just put <tt>3.14</tt> straight into the later expression.  However, doing so will make the intent of the code less clear, and I trust you will find it easy to imagine more complex cases where it makes even more sense to use multi-part <tt>begin</tt> expressions.  This is especially the case since it is permissible to split Lisp expressions over multiple lines, so the above procedure definition could have been written: </p>
<pre>
(define area (lambda (r)
  (begin (define pi 3.14)
         (* pi (* r r)))))
</pre>
<p> Now, I should mention that if you enter the above into the tiddlylisp interpreter, you'll get errors.  This is because the tiddlylisp interpreter is so stripped down that it doesn't allow multi-line inputs.  However, tiddlylisp will allow you to load multi-line expressions from a file - try saving the above three lines in a file named <tt>area.tl</tt>, and then running that file with <tt>python tiddlylisp.py area.tl</tt>.  Tiddlylisp will execute the code, defining the <tt>area</tt> procedure, and then leave you in the interpreter, where you can type things like: </p>
<pre>
tiddlylisp> (area 3)
28.26
</pre>
<p>A second advantage of using <tt>begin</tt> in the above code is that the variable <tt>pi</tt> is only defined in the inner environment associated to the <tt>lambda</tt> expression.  If, for some reason, you want to define <tt>pi</tt> differently outside that scope, you can do so, and it won't be affected by the definition of <tt>area</tt>.  Consider, for example, the following sequence of expressions: </p>
<pre>
tiddlylisp> (define pi 3)
tiddlylisp> pi
3
tiddlylisp> (define area (lambda (r) (begin (define pi 3.14) (* pi (* r r)))))
tiddlylisp> (area 1)
3.14
tiddlylisp> pi
3
</pre>
<p> In other words, the value of the final <tt>pi</tt> is returned from the outer (global) environment, not the inner environment created during the definition of the procedure <tt>area</tt>.</p>
<p>Earlier, we discussed <tt>define</tt> and <tt>set!</tt>, and wondered whether there was really any essential difference.  Consider now the following example, where we modify <tt>area</tt> by using <tt>set!</tt> instead of <tt>define</tt>: </p>
<pre>
tiddlylisp> (define pi 3)
tiddlylisp> pi
3
tiddlylisp> (define area (lambda (r) (begin (set! pi 3.14) (* pi (* r r)))))
tiddlylisp> (area 1)
3.14
tiddlylisp> pi
3.14
</pre>
<p> As you can see by comparing the final line of this example to our earlier example, there really is a significant difference between <tt>define</tt> and <tt>set!</tt>.  In particular, when we define <tt>area</tt> in this example, what <tt>set!</tt> does is check to see whether the inner environment contains a variable named <tt>pi</tt>. Since it doesn't, it checks the outer environment, where it does find such a variable, and that's what <tt>set!</tt> updates. If we'd used <tt>define</tt> instead it would have created a completely new variable named <tt>pi</tt> in the inner environment, while leaving <tt>pi</tt> in the outer environment untouched, so the final line would have returned <tt>3</tt> instead.  So having both <tt>define</tt> and <tt>set!</tt> gives us quite a bit of flexibility and control over which environment is being used, at the expense of a complication in syntax.</p>
<p>As our final piece of Lisp before putting what we've learnt together to construct a nontrivial example, tiddlylisp includes a keyword <tt>if</tt> that can be used to test conditions, and conditionally return the value of different expressions.  Here's an example showing the use of <tt>if</tt> to evaluate the absolute value of a variable: </p>
<pre>
tiddlylisp> (define x -2)
tiddlylisp> (if (> x 0) x (- 0 x))
2
</pre>
<p> We can use the same idea to define a procedure <tt>abs</tt> which returns the absolute value of a number: </p>
<pre>
tiddlylisp> (define abs (lambda (x) (if (> x 0) x (- 0 x))))
tiddlylisp> (abs -2)
2
</pre>
<p> The general form for <tt>if</tt> expressions is <tt>(if </tt><em>cond   result alt</em><tt>)</tt>, where we evaluate the expression <em>cond</em>, and if the value is <tt>True</tt>, return the value of <em>result</em>, otherwise return the value of <em>alt</em>.  Note that in the expression <tt>(if </tt><em>cond result alt</em><tt>)</tt>, the <em>cond</em>, <em>result</em> and <em>alt</em> of course shouldn't be read literally. Rather, they are placeholders for other expressions, as we saw in the absolute value example above.  Through the remainder of this essay I will use use italics to indicate such placeholder expressions.</p>
<p>Let me conclude this section by briefly introducing two important Lisp concepts.  The first is the concept of a <em>special form</em>.  In this section we discussed several built-in procedures, such as <tt>+</tt> and <tt>*</tt>, and also some user-defined procedures.  Calls to such procedures always have the form <tt>(</tt><em>proc exp1   exp2...</em><tt>)</tt>, where <em>proc</em> is the procedure name, and the other arguments are expressions.  Such expressions are evaluated by evaluating each individual expression <em>exp1, exp2...</em>, and then passing those values to the procedure.  However, not all Lisp expressions are procedure calls.  Consider the expression <tt>(define pi 3.14)</tt>.  This isn't evaluated by calling some procedure <tt>define</tt> with arguments given by the value of <tt>pi</tt> and the value of <tt>3.14</tt>.  It can't be evaluated this way because <tt>pi</tt> doesn't have a value yet!  So <tt>define</tt> isn't a procedure.  Instead, <tt>define</tt> is an example of what is known as a <em>special form</em>.  Other examples of special forms include <tt>lambda</tt>, <tt>begin</tt>, and <tt>if</tt>, and a few more which we'll meet later.  Like <tt>define</tt>, none of these is a procedure, but instead each special forms has its own special rule for evaluation.</p>
<p>The second concept I want to briefly introduce is that of a list.  The list is one of the basic data structures used in Lisp, and even gives the language its name - Lisp is short for "list processing".  In fact, most of the Lisp expressions we've seen up to now are lists: an expression such as <tt>(abs 2)</tt> is a two-element list, with elements <tt>abs</tt> and <tt>2</tt>, delimited by spaces and parentheses.  A more complex expression such as <tt>(define abs (lambda (x) (if (> x 0) x (- 0 x))))</tt> is also a list, in this case with the first two elements being <tt>define</tt> and <tt>abs</tt>.  The third element, <tt>(lambda (x) (if (> x 0) x (- 0 x)))</tt>, is a sublist which in turn has sublists (and subsublists) of its own.  Later we'll see how to use Lisp to do manipulations with such lists.</p>
<h3>A nontrivial example: square roots using only elementary   arithmetic</h3>
<p>Let's put together the ideas above to do something nontrivial.  We'll write a short tiddlylisp program to compute square roots, using only the elementary arithmetical operations (addition, subtraction, multiplication, and division).  The idea behind the program is to use <a href="http://en.wikipedia.org/wiki/Newton's_method">Newton's method</a>. Although Newton's method is interesting in its own right, I'm not including this example because of its algorithmic elegance.  Instead, I'm including it as a simple and beautiful example of a Lisp program. The example comes from Abelson and Sussman's book on the <a href="http://www.amazon.com/Structure-Interpretation-Computer-Programs-Engineering/dp/0262011530">Structure   and Interpretation of Computer Programs</a>.</p>
<p>Here's how Newton's method for computing square roots works.  Suppose we have a (positive) number <tt>x</tt> whose square root we wish to compute.  Then we start by making a <em>guess</em> at the square root, <tt>guess</tt>.  We're going to start by arbitrarily choosing <tt>1.0</tt> as our initial value for <tt>guess</tt>; in principle, any positive number will do.  According to Newton's method, we'll get an improved guess by computing <tt>(guess+x/guess)/2</tt>, i.e., by taking the average of <tt>guess</tt> and <tt>x/guess</tt>.  If we repeat this averaging process enough times, we'll converge to a good estimate of the square root.</p>
<p>Let's express these ideas in Lisp code.  We'll do it from the top down, starting at a high level, and gradually filling in the details of all the procedures we need.  We start at the absolute top level, </p>
<pre>
(define sqrt (lambda (x) (sqrt-iter 1.0 x)))
</pre>
<p> Here, <tt>(sqrt-iter guess x)</tt> is the value of a to-be-defined procedure <tt>sqrt-iter</tt> that takes a guess <tt>guess</tt> at the square root of <tt>x</tt>, and keeps improving that guess over and over until it's a good enough estimate of the square root.  As I mentioned above, we start by arbitrarily choosing <tt>guess = 1.0</tt>.</p>
<p>The core of the algorithm is the definition of <tt>sqrt-iter</tt>: </p>
<pre>
(define sqrt-iter 
    (lambda (guess x) 
      (if (good-enough? guess x) guess (sqrt-iter (improve guess x) x))))
</pre>
<p> What <tt>sqrt-iter</tt> does is takes the <tt>guess</tt> and checks to see whether it's yet a <tt>good-enough?</tt> approximation to the square root of <tt>x</tt>, in a sense to be made precise below.  If it is, then <tt>sqrt-iter</tt> is done, and returns <tt>guess</tt>, otherwise it applies <tt>sqrt-iter</tt> to the improved guess <tt>(improve guess x)</tt> supplied by applying Newton's method.</p>
<p>Incidentally, the way I've written <tt>sqrt-iter</tt> above it's another example of a multi-line tiddlylisp expression, and so it's not possible to enter in the tiddlylisp interpreter.  However, you can enter it as part of a file, <tt>sqrt.tl</tt>, along with the definition of <tt>sqrt</tt>, and other procedures which we'll define below.  We'll later use tiddlylisp to execute the file <tt>sqrt.tl</tt>.</p>
<p>To fill out the details of the above, we need to understand how <tt>good-enough?</tt> and <tt>improve</tt> work.  Let's start with <tt>good-enough?</tt>, which simply checks whether the absolute value of the square of <tt>guess</tt> is within 0.00001 of <tt>x</tt>: </p>
<pre> 
(define good-enough? 
    (lambda (guess x) (< (abs (- x (square guess))) 0.00001)))
</pre>
<p> Here, we have defined the absolute value procedure <tt>abs</tt> and the squaring procedure <tt>square</tt> as: </p>
<pre> 
(define abs (lambda (x) (if (< 0 x) x (- 0 x))))
(define square (lambda (x) (* x x)))
</pre>
<p> That's all the code we need for <tt>good-enough?</tt>.  For <tt>improve</tt> we simply implement Newton's method, </p>
<pre>
(define improve (lambda (guess x) (average guess (/ x guess))))
</pre>
<p> where <tt>average</tt> is defined in the obvious way: </p>
<pre>
(define average (lambda (x y) (* 0.5 (+ x y))))
</pre>
<p> We can write out the whole program as follows: </p>
<pre>
(define average (lambda (x y) (* 0.5 (+ x y))))
(define improve (lambda (guess x) (average guess (/ x guess))))
(define square (lambda (x) (* x x)))
(define abs (lambda (x) (if (< 0 x) x (- 0 x))))
(define good-enough? 
    (lambda (guess x) (< (abs (- x (square guess))) 0.00001)))
(define sqrt-iter 
    (lambda (guess x) 
      (if (good-enough? guess x) guess (sqrt-iter (improve guess x) x))))
(define sqrt (lambda (x) (sqrt-iter 1.0 x)))
</pre>
<p> Save these lines to the file <tt>sqrt.tl</tt>, and then run it using <tt>python tiddlylisp.py sqrt.tl</tt>.  Tiddlylisp will execute the code, defining all the above procedures, and then leave you in an interpreter, where you can type: </p>
<pre>
tiddlylisp> (sqrt 2.0)
1.41421568627
</pre>
<p> This is, indeed, a pretty good approximation to the true square root: <tt>1.4142135...</tt>.</p>
<p>In writing out the overall program <tt>sqrt.tl</tt>, I've reversed the order of the lines, compared to my initial explanation.  The reason I've done this is worth discussing.  You see, the program <tt>sqrt.tl</tt> was actually the first piece of Lisp code I ever worked though in detail. The natural way to think through the problem of computing square roots with Newton's method is in the order I explained, working in a top-down fashion, starting from the broad problem, and gradually breaking our attack on that problem down into parts.  But the first time I wrote the code out I assumed that I needed to explain these ideas to Lisp in a bottom-up fashion, defining procedures like <tt>average</tt> before <tt>improve</tt> and so on, so that procedure definitions only include previously defined procedures. That meant reversing the order of the code, as I've shown above.</p>
<p>Taking this approach bugged me. It doesn't seem like the most natural way to think about implementing Newton's method.  At least for this problem a top-down approach seems more natural, and if you were just doing exploratory programming you'd start by defining <tt>sqrt</tt>, then <tt>sqrt-iter</tt>, and so on.  As an experiment, I decided to reverse the order of the progam, so it reflects the natural "thinking order": </p>
<pre>
(define sqrt (lambda (x) (sqrt-iter 1.0 x)))
(define sqrt-iter 
    (lambda (guess x) 
      (if (good-enough? guess x) guess (sqrt-iter (improve guess x) x))))
(define good-enough? 
    (lambda (guess x) (< (abs (- x (square guess))) 0.00001)))
(define abs (lambda (x) (if (< 0 x) x (- 0 x))))
(define square (lambda (x) (* x x)))
(define improve (lambda (guess x) (average guess (/ x guess))))
(define average (lambda (x y) (* 0.5 (+ x y))))
</pre>
<p> Somewhat to my surprise, this ran just fine !  (Actually, my code was slightly different, since I was using Common Lisp at that point, not tiddlylisp, but tiddlylisp works as well.)  It's apparently okay to define a procedure in terms of some other procedure which isn't defined until later.  Now, upon close examination of the code for tiddlylisp it turns out that this makes perfect sense.  But it came as a surprise to me.  Later, when we examine the code for tiddlylisp, I'll set a problem asking you to explain why it's okay to re-order the code above.</p>
<p>I think this program for the square root is a very beautiful program. Of course, in most programming languages the square root is built in, and so it's not exactly notable to have an elegant program for it! Indeed, the square root is built in to most version of Lisp, and it's trivially possible to add the square root to tiddlylisp.  But what I like about the above is that it's such a simple and natural expression of Newton's method.  As <a href="http://en.wikipedia.org/wiki/L._Peter_Deutsch">Peter Deutsch</a> has <a href="http://en.wikiquote.org/wiki/Lisp_programming_language">said</a>, Lisp programs come close to being "executable mathematics".</p>
<h3>Problems for the author</h3>
<ul>
<li> As a general point about programming language design it seems   like it would often be helpful to be able to define procedures in   terms of other procedures which have not yet been defined.  Which   languages make this possible, and which do not?  What advantages   does it bring for a programming language to be able to do this?  Are   there any disadvantages? </ul>
<h3>A few more elements of Lisp</h3>
<p>I began my introduction to Lisp by focusing on elementary arithmetical operations, and how to put them together.  I did that so I could give you a concrete example of Lisp in action - the square root example in the last section - which is a good way of getting a feel for how Lisp works.  But if we're going to understand Lisp as "the Maxwell's equations of software" then we also need to understand more about how Lisp deals with expressions and with lists. I'll describe those operations in this section.</p>
<p>Before I describe those operations, I want to discuss a distinction that I haven't drawn much attention to until now.  That's the distinction between an <em>expression</em> such as <tt>(+ 1 2)</tt> and the <em>value</em> of that expression, which in this case is <tt>3</tt>. When we feed a (valid) tiddlylisp expression to the tiddlylisp interpreter, it evaluates the expression, and returns its value.  </p>
<p>When I was first learning Lisp I'd often get myself in trouble because I failed to keep myself clear on the distinction between an expression and its value.  Let me give you an example of the kind of confusion I'd get myself into.  There is a built-in Lisp procedure called <tt>atom?</tt>, defined so that <tt>(atom? </tt><em>exp</em><tt>)</tt> returns <tt>True</tt> if the value of the expression <em>exp</em> is atomic - a number, for example, or anything which isn't a list - and returns <tt>False</tt> otherwise.  (Note that <em>exp</em> is a placeholder expression, as we discussed earlier, and shouldn't be read literally as the identifier <tt>exp</tt>.)  Now, I'd look at an example like <tt>(atom? 5)</tt> and have no trouble figuring out that it evaluated to <tt>True</tt>.  But where I'd get into trouble is with an expression like <tt>(atom? (+ 1 2))</tt>.  I'd look at it and think it must return <tt>False</tt>, because the expression <tt>(+ 1 2)</tt> is not atomic, it's a list.  Unfortunately, while it's true that the expression <tt>(+ 1 2)</tt> is not atomic, it's irrelevant, because the value of <tt>(atom?  </tt><em>exp</em><tt>)</tt> is determined by whether the <em>value</em> of <em>exp</em> is atomic - which it is, since the value of <tt>(+ 1 2)</tt>) is <tt>3</tt>.  You'll be a much happier Lisper if you always keep very clear on the distinction between expressions and the value of expressions.</p>
<p>Now, of course, this distinction between expressions and their values appears in most other programming languages.  For this reason, I felt foolish when I understood what was causing my confusion.  But what makes it so easy to make this kind of mistake is that in Lisp both expressions and the value of those expressions can be lists.  And that's why when I write something like <tt>(atom? (+ 1 2))</tt> there's a real question: is <tt>atom?</tt> checking whether the expression <tt>(+ 1 2)</tt> is atomic (no, it's not, it's a list), or whether the value of the expression - <tt>3</tt> - is atomic (yes, it is).  So you need to be pretty careful to keep clear on which is meant.  To help with this, I'll be pretty pedantic about the distinction in what follows, writing out "the value of <em>exp</em>" explicitly, to distinguish the value from the expression itself.  Note, however, that it's quite common for people to be less pedantic and more informal, so that something like <tt>(+ x 2)</tt> may well be described as "adding the variable <tt>x</tt> to the variable <tt>y</tt>", not the more explicit "adding the value of the variable <tt>x</tt> to the value of the variable <tt>y</tt>."</p>
<p>Alright, with that admonishment out of the way, let's turn our attention to defining the final set of elementary operations we'll need in tiddlylisp.</p>
<p><strong>Returning an expression as a value:</strong> <tt>(quote </tt><em>exp</em><tt>)</tt> returns the expression <em>exp</em> as its value, <em>without</em> evaluating <em>exp</em>.  This is most clearly demonstrated by an example, </p>
<pre>
tiddlylisp> (quote (+ 1 2))
(+ 1 2)
</pre>
<p> as opposed to the value of <tt>(+ 1 2)</tt>, which of course is <tt>3</tt>.  Another example, just to reinforce the point that <tt>quote</tt> returns an expression literally, without evaluating it, </p>
<pre>
tiddlylisp> (define x 3)
tiddlylisp> (quote x)
x
</pre>
<p> not the value of <tt>x</tt>, which of course is <tt>3</tt>.  Note also that quoting doesn't evaluate nested subexpressions, e.g., </p>
<pre>
tiddlylisp> (quote ((+ 1 2) 3))
((+ 1 2) 3)
</pre>
<p> <tt>quote</tt> can also be used to return lists which aren't valid Lisp expressions at all, e.g., </p>
<pre>
tiddlylisp> (quote (1 2 3))
(1 2 3)
</pre>
<p> It's worth emphasizing that <tt>(1 2 3)</tt> isn't a valid Lisp expression, since <tt>1</tt> is neither a procedure nor a special form - if you enter <tt>(1 2 3)</tt> into the tiddlylisp interpreter, it will give an error. But what <tt>quote</tt> lets us do is use the list <tt>(1 2 3)</tt> as data.  For example, we could use it to store the <tt>(1 2 3)</tt> in a variable, </p>
<pre>
tiddlylisp> (define x (quote (1 2 3)))
tiddlylisp> x
(1 2 3)
</pre>
<p> In this way, <tt>quote</tt> lets us work with lists as data structures.</p>
<p>Why does Lisp have <tt>quote</tt>?  Most other computer programming languages don't have anything like it.  The reason Lisp has it is because Lisp allows you to treat code as data, and data as code.  So, for example, an object such as <tt>(+ 1 2)</tt> can be treated as code, i.e., as a Lisp expression to be evaluated, which is what we've been doing up until the current discussion.  But <tt>(+1 2)</tt> can also potentially be treated as data, i.e., as a list of three objects. This ability to treat code and data on the same footing is a wonderful thing, because it means you can write programs to manipulate programs. But it also creates a problem, which is that we need to be able to distinguish when an expression should be treated as data, and when it should be treated as code.  <tt>quote</tt> is a way of distinguishing between the two.  It's similar to the way most languages use escape characters to deal with special strings such as ".  <tt>quote</tt> is a way of saying "Hey, what follows should be treated as data, not evaluated as Lisp code".  And so <tt>quote</tt> lets us escape Lisp code, so we can take an expression such as <tt>(+ 1 2)</tt> and turn it into data: <tt>(quote (+ 1 2)</tt>.  In this way, <tt>quote</tt> allows us to define Lisp expressions whose values are arbitrary Lisp expressions.</p>
<p>Up until now we've been focused on using Lisp to do simple arithmetic operations, and as a result we haven't needed <tt>quote</tt>.  But when we get to Lisp-as-Maxwell's-equations we're going to be increasingly focused on using Lisp to manipulate Lisp code, and as a result we'll be making frequent use of <tt>quote</tt>.  For this reason it's helpful to introduce shorthand for <tt>quote</tt>.  In most Lisps, the conventional shorthand is to use <tt>'</tt><em>exp</em> to denote <tt>(quote </tt><em>exp</em><tt>)</tt>, i.e. if <em>exp</em> is some Lisp expression, then the value of <tt>'</tt><em>exp</em> is just the expression <em>exp</em> itself.  Unfortunately, using the shorthand <tt>'</tt><em>exp</em> complicated the parsing in the tiddlylisp interpreter more than I wanted.  So I decided to instead use a different (and, I emphasize, unconventional) shorthand for <tt>(quote </tt><em>exp</em><tt>)</tt>, namely <tt>(q </tt><em>exp</em><tt>)</tt>. So our examples of <tt>quote</tt> above can be shortened to, </p>
<pre>
tiddlylisp> (q (+ 1 2))
(+ 1 2)
tiddlylisp> (define x 3)
tiddlylisp> (q x)
x
tiddlylisp> (q ((+ 1 2) 3))
((+ 1 2) 3)
</pre>
<p><strong>Testing whether the value of an expression is atomic:</strong> As I noted above, <tt>(atom? </tt><em>exp</em><tt>)</tt> returns <tt>True</tt> if the value of the expression <em>exp</em> is atomic, and otherwise returns <tt>False</tt>.  We've already discussed the following example, </p>
<pre>
tiddlylisp> (atom? (+ 1 2))
True
</pre>
<p> but it's illuminating to see it in tandem with the following example, which illustrates also the use of <tt>quote</tt>, </p>
<pre>
tiddlylisp> (atom? (q (+ 1 2)))
False
</pre>
<p> As above, the first example returns <tt>True</tt> because while <tt>(+ 1 2)</tt> is not atomic, its value, <tt>3</tt>, is.  But the second example returns <tt>False</tt> because the value of <tt>(q (+ 1 2))</tt> is just <tt>(+ 1 2)</tt>, which is a list, and thus not atomic.</p>
<p>By the way, I should mention that <tt>atom?</tt> is not a built-in procedure in the dialect of Lisp that tiddlylisp based on, Scheme. I've built <tt>atom?</tt> into tiddlylisp because, as we'll see, the analogous operation is used many times in the code on page 13 of the LISP 1.5 Programmer's Manual.  Of course, an <tt>atom?</tt> procedure is easily defined in Scheme, but for the purposes of understanding the code on page 13 it seemed most direct to simply include <tt>atom?</tt> as a built-in in tiddlylisp.</p>
<p><strong>Testing whether two expressions both evaluate to the same atom   (or the empty list):</strong> <tt>(eq? </tt><em>exp1 exp2</em><tt>)</tt> returns <tt>True</tt> if the values of <em>exp1</em> and <em>exp2</em> are both the same atom, or both the empty list, <tt>()</tt>.  <tt>(eq? </tt><em>exp1   exp2</em><tt>)</tt> returns <tt>False</tt> otherwise.  Note that if <em>exp1</em> and <em>exp2</em> have the same value, but are not atomic or the empty list, then <tt>(eq? </tt><em>exp1 exp2</em><tt>)</tt> returns <tt>False</tt>.  For example, </p>
<pre>
tiddlylisp> (eq? 2 (+ 1 1))
True
tiddlylisp> (eq? 3 (+ 1 1))
False
tiddlylisp> (eq? (q (1 2)) (q (1 2)))
False
</pre>
<p> As for <tt>atom?</tt> my explanation of <tt>eq?</tt> is not quite the same as in standard Scheme, but instead more closely matches the function <tt>eq</tt> defined in the LISP 1.5 Programmer's Manual.</p>
<p><strong>Getting the first item of a list:</strong> <tt>(car </tt><em>exp</em><tt>)</tt> returns the first element of the value of <em>exp</em>, provided the value of <em>exp</em> is a list.  Otherwise <tt>(car </tt><em>exp</em><tt>)</tt> is undefined.  For example, </p>
<pre>
tiddlylisp> (car (q (+ 2 3)))
+
tiddlylisp> (car (+ 2 3))
<Error message, which I've elided>
</pre>
<p> The first of these two behaves as expected: the value of <tt>(q (+ 2 3)</tt> is just the list <tt>(+ 2 3)</tt>, and so <tt>car</tt> returns the first element, which is <tt>+</tt>.  In the second, though, <tt>car</tt> is not defined, and tiddlylisp returns an error message, which I've elided.  The reason it returns an error message is because the value of <tt>(+ 2 3)</tt> is <tt>5</tt>, which is not a list, and so <tt>car</tt> is undefined.  In a similar way, suppose we try </p>
<pre>
tiddlylisp> (car (1 2 3))
<Error message>
</pre>
<p> Again, we get an error message.  The reason is that <tt>car</tt> is being applied to the <em>value</em> of <tt>(1 2 3)</tt>, considered as a Lisp expression.  And, of course, that value is undefined, since <tt>1</tt> is neither a procedure nor a special form.  The right way to do the above is to <tt>quote</tt> the list first, </p>
<pre>
tiddlylisp> (car (q (1 2 3)))
1
</pre>
<p> Once again, we see how <tt>quote</tt> is used to make it clear to tiddlylisp that we're dealing with data, not code to be evaluated.</p>
<p><strong>Getting the rest of a list:</strong> <tt>(cdr </tt><em>exp</em><tt>)</tt> returns a list containing all but the first element of the value of <em>exp</em>.  Of course, the value of <em>exp</em> must be a list, otherwise <tt>(cdr </tt><em>exp</em><tt>)</tt> is undefined.  For example, </p>
<pre>
tiddlylisp> (cdr (q (1 2 3)))
(2 3)
</pre>
<p> According to <a href="http://en.wikipedia.org/wiki/CAR_and_CDR#Etymology">Wikipedia</a> the names <tt>car</tt> and <tt>cdr</tt> have their origin in some pretty esoteric facts about the early implementations of Lisp.  The details don't much matter here - <tt>car</tt> stands for "contents of address part of register", while <tt>cdr</tt> stands for "contents of decrement part of register" - I just wanted to make the point that you could reasonably be wondering "Where on Earth did those names come from?!"  A mnemonic I find useful in distinguishing the two is to focus on the difference between the names of the two procedures, which of course is just the middle letter - <tt>a</tt> or <tt>d</tt> - and to keep in mind that <tt>a</tt> comes <em>before</em> <tt>d</tt> in the alphabet, just as <tt>car</tt> extracts the element of a list that comes <em>before</em> the remainder of the list, as given to us by <tt>cdr</tt>. Your taste in mnemonics may vary - if you don't like mine, it's still worth taking a minute or two to come up with some trick for remembering and distinguishing <tt>car</tt> and <tt>cdr</tt>.  Of course, after a little practice you get used to them, and you won't need the mnemonic any more, but at first it's helpful.</p>
<p><strong>Appending an item at the start of a list:</strong> Provided the value of <em>exp2</em> is a list, then <tt>(cons </tt><em>exp1 exp2</em><tt>)</tt> returns a list containing the value of <em>exp1</em> as its first element, followed by all the elements of the value of <em>exp2</em>. For example, </p>
<pre>
tiddlylisp> (cons 1 (q (2 3)))
(1 2 3)
</pre>
<p><strong>Testing whether an expression evaluates to the empty list:</strong> <tt>(null? </tt><em>exp</em><tt>)</tt> returns <tt>True</tt> if <em>exp</em> evaluates to the empty list, <tt>()</tt>, and otherwise returns <tt>False</tt>.  For example, </p>
<pre>
tiddlylisp> (null? (cdr (q (1))))
True
</pre>
<p> since <tt>(cdr (q (1)))</tt> returns the empty list.</p>
<p><strong>Conditionals:</strong> <tt>(cond (</tt><em>p1 e1</em><tt>)...(</tt><em>pn   en</em><tt>))</tt>: This starts by evaluating the expression <em>p1</em>. If <em>p1</em> evaluates to <tt>True</tt>, then evaluate the expression <em>e1</em>, and return that value.  If not, evaluate <em>p2</em>, and if it is <tt>True</tt>, return the value of <em>e2</em>, and so on.  If none of the <em>pj</em> evaluates to <tt>True</tt>, then the <tt>(cond ...)</tt> expression is undefined.</p>
<p>That's all the Lisp we're going to need to write our version of Lisp-as-Maxwell's-equations, i.e., our version of the code on page 13 of the LISP Manual!  In fact, as we'll see shortly, it's more than we need - I included a few extra features so that we could work through examples like <tt>sqrt</tt>, and also simply for fun, to make tiddlylisp a richer and more interesting language to play with.  Of course, what I've described is merely a subset (with some variations) of our chosen dialect of Lisp (Scheme), and there are important missing ideas.  To learn more about Lisp or Scheme, please consult the suggestions for further reading at the end of this essay.</p>
<h3>An interpreter for Lisp</h3>
<p>Now that we've worked through the basic elements of Lisp, let's write a simple Lisp interpreter, using Python.  The interpreter we'll write is based on <a href="http://norvig.com/">Peter Norvig</a>'s lispy interpreter, and I highly recommend you read through <a href="http://norvig.com/lispy.html">his explanation</a>.  I've given the program we'll discuss a separate name - tiddlylisp - so as to make it easy to refer to separately from Norvig's interpreter, but please keep in mind that most of the code we're discussing is Norvig's. However, we're going to examine the code from a different angle than Norvig: we're going to take a bit more of a bottom-up computer's-eye view, looking at how the code executes.</p>
<p>We'll look at the code piece by piece, before putting it all together. Let's start with the interactive interpreter, which is run when you start up.  We implement this with a Python procedure called <tt>repl</tt>, meaning to <em>r</em>ead some input, <em>e</em>valuate the expression, <em>p</em>rint the result of the evaluation, and then <em>l</em>oop back to the beginning.  This is also know as the read-eval-print loop, or REPL.  Here's the <tt>repl</tt> procedure, together with a procedure for handling errors when they occur: </p>
<pre>
import traceback

def repl(prompt='tiddlylisp> '):
    "A prompt-read-eval-print loop."
    while True:
        try:
            val = eval(parse(raw_input(prompt)))
            if val is not None: print to_string(val)
        except KeyboardInterrupt:
            print "\nExiting tiddlylisp\n"
            sys.exit()
        except:
            handle_error()

def handle_error():
    """
    Simple error handling for both the repl and load.
    """
    print "An error occurred.  Here's the Python stack trace:\n"
    traceback.print_exc()
</pre>
<p> The core of <tt>repl</tt> is in the <tt>try</tt> clause, and we'll get to how that works shortly.  Before we look at that, note that if the user presses <tt>Ctrl-C</tt>, it raises the <tt>KeyboardInterrupt</tt> exception, which causes the program to exit.  If an error occurs during the <tt>try</tt> block -- say, due to a syntax error in the Lisp expression being parsed, or due to a bug in tiddlylisp itself - then some other exception will be raised.  Tiddlylisp doesn't deal very well with errors - it simply announces that an error has occurred, and prints the Python stack trace, which may give you a few hints about what's gone wrong, but which is obviously a long way short of truly informative error handling!  After printing the stack trace, tiddlylisp simply returns to the prompt.  This type of error handling could easily be improved, but we're not going to invest any effort in this direction.</p>
<p>Let's look at the <tt>try</tt> clause.  It begins by taking input at the <tt>prompt</tt>, and then passing it to the function <tt>parse</tt>, which converts the string entered by the user into an <em>internal   representation</em>, i.e., a data structure that's more convenient for our Python program to work with than a string.  Here's an example which shows how it works: </p>
<pre>
parse("(* (+ 7 12) (- 8 6))") = ["*", ["+", 7, 12], ["-", 8, 6]]
</pre>
<p> In other words, nested Lisp lists become Python sublists, and things like procedures and numbers become elements in a Python list.</p>
<p>We'll look shortly at how <tt>parse</tt> is implemented.  For now, let's finish understanding how <tt>repl</tt> works.  The output of <tt>parse</tt> is passed to the function <tt>eval</tt>, which evaluates the internal representation of the expression entered by the user. Provided no error occurs, the result is returned in <tt>val</tt>. However, <tt>val</tt> is in the format of the internal representation, and so we need to convert it from that internal representation back into a printable Lisp expression, using <tt>to_string</tt>.</p>
<p>At this point, we've got three functions to understand the details of: <tt>parse</tt>, <tt>eval</tt> and <tt>to_string</tt>.  I'll explain them out of order, starting with <tt>parse</tt> and <tt>to_string</tt>, since they're extremely similar.  Then we'll get to <tt>eval</tt>.</p>
<p>Alright, let's understand how <tt>parse</tt> works.  Without further ado, here's the code; for the explanation, see below: </p>
<pre>
Symbol = str

def parse(s):
    "Parse a Lisp expression from a string."
    return read_from(tokenize(s))

def tokenize(s):
    "Convert a string into a list of tokens."
    return s.replace('(',' ( ').replace(')',' ) ').split()

def read_from(tokens):
    "Read an expression from a sequence of tokens."
    if len(tokens) == 0:
        raise SyntaxError('unexpected EOF while reading')
    token = tokens.pop(0)
    if '(' == token:
        L = []
        while tokens[0] != ')':
            L.append(read_from(tokens))
        tokens.pop(0) # pop off ')'
        return L
    elif ')' == token:
        raise SyntaxError('unexpected )')
    else:
        return atom(token)

def atom(token):
    "Numbers become numbers; every other token is a symbol."
    try: return int(token)
    except ValueError:
        try: return float(token)
        except ValueError:
            return Symbol(token)
</pre>
<p> The parsing is easy to understand.  <tt>tokenize</tt> first inserts spaces on either side of any parentheses, and then splits the string around spaces, returning a list of tokens, i.e., the non-space substrings.  <tt>read_from</tt> takes that list and removes the parentheses, instead nesting sublists as indicated by the original parenthesis structure.  And, finally, <tt>atom</tt> turns tokens into Python <tt>int</tt>s, <tt>float</tt>s or <tt>Symbol</tt>s (strings, by definition), as appropriate.</p>
<p>That's all there is to <tt>parse</tt>.  If tiddlylisp were a little more powerful then <tt>parse</tt> would need to be more complex.  For example, if we allowed strings as first-class objects in the language, then it would not work to tokenize by splitting around spaces, since that would risk splitting a string into separate tokens.  This is the kind of thing that'd be fun to include in an extended version of tiddlylisp (and I've included it as a problem later in this section), but which we don't need in a first version.</p>
<p>Let's look now at how <tt>to_string</tt> works.  It's much simpler, quickly undoing the steps taken in parsing: </p>
<pre>
isa = isinstance

def to_string(exp):
    "Convert a Python object back into a Lisp-readable string."
    if not isa(exp, list):
        return str(exp)
    else:
        return '('+' '.join(map(to_string, exp))+')'
</pre>
<p> In other words, if the internal representation of the expression is not a list, then return an appropriate stringified version (this takes care of the fact that we may have <tt>int</tt>s, <tt>float</tt>s and, as we shall see, <tt>Boolean</tt>s in the internal representation).  If it is a list, then return whatever we get by applying <tt>to_string</tt> to all the elements of that list, with appropriate delimiting by whitespace and parentheses.</p>
<p>At this point, the main thing we need to complete tiddlylisp is the <tt>eval</tt> function.  Actually, that's not quite true: as we discussed earlier, tiddlylisp also keeps track of a global environment (and possibly one or more inner environments), to store variable and procedure names, and their values.  <tt>eval</tt> is going to make heavy use of the environments, and so it helps to look first at how environments are defined.  They're pretty simple: an environment has a bunch of keys, representing the names of the variables and procedures in that environment, and corresponding values for those keys, which are just the values for the variables or procedures.  An environment also keeps track of its outer environment, with the caveat that the global environment has an outer environment set to Python's <tt>None</tt>.  Such environments are easily implemented as a subclass of Python dictionaries: </p>
<pre>
class Env(dict):
    "An environment: a dict of {'var':val} pairs, with an outer Env."

    def __init__(self, params=(), args=(), outer=None):
        self.update(zip(params, args))
        self.outer = outer

    def find(self, var):
        "Find the innermost Env where var appears."
        return self if var in self else self.outer.find(var)
</pre>
<p> As you can see, the only modifications of the dictionary class are that: (1) an environment also keeps track of its own outer environment; and (2) an environment can determine whether a variable or procedure name appears in its list of keys, and if it doesn't, then it looks to see if it's in its outer environment.  As a result, the <tt>find</tt> method returns the innermost environment where the variable or procedure name appears.  </p>
<p>Note, incidentally, that the environment doesn't distinguish between variable and procedure names.  Indeed, as we'll see, tiddlylisp treats user-defined procedures and variables in the same way: a procedure is a variable which just happens to take the value of a <tt>lambda</tt> expression as its value.</p>
<p>Tiddlylisp starts off operating in a particular global environment, and this too must be defined by our program.  We'll do this by creating an instance of the class <tt>Env</tt>, and calling a function to add some built-in procedure definitions and variables: </p>
<pre>
def add_globals(env):
    "Add some built-in procedures and variables to the environment."
    import operator
    env.update(
        {'+': operator.add,
         '-': operator.sub, 
         '*': operator.mul, 
         '/': operator.div, 
         '>': operator.gt, 
         '<': operator.lt, 
         '>=': operator.ge, 
         '<=': operator.le, 
         '=': operator.eq
         })
    env.update({'True': True, 'False': False})
    return env

global_env = add_globals(Env())
</pre>
<p> Incidentally, in tiddlylisp's version of <tt>add_globals</tt> I decided to strip out many of the built-in procedures which Norvig includes in lispy's global environment - it's instructive to look at <a href="http://norvig.com/lis.py">Norvig's code</a> for <tt>add_globals</tt> to see just how easy it is to add more built-in procedures to tiddlylisp.  If you want to do some exploratory programming with tiddlylisp then you should probably copy some of Norvig's additional built-in procedures (and perhaps add some of your own).  For us, though, the above procedures are enough.</p>
<p>One notable feature of the global environment is the variables named <tt>True</tt> and <tt>False</tt>, which evaluate to Python's Boolean <tt>True</tt> and <tt>False</tt>, respectively.  This isn't standard in Scheme (or most other Lisps), but I've done it because it ensures that we can use the strings <tt>True</tt> and <tt>False</tt>, and get the appropriate internal representation.</p>
<p>With the global environment set up, we can now understand how <tt>eval</tt> works.  The code is extremely simple, simply enumerating the different types of expressions we might be evaluating, and reading from or modifying the environment, as appropriate.  It's worth reading (and rereading) the code in detail, until you understand exactly how <tt>eval</tt> works.  I also have a few comments at the end.  Here's the code: </p>
<pre>
isa = isinstance

def eval(x, env=global_env):
    "Evaluate an expression in an environment."
    if isa(x, Symbol):              # variable reference
        return env.find(x)[x]
    elif not isa(x, list):          # constant literal
        return x                
    elif x[0] == 'quote' or x[0] == 'q': # (quote exp), or (q exp)
        (_, exp) = x
        return exp
    elif x[0] == 'atom?':           # (atom? exp)
        (_, exp) = x
        return not isa(eval(exp, env), list)
    elif x[0] == 'eq?':             # (eq? exp1 exp2)
        (_, exp1, exp2) = x
        v1, v2 = eval(exp1, env), eval(exp2, env)
        return (not isa(v1, list)) and (v1 == v2)
    elif x[0] == 'car':             # (car exp)
        (_, exp) = x
        return eval(exp, env)[0]
    elif x[0] == 'cdr':             # (cdr exp)
        (_, exp) = x
        return eval(exp, env)[1:]
    elif x[0] == 'cons':            # (cons exp1 exp2)
        (_, exp1, exp2) = x
        return [eval(exp1, env)]+eval(exp2,env)
    elif x[0] == 'cond':            # (cond (p1 e1) ... (pn en))
        for (p, e) in x[1:]:
            if eval(p, env): 
                return eval(e, env)
    elif x[0] == 'null?':           # (null? exp)
        (_, exp) = x
        return eval(exp,env) == []
    elif x[0] == 'if':              # (if test conseq alt)
        (_, test, conseq, alt) = x
        return eval((conseq if eval(test, env) else alt), env)
    elif x[0] == 'set!':            # (set! var exp)
        (_, var, exp) = x
        env.find(var)[var] = eval(exp, env)
    elif x[0] == 'define':          # (define var exp)
        (_, var, exp) = x
        env[var] = eval(exp, env)
    elif x[0] == 'lambda':          # (lambda (var*) exp)
        (_, vars, exp) = x
        return lambda *args: eval(exp, Env(vars, args, env))
    elif x[0] == 'begin':           # (begin exp*)
        for exp in x[1:]:
            val = eval(exp, env)
        return val
    else:                           # (proc exp*)
        exps = [eval(exp, env) for exp in x]
        proc = exps.pop(0)
        return proc(*exps)
</pre>
<p> Mostly this is self-explanatory.  But allow me to draw your attention to how Norvig deals with anonymous procedure definitions using <tt>lambda</tt>.  When I first examined his code I wondered how he'd cope with this, and expected it would be quite complex.  But as you can see, it is extremely simple: <tt>lambda</tt> expressions evaluate to the appropriate anonymous Python function, with a new environment modified by the addition of the appropriate variable keys, and their values.  Beautiful!</p>
<p>Tiddlylisp is essentially complete at this point.  It's convenient to finish off the program by providing two ways of running tiddlylisp: either in an interactive interpreter mode, i.e., the REPL, or by loading a tiddlylisp program stored in a separate file.  To start the REPL, we'll simply run <tt>python tiddlylisp.py</tt>.  To load and execute a file, we'll run <tt>python tiddlylisp.py </tt><em>filename</em>. After execution, we'd like to be dropped into the REPL so we can inspect results and do further experiments.  The main complication in doing this is the need to load tiddlylisp code which is split over multiple lines.  We do this by merging lines until the number of opening and closing parentheses match.  Here's the code - it's best to start at the bottom, with the code immediately after <tt>if __name__ == "__main__"</tt>: </p>
<pre>
import sys

def load(filename):
    """
    Load the tiddlylisp program in filename, execute it, and start the
    repl.  If an error occurs, execution stops, and we are left in the
    repl.  Note that load copes with multi-line tiddlylisp code by
    merging lines until the number of opening and closing parentheses
    match.
    """
    print "Loading and executing 
    f = open(filename, "r")
    program = f.readlines()
    f.close()
    rps = running_paren_sums(program)
    full_line = ""
    for (paren_sum, program_line) in zip(rps, program):
        program_line = program_line.strip()
        full_line += program_line+" "
        if paren_sum == 0 and full_line.strip() != "":
            try:
                val = eval(parse(full_line))
                if val is not None: print to_string(val)
            except:
                handle_error()
                print "\nThe line in which the error occurred:\n
                break
            full_line = ""
    repl()

def running_paren_sums(program):
    """
    Map the lines in the list program to a list whose entries contain
    a running sum of the per-line difference between the number of '('
    parentheses and the number of ')' parentheses.
    """
    count_open_parens = lambda line: line.count("(")-line.count(")")
    paren_counts = map(count_open_parens, program)
    rps = []
    total = 0
    for paren_count in paren_counts:
        total += paren_count
        rps.append(total)
    return rps

if __name__ == "__main__":
    if len(sys.argv) > 1: 
        load(sys.argv[1])
    else: 
        repl()
</pre>
<p>That completes the code for tiddlylisp!  A grand total of 153 lines of non-comment, non-whitespace code.  Here it all is, in one big block (commented and slightly reordered), so you can see how the pieces fit together: </p>
<pre>
#### tiddlylisp.py
#
# Based on Peter Norvig's lispy (http://norvig.com/lispy.html),
# copyright by Peter Norvig, 2010.
#
# Adaptations by Michael Nielsen.  See
# https://michaelnielsen.org/ddi/lisp-as-the-maxwells-equations-of-software/

import sys
import traceback

#### Symbol, Env classes

Symbol = str

class Env(dict):
    "An environment: a dict of {'var':val} pairs, with an outer Env."

    def __init__(self, params=(), args=(), outer=None):
        self.update(zip(params, args))
        self.outer = outer

    def find(self, var):
        "Find the innermost Env where var appears."
        return self if var in self else self.outer.find(var)

def add_globals(env):
    "Add some built-in procedures and variables to the environment."
    import operator
    env.update(
        {'+': operator.add,
         '-': operator.sub, 
         '*': operator.mul, 
         '/': operator.div, 
         '>': operator.gt, 
         '<': operator.lt, 
         '>=': operator.ge, 
         '<=': operator.le, 
         '=': operator.eq
         })
    env.update({'True': True, 'False': False})
    return env

global_env = add_globals(Env())

isa = isinstance

#### eval

def eval(x, env=global_env):
    "Evaluate an expression in an environment."
    if isa(x, Symbol):              # variable reference
        return env.find(x)[x]
    elif not isa(x, list):          # constant literal
        return x                
    elif x[0] == 'quote' or x[0] == 'q': # (quote exp), or (q exp)
        (_, exp) = x
        return exp
    elif x[0] == 'atom?':           # (atom? exp)
        (_, exp) = x
        return not isa(eval(exp, env), list)
    elif x[0] == 'eq?':             # (eq? exp1 exp2)
        (_, exp1, exp2) = x
        v1, v2 = eval(exp1, env), eval(exp2, env)
        return (not isa(v1, list)) and (v1 == v2)
    elif x[0] == 'car':             # (car exp)
        (_, exp) = x
        return eval(exp, env)[0]
    elif x[0] == 'cdr':             # (cdr exp)
        (_, exp) = x
        return eval(exp, env)[1:]
    elif x[0] == 'cons':            # (cons exp1 exp2)
        (_, exp1, exp2) = x
        return [eval(exp1, env)]+eval(exp2,env)
    elif x[0] == 'cond':            # (cond (p1 e1) ... (pn en))
        for (p, e) in x[1:]:
            if eval(p, env): 
                return eval(e, env)
    elif x[0] == 'null?':           # (null? exp)
        (_, exp) = x
        return eval(exp,env) == []
    elif x[0] == 'if':              # (if test conseq alt)
        (_, test, conseq, alt) = x
        return eval((conseq if eval(test, env) else alt), env)
    elif x[0] == 'set!':            # (set! var exp)
        (_, var, exp) = x
        env.find(var)[var] = eval(exp, env)
    elif x[0] == 'define':          # (define var exp)
        (_, var, exp) = x
        env[var] = eval(exp, env)
    elif x[0] == 'lambda':          # (lambda (var*) exp)
        (_, vars, exp) = x
        return lambda *args: eval(exp, Env(vars, args, env))
    elif x[0] == 'begin':           # (begin exp*)
        for exp in x[1:]:
            val = eval(exp, env)
        return val
    else:                           # (proc exp*)
        exps = [eval(exp, env) for exp in x]
        proc = exps.pop(0)
        return proc(*exps)

#### parsing

def parse(s):
    "Parse a Lisp expression from a string."
    return read_from(tokenize(s))

def tokenize(s):
    "Convert a string into a list of tokens."
    return s.replace('(',' ( ').replace(')',' ) ').split()

def read_from(tokens):
    "Read an expression from a sequence of tokens."
    if len(tokens) == 0:
        raise SyntaxError('unexpected EOF while reading')
    token = tokens.pop(0)
    if '(' == token:
        L = []
        while tokens[0] != ')':
            L.append(read_from(tokens))
        tokens.pop(0) # pop off ')'
        return L
    elif ')' == token:
        raise SyntaxError('unexpected )')
    else:
        return atom(token)

def atom(token):
    "Numbers become numbers; every other token is a symbol."
    try: return int(token)
    except ValueError:
        try: return float(token)
        except ValueError:
            return Symbol(token)

def to_string(exp):
    "Convert a Python object back into a Lisp-readable string."
    if not isa(exp, list):
        return str(exp)
    else:
        return '('+' '.join(map(to_string, exp))+')'         

#### Load from a file and run

def load(filename):
    """
    Load the tiddlylisp program in filename, execute it, and start the
    repl.  If an error occurs, execution stops, and we are left in the
    repl.  Note that load copes with multi-line tiddlylisp code by
    merging lines until the number of opening and closing parentheses
    match.
    """
    print "Loading and executing 
    f = open(filename, "r")
    program = f.readlines()
    f.close()
    rps = running_paren_sums(program)
    full_line = ""
    for (paren_sum, program_line) in zip(rps, program):
        program_line = program_line.strip()
        full_line += program_line+" "
        if paren_sum == 0 and full_line.strip() != "":
            try:
                val = eval(parse(full_line))
                if val is not None: print to_string(val)
            except:
                handle_error()
                print "\nThe line in which the error occurred:\n
                break
            full_line = ""
    repl()

def running_paren_sums(program):
    """
    Map the lines in the list program to a list whose entries contain
    a running sum of the per-line difference between the number of '('
    parentheses and the number of ')' parentheses.
    """
    count_open_parens = lambda line: line.count("(")-line.count(")")
    paren_counts = map(count_open_parens, program)
    rps = []
    total = 0
    for paren_count in paren_counts:
        total += paren_count
        rps.append(total)
    return rps

#### repl

def repl(prompt='tiddlylisp> '):
    "A prompt-read-eval-print loop."
    while True:
        try:
            val = eval(parse(raw_input(prompt)))
            if val is not None: print to_string(val)
        except KeyboardInterrupt:
            print "\nExiting tiddlylisp\n"
            sys.exit()
        except:
            handle_error()

#### error handling

def handle_error():
    """
    Simple error handling for both the repl and load.
    """
    print "An error occurred.  Here's the Python stack trace:\n"
    traceback.print_exc()

#### on startup from the command line

if __name__ == "__main__":
    if len(sys.argv) > 1: 
        load(sys.argv[1])
    else: 
        repl()
</pre>
<h3>Problems</h3>
<ul>
<li> Modify tiddlylisp so that the <tt>+</tt> procedure can be applied to   any number of arguments, e.g., so that <tt>(+ 1 2 3)</tt> evaluates to   <tt>6</tt>.
<li> Earlier we implemented a square root procedure in tiddlylisp.   Can you add it directly to tiddlylisp, using the Python <tt>math</tt>   module's <tt>sqrt</tt> function?
<li> In our earlier implementation of the <tt>sqrt</tt> procedure we   discussed the ordering of the lines of code, and whether it's okay   to define a procedure in terms of some other yet-to-be-defined   procedure.  Examine the code for tiddlylisp, and explain why it's   okay for a procedure such as <tt>sqrt</tt> to be defined in terms of a   procedure such as <tt>sqrt-iter</tt> which isn't defined until later.   Try doing the same thing with variables, e.g., try running   <tt>(define x y)</tt> followed by <tt>(define y 1)</tt>.  Does this   work?  If so, why?  If not, why not?
<li> Modify tiddlylisp so that when applied to one argument the   <tt>-</tt> procedure simply negates it, e.g., <tt>(- 2)</tt> returns   <tt>-2</tt>, while <tt>-</tt> still computes differences when applied to   two arguments.
<li> Is it possible to write a pure tiddlylisp procedure <tt>minus</tt> so   that <tt>(minus x)</tt> returns <tt>-x</tt>, while <tt>(minus x y)</tt>   returns <tt>x-y</tt>?
<li> In the discussion where we introduced <tt>cond</tt> I stated that   <tt>(cond (</tt><em>p1 e1</em><tt>)...(</tt><em>pn en</em><tt>))</tt> is   undefined when none of the expressions <em>p1...pn</em> evaluate to   <tt>True</tt>.  What does Tiddlylisp return in this case?  Can you   think of a better way of dealing with this situation?
<li> Can you add support for strings to tiddlylisp?
</ul>
<p>When I first examined Norvig's code for lispy, I was surprised by just how much I learned from his code.  Of course, I expected to learn quite a bit - I am just a beginner at Lisp - but what I learned greatly exceeded my expectations.  Why might writing an interpreter deepen our understanding of a programming language?  I think the answer has to do with how we understand abstractions.  Consider the way I first explained the concept of Lisp environments, early in this essay: I gave a general discussion of the concept, and then related it to several of the examples we were working through.  This is the usual way we cope with abstractions when learning (or teaching) a language: we make those abstractions concrete by working through code examples that illustrates the consequences of those abstractions.  The problem is that although I can show you examples, the abstraction itself remains ephemeral.</p>
<p>Writing an interpreter is a way of making a programming language's abstractions concrete.  I can show you a million examples illustrating consequences of the Lisp environment, but none will have quite the same concrete flavour as the code for our Python Lisp interpreter. That code shows explicitly how the environment can be represented as a data structure, how it is manipulated by commands such as <tt>define</tt>, and so on.  And so writing an interpreter is a way of reifying abstractions in the programming language being interpreted.</p>
<h3>Problems</h3>
<ul>
<li> I gave the example of the environment as a Lisp abstraction   which is made more concrete when you understand the code for the   interpreter.  Another example of an abstraction is errors in code.   Can you improve tiddlylisp's error handling so that we get something   more informative than a Python stack trace when something goes   wrong?  One suggestion for how to do this is to identify two (or   more) classes of error that may occur in tiddlylisp programs, and to   modify the interpreter so it catches and gracefully handles those   error classes, printing informative error messages. </ul>
<p>On Peter Norvig's webpage describing his interpreter, a few commenters take him to task for writing his interpreter in Python.  Here's an <a href="http://norvig.com/lispy.html#comment-359244302">example</a> to give you the flavour of these comments: </p>
<blockquote><p>   This code looks very nice, but i think that implementing a Lisp   Interpreter in Python is some kind of cheating. Python is a   high-level language, so you get very much for free. </p></blockquote>
<p> Norvig <a href="http://norvig.com/lispy.html#comment-359339548">replies</a>: </p>
<blockquote><p>   You are right -- we are relying on many features of Python: call   stack, data types, garbage collection, etc. The next step would be   to show a compiler to some sort of assembly language.  I think   either the Java JVM or the Python byte code would be good targets.   We'd also need a runtime system with GC. I show the compiler in my   <a href="http://www.amazon.com/Paradigms-Artificial-Intelligence-Programming-Studies/dp/1558601910">PAIP     [Paradigms of Artificial Intelligence] book</a>. </p></blockquote>
<p> The commenter and Norvig are right, in some sense.  But there's also a sense in which the Python interpreter achieves something that would not be achieved by a program that compiled Lisp to the Java JVM, or to assembler, or some other target closer to the bare metal.  That's because of all programming languages, Python is one of the closest to an ordinary human language.  And so writing a Lisp interpreter in Python is an exceptionally clear way of explaining how Lisp works to a human who doesn't yet understand the core concepts of Lisp.  Insofar as I can guess at Norvig's intention, I believe the code for his interpreter is primarily intended to be read by humans, and the fact that it can also be read by a computer is a design constraint, not the fundamental purpose [3].</p>
<p>It seems to me that the kind of comment above arises because there are really three natural variations on the question "How to explain Lisp?".  All three variations are interesting, and worth answering; all three have different answers.  The first variation is how to explain Lisp to a person who doesn't yet know Lisp.  As I've just argued, a good answer to this question is to work through some examples, and then to write a simple Python interpreter.  The second variation is how to explain Lisp to a machine.  That's the question the commenter on Norvig's blog is asking, and to answer that question nothing beats writing a Lisp interpreter (or compiler) that works close to the bare metal, say in assembler, requiring you to deal explicitly with memory allocation, garbage collection, and so on.</p>
<p>But there's also a third variation on the question.  And that's how best to explain Lisp to someone who <em>already</em> understands the core concepts of Lisp.  That sounds paradoxical: doesn't such a person, by definition, already understand Lisp?  But it's not paradoxical at all.  Consider the following experience which many people have when learning (or teaching) mathematics.  The best way to explain a mathematical idea to someone new to the idea is using their old language and their old way of looking at the world.  This is like explaining Lisp by writing a Python interpreter.  But once the person has grasped a transformative new mathematical idea, they can often deepen their understanding by re-examining that idea within their new way of looking at the world.  That re-examination can help crystallize a deeper understanding.  In a similar way, while writing a Lisp interpreter in Python may be a good way of explaining Lisp to a person who doesn't yet understand Lisp, someone who grasps the core ideas of Lisp may find the Python interpreter a little clunky.  How should we explain Lisp within the framework of Lisp itself?  One answer to that question is to use Lisp to write a Lisp interpreter.  It's to that task that we now turn.</p>
<h3>Lisp in Lisp</h3>
<p>How should we write a Lisp interpreter in Lisp?  Let's think back to what Alan Kay saw at the bottom of page 13 of the LISP Manual:</p>
<p><a href="https://michaelnielsen.org/ddi/wp-content/uploads/2012/04/Lisp_Maxwells_Equations.png"><img decoding="async" src="https://michaelnielsen.org/ddi/wp-content/uploads/2012/04/Lisp_Maxwells_Equations.png" alt="" title="Lisp_Maxwells_Equations" width="480" class="alignnone size-full wp-image-63" srcset="https://michaelnielsen.org/ddi/wp-content/uploads/2012/04/Lisp_Maxwells_Equations.png 555w, https://michaelnielsen.org/ddi/wp-content/uploads/2012/04/Lisp_Maxwells_Equations-291x300.png 291w" sizes="(max-width: 555px) 100vw, 555px" /></a></p>
<p>Although it's written in a different notation than we've used, this is Lisp code.  In fact, it's the core of a Lisp interpreter written in Lisp: the procedure <tt>evalquote</tt> takes a Lisp expression as input, and then returns the value of that expression.  In this section we're going to use tiddlylisp to write an analogue to <tt>evalquote</tt> (we'll change the name to <tt>eval</tt>).  Of course, such a procedure is not really a full interpreter - we won't have a read-eval-print loop, for one thing - but it's not difficult to extend our code to a full interpreter (it requires a few additions to tiddlylisp, too). For this reason, in what follows I'll refer to our <tt>eval</tt> procedure as an "interpreter", even though it's more accurate to say that it's the core of an interpreter.  I haven't made the extension to a full interpreter here, partly because I don't want to lengthen an already long essay, but mostly because I want to stick to the theme of the "Maxwell's equations of software".  For the same reasons, I've also limited our <tt>eval</tt> to interpreting only a subset of tiddlylisp, omitting the arithmetical operations and concentrating instead on procedures for manipulating lists.</p>
<p>My treatment in this section is based on a beautiful <a href="http://lib.store.yahoo.net/lib/paulgraham/jmc.ps">essay</a> (postscript) by Paul Graham, in which he explains what the original designer of Lisp, John McCarthy, was up to in <a href="http://www-formal.stanford.edu/jmc/recursive.html">the paper</a> where he introduced Lisp.  In his essay, Graham writes a fully executable Lisp interpreter in one of the modern dialects of Lisp, Common Lisp, and I've based much of my code on Graham's.  Perhaps the main difference in my treatment is that while Graham's <tt>eval</tt> is written to be run under Common Lisp, our <tt>eval</tt> is executable in tiddlylisp, an interpreter for Lisp that we've written ourselves (with lots of help from Peter Norvig!)  So even though the code is very similar, the perspective is quite diferent, and I think we gain something from this different perspective.</p>
<p>The code we'll write is longer than what you see on page 13 of the LISP Manual.  The reason is that the code on page 13 was not actually self-contained, but made use of several procedures defined earlier in the LISP Manual, and we need to include those procedures.  The final result is still only a little over a page of code.  Let's start by defining a few of those helper procedures.</p>
<p><tt>(not </tt><em>exp</em><tt>)</tt> returns <tt>True</tt> if the expression <em>exp</em> evaluates to <tt>False</tt>, and otherwise returns <tt>False</tt>.  For example, </p>
<pre>
tiddlylisp> (not (atom? (q (1 2))))
True
tiddlylisp> (not (eq? 1 (- 2 1)))
False
</pre>
<p> Here's the tiddlylisp code for <tt>not</tt>: </p>
<pre>
(define not (lambda (x) (if x False True)))
</pre>
<p><tt>(append </tt><em>exp1 exp2</em><tt>)</tt> takes expressions <em>exp1</em> and <em>exp2</em> both of whose values are lists, and returns the list formed by concatenating those lists.  For example, </p>
<pre>
tiddlylisp> (append (q (1 2 3)) (q (4 5)))
(1 2 3 4 5)
</pre>
<p> Here's the tiddlylisp code for <tt>append</tt>: </p>
<pre>
(define append (lambda (x y)
		 (if (null? x) y (cons (car x) (append (cdr x) y)))))
</pre>
<p><tt>(pair </tt><em>exp1 exp2</em><tt>)</tt> returns a two-element list whose elements are the value of <em>exp1</em> and the value of <em>exp2</em>: </p>
<pre>
tiddlylisp> (pair 1 2)
(1 2)
tiddlylisp> (pair (+ 1 2) 1)
(3 1)
</pre>
<p> Here's the tiddlylisp code for <tt>pair</tt>: </p>
<pre>
(define pair (lambda (x y) (cons x (cons y (q ()) ))))
</pre>
<p> Note that my use of <tt>pair</tt> is somewhat unconventional - the more usual approach in Lisp is to use <tt>(list </tt><em>exp1 exp2   exp3...</em><tt>)</tt> to construct a list whose values are just the values of the respective expressions.  The reason I haven't done this is because tiddlylisp doesn't allow us to define Lisp procedures with a variable number of arguments.  Note also that the procedure <tt>pair</tt> that I've defined should not be confused with one of Scheme's standard procedures, <tt>pair?</tt>, which has a different purpose, and which we won't use in the current essay.</p>
<h3>Problems</h3>
<ul>
<li> Can you modify tiddlylisp so that <tt>(list </tt><em>exp1 exp2     exp3...</em><tt>)</tt> does indeed return a list whose values are just   the values of the respective expressions? </ul>
<p>I'll now introduce a class of helper procedures which are concatenations of two or more applications of <tt>car</tt> or <tt>cdr</tt>.  An example is the procedure <tt>cdar</tt>, which applies <tt>car</tt> first, followed by <tt>cdr</tt>, that is, <tt>(cdar </tt><em>exp</em><tt>)</tt> has the same value as <tt>(cdr (car </tt><em>exp</em><tt>))</tt>.  The notation <tt>cdar</tt> is a mnemonic, whose key elements are the middle two letters, <tt>d</tt> and <tt>a</tt>, indicating that <tt>cdar</tt> is what you get when you apply (in reverse order) <tt>cdr</tt> and <tt>car</tt>.  You might wonder why it's reverse order - the answer is that reverse order corresponds to the visual syntactic order, that is, the order from left-to-right that the procedures appear in the expression <tt>(cdr (car </tt><em>exp</em><tt>))</tt>.</p>
<p>As another example, the procedure <tt>caar</tt> is defined so that <tt>(caar </tt><em>exp</em><tt>)</tt> has the same value as <tt>(car (car </tt><em>exp</em><tt>))</tt>.  In our <tt>eval</tt> it'll be helpful to use several such procedures: </p>
<pre>
(define caar (lambda (x) (car (car x))))
(define cadr (lambda (x) (car (cdr x))))
(define cadar (lambda (x) (cadr (car x))))
(define caddr (lambda (x) (cadr (cdr x))))
(define caddar (lambda (x) (caddr (car x))))
</pre>
<p>Our next helper procedure is called <tt>pairlis</tt>. <tt>(pairlis</tt><em> exp1 exp2</em><tt>)</tt> takes expressions <em>exp1</em> and <em>exp2</em> whose values are lists of the same length, and returns a list which is formed by pairing the values of corresponding elements.  For example, </p>
<pre>
tiddlylisp> (pairlis (q (1 2 3)) (q (4 5 6)))
((1 4) (2 5) (3 6))
</pre>
<p> Here's the tiddlylisp code for <tt>pairlis</tt>: </p>
<pre>
(define pairlis 
    (lambda (x y)
      (if (null? x)
	  (q ())
	  (cons (pair (car x) (car y)) (pairlis (cdr x) (cdr y))))))
</pre>
<p>We'll call a list of pairs such as that produced by <tt>pairlis</tt> an <em>association list</em>.  It gets this name from our final helper procedure, the <tt>assoc</tt> procedure, which takes an association list and treats it as a lookup dictionary.  The easiest way to explain what this means is through an example, </p>
<pre>
tiddlylisp> (define a (pairlis (q (1 2 3)) (q (4 5 6))))
tiddlylisp> a
((1 4) (2 5) (3 6))
tiddlylisp> (assoc 2 a)
5
</pre>
<p> In other words, <tt>assoc</tt> looks for the key <tt>2</tt> as the first entry in one of the pairs in the list which is the value of <tt>a</tt>. Once it finds such a pair, it returns the second element in the pair.</p>
<p>Stated more abstractly, suppose the expression <em>exp1</em> has a value which appears as the first entry in one of the pairs in the association list which is the value of <em>exp2</em>.  Then <tt>(assoc </tt><em>exp1 exp2</em><tt>)</tt> returns the second entry of that pair.</p>
<p>After all that explanation, the code for <tt>assoc</tt> is extremely simple, simpler even than <tt>pairlis</tt>: </p>
<pre>
(define assoc (lambda (x y)
		(if (eq? (caar y) x) (cadar y) (assoc x (cdr y)))))
</pre>
<p> I won't explain how <tt>assoc</tt> works, but if you're looking for a good exercise in applying <tt>caar</tt> and similar procedures, then it's worth spending some time to carefully understand how <tt>assoc</tt> works.</p>
<p>With all these helper procedures in place, we can now write our equivalent to the code on page 13 of the LISP Manual.  This includes both the core procedure, <tt>eval</tt>, together with a couple of extra helper procedures, <tt>evcon</tt> and <tt>evlis</tt>.  Here's the code: </p>
<pre>
(define eval 
    (lambda (e a)
      (cond
	((atom? e) (assoc e a))
	((atom? (car e))
	 (cond
	   ((eq? (car e) (q car))   (car (eval (cadr e) a)))
	   ((eq? (car e) (q cdr))   (cdr (eval (cadr e) a)))
	   ((eq? (car e) (q cons))  (cons (eval (cadr e) a) (eval (caddr e) a)))
	   ((eq? (car e) (q atom?)) (atom? (eval (cadr e) a)))
	   ((eq? (car e) (q eq?))   (eq? (eval (cadr e) a) (eval (caddr e) a)))
	   ((eq? (car e) (q quote)) (cadr e))
	   ((eq? (car e) (q q))     (cadr e))
	   ((eq? (car e) (q cond))  (evcon (cdr e) a))
	   (True                   (eval (cons (assoc (car e) a) (cdr e)) a))))
	((eq? (caar e) (q lambda))
	 (eval (caddar e) (append (pairlis (cadar e) (evlis (cdr e) a)) a))))))

(define evcon 
    (lambda (c a)
      (cond ((eval (caar c) a) (eval (cadar c) a))
	    (True              (evcon (cdr c) a)))))

(define evlis 
    (lambda (m a)
      (cond ((null? m) (q ()))
	    (True     (cons (eval (car m) a) (evlis (cdr m) a))))))
</pre>
<p> Before we examine how <tt>eval</tt> works, I want to give you some examples of <tt>eval</tt> in action.  If you want, you can follow along with the examples by first loading the program defining <tt>eval</tt> into tiddlylisp (the full source is below), and then typing the examples into the interpreter.</p>
<p>To understand how to use <tt>eval</tt> in examples, we need to be clear about the meaning of its arguments.  <tt>e</tt> is a Lisp expression whose value is the Lisp expression that we want to evaluate with <tt>eval</tt>.  And <tt>a</tt> is a Lisp expression whose value is an association list, representing the environment.  In particular, the first element of each pair in <tt>a</tt> is the name of a variable or procedure, and the second element is the value of that variable or procedure.  I'll often refer to <tt>a</tt> just as the environment.</p>
<p>Suppose, for example, that we wanted to use <tt>eval</tt> to evaluate the expression <tt>(car (q (1 2)))</tt>.  We'll assume that we're evaluating it in the empty environment, that is, no variables or extra procedures have been defined.  Then we'd need to pass <tt>eval</tt> expressions with values <tt>(car (q (1 2)))</tt> and <tt>()</tt>.  We can do this by quoting those values: </p>
<pre>
tiddlylisp> (eval (q (car (q (1 2)))) (q ()))
1
</pre>
<p> As you can see, we get the right result: <tt>1</tt>.</p>
<p>I explained in detail how to build up the expression <tt>(eval (q (car...)</tt> evaluated above.  But if we hadn't gone through that explanation, then the expression would have appeared quite of complicated, with lots of quoting going on.  The reason is that <tt>eval</tt> is evaluating an expression which is itself the value of another expression.  With so much evaluation going on it's no wonder there's many <tt>q</tt>'s floating around!  But after working carefully through a few examples it all becomes transparent.</p>
<p>Here's an example showing how to use variables in the environment: </p>
<pre>
tiddlylisp> (eval (q (cdr x)) (q ((x (1 2 3)))))
(2 3)
</pre>
<p> Unpacking the quoting, we see that it's evaluating the expression <tt>(cdr x)</tt> in an environment with a variable <tt>x</tt> whose value is <tt>(1 2 3)</tt>.  The result is, of course, <tt>(2 3)</tt>.</p>
<p>Here's an example showing how to use a procedure which has been defined in the environment: </p>
<pre>
tiddlylisp> (eval (q (cddr (q (1 2 3 4 5)))) (q ((cddr (lambda (x) (cdr (cdr x)))))))
(3 4 5)
</pre>
<p> In other words, the environment stores a procedure <tt>cddr</tt> whose value is <tt>(lambda (x) (cdr (cdr x)))</tt>, and <tt>eval</tt> returns the result of applying <tt>cddr</tt> to an expression whose value is <tt>(1 2 3 4 5)</tt>.  Of course, this is just <tt>(3 4 5)</tt>.</p>
<p>We can also use <tt>eval</tt> to define and evaluate an anonymous procedure, in this case one that has the same effect as <tt>cadr</tt>: </p>
<pre>
tiddlylisp> (eval (q ((lambda (x) (car (cdr x))) (q (1 2 3 4)))) (q ()))
2
</pre>
<p>A significant drawback of <tt>eval</tt> is that it has a pretty limited Lisp vocabulary.  You can see this by running: </p>
<pre>
tiddlylisp> (eval (q (eq? 1 1)) (q (())))
<Error message>
</pre>
<p> The first line looks like perfectly valid Lisp - in fact, it is perfectly valid Lisp.  The problem is that <tt>eval</tt> doesn't recognize <tt>1</tt> - at the level of sophistication we're working it really only understands lists, variables, and procedures.  So what it tries to do is treat <tt>1</tt> as a variable or procedure to look up in the environment, <tt>a</tt>.  But <tt>1</tt> isn't in the environment, which is why there's an error message.</p>
<p>Fixing this problem by modifying <tt>eval</tt> isn't terribly difficult [4].  However, to stay close to the LISP Manual, I'll leave this as is.  A kludge to get around this issue is to add <tt>1</tt> as a key in the environment.  For example, we can use: </p>
<pre>
tiddlylisp> (eval (q (eq? 1 1)) (q ((1 1))))
True
tiddlylisp> (eval (q (eq? 1 2)) (q ((1 1) (2 2))))
False
</pre>
<p> This is exactly as expected.  We didn't see this problem in our earlier examples of <tt>eval</tt>, simply because they involved list manipulations which didn't require us to evaluate numbers such as <tt>1</tt>.  Incidentally, here's an amusing variation on the above kludge: </p>
<pre>
tiddlylisp> (eval (q (eq? 1 2)) (q ((1 1) (2 1))))
True
</pre>
<p> In other words, if we tell our interpreter emphatically enough that <tt>1 = 2</tt> then it will start to believe it!  </p>
<p>Just to put <tt>eval</tt> through its paces, let's add a bundle of tests of basic functionality.  It's not an exhaustive test suite, but at least checks that the basic procedures are working as we expect.  You don't need to read through the following test code in exhaustive detail, although you should read at least the first few lines, to get a feeling for what's going on.  Note that in a few of the lines we need to add something like <tt>1</tt> or <tt>2</tt> to the environment, in order that <tt>eval</tt> be able to evaluate it, as occurred in the example just above. </p>
<pre>
(define assert-equal (lambda (x y) (= x y)))

(define assert-not-equal (lambda (x y) (not (assert-equal x y))))

(assert-equal (eval (q x) (q ((x test-value))))
	      (q test-value))
(assert-equal (eval (q y) (q ((y (1 2 3)))))
	      (q (1 2 3)))
(assert-not-equal (eval (q z) (q ((z ((1) 2 3)))))
		  (q (1 2 3)))
(assert-equal (eval (q (quote 7)) (q ()))
	      (q 7))
(assert-equal (eval (q (atom? (q (1 2)))) (q ()))
	      False)
(assert-equal (eval (q (eq? 1 1)) (q ((1 1))))
	      True)
(assert-equal (eval (q (eq? 1 2)) (q ((1 1) (2 2))))
	      False)
(assert-equal (eval (q (eq? 1 1)) (q ((1 1))))
	      True)
(assert-equal (eval (q (car (q (3 2)))) (q ()))
	      (q 3))
(assert-equal (eval (q (cdr (q (1 2 3)))) (q ()))
	      (q (2 3)))
(assert-not-equal (eval (q (cdr (q (1 (2 3) 4)))) (q ()))
		  (q (2 3 4)))
(assert-equal (eval (q (cons 1 (q (2 3)))) (q ((1 1)(2 2)(3 3))))
	      (q (1 2 3)))
(assert-equal (eval (q (cond ((atom? x) (q x-atomic)) 
			     ((atom? y) (q y-atomic)) 
			     ((q True) (q nonatomic)))) 
		    (q ((x 1)(y (3 4)))))
	      (q x-atomic))
(assert-equal (eval (q (cond ((atom? x) (q x-atomic)) 
			     ((atom? y) (q y-atomic)) 
			     ((q True) (q nonatomic)))) 
		    (q ((x (1 2))(y 3))))
	      (q y-atomic))
(assert-equal (eval (q (cond ((atom? x) (q x-atomic)) 
			     ((atom? y) (q y-atomic)) 
			     ((q True) (q nonatomic)))) 
		    (q ((x (1 2))(y (3 4)))))
	      (q nonatomic))
(assert-equal (eval (q ((lambda (x) (car (cdr x))) (q (1 2 3 4)))) (q ()))
	      2)
</pre>
<p> In tiddlylisp, perhaps the easiest way to use this test code is to append it at the bottom of the file where we define <tt>eval</tt>. Then, when we load that file into memory, the tests run automatically. If everything is working properly, then all the tests should evaluate to <tt>True</tt>.</p>
<p>How does <tt>eval</tt> work?  Looking back at the code, we see that it's just a big <tt>cond</tt> statement, whose value is determined by which of various conditions evaluate to <tt>True</tt>.  The <tt>cond</tt> statement starts off: </p>
<pre>
      (cond
	((atom? e) (assoc e a))
        ...
</pre>
<p> To understand what this accomplishes, it is helpful to remember that what we're most interested in is <em>the value</em> of <tt>e</tt>, not <tt>e</tt> itself.  Let's use <tt>e'</tt> to denote the value of <tt>e</tt>, i.e., <tt>e'</tt> is the Lisp expression that we actually want to evaluate using <tt>eval</tt>.  Then what the condition above does is check whether <tt>e'</tt> is atomic, and if so it returns the value of the corresponding variable or procedure in the environment, exactly as we'd expect.</p>
<p>Let's look at the next line in the big outer conditional statement: </p>
<pre>
	((atom? (car e))
</pre>
<p> At this stage, we know that <tt>e'</tt> isn't atomic, since we already checked for that, and so <tt>e'</tt> must be a list.  This line checks to see whether the first element of <tt>e'</tt> is itself an atom.  If it is, then there are multiple possibilities: it could be a special form, such as <tt>quote</tt>, or a built-in procedure, such as <tt>car</tt>, or else a procedure that's defined in the environment.  To check which of these possibilities is the case, we evaluate another (nested) conditional statement.  This just checks off the different cases, for instance the first line of the nested conditional checks to see if we're applying the procedure <tt>car</tt>, and if so proceeds appropriately, </p>
<pre>
	   ((eq? (car e) (q car))   (car (eval (cadr e) a)))
</pre>
<p> In other words, if the first symbol in <tt>e'</tt> is <tt>car</tt>, then extract whatever expression is being passed to <tt>car</tt>, using <tt>(cadr e)</tt>, then evaluate that expression using <tt>(eval (cadr e) a)</tt>, and finally extract the first element, using <tt>(car (eval...))</tt>.  That's exactly what we'd expect <tt>car</tt> to do.  Most of the rest of this nested conditional statement works along similar lines, as you can check yourself.  The final line is interesting, and deserves comment: </p>
<pre>
	   (True                   (eval (cons (assoc (car e) a) (cdr e)) a))))
</pre>
<p> This line is evaluated when the expression <tt>e'</tt> does not start with a special form or built-in procedure, but instead starts with the name of a procedure defined in the environment.  To understand what is returned, note that <tt>(car e)</tt> retrieves the name of the procedure, so <tt>(assoc (car e) a)</tt> can retrieve the procedure from the environment, and then <tt>(cons (assoc (car e) a) (cdr e))</tt> appends the arguments to the procedure.  The whole thing is then evaluated.  It's all quite simple and elegant!</p>
<p>Moving back into the outer <tt>cond</tt> statement, the final condition is as follows: </p>
<pre>
	((eq? (caar e) (q lambda))
	 (eval (caddar e) (append (pairlis (cadar e) (evlis (cdr e) a)) a))))))
</pre>
<p> This occurs when evaluating a quoted expression of the form <tt>((lambda (</tt><em>x...</em>|<tt>) </tt><em>exp</em><tt>)</tt>.  The first line simply checks that we are, indeed, seeing a <tt>lambda</tt> expression.  The <tt>caddar e</tt> extracts the expression <em>exp</em> from the body of the <tt>lambda</tt> expression.  We evaluate this in the context of an environment which has modified by <tt>append</tt>ing some new variable names (extracted with <tt>cadar e</tt>), using <tt>pairlis</tt> to pair them with their values, which are evaluated using <tt>evlis</tt> (which you can work through yourself).  Once again, it's all quite simple and neat - a fact which speaks to the marvellous elegance of the design presented in the LISP Manual (and, ultimately, due to John McCarthy).</p>
<p>It won't have escaped your attention that our Lisp <tt>eval</tt> is very similar to the <tt>eval</tt> we wrote earlier in Python.  Tiddlylisp is somewhat different to the dialect of Lisp our <tt>eval</tt> interprets, but the implementation is recognizably similar.  It is a matter of taste, but I think the Lisp implementation is more elegant.  It's true that the Lisp code is superficially a little more complex - it relies more on concepts outside our everyday experience, such as the procedures <tt>caar</tt>, <tt>cadar</tt>, and so on.  But it makes up for that by possessing a greater conceptual economy, in that we are using concepts such as <tt>car</tt>, <tt>cdr</tt> and <tt>cond</tt> to write an interpreter which understands those very same concepts.</p>
<p>Here's the full code for our Lisp interpreter in tiddlylisp.  You should append the test code given above, and save it all as a single file, <tt>eval.tl</tt>.   </p>
<pre>
(define caar (lambda (x) (car (car x))))
(define cadr (lambda (x) (car (cdr x))))
(define cadar (lambda (x) (cadr (car x))))
(define caddr (lambda (x) (cadr (cdr x))))
(define caddar (lambda (x) (caddr (car x))))

(define not (lambda (x) (if x False True)))

(define append (lambda (x y)
		 (if (null? x) y (cons (car x) (append (cdr x) y)))))

(define pair (lambda (x y) (cons x (cons y (q ()) ))))

(define pairlis 
    (lambda (x y)
      (if (null? x)
	  (q ())
	  (cons (pair (car x) (car y)) (pairlis (cdr x) (cdr y))))))

(define assoc (lambda (x y)
		(if (eq? (caar y) x) (cadar y) (assoc x (cdr y)))))

(define eval 
    (lambda (e a)
      (cond
	((atom? e) (assoc e a))
	((atom? (car e))
	 (cond
	   ((eq? (car e) (q car))   (car (eval (cadr e) a)))
	   ((eq? (car e) (q cdr))   (cdr (eval (cadr e) a)))
	   ((eq? (car e) (q cons))  (cons (eval (cadr e) a) (eval (caddr e) a)))
	   ((eq? (car e) (q atom?)) (atom? (eval (cadr e) a)))
	   ((eq? (car e) (q eq?))   (eq? (eval (cadr e) a) (eval (caddr e) a)))
	   ((eq? (car e) (q quote)) (cadr e))
	   ((eq? (car e) (q q))     (cadr e))
	   ((eq? (car e) (q cond))  (evcon (cdr e) a))
	   (True                   (eval (cons (assoc (car e) a) (cdr e)) a))))
	((eq? (caar e) (q lambda))
	 (eval (caddar e) (append (pairlis (cadar e) (evlis (cdr e) a)) a))))))

(define evcon 
    (lambda (c a)
      (cond ((eval (caar c) a) (eval (cadar c) a))
	    (True              (evcon (cdr c) a)))))

(define evlis 
    (lambda (m a)
      (cond ((null? m) (q ()))
	    (True     (cons (eval (car m) a) (evlis (cdr m) a))))))
</pre>
<p>It's instructive to compare our <tt>eval</tt> to what Kay saw on page 13 of the LISP 1.5 Programmer's Manual:</p>
<p><a href="https://michaelnielsen.org/ddi/wp-content/uploads/2012/04/Lisp_Maxwells_Equations.png"><img decoding="async" src="https://michaelnielsen.org/ddi/wp-content/uploads/2012/04/Lisp_Maxwells_Equations.png" alt="" title="Lisp_Maxwells_Equations" width="480" class="alignnone size-full wp-image-63" srcset="https://michaelnielsen.org/ddi/wp-content/uploads/2012/04/Lisp_Maxwells_Equations.png 555w, https://michaelnielsen.org/ddi/wp-content/uploads/2012/04/Lisp_Maxwells_Equations-291x300.png 291w" sizes="(max-width: 555px) 100vw, 555px" /></a></p>
<p>Obviously, what we've written is longer than that half-page!  However, as I mentioned earlier, that half-page omitted the code for helper procedures such as <tt>caar</tt>, <tt>append</tt>, and so on, which were defined earlier in the LISP Manual. A more direct comparison is to our code for the <tt>eval, evcon</tt> and <tt>evlis</tt> procedures.</p>
<p>If you compare our code to the LISP Manual, a few differences jump out.  The most obvious is that the LISP Manual's <tt>evalquote, apply</tt> and <tt>eval</tt> have all been combined into one procedure.  This is a form of organization I adopted from Paul Graham's <tt>eval</tt>, and it makes it much easier to see what is going on, in the outer <tt>cond</tt>.  In particular, the outer <tt>cond</tt> has a very simple structure: (1) if the expression we're tring to evaluate is an atom, return its value; otherwise (2) the expression must be a list, so check to see if the first element is an atom, in which case it must be a special form or procedure, and should be evaluated appropriately (this is the inner <tt>cond</tt>); and otherwise (3) we must be dealing with a <tt>lambda</tt> expression.</p>
<p>Condition (3) is interesting.  With the syntax we're using, the condition in step (3) could simply be expressed as <tt>True</tt>, not <tt>(eq? (caar e) (q lambda)</tt>, since it's the only remaining possibility.  This would, in some sense, simplify (and speed up) the code.  However, it would also make it harder to understand the intent of the code.  </p>
<p>Something that you may note was present in the LISP Manual but which is missing from our <tt>eval</tt> is the special form <tt>label</tt>. <tt>label</tt> was used in the LISP Manual to give names to procedures, and so that procedure definitions could refer recursively to themselves.  It's only a couple of lines to add back in, but I haven't done so.  If you'd like, it's a fun challenge to add this functionality back in, and so I've given this as a problem below.</p>
<h3>Problems</h3>
<ul>
<li> How could you add a facility to <tt>eval</tt> so that procedure   definitions can refer to themselves?  If you're having trouble with   this problem, you can get a hint by looking at the code from page 13   of the LISP Manual. A complete solution to the problem may be found   in Paul Graham's   <a href="http://lib.store.yahoo.net/lib/paulgraham/jmc.ps">essay about     the roots of Lisp</a>. </ul>
<p>This is a nice little interpreter.  However, it has many limitations, even when compared to tiddlylisp.  It can't do basic arithmetic, doesn't cope with integers, much less more complicated data types, it doesn't even have a way of <tt>define</tt>ing variables (after all, it doesn't return a modified environment).  Still, it already contains many of the core concepts of Lisp, and it really is an executable counterpart to what Alan Kay saw on page 13 of the LISP 1.5 Manual.</p>
<p>What can we learn from this interpreter for Lisp in Lisp?  As an intellectual exercise, it's cute, but beyond that, so what?  Let's think about the analogous question for Python, i.e., writing a Python function that can return the value of Python expressions.  In some sense, solving this problem is a trivial one-liner, since Python has a built-in function called <tt>eval</tt> that is capable of evaluating Python expressions.</p>
<p>What if, however, we eliminated <tt>eval</tt> (and similar functions) from Python?  What then?  Well, a Python version of <tt>eval</tt> would be much more complicated than our Lisp <tt>eval</tt>.  Python is much less regular language than Lisp, and this makes it much more complicated for it to deal with Python code.  By contrast, Lisp has an extremely simple syntax, and is designed to manipulate its own code as data.  This is all reflected in the simplicity of the interpreter above. </p>
<p>Beyond this, the code for <tt>eval</tt> is a beautiful expression of the core ideas of Lisp, written in Lisp.  It's true that our <tt>eval</tt> implements a very incomplete version of Lisp, but with just a little elaboration we can add support for arithmetic, more advanced control structures, and so on - everything needed to make this an essentially complete basic Lisp.  And so we need only a little poetic license to say that, just as with Maxwell's equations and electromagnetism, there is a sense in which if you can look at this compact little program and understand all its consequences, then you understand all that Lisp can do.  And because Lisp is universal, that means that inside these few lines of code is all a computer can do - everything from Space Invaders to computer models of climate to the Google search engine.  In that sense this elegant little program really is the Maxwell's equations of software.</p>
<h3>Problems</h3>
<ul>
<li> Outline a proof that <tt>(eval (q (</tt><em>exp</em><tt>)) a)</tt>   returns the value of <em>exp</em> in the environment <tt>a</tt> for all   expressions <em>exp</em> and environments <tt>a</tt> if and only if the   underlying Lisp interpreter is correct.  This little theorem can be   considered a formal way of stating that <tt>eval</tt> contains all of   Lisp.  The reason I ask for an outline proof only is that various   elements in the statement aren't defined as well as they need to be   to make this a rigorous result; still, a compelling outline proof is   possible.
<li> Extend the code given for <tt>eval</tt> so that you can implement   a full read-eval-print loop.  This will require you to extend   tiddlylisp so that it can cope with input and output, and (perhaps)   some sort of looping.
<li> Having worked through <tt>eval.tl</tt>, it should now be easy to   work through the first chapter of the LISP 1.5 Programmer's Manual.   <a href="http://www.softwarepreservation.org/projects/LISP/book/LISP%201.5%20Programmers%20Manual.pdf">Download the LISP manual</a> and work through the first chapter, including the   code on page 13. </ul>
<h3>Problems for the author</h3>
<ul>
<li> Is it possible to modify the above Lisp-in-Lisp so that it   interprets all of tiddlylisp?  Note that this will require   modification of tiddlylisp. </ul>
<h3>Acknowledgements</h3>
<p>Thanks to <a href="http://jendodd.com">Jen Dodd</a> for many helpful discussions.</p>
<h3>Footnote</h3>
<p>[1] I'm paraphrasing, since this was 17 years ago, but I believe I've reported the essence of the comments correctly.  I've taken one liberty, which is in supplying my own set of examples (antennas, motors, and circuits), since I don't recall the examples he gave. Incidentally, his comments contain a common error that took me several years to sort out in my own thinking: Maxwell's equations actually don't completely specify electromagnetism.  For that, you need to augment them with one extra equation, the <a href="http://en.wikipedia.org/wiki/Lorentz_force">Lorentz force law</a>. It is, perhaps, unfair to characterize this as an error, since it's a common useage to equate Maxwell's equations with electromagnetism, and I've no doubt my professor was aware of the nuance.  Nonetheless, while the useage is common, it's not correct, and you really do need the Lorentz force law as well.  This nuance creates a slight quandary for me in this essay.  As the title of the essay suggests, it explores a famous remark made by Alan Kay about what he called "the Maxwell's equations of software", and I presume that in making this remark Kay was following the common useage of equating the consequences of Maxwell's equations with the whole of electromagnetism.  My quandary is this: on the one hand I don't wish to perpetuate this useage; on the other hand I think Kay's formulation is stimulating and elegant. So I'll adopt the same useage, but with the admonishment that you should read "Maxwell's equations" as synonomous with "Maxwell's equations plus the Lorentz force law".  That set of five equations really does specify the whole of (classical) electromagnetism.</p>
<p>[2] I first encountered <tt>lambda</tt> in Python, not Lisp, but I believe I would have been perplexed for the same reason even if I'd first encountered it in Lisp.</p>
<p>[3] With a hat tip to Abelson and Sussman, who <a href="http://mitpress.mit.edu/sicp/full-text/book/book-Z-H-7.html#%_chap_Temp_4">famously wrote</a> "programs must be written for people to read, and only incidentally for machines to execute".</p>
<p>[4] The simplest solutions I can think of are: (1) to give <tt>eval</tt> the ability to determine when some key is not in the environment; or (2) to give <tt>eval</tt> the ability to recognize numbers.  Both approaches seem to also require making some modifications to tiddlylisp.</p>
<h3>Further reading</h3>
<p>Much of Alan Kay's writing may be found at the website of the <a href="http://vpri.org/html/writings.php">Viewpoints Research   Institute</a>.  I also recommend browsing his <a href="http://www.squeakland.org/resources/books/readingList.jsp">list   of recommended reading</a>.</p>
<p>Lisp enjoys a plethora of insightful and beautifully written books and essays, many of them freely available online.  This essay is, of course, based principally on Peter Norvig's essay on his <a href="http://norvig.com/lispy.html">basic Lisp interpreter, lispy</a>. I've also drawn a few ideas from a followup essay of Norvig's which describes a <a href="http://norvig.com/lispy2.html">more sophisticated   Lisp interpreter</a>.  Both essays (and the accompanying code) are marvellously elegant, and well worth working through.  Norvig's <a href="http://norvig.com/">other works</a> are also worth your time.  The first three chapters of Norvig's book <a href="http://www.amazon.com/exec/obidos/ASIN/1558601910">Paradigms of   Artificial Intelligence Programming</a> are an excellent introduction to Common Lisp.</p>
<p>The other principal inspiration for the current essay is Paul Graham's essay <a href="http://lib.store.yahoo.net/lib/paulgraham/jmc.ps">The   Roots of Lisp</a> (postscript file), where he explains John McCarthy's early ideas about Lisp.  My essay may be viewed as an attempt to remix the ideas in Norvig's and Graham's essays, in an attempt to better understand Alan Kay's remark about Lisp-as-Maxwell's-Equations.  I also recommend Graham's book "On Lisp", which contains an excellent discussion of Lisp macros and many other subjects.  The book seems to be out of print, but thanks to Graham and the publisher Prentice Hall the text of the entire book is <a href="http://www.paulgraham.com/onlisptext.html">freely available   online</a>.  Note that I am still working through "On Lisp", I have not yet read it to completion.  The same is true of the books I mention below, by Seibel, and by Abelson and Sussmann.</p>
<p>Although "On Lisp" is a marvellous book, it's not written for people new to Lisp.  To gain familiarity, I suggest working through the first three chapters of Norvig's book, mentioned above.  If that's not available, then you should take a look at <a href="http://www.gigamonkeys.com/">Peter Seibel</a>'s book <a href="http://www.gigamonkeys.com/book/">Practical Common Lisp</a>.  It's freely available at the link, and gives an easily readable introduction to Common Lisp.</p>
<p>Finally, I must recommend the wonderful book by Abelson and Sussman on the <a href="http://www.amazon.com/Structure-Interpretation-Computer-Programs-Engineering/dp/0262011530">Structure   and Interpretation of Computer Programs</a>.  Among other things, it's an introduction to the Scheme dialect of Lisp, but it's about much more than that; it's about how to think about programming.  It's a famous book, but for a long time I avoided looking at it, because I'd somehow picked up the impression that it was a little dry.  I started reading, and found this impression was completely false: I was utterly gripped.  Much of "On Lisp" and "Paradigms of Artificial Intelligence Programming" also have this quality.  Abelson and Sussmann's book is <a href="http://mitpress.mit.edu/sicp/full-text/book/book-Z-H-4.html">freely   available online</a>. </a> </p>
<p>  <em>Interested in more?  Please <a href="https://michaelnielsen.org/ddi/feed/">subscribe to this blog</a>, or <a href="http://twitter.com/\#!/michael_nielsen">follow me on Twitter</a>.  You may also enjoy reading my new book about  open science, <a href="http://www.amazon.com/Reinventing-Discovery-New-Networked-Science/dp/product-description/0691148902">Reinventing Discovery</a>. </em> </p>
]]></content:encoded>
					
					<wfw:commentRss>https://michaelnielsen.org/ddi/lisp-as-the-maxwells-equations-of-software/feed/</wfw:commentRss>
			<slash:comments>37</slash:comments>
		
		
			</item>
		<item>
		<title>How changing the structure of the web changes PageRank</title>
		<link>https://michaelnielsen.org/ddi/how-changing-the-structure-of-the-web-changes-pagerank/</link>
		
		<dc:creator><![CDATA[Michael Nielsen]]></dc:creator>
		<pubDate>Tue, 06 Mar 2012 01:15:54 +0000</pubDate>
				<category><![CDATA[Uncategorized]]></category>
		<guid isPermaLink="false">https://michaelnielsen.org/ddi/?p=56</guid>

					<description><![CDATA[Suppose I add a hyperlink from a webpage to a webpage . In principle, adding that single hyperlink changes the PageRank not just of those two pages, but potentially of nearly every other page on the web. For instance, if the CNN homepage http://cnn.com adds a link to the homepage of my site, https://michaelnielsen.org, then&#8230; <a class="more-link" href="https://michaelnielsen.org/ddi/how-changing-the-structure-of-the-web-changes-pagerank/">Continue reading <span class="screen-reader-text">How changing the structure of the web changes PageRank</span></a>]]></description>
										<content:encoded><![CDATA[<p>Suppose I add a hyperlink from a webpage <img src='https://s0.wp.com/latex.php?latex=w&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='w' title='w' class='latex' /> to a webpage <img src='https://s0.wp.com/latex.php?latex=w%27&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='w&#039;' title='w&#039;' class='latex' />.  In principle, adding that single hyperlink changes the PageRank not just of those two pages, but potentially of nearly every other page on the web.  For instance, if the CNN homepage http://cnn.com adds a link to the homepage of my site, https://michaelnielsen.org, then not only will that increase the PageRank of my homepage, it will also increase (though in a smaller way) the PageRank of pages my homepage links to, and then the pages they link to, and so on, a cascading network of ever-smaller changes to PageRank.</p>
<p>In this post I investigate how PageRank changes when the structure of the web changes.  We&#8217;ll look at the impact of adding and removing multiple links from the web, and also at what happens when entirely new webpages are created.  In each case we&#8217;ll derive quantitative bounds on the way in which PageRank can change.</p>
<p>I&#8217;ve no doubt that most or all of the results I describe here are already known &#8211; PageRank has been studied in <a href="http://scholar.google.com/scholar?cites=12735303212700583171&#038;as_sdt=2005&#038;sciodt=0,5&#038;hl=en">great   detail</a> in the academic literature, and presumably the bounds I describe (or better) have been obtained by others.  This post is for my own pleasure (and to improve my own understanding) in attacking these questions from first principles; it is, essentially, a cleaned up version of my notes on the problem, with the hope that the notes may also be of interest to others.  For an entree into the academic literature on these questions, see <a href="http://scholar.google.com/scholar?q=changes+in+pagerank">here</a>.</p>
<p>The post requires a little undergraduate-level linear algebra to follow, as well as some basic familiarity with how PageRank is defined and calculated.  I&#8217;ve written an introduction to PageRank <a href="https://michaelnielsen.org/blog/lectures-on-the-google-technology-stack-1-introduction-to-pagerank/">here</a> (see also <a href="https://michaelnielsen.org/blog/?p=511">here</a>, <a href="https://michaelnielsen.org/blog/?p=516">here</a>, <a href="https://michaelnielsen.org/blog/?p=523">here</a> and <a href="https://michaelnielsen.org/blog/?p=534">here</a>).  However, I&#8217;ve included a short summary below that should be enough to follow this post.</p>
<p>One final note before we get into the analysis: I understand that there are some people reading my blog who are interested in search engine optimization (SEO).  If you&#8217;re from that community, then I should warn you in advance that this post doesn&#8217;t address questions like how to link in such a way as to boost the PageRank of a particular page.  Rather, the results I obtain are general bounds on <em>global</em> changes in PageRank as a result of local changes to the structure of the web.  Understand such global changes is of great interest if you&#8217;re trying to understand how a search engine using PageRank might work (which is my concern), but probably of less interest if your goal is to boost the PageRank of individual pages (SEO&#8217;s concern).</p>
<h3>PageRank summary</h3>
<p>PageRank is defined by imagining a crazy websurfer, who surfs the web by randomly following links on any given page.  They interrupt this random websurfing occasionally to &#8220;teleport&#8221; to a page chosen completely at random from the web.  The probability of teleporting <img src='https://s0.wp.com/latex.php?latex=t&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t' title='t' class='latex' /> in any given step is assumed to be around <img src='https://s0.wp.com/latex.php?latex=t+%3D+0.15&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t = 0.15' title='t = 0.15' class='latex' />, in line with the <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.38.5427">original   PageRank paper</a>.  The PageRank of a given webpage <img src='https://s0.wp.com/latex.php?latex=w&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='w' title='w' class='latex' /> is defined to be the long-run probability <img src='https://s0.wp.com/latex.php?latex=p_w&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_w' title='p_w' class='latex' /> of the surfer being on the page.  A high PageRank means the page is important; a low PageRank means the page is not important.</p>
<p>One minor wrinkle in this definition of PageRank is what our websurfer should do when they arrive at a <em>dangling</em> webpage, i.e., a webpage with no outgoing links.  Obviously, they can&#8217;t just select a link to follow at random &#8211; there&#8217;s nothing to select!  So, instead, in this situation what they do is <em>always</em> teleport to a randomly chosen webpage.  Another, equivalent way of thinking about this is to imagine that we add outgoing links from the dangling page to <em>every</em> single other page.  In this imagined world, our crazy websurfer simply chooses a webpage at random, regardless of whether they are teleporting or not.</p>
<p>With the above definition of the PageRank of a page in mind, the <em>PageRank vector</em> is defined to be the vector <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> whose components are the PageRanks of all the different webpages.  If we number the pages <img src='https://s0.wp.com/latex.php?latex=0%2C+1%2C+2%2C+%5Cldots%2C+N-1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0, 1, 2, \ldots, N-1' title='0, 1, 2, \ldots, N-1' class='latex' /> then the component <img src='https://s0.wp.com/latex.php?latex=p_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_0' title='p_0' class='latex' /> is the PageRank of page number 0, <img src='https://s0.wp.com/latex.php?latex=p_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_1' title='p_1' class='latex' /> is the PageRank of page number 1, and so on.  The PageRank vector satisfies the <em>PageRank   equation</em>, <img src='https://s0.wp.com/latex.php?latex=p+%3D+Mp&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p = Mp' title='p = Mp' class='latex' />, where <img src='https://s0.wp.com/latex.php?latex=M&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M' title='M' class='latex' /> is an <img src='https://s0.wp.com/latex.php?latex=N+%5Ctimes+N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N \times N' title='N \times N' class='latex' /> matrix which we&#8217;ll call the <em>PageRank matrix</em>.  The element <img src='https://s0.wp.com/latex.php?latex=M_%7Bjk%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M_{jk}' title='M_{jk}' class='latex' /> of <img src='https://s0.wp.com/latex.php?latex=M&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M' title='M' class='latex' /> is just the probability that a crazy websurfer on page <img src='https://s0.wp.com/latex.php?latex=j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j' title='j' class='latex' /> will go to page <img src='https://s0.wp.com/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' />.  This probability is <img src='https://s0.wp.com/latex.php?latex=t%2FN&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t/N' title='t/N' class='latex' /> if there is no link between pages <img src='https://s0.wp.com/latex.php?latex=j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j' title='j' class='latex' /> and <img src='https://s0.wp.com/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' />, i.e., it&#8217;s just the teleporation probability.  The probability is <img src='https://s0.wp.com/latex.php?latex=t%2FN+%2B+%281-t%29%2Fl&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t/N + (1-t)/l' title='t/N + (1-t)/l' class='latex' /> if there is a link between page <img src='https://s0.wp.com/latex.php?latex=j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j' title='j' class='latex' /> and <img src='https://s0.wp.com/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' />, where <img src='https://s0.wp.com/latex.php?latex=l&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l' title='l' class='latex' /> is the number of outbound links from page <img src='https://s0.wp.com/latex.php?latex=j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j' title='j' class='latex' />.  Note that in writing these formulae I&#8217;ve assumed that <img src='https://s0.wp.com/latex.php?latex=j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j' title='j' class='latex' /> is not a dangling page.  If it is, then we will use the convention that <img src='https://s0.wp.com/latex.php?latex=l+%3D+N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l = N' title='l = N' class='latex' />, i.e., the page is linked to every other page, and so <img src='https://s0.wp.com/latex.php?latex=M_%7Bjk%7D+%3D+1%2FN&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M_{jk} = 1/N' title='M_{jk} = 1/N' class='latex' />.</p>
<p>In this post I&#8217;ll consider several scenarios where we imagine altering the structure of the web in some way &#8211; by adding a link, deleting a link, adding a page, and so on.  I&#8217;ll use <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> to denote the PageRank vector <em>before</em> the alteration, and <img src='https://s0.wp.com/latex.php?latex=q&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='q' title='q' class='latex' /> to denote the PageRank vector <em>after</em> the alteration.  What we&#8217;re interested in is understanding the change from <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> to <img src='https://s0.wp.com/latex.php?latex=q&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='q' title='q' class='latex' />.  The main quantity we&#8217;ll study is the total change <img src='https://s0.wp.com/latex.php?latex=%5Csum_j+%7Cq_j-p_j%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\sum_j |q_j-p_j|' title='\sum_j |q_j-p_j|' class='latex' /> in PageRank across all webpages.  This quantity is derived from the <img src='https://s0.wp.com/latex.php?latex=l_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l_1' title='l_1' class='latex' /> norm for vectors, <img src='https://s0.wp.com/latex.php?latex=%5C%7Cv%5C%7C_1+%5Cequiv+%5Csum_j+%7Cv_j%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\|v\|_1 \equiv \sum_j |v_j|' title='\|v\|_1 \equiv \sum_j |v_j|' class='latex' />.  We&#8217;ll drop the subscript <img src='https://s0.wp.com/latex.php?latex=1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1' title='1' class='latex' /> in the norm notation <img src='https://s0.wp.com/latex.php?latex=%5C%7C%5Ccdot%5C%7C_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\|\cdot\|_1' title='\|\cdot\|_1' class='latex' />, since we&#8217;ll only be using the <img src='https://s0.wp.com/latex.php?latex=l_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l_1' title='l_1' class='latex' /> norm, never the Euclidean norm, which is more conventionally denoted <img src='https://s0.wp.com/latex.php?latex=%5C%7C%5Ccdot%5C%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\|\cdot\|' title='\|\cdot\|' class='latex' />.</p>
<p>A key fact about the <img src='https://s0.wp.com/latex.php?latex=l_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l_1' title='l_1' class='latex' /> norm and the PageRank matrix is that when we apply the PageRank matrix to any two probability vectors, <img src='https://s0.wp.com/latex.php?latex=r&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='r' title='r' class='latex' /> and <img src='https://s0.wp.com/latex.php?latex=s&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='s' title='s' class='latex' />, they are guaranteed to get closer by a factor of <img src='https://s0.wp.com/latex.php?latex=%281-t%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='(1-t)' title='(1-t)' class='latex' />:</p>
<img src='https://s0.wp.com/latex.php?latex=+%5C%7CM%28r-s%29%5C%7C+%5Cleq+%281-t%29+%5C%7Cr-s%5C%7C.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt=' \|M(r-s)\| \leq (1-t) \|r-s\|. ' title=' \|M(r-s)\| \leq (1-t) \|r-s\|. ' class='latex' />
<p>Intuitively, applying the PageRank matrix is like our crazy websurfer doing a single step.  And so if we think of <img src='https://s0.wp.com/latex.php?latex=r&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='r' title='r' class='latex' /> and <img src='https://s0.wp.com/latex.php?latex=s&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='s' title='s' class='latex' /> as possible starting distributions for the crazy websurfer, then this inequality shows that these distributions gradually get closer and closer, before finally converging to PageRank.  We&#8217;ll call this inequality the <em>contractivity property</em> for PageRank.  I discuss the contractivity property and its consequences in much more detail in <a href="https://michaelnielsen.org/blog/lectures-on-the-google-technology-stack-1-introduction-to-pagerank/">this   post</a>.</p>
<h3>What happens when we add a single link to the web?</h3>
<p>Suppose we add a single new link to the web, from a page <img src='https://s0.wp.com/latex.php?latex=w&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='w' title='w' class='latex' /> to a page <img src='https://s0.wp.com/latex.php?latex=w%27&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='w&#039;' title='w&#039;' class='latex' />.  How does the PageRank vector change?  In this section we&#8217;ll obtain a bound on the total change, <img src='https://s0.wp.com/latex.php?latex=%5C%7Cq-p%5C%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\|q-p\|' title='\|q-p\|' class='latex' />.  We&#8217;ll obtain this bound under the assumption that <img src='https://s0.wp.com/latex.php?latex=w&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='w' title='w' class='latex' /> isn&#8217;t a dangling page, i.e., a page with no outgoing links.  As we&#8217;ll see, dangling pages create some complications, which we&#8217;ll deal with in a later section.</p>
<p>To obtain the bound on <img src='https://s0.wp.com/latex.php?latex=%5C%7Cp-q%5C%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\|p-q\|' title='\|p-q\|' class='latex' />, let <img src='https://s0.wp.com/latex.php?latex=M&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M' title='M' class='latex' /> be the PageRank matrix before the new link is inserted, so the PageRank equation is <img src='https://s0.wp.com/latex.php?latex=p+%3D+Mp&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p = Mp' title='p = Mp' class='latex' />, and let <img src='https://s0.wp.com/latex.php?latex=M%27&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M&#039;' title='M&#039;' class='latex' /> be the PageRank matrix after the new link is inserted, when the PageRank equation becomes <img src='https://s0.wp.com/latex.php?latex=q+%3D+M%27q&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='q = M&#039;q' title='q = M&#039;q' class='latex' />.  Note that we have <img src='https://s0.wp.com/latex.php?latex=q-p+%3D+M%27q+-Mp&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='q-p = M&#039;q -Mp' title='q-p = M&#039;q -Mp' class='latex' />.  If we introduce a new matrix <img src='https://s0.wp.com/latex.php?latex=%5CDelta+%5Cequiv+M%27-M&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\Delta \equiv M&#039;-M' title='\Delta \equiv M&#039;-M' class='latex' />, then this equation may be rewritten:</p>
<img src='https://s0.wp.com/latex.php?latex=+q-p+%3D+M%27q-%28M%27-%5CDelta%29p+%3D+M%27%28q-p%29+%2B+%5CDelta+p.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt=' q-p = M&#039;q-(M&#039;-\Delta)p = M&#039;(q-p) + \Delta p. ' title=' q-p = M&#039;q-(M&#039;-\Delta)p = M&#039;(q-p) + \Delta p. ' class='latex' />
<p>Taking norms on both sides, and using the triangle inequality:</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%7Cq-p%5C%7C+%5Cleq+%5C%7CM%27%28q-p%29%5C%7C+%2B+%5C%7C+%5CDelta+p%5C%7C.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \|q-p\| \leq \|M&#039;(q-p)\| + \| \Delta p\|. ' title='   \|q-p\| \leq \|M&#039;(q-p)\| + \| \Delta p\|. ' class='latex' />
<p>Using the contractivity property, <img src='https://s0.wp.com/latex.php?latex=%5C%7CM%27%28q-p%29%5C%7C+%5Cleq+%281-t%29+%5C%7Cq-p%5C%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\|M&#039;(q-p)\| \leq (1-t) \|q-p\|' title='\|M&#039;(q-p)\| \leq (1-t) \|q-p\|' class='latex' />, and so:</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%7Cq-p%5C%7C+%5Cleq+%281-t%29+%5C%7Cq-p%5C%7C+%2B+%5C%7C+%5CDelta+p%5C%7C.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \|q-p\| \leq (1-t) \|q-p\| + \| \Delta p\|. ' title='   \|q-p\| \leq (1-t) \|q-p\| + \| \Delta p\|. ' class='latex' />
<p>Rearranging, we obtain:</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5B%2A%5D+%5C%2C%5C%2C%5C%2C%5C%2C+%5C%7Cq-p%5C%7C+%5Cleq+%5Cfrac%7B%5C%7C+%5CDelta+p+%5C%7C%7D%7Bt%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   [*] \,\,\,\, \|q-p\| \leq \frac{\| \Delta p \|}{t}. ' title='   [*] \,\,\,\, \|q-p\| \leq \frac{\| \Delta p \|}{t}. ' class='latex' />
<p>Up to this point we haven&#8217;t used in any way the fact that we&#8217;re adding merely a single link to the web.  The inequality [*] holds no matter how we change the structure of the web, and so in later sections we&#8217;ll use [*] (or an analagous inequality) to analyse more complex situations.</p>
<p>To analyse what happens in the specific case when we add a single new link, we will compute <img src='https://s0.wp.com/latex.php?latex=%5C%7C+%5CDelta+p%5C%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\| \Delta p\|' title='\| \Delta p\|' class='latex' />.  Recall that <img src='https://s0.wp.com/latex.php?latex=%5CDelta+%3D+M%27-M&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\Delta = M&#039;-M' title='\Delta = M&#039;-M' class='latex' /> is the change between the two PageRank matrices.  Suppose that <img src='https://s0.wp.com/latex.php?latex=w&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='w' title='w' class='latex' /> starts with <img src='https://s0.wp.com/latex.php?latex=l&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l' title='l' class='latex' /> outgoing hyperlinks (where <img src='https://s0.wp.com/latex.php?latex=l+%3E+0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l &gt; 0' title='l &gt; 0' class='latex' />, reflecting the fact that <img src='https://s0.wp.com/latex.php?latex=w&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='w' title='w' class='latex' /> is not a dangling page).  By numbering pages appropriately, we can assume that <img src='https://s0.wp.com/latex.php?latex=w&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='w' title='w' class='latex' /> is page number <img src='https://s0.wp.com/latex.php?latex=0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0' title='0' class='latex' />, that the new link is from page <img src='https://s0.wp.com/latex.php?latex=0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0' title='0' class='latex' /> to page <img src='https://s0.wp.com/latex.php?latex=l%2B1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l+1' title='l+1' class='latex' />, and that before the link was inserted, the page <img src='https://s0.wp.com/latex.php?latex=0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0' title='0' class='latex' /> was linked to pages <img src='https://s0.wp.com/latex.php?latex=1%2C2%2C%5Cldots%2Cl&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1,2,\ldots,l' title='1,2,\ldots,l' class='latex' />.  Under this numbering of pages, the only column of <img src='https://s0.wp.com/latex.php?latex=M&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M' title='M' class='latex' /> which changes in <img src='https://s0.wp.com/latex.php?latex=M%27&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M&#039;' title='M&#039;' class='latex' /> is the first column, corresponding to the new link from page <img src='https://s0.wp.com/latex.php?latex=0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0' title='0' class='latex' /> to page <img src='https://s0.wp.com/latex.php?latex=l%2B1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l+1' title='l+1' class='latex' />.  Before the link was added, this column had entries <img src='https://s0.wp.com/latex.php?latex=t%2FN+%2B+%281-t%29%2Fl&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t/N + (1-t)/l' title='t/N + (1-t)/l' class='latex' /> for the rows corresponding to pages <img src='https://s0.wp.com/latex.php?latex=1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1' title='1' class='latex' /> through <img src='https://s0.wp.com/latex.php?latex=l&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l' title='l' class='latex' />, i.e., for the outgoing links, and <img src='https://s0.wp.com/latex.php?latex=t%2FN&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t/N' title='t/N' class='latex' /> everywhere else.  After the link is added, this column has entries <img src='https://s0.wp.com/latex.php?latex=t%2FN%2B%281-t%29%2F%28l%2B1%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t/N+(1-t)/(l+1)' title='t/N+(1-t)/(l+1)' class='latex' /> for the rows corresponding to pages <img src='https://s0.wp.com/latex.php?latex=1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1' title='1' class='latex' /> through <img src='https://s0.wp.com/latex.php?latex=l%2B1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l+1' title='l+1' class='latex' />, and <img src='https://s0.wp.com/latex.php?latex=t%2FN&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t/N' title='t/N' class='latex' /> everywhere else.  Combining these facts and doing a little algebra we find that the entries in the first column of <img src='https://s0.wp.com/latex.php?latex=%5CDelta&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\Delta' title='\Delta' class='latex' /> are:</p>
<img src='https://s0.wp.com/latex.php?latex=+++%281-t%29+%5Cleft%5B+%5Cbegin%7Barray%7D%7Bc%7D+++++++0+%5C%5C+++++++%5Cfrac%7B-1%7D%7Bl%28l%2B1%29%7D+%5C%5C+++++++%5Cvdots+%5C%5C++++++++%5Cfrac%7B-1%7D%7Bl%28l%2B1%29%7D+%5C%5C+++++++%5Cfrac%7B1%7D%7Bl%2B1%7D+%5C%5C+++++++0+%5C%5C+++++++%5Cvdots+%5C%5C+++++++0+++++++%5Cend%7Barray%7D+%5Cright%5D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   (1-t) \left[ \begin{array}{c}       0 \\       \frac{-1}{l(l+1)} \\       \vdots \\        \frac{-1}{l(l+1)} \\       \frac{1}{l+1} \\       0 \\       \vdots \\       0       \end{array} \right]. ' title='   (1-t) \left[ \begin{array}{c}       0 \\       \frac{-1}{l(l+1)} \\       \vdots \\        \frac{-1}{l(l+1)} \\       \frac{1}{l+1} \\       0 \\       \vdots \\       0       \end{array} \right]. ' class='latex' />
<p>The other columns of <img src='https://s0.wp.com/latex.php?latex=%5CDelta&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\Delta' title='\Delta' class='latex' /> are all zero, because <img src='https://s0.wp.com/latex.php?latex=M&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M' title='M' class='latex' /> and <img src='https://s0.wp.com/latex.php?latex=M%27&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M&#039;' title='M&#039;' class='latex' /> don&#8217;t change in those other columns.  Substituting for <img src='https://s0.wp.com/latex.php?latex=%5CDelta&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\Delta' title='\Delta' class='latex' /> in [*] and simplifying, we see that [*] becomes:</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5B%2A%2A%5D+%5C%2C%5C%2C%5C%2C%5C%2C+%5C%7Cq-p%5C%7C+%5Cleq+%5Cfrac%7B%281-t%29%7D%7Bt%7D+%5Cfrac%7B2p_0%7D%7Bl%7D+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   [**] \,\,\,\, \|q-p\| \leq \frac{(1-t)}{t} \frac{2p_0}{l} ' title='   [**] \,\,\,\, \|q-p\| \leq \frac{(1-t)}{t} \frac{2p_0}{l} ' class='latex' />
<p>This is a strong result.  The inequality [**] tells us that the <em>total</em> PageRank vector changes at most in proportion to the PageRank <img src='https://s0.wp.com/latex.php?latex=p_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_0' title='p_0' class='latex' /> of the page to which the outbound link is being added. Ordinarily, <img src='https://s0.wp.com/latex.php?latex=p_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_0' title='p_0' class='latex' /> will be tiny &#8211; it&#8217;s a probability in an <img src='https://s0.wp.com/latex.php?latex=N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N' title='N' class='latex' />-element probability distribution, and typically such probabilities are of size <img src='https://s0.wp.com/latex.php?latex=1%2FN&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1/N' title='1/N' class='latex' />.  As a result, we learn that the total variation in PageRank will be miniscule in the typical case.  [**] also tells us that the total variation in PageRank scales inversely with the number of outbound links.  So adding an extra outbound link to a page which already has a large number of links will have little effect on the overall PageRank.</p>
<p>Incidentally, going carefully over the derivation of [**] we can see why we needed to assume that the page <img src='https://s0.wp.com/latex.php?latex=w&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='w' title='w' class='latex' /> was not a dangling page.  We get a hint that something must be wrong from the final result: the right-hand side of [**] diverges for a dangling page, since <img src='https://s0.wp.com/latex.php?latex=l+%3D+0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l = 0' title='l = 0' class='latex' />. In fact, during the derivation of [**] we assumed that <img src='https://s0.wp.com/latex.php?latex=M&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M' title='M' class='latex' />&#8216;s first column had entries <img src='https://s0.wp.com/latex.php?latex=t%2FN+%2B+%281-t%29%2Fl&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t/N + (1-t)/l' title='t/N + (1-t)/l' class='latex' /> for the rows corresponding to pages <img src='https://s0.wp.com/latex.php?latex=1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1' title='1' class='latex' /> through <img src='https://s0.wp.com/latex.php?latex=l&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l' title='l' class='latex' />, and <img src='https://s0.wp.com/latex.php?latex=t%2FN&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='t/N' title='t/N' class='latex' /> everywhere else.  In fact, <img src='https://s0.wp.com/latex.php?latex=M&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M' title='M' class='latex' />&#8216;s first column has entries <img src='https://s0.wp.com/latex.php?latex=1%2FN&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1/N' title='1/N' class='latex' /> everywhere in the case of a dangling page. We could fix this problem right now by redoing the analysis for the <img src='https://s0.wp.com/latex.php?latex=l%3D0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l=0' title='l=0' class='latex' /> case, but I&#8217;ll skip it, because we&#8217;ll get the result for free as part of a later analysis we need to do anyway.</p>
<p>One final note: in the derivation of [**] I assumed that the existing links from page <img src='https://s0.wp.com/latex.php?latex=0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0' title='0' class='latex' /> are to pages <img src='https://s0.wp.com/latex.php?latex=1%2C%5Cldots%2Cl&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1,\ldots,l' title='1,\ldots,l' class='latex' />, and that the new link is to page <img src='https://s0.wp.com/latex.php?latex=l%2B1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l+1' title='l+1' class='latex' />.  This presumes that all the links are to pages <em>other</em> than the original page, i.e., it&#8217;s not self-linking.  Of course, in practice many pages are self-linking, and there&#8217;s no reason that couldn&#8217;t be the case here.  However, it&#8217;s easy to redo the analysis for the case when the page is self-linking, and if we do so it turns out that we arrive at the same result, [**].</p>
<h3>Problems</h3>
<ul>
<li> Verify the assertion in the last paragraph.
<li> The inequality [**] bounds the change <img src='https://s0.wp.com/latex.php?latex=%5C%7Cq-p%5C%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\|q-p\|' title='\|q-p\|' class='latex' /> in terms of the   initial PageRank of the linking page, <img src='https://s0.wp.com/latex.php?latex=p_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_0' title='p_0' class='latex' />.  A very similar   derivation can be used to bound it in terms of the final PageRank,   <img src='https://s0.wp.com/latex.php?latex=q_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='q_0' title='q_0' class='latex' />.  Prove that <img src='https://s0.wp.com/latex.php?latex=%5C%7Cp-q%5C%7C+%5Cleq+%5Cfrac%7B%281-t%29%7D%7Bt%7D+%5Cfrac%7B2q_0%7D%7Bl%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\|p-q\| \leq \frac{(1-t)}{t} \frac{2q_0}{l}' title='\|p-q\| \leq \frac{(1-t)}{t} \frac{2q_0}{l}' class='latex' />. </ul>
<h3>Problems for the author</h3>
<ul>
<li> Is it possible to saturate the bound [**]?  My guess is yes (or   close to, maybe within a constant factor), but I haven&#8217;t explicitly   proved this.  What is the best possible bound of this form? </ul>
<p>We&#8217;ve obtained a bound on the variation in <img src='https://s0.wp.com/latex.php?latex=l_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l_1' title='l_1' class='latex' /> norm produced by adding a link to the web.  We might also wonder whether we can say anything about the change in the PageRank of individual pages, i.e., whether we can bound <img src='https://s0.wp.com/latex.php?latex=%7Cq_j-p_j%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|q_j-p_j|' title='|q_j-p_j|' class='latex' />?  This is the type of question that is likely to be of interest to people who run webpages, or to people interested in search engine optimization.  Note that it&#8217;s really three separate questions: we&#8217;d like bounds on <img src='https://s0.wp.com/latex.php?latex=%7Cq_j-p_j%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|q_j-p_j|' title='|q_j-p_j|' class='latex' /> when (1) <img src='https://s0.wp.com/latex.php?latex=j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j' title='j' class='latex' /> is the source page for the new link; (2) <img src='https://s0.wp.com/latex.php?latex=j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j' title='j' class='latex' /> is the target page for the new link; and (3) <img src='https://s0.wp.com/latex.php?latex=j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j' title='j' class='latex' /> is neither the source nor the target. Unfortunately, I don&#8217;t have any such bounds to report, beyond the obvious observation that in all three cases, <img src='https://s0.wp.com/latex.php?latex=%7Cq_j-p_j%7C+%5Cleq+%5C%7Cq-p%5C%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|q_j-p_j| \leq \|q-p\|' title='|q_j-p_j| \leq \|q-p\|' class='latex' />, and so at least the bound on the right-hand side of [**] always applies.  But it&#8217;d obviously be nice to have a stronger result!</p>
<h3>Problems for the author</h3>
<ul>
<li> Can I find bounds on <img src='https://s0.wp.com/latex.php?latex=%7Cq_j-p_j%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='|q_j-p_j|' title='|q_j-p_j|' class='latex' /> for the three scenarios   described above?  (Or perhaps for some different way of specifying   the properties of <img src='https://s0.wp.com/latex.php?latex=j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j' title='j' class='latex' />?, e.g., when <img src='https://s0.wp.com/latex.php?latex=j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j' title='j' class='latex' /> is another page linked to by   the source page.) </ul>
<h3>What happens when we add and remove multiple links from the   same page?</h3>
<p>Suppose now that instead of adding a single link from a page, we <em>remove</em> <img src='https://s0.wp.com/latex.php?latex=m&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m' title='m' class='latex' /> existing links outbound from that page, and add <img src='https://s0.wp.com/latex.php?latex=n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='n' title='n' class='latex' /> new links outbound from the page, for a total of <img src='https://s0.wp.com/latex.php?latex=l%27+%3D+l-m%2Bn&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l&#039; = l-m+n' title='l&#039; = l-m+n' class='latex' /> links. How does the PageRank change? Exactly as in our earlier analysis, we can show:</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%7Cq-p%5C%7C+%5Cleq+%5Cfrac%7B%5C%7C+%5CDelta+p+%5C%7C%7D%7Bt%7D%2C+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \|q-p\| \leq \frac{\| \Delta p \|}{t}, ' title='   \|q-p\| \leq \frac{\| \Delta p \|}{t}, ' class='latex' />
<p>where <img src='https://s0.wp.com/latex.php?latex=%5CDelta&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\Delta' title='\Delta' class='latex' /> is the change in the PageRank matrix caused by the modified link structure.  Assuming as before that the page is not a dangling page, we number the pages so that: (1) the links are outbound from page <img src='https://s0.wp.com/latex.php?latex=0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0' title='0' class='latex' />; (2) the links to pages <img src='https://s0.wp.com/latex.php?latex=1%2C%5Cldots%2C+l-m&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1,\ldots, l-m' title='1,\ldots, l-m' class='latex' /> are preserved, before and after; (3) the links to pages <img src='https://s0.wp.com/latex.php?latex=l-m%2B1%2C%5Cldots%2Cl&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l-m+1,\ldots,l' title='l-m+1,\ldots,l' class='latex' /> are removed; and (4) the links to pages <img src='https://s0.wp.com/latex.php?latex=l%2B1%2C%5Cldots%2Cl%2Bn&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l+1,\ldots,l+n' title='l+1,\ldots,l+n' class='latex' /> are new links.  Then all the columns of <img src='https://s0.wp.com/latex.php?latex=%5CDelta&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\Delta' title='\Delta' class='latex' /> are zero, except the first column, which has entries:</p>
<img src='https://s0.wp.com/latex.php?latex=+++%281-t%29+%5Cleft%5B+%5Cbegin%7Barray%7D%7Bc%7D+++++++0+%5C%5C+++++++%5Cfrac%7Bl-l%27%7D%7Bll%27%7D+%5C%5C+++++++%5Cvdots+%5C%5C+++++++%5Cfrac%7Bl-l%27%7D%7Bll%27%7D+%5C%5C+++++++%5Cfrac%7B-1%7D%7Bl%7D+%5C%5C+++++++%5Cvdots+%5C%5C+++++++%5Cfrac%7B-1%7D%7Bl%7D+%5C%5C+++++++%5Cfrac%7B1%7D%7Bl%27%7D+%5C%5C+++++++%5Cvdots+%5C%5C+++++++%5Cfrac%7B1%7D%7Bl%27%7D+%5C%5C+++++++0+%5C%5C+++++++%5Cvdots+%5C%5C+++++++0+++++++%5Cend%7Barray%7D+%5Cright%5D+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   (1-t) \left[ \begin{array}{c}       0 \\       \frac{l-l&#039;}{ll&#039;} \\       \vdots \\       \frac{l-l&#039;}{ll&#039;} \\       \frac{-1}{l} \\       \vdots \\       \frac{-1}{l} \\       \frac{1}{l&#039;} \\       \vdots \\       \frac{1}{l&#039;} \\       0 \\       \vdots \\       0       \end{array} \right] ' title='   (1-t) \left[ \begin{array}{c}       0 \\       \frac{l-l&#039;}{ll&#039;} \\       \vdots \\       \frac{l-l&#039;}{ll&#039;} \\       \frac{-1}{l} \\       \vdots \\       \frac{-1}{l} \\       \frac{1}{l&#039;} \\       \vdots \\       \frac{1}{l&#039;} \\       0 \\       \vdots \\       0       \end{array} \right] ' class='latex' />
<p>As a result, we have</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%7Cq-p%5C%7C+%5Cleq+%5Cfrac%7B%281-t%29p_0%7D%7Bt%7D+%5Cleft%5B+%5Cfrac%7B%7Cl-l%27%7C%28l-m%29%7D%7Bll%27%7D%2B+%5Cfrac%7Bm%7D%7Bl%7D+++++%2B+%5Cfrac%7Bn%7D%7Bl%27%7D+%5Cright%5D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \|q-p\| \leq \frac{(1-t)p_0}{t} \left[ \frac{|l-l&#039;|(l-m)}{ll&#039;}+ \frac{m}{l}     + \frac{n}{l&#039;} \right]. ' title='   \|q-p\| \leq \frac{(1-t)p_0}{t} \left[ \frac{|l-l&#039;|(l-m)}{ll&#039;}+ \frac{m}{l}     + \frac{n}{l&#039;} \right]. ' class='latex' />
<p>To simplify the quantity on the right we analyse two cases separately: what happens when <img src='https://s0.wp.com/latex.php?latex=l%27+%5Cgeq+l&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l&#039; \geq l' title='l&#039; \geq l' class='latex' />, and what happens when <img src='https://s0.wp.com/latex.php?latex=l%27+%3Cl%26%2391%3B%2Flatex%26%2393%3B.++In+the+case+%26%2391%3Blatex%26%2393%3Bl%27+%5Cgeq+l%26%2391%3B%2Flatex%26%2393%3B+we+get+after+some+algebra++%26%2391%3Blatex%26%2393%3B+++%5C%7Cq-p%5C%7C+%5Cleq+%5Cfrac%7B%281-t%29%7D%7Bt%7D+%5Cfrac%7B2p_0n%7D%7Bl%27%7D%2C+%26%2391%3B%2Flatex%26%2393%3B++which+generalizes+our+earlier+inequality+%26%2391%3B%2A%2A%26%2393%3B.++In+the+case+%26%2391%3Blatex%26%2393%3Bl%27+%3C+l%26%2391%3B%2Flatex%26%2393%3B+we+get++%26%2391%3Blatex%26%2393%3B+++%5C%7Cq-p%5C%7C+%5Cleq+%5Cfrac%7B%281-t%29%7D%7Bt%7D+%5Cfrac%7B2+p_0m%7D%7Bl%7D.+%26%2391%3B%2Flatex%26%2393%3B++Observing+that+%26%2391%3Blatex%26%2393%3Bl%27+%5Cgeq+l%26%2391%3B%2Flatex%26%2393%3B+is+equivalent+to+%26%2391%3Blatex%26%2393%3Bn+%5Cgeq+m%26%2391%3B%2Flatex%26%2393%3B%2C+we+may+combine+these+two+inequalities+into+a+single+unified+inequality%2C+generalizing+our+earlier+result+%26%2391%3B%2A%2A%26%2393%3B%3A++%26%2391%3Blatex%26%2393%3B+++%26%2391%3B%2A%2A%2A%26%2393%3B+%5C%2C%5C%2C%5C%2C%5C%2C+%5C%7Cq-p%5C%7C+%5Cleq+%5Cfrac%7B%281-t%29%7D%7Bt%7D+%5Cfrac%7B2p_0+%5Cmax%28m%2Cn%29%7D%7B%5Cmax%28l%2Cl%27%29%7D.+%26%2391%3B%2Flatex%26%2393%3B++++%3Ch3%3EWhat+about+adding+links+to+a+dangling+page%3F%3C%2Fh3%3E++Let%27s+come+back+to+the+question+we+ducked+earlier%2C+namely%2C+how+PageRank+changes+when+we+add+a+single+link+to+a+dangling+page.++Recall+that+in+this+case+the+crazy+websurfer+model+of+PageRank+is+modified+so+it%27s+as+though+there+are+really+%5Blatex%5DN&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l&#039; &lt;l&#091;/latex&#093;.  In the case &#091;latex&#093;l&#039; \geq l&#091;/latex&#093; we get after some algebra  &#091;latex&#093;   \|q-p\| \leq \frac{(1-t)}{t} \frac{2p_0n}{l&#039;}, &#091;/latex&#093;  which generalizes our earlier inequality &#091;**&#093;.  In the case &#091;latex&#093;l&#039; &lt; l&#091;/latex&#093; we get  &#091;latex&#093;   \|q-p\| \leq \frac{(1-t)}{t} \frac{2 p_0m}{l}. &#091;/latex&#093;  Observing that &#091;latex&#093;l&#039; \geq l&#091;/latex&#093; is equivalent to &#091;latex&#093;n \geq m&#091;/latex&#093;, we may combine these two inequalities into a single unified inequality, generalizing our earlier result &#091;**&#093;:  &#091;latex&#093;   &#091;***&#093; \,\,\,\, \|q-p\| \leq \frac{(1-t)}{t} \frac{2p_0 \max(m,n)}{\max(l,l&#039;)}. &#091;/latex&#093;    &lt;h3&gt;What about adding links to a dangling page?&lt;/h3&gt;  Let&#039;s come back to the question we ducked earlier, namely, how PageRank changes when we add a single link to a dangling page.  Recall that in this case the crazy websurfer model of PageRank is modified so it&#039;s as though there are really [latex]N' title='l&#039; &lt;l&#091;/latex&#093;.  In the case &#091;latex&#093;l&#039; \geq l&#091;/latex&#093; we get after some algebra  &#091;latex&#093;   \|q-p\| \leq \frac{(1-t)}{t} \frac{2p_0n}{l&#039;}, &#091;/latex&#093;  which generalizes our earlier inequality &#091;**&#093;.  In the case &#091;latex&#093;l&#039; &lt; l&#091;/latex&#093; we get  &#091;latex&#093;   \|q-p\| \leq \frac{(1-t)}{t} \frac{2 p_0m}{l}. &#091;/latex&#093;  Observing that &#091;latex&#093;l&#039; \geq l&#091;/latex&#093; is equivalent to &#091;latex&#093;n \geq m&#091;/latex&#093;, we may combine these two inequalities into a single unified inequality, generalizing our earlier result &#091;**&#093;:  &#091;latex&#093;   &#091;***&#093; \,\,\,\, \|q-p\| \leq \frac{(1-t)}{t} \frac{2p_0 \max(m,n)}{\max(l,l&#039;)}. &#091;/latex&#093;    &lt;h3&gt;What about adding links to a dangling page?&lt;/h3&gt;  Let&#039;s come back to the question we ducked earlier, namely, how PageRank changes when we add a single link to a dangling page.  Recall that in this case the crazy websurfer model of PageRank is modified so it&#039;s as though there are really [latex]N' class='latex' /> outgoing links from the dangling page.  Because of this, the addition of a single link to a dangling page is equivalent to the deletion of <img src='https://s0.wp.com/latex.php?latex=m+%3D+N-1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m = N-1' title='m = N-1' class='latex' /> links from a page that started with <img src='https://s0.wp.com/latex.php?latex=l+%3D+N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l = N' title='l = N' class='latex' /> outgoing links.  From the inequality [***], we see that:</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%7Cq-p%5C%7C+%5Cleq+%5Cfrac%7B%281-t%29%7D%7Bt%7D+%5Cfrac%7B2p_0+%28N-1%29%7D%7BN%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \|q-p\| \leq \frac{(1-t)}{t} \frac{2p_0 (N-1)}{N}. ' title='   \|q-p\| \leq \frac{(1-t)}{t} \frac{2p_0 (N-1)}{N}. ' class='latex' />
<p>To generalize further, suppose instead that we&#8217;d added <img src='https://s0.wp.com/latex.php?latex=n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='n' title='n' class='latex' /> links to a dangling page.  Then the same method of analysis shows:</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%7Cq-p%5C%7C+%5Cleq+%5Cfrac%7B%281-t%29%7D%7Bt%7D+%5Cfrac%7B2p_0+%28N-n%29%7D%7BN%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \|q-p\| \leq \frac{(1-t)}{t} \frac{2p_0 (N-n)}{N}. ' title='   \|q-p\| \leq \frac{(1-t)}{t} \frac{2p_0 (N-n)}{N}. ' class='latex' />
<p>In general, we can always deal with the addition of <img src='https://s0.wp.com/latex.php?latex=n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='n' title='n' class='latex' /> links to a dangling page as being equivalent to the deletion of <img src='https://s0.wp.com/latex.php?latex=m+%3D+N-n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m = N-n' title='m = N-n' class='latex' /> links from a page with <img src='https://s0.wp.com/latex.php?latex=N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N' title='N' class='latex' /> outgoing links.  Because of this, in the remainder of this post we will use the convention of always treating dangling pages as though they have <img src='https://s0.wp.com/latex.php?latex=l%3DN&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l=N' title='l=N' class='latex' /> outgoing links.</p>
<h3>What happens when we add and remove links from different   pages?</h3>
<p>In the last section we assumed that the links were all being added or removed from the same page.  In this section we generalize these results to the case where the links aren&#8217;t all from the same page. To do this, we&#8217;ll start by considering the case where just two pages are involved, which we label <img src='https://s0.wp.com/latex.php?latex=0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0' title='0' class='latex' /> and <img src='https://s0.wp.com/latex.php?latex=1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1' title='1' class='latex' />.  We suppose <img src='https://s0.wp.com/latex.php?latex=m_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m_j' title='m_j' class='latex' /> outbound links are added to page <img src='https://s0.wp.com/latex.php?latex=j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j' title='j' class='latex' /> (<img src='https://s0.wp.com/latex.php?latex=j+%3D+0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j = 0' title='j = 0' class='latex' /> or <img src='https://s0.wp.com/latex.php?latex=1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1' title='1' class='latex' />), and <img src='https://s0.wp.com/latex.php?latex=n_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='n_j' title='n_j' class='latex' /> outbound links are removed.  Observe that <img src='https://s0.wp.com/latex.php?latex=M%27+%3D+M+%2B+%5CDelta_0%2B%5CDelta_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M&#039; = M + \Delta_0+\Delta_1' title='M&#039; = M + \Delta_0+\Delta_1' class='latex' />, where <img src='https://s0.wp.com/latex.php?latex=%5CDelta_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\Delta_0' title='\Delta_0' class='latex' /> and <img src='https://s0.wp.com/latex.php?latex=%5CDelta_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\Delta_1' title='\Delta_1' class='latex' /> are the changes in the PageRank matrix due to the modifications to page <img src='https://s0.wp.com/latex.php?latex=0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0' title='0' class='latex' /> and page <img src='https://s0.wp.com/latex.php?latex=1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1' title='1' class='latex' />, respectively.  Then a similar argument to before shows that:</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%7Cq-p%5C%7C+%5Cleq+%5Cfrac%7B%5C%7C%28%5CDelta_0%2B%5CDelta_1%29p%5C%7C%7D%7Bt%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \|q-p\| \leq \frac{\|(\Delta_0+\Delta_1)p\|}{t}. ' title='   \|q-p\| \leq \frac{\|(\Delta_0+\Delta_1)p\|}{t}. ' class='latex' />
<p>Applying the triangle inequality we get:</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%7Cq-p%5C%7C+%5Cleq+%5Cfrac%7B%5C%7C%5CDelta_0p%5C%7C%7D%7Bt%7D+%2B+%5Cfrac%7B%5C%7C%5CDelta_1p%5C%7C%7D%7Bt%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \|q-p\| \leq \frac{\|\Delta_0p\|}{t} + \frac{\|\Delta_1p\|}{t}. ' title='   \|q-p\| \leq \frac{\|\Delta_0p\|}{t} + \frac{\|\Delta_1p\|}{t}. ' class='latex' />
<p>And so we can reuse our earlier analysis to conclude that:</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%7Cq-p%5C%7C+%5Cleq+%5Cfrac%7B2%281-t%29%7D%7Bt%7D+%5Cleft%5B+%5Cfrac%7B%5Cmax%28m_0%2Cn_0%29p_0%7D%7B%5Cmax%28l_0%2Cl_0%26%238242%3B%29%7D+++++%2B+%5Cfrac%7B%5Cmax%28m_1%2Cn_1%29p_1%7D%7B%5Cmax%28l_1%2Cl_1%26%238242%3B%29%7D+%5Cright%5D%2C+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \|q-p\| \leq \frac{2(1-t)}{t} \left[ \frac{\max(m_0,n_0)p_0}{\max(l_0,l_0&#8242;)}     + \frac{\max(m_1,n_1)p_1}{\max(l_1,l_1&#8242;)} \right], ' title='   \|q-p\| \leq \frac{2(1-t)}{t} \left[ \frac{\max(m_0,n_0)p_0}{\max(l_0,l_0&#8242;)}     + \frac{\max(m_1,n_1)p_1}{\max(l_1,l_1&#8242;)} \right], ' class='latex' />
<p>where <img src='https://s0.wp.com/latex.php?latex=l_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l_j' title='l_j' class='latex' /> and <img src='https://s0.wp.com/latex.php?latex=l_j%27&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l_j&#039;' title='l_j&#039;' class='latex' /> are just the number of links outbound from page <img src='https://s0.wp.com/latex.php?latex=j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j' title='j' class='latex' />, before and after the modifications to the link structure, respectively.  This expression is easily generalized to the case where we are modifying more than two pages, giving:</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%7Cq-p%5C%7C+%5Cleq+%5Cfrac%7B2%281-t%29%7D%7Bt%7D+%5Csum_j+%5Cfrac%7B%5Cmax%28m_j%2Cn_j%29p_j%7D%7B%5Cmax%28l_j%2Cl_j%27%29%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \|q-p\| \leq \frac{2(1-t)}{t} \sum_j \frac{\max(m_j,n_j)p_j}{\max(l_j,l_j&#039;)}. ' title='   \|q-p\| \leq \frac{2(1-t)}{t} \sum_j \frac{\max(m_j,n_j)p_j}{\max(l_j,l_j&#039;)}. ' class='latex' />
<p>This inequality is a quite general bound which can be applied even when very extensive modifications are made to the structure of the web.  Note that, as discussed in the last section, if any of the pages are dangling pages then we treat the addition of <img src='https://s0.wp.com/latex.php?latex=n_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='n_j' title='n_j' class='latex' /> links to that page as really being deleting <img src='https://s0.wp.com/latex.php?latex=m+%3D+N-n_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m = N-n_j' title='m = N-n_j' class='latex' /> links from a page with <img src='https://s0.wp.com/latex.php?latex=l_j+%3D+N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l_j = N' title='l_j = N' class='latex' /> outgoing links, and so the corresponding term in the sum is <img src='https://s0.wp.com/latex.php?latex=%28N-n_j%29p_j%2FN&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='(N-n_j)p_j/N' title='(N-n_j)p_j/N' class='latex' />.  Indeed, we can rewrite the last inequality, splitting up the right-hand side into a sum <img src='https://s0.wp.com/latex.php?latex=%5Csum_%7Bj%3A%7B%5Crm+nd%7D%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\sum_{j:{\rm nd}}' title='\sum_{j:{\rm nd}}' class='latex' /> over pages <img src='https://s0.wp.com/latex.php?latex=j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j' title='j' class='latex' /> which are not dangling, and a sum <img src='https://s0.wp.com/latex.php?latex=%5Csum_%7Bj%3A%7B%5Crm+d%7D%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\sum_{j:{\rm d}}' title='\sum_{j:{\rm d}}' class='latex' /> over dangling pages <img src='https://s0.wp.com/latex.php?latex=j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j' title='j' class='latex' /> to obtain:</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%7Cq-p%5C%7C+%5Cleq+%5Cfrac%7B2%281-t%29%7D%7Bt%7D+%5Cleft%28+%5Csum_%7Bj%3A%7B%5Crm+nd%7D%7D+%5Cfrac%7B%5Cmax%28m_j%2Cn_j%29p_j%7D%7B%5Cmax%28l_j%2Cl_j%27%29%7D+++++%2B+%5Csum_%7Bj%3A%7B%5Crm+d%7D%7D+%5Cfrac%7B%28N-n_j%29p_j%7D%7BN%7D+%5Cright%29.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \|q-p\| \leq \frac{2(1-t)}{t} \left( \sum_{j:{\rm nd}} \frac{\max(m_j,n_j)p_j}{\max(l_j,l_j&#039;)}     + \sum_{j:{\rm d}} \frac{(N-n_j)p_j}{N} \right). ' title='   \|q-p\| \leq \frac{2(1-t)}{t} \left( \sum_{j:{\rm nd}} \frac{\max(m_j,n_j)p_j}{\max(l_j,l_j&#039;)}     + \sum_{j:{\rm d}} \frac{(N-n_j)p_j}{N} \right). ' class='latex' />
<h3>Modified teleportation step</h3>
<p>The <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.38.5427">original   PageRank paper</a> describes a way of modifying PageRank to use a teleportation step which doesn&#8217;t take us to a page chosen uniformly at random.  Instead, some non-uniform probability distribution is used for teleportation.  I don&#8217;t know for sure whether Google uses this idea, but I wouldn&#8217;t be suprised if they use it as a way of decreasing the PageRank of suspected spam and link-farm pages, by making them less likely to be the target of teleportation.  It&#8217;s possible to redo all the analyses to date, and to prove that the bounds we&#8217;ve obtained hold even with this modified teleportation step.</p>
<h3>Problems</h3>
<ul>
<li> Prove the assertion in the last paragraph. </ul>
<h3>What happens when we add a new page to the web?</h3>
<p>So far, our analysis has involved only links added between <em>existing</em> pages.  How about when a new page is added to the web? How does the PageRank change then?  To analyse this question we&#8217;ll begin by focusing on understanding how PageRank changes when a single page is created, with no incoming or outgoing links.  Obviously, this is rather unrealistic, but by understanding this simple case first we&#8217;ll be in better position to understand more realistic cases.</p>
<p>It&#8217;s perhaps worth commenting on why we&#8217;d expect the PageRank to change at all when we create a new page with no incoming or outgoing links.  After all, it seems as though the random walk taken by our crazy websurfer won&#8217;t be changed by the addition of this new and completely isolated page.  In fact, things do change.  The reason is because the teleportation step is modified.  Instead of going to a page chosen uniformly at random from <img src='https://s0.wp.com/latex.php?latex=N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N' title='N' class='latex' /> pages, each with probability <img src='https://s0.wp.com/latex.php?latex=1%2FN&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1/N' title='1/N' class='latex' />, teleportation now takes us to a page chosen uniformly at random from the full set of <img src='https://s0.wp.com/latex.php?latex=N%2B1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N+1' title='N+1' class='latex' /> pages, each with probability <img src='https://s0.wp.com/latex.php?latex=1%2F%28N%2B1%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1/(N+1)' title='1/(N+1)' class='latex' />.  It is this change which causes a change in the PageRank.</p>
<p>A slight quirk in analysing the change in the PageRank vector is that while the initial PageRank vector <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> is an <img src='https://s0.wp.com/latex.php?latex=N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N' title='N' class='latex' />-dimensional vector, the new PageRank vector <img src='https://s0.wp.com/latex.php?latex=q&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='q' title='q' class='latex' /> is an <img src='https://s0.wp.com/latex.php?latex=N%2B1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N+1' title='N+1' class='latex' />-dimensional vector.  Because of the different dimensionalities, the quantity <img src='https://s0.wp.com/latex.php?latex=%5C%7Cq-p%5C%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\|q-p\|' title='\|q-p\|' class='latex' /> is not even defined!  We resolve this problem by extending <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> into the <img src='https://s0.wp.com/latex.php?latex=N%2B1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N+1' title='N+1' class='latex' />-dimensional space, adding a PageRank of <img src='https://s0.wp.com/latex.php?latex=0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0' title='0' class='latex' /> for the new page, page number <img src='https://s0.wp.com/latex.php?latex=N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N' title='N' class='latex' />.  We&#8217;ll denote this extended version of <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> by <img src='https://s0.wp.com/latex.php?latex=p_e&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_e' title='p_e' class='latex' />. We also need to extend <img src='https://s0.wp.com/latex.php?latex=M&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M' title='M' class='latex' /> so that it becomes an <img src='https://s0.wp.com/latex.php?latex=N%2B1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N+1' title='N+1' class='latex' /> by <img src='https://s0.wp.com/latex.php?latex=N%2B1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N+1' title='N+1' class='latex' /> matrix <img src='https://s0.wp.com/latex.php?latex=M_e&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M_e' title='M_e' class='latex' /> satisfying the PageRank equation <img src='https://s0.wp.com/latex.php?latex=M_ep_e+%3D+p_e&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M_ep_e = p_e' title='M_ep_e = p_e' class='latex' /> for the extended <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' />.  One way of doing this is to extend <img src='https://s0.wp.com/latex.php?latex=M&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M' title='M' class='latex' /> by adding a row of zeros at the bottom, and <img src='https://s0.wp.com/latex.php?latex=1%2FN&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1/N' title='1/N' class='latex' /> everywhere in the final column, except the final entry:</p>
<img src='https://s0.wp.com/latex.php?latex=+++M_e+%3D+++%5Cleft%5B+%5Cbegin%7Barray%7D%7Bcccc%7D++++++++++%26++++++++%26+++%26+%5Cfrac%7B1%7D%7BN%7D+%5C%5C+++++++++%26+M++++++%26+++%26+%5Cvdots+%5C%5C+++++++++%26++++++++%26+++%26+%5Cfrac%7B1%7D%7BN%7D+%5C%5C+++++++0+%26+%5Cldots+%26+0+%26+0+++++++%5Cend%7Barray%7D+%5Cright%5D+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   M_e =   \left[ \begin{array}{cccc}          &amp;        &amp;   &amp; \frac{1}{N} \\         &amp; M      &amp;   &amp; \vdots \\         &amp;        &amp;   &amp; \frac{1}{N} \\       0 &amp; \ldots &amp; 0 &amp; 0       \end{array} \right] ' title='   M_e =   \left[ \begin{array}{cccc}          &amp;        &amp;   &amp; \frac{1}{N} \\         &amp; M      &amp;   &amp; \vdots \\         &amp;        &amp;   &amp; \frac{1}{N} \\       0 &amp; \ldots &amp; 0 &amp; 0       \end{array} \right] ' class='latex' />
<p>With the extended <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> and <img src='https://s0.wp.com/latex.php?latex=M&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M' title='M' class='latex' /> it is easy to verify that the PageRank equation <img src='https://s0.wp.com/latex.php?latex=M_ep_e+%3D+p_e&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M_ep_e = p_e' title='M_ep_e = p_e' class='latex' /> is satisfied.  As in our earlier discussions, we have the inequality</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%7Cq-p_e%5C%7C+%5Cleq+%5Cfrac%7B%5C%7C%5CDelta+p_e+%5C%7C%7D%7Bt%7D%2C+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \|q-p_e\| \leq \frac{\|\Delta p_e \|}{t}, ' title='   \|q-p_e\| \leq \frac{\|\Delta p_e \|}{t}, ' class='latex' />
<p>where now <img src='https://s0.wp.com/latex.php?latex=%5CDelta+%3D+M%27-M_e&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\Delta = M&#039;-M_e' title='\Delta = M&#039;-M_e' class='latex' />.  Computing <img src='https://s0.wp.com/latex.php?latex=%5CDelta&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\Delta' title='\Delta' class='latex' />, we have:</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5CDelta+%3D+%5Cleft%5B+%5Cbegin%7Barray%7D%7Bcccc%7D+++++++%5Cfrac%7Bt%7D%7BN%2B1%7D-%5Cfrac%7Bt%7D%7BN%7D+%26+%5Cldots+%26+%5Cfrac%7Bt%7D%7BN%2B1%7D-%5Cfrac%7Bt%7D%7BN%7D+%26+%5Cfrac%7B1%7D%7BN%2B1%7D-%5Cfrac%7B1%7D%7BN%7D+%5C%5C+++++++%5Cvdots++++++++++++++++++++%26+%5Cddots+%26+%5Cvdots++++++++++++++++++++%26+%5Cvdots+%5C%5C+++++++%5Cfrac%7Bt%7D%7BN%2B1%7D-%5Cfrac%7Bt%7D%7BN%7D+%26+%5Cldots+%26+%5Cfrac%7Bt%7D%7BN%2B1%7D-%5Cfrac%7Bt%7D%7BN%7D+%26+%5Cfrac%7B1%7D%7BN%2B1%7D-%5Cfrac%7B1%7D%7BN%7D+%5C%5C+++++++%5Cfrac%7Bt%7D%7BN%2B1%7D+++++++++++++%26+%5Cldots+%26+%5Cfrac%7Bt%7D%7BN%2B1%7D+++++++++++++%26+%5Cfrac%7B1%7D%7BN%2B1%7D+++++%5Cend%7Barray%7D+%5Cright%5D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \Delta = \left[ \begin{array}{cccc}       \frac{t}{N+1}-\frac{t}{N} &amp; \ldots &amp; \frac{t}{N+1}-\frac{t}{N} &amp; \frac{1}{N+1}-\frac{1}{N} \\       \vdots                    &amp; \ddots &amp; \vdots                    &amp; \vdots \\       \frac{t}{N+1}-\frac{t}{N} &amp; \ldots &amp; \frac{t}{N+1}-\frac{t}{N} &amp; \frac{1}{N+1}-\frac{1}{N} \\       \frac{t}{N+1}             &amp; \ldots &amp; \frac{t}{N+1}             &amp; \frac{1}{N+1}     \end{array} \right]. ' title='   \Delta = \left[ \begin{array}{cccc}       \frac{t}{N+1}-\frac{t}{N} &amp; \ldots &amp; \frac{t}{N+1}-\frac{t}{N} &amp; \frac{1}{N+1}-\frac{1}{N} \\       \vdots                    &amp; \ddots &amp; \vdots                    &amp; \vdots \\       \frac{t}{N+1}-\frac{t}{N} &amp; \ldots &amp; \frac{t}{N+1}-\frac{t}{N} &amp; \frac{1}{N+1}-\frac{1}{N} \\       \frac{t}{N+1}             &amp; \ldots &amp; \frac{t}{N+1}             &amp; \frac{1}{N+1}     \end{array} \right]. ' class='latex' />
<p>Substituting and simplifying, we obtain the bound:</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5B%2A%2A%2A%2A%5D+%5C%2C%5C%2C%5C%2C%5C%2C+%5C%7Cq-p_e%5C%7C+%5Cleq+%5Cfrac%7B2%7D%7BN%2B1%7D.++&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   [****] \,\,\,\, \|q-p_e\| \leq \frac{2}{N+1}.  ' title='   [****] \,\,\,\, \|q-p_e\| \leq \frac{2}{N+1}.  ' class='latex' />
<p>Not surprisingly, this bound shows that creating a new and completely isolated page makes only a very small difference to PageRank.  Still, it&#8217;s good to have this intuition confirmed and precisely quantified.</p>
<h3>Exercises</h3>
<ul>
<li>  Suppose we  turn the  question  asked in  this section  around.   Suppose we  have a web of  <img src='https://s0.wp.com/latex.php?latex=N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N' title='N' class='latex' /> pages, and  one of those pages  is an   isolated page which is deleted.  Can you prove an analogous bound to   [****] in this situation? </ul>
<h3>Problems for the author</h3>
<ul>
<li> How does the bound [****] change if we use a non-uniform   probability distribution to teleport?  The next problem may assist   in addressing this question.
<li> An alternative approach to analysing the problem in this section   is to use the   <a href="https://michaelnielsen.org/blog/lectures-on-the-google-technology-stack-1-introduction-to-pagerank/">formula</a>   <img src='https://s0.wp.com/latex.php?latex=p+%3D+t+%28I-%281-t%29G%29%5E%7B-1%7D+e&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p = t (I-(1-t)G)^{-1} e' title='p = t (I-(1-t)G)^{-1} e' class='latex' />, where <img src='https://s0.wp.com/latex.php?latex=e&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='e' title='e' class='latex' /> is the vector whose entries are   the uniform probability distribution (or, more generally, the vector   of teleportation probabilities), and <img src='https://s0.wp.com/latex.php?latex=G&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='G' title='G' class='latex' /> is the matrix whose <img src='https://s0.wp.com/latex.php?latex=jk&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='jk' title='jk' class='latex' />   entry is <img src='https://s0.wp.com/latex.php?latex=1%2Fl_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1/l_j' title='1/l_j' class='latex' /> if <img src='https://s0.wp.com/latex.php?latex=j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j' title='j' class='latex' /> links to <img src='https://s0.wp.com/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' />, and <img src='https://s0.wp.com/latex.php?latex=0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0' title='0' class='latex' /> otherwise.  This   formula is well-suited to analysing the problem considered in the   current section; without going into details, the essential reason is   that the modified matrix <img src='https://s0.wp.com/latex.php?latex=I-%281-t%29G%27&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='I-(1-t)G&#039;' title='I-(1-t)G&#039;' class='latex' /> has a natural block-triangular   structure, and such block-triangular matrices are easy to invert.   What is the outcome of doing such an analysis? </ul>
<h3>Adding multiple pages to the web</h3>
<p>Suppose now that instead of adding a single page to the web we add multiple pages to the web.  How does the PageRank change? Again, we&#8217;ll analyse the case where no new links are created, on the understanding that by analysing this simple case first we&#8217;ll be in better position to understand more realistic cases.</p>
<p>We&#8217;ll consider first the case of just two new pages.  Suppose <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> is the original <img src='https://s0.wp.com/latex.php?latex=N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N' title='N' class='latex' />-dimensional PageRank vector.  Then we&#8217;ll use <img src='https://s0.wp.com/latex.php?latex=p_e&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_e' title='p_e' class='latex' /> as before to denote the extension of <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> by an extra <img src='https://s0.wp.com/latex.php?latex=0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0' title='0' class='latex' /> entry, and <img src='https://s0.wp.com/latex.php?latex=p_%7Bee%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{ee}' title='p_{ee}' class='latex' /> to denote the extension of <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> by two <img src='https://s0.wp.com/latex.php?latex=0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0' title='0' class='latex' /> entries.  We&#8217;ll use <img src='https://s0.wp.com/latex.php?latex=q&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='q' title='q' class='latex' /> to denote the <img src='https://s0.wp.com/latex.php?latex=N%2B1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N+1' title='N+1' class='latex' />-dimensional PageRank vector after adding a single webpage, and use <img src='https://s0.wp.com/latex.php?latex=q_e&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='q_e' title='q_e' class='latex' /> to denote the extension of <img src='https://s0.wp.com/latex.php?latex=q&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='q' title='q' class='latex' /> by appending an extra <img src='https://s0.wp.com/latex.php?latex=0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0' title='0' class='latex' /> entry.  Finally, we&#8217;ll use <img src='https://s0.wp.com/latex.php?latex=r&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='r' title='r' class='latex' /> to denote the <img src='https://s0.wp.com/latex.php?latex=N%2B2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N+2' title='N+2' class='latex' />-dimensional PageRank vector after adding two new webpages.</p>
<p>Our interest is in analysing <img src='https://s0.wp.com/latex.php?latex=%5C%7Cr-p_%7Bee%7D%5C%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\|r-p_{ee}\|' title='\|r-p_{ee}\|' class='latex' />.  To do this, note that</p>
<img src='https://s0.wp.com/latex.php?latex=+%5C%7Cr-p_%7Bee%7D%5C%7C+%3D+%5C%7Cr-q_%7Be%7D%2Bq_e-p_%7Bee%7D%5C%7C.+++&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt=' \|r-p_{ee}\| = \|r-q_{e}+q_e-p_{ee}\|.   ' title=' \|r-p_{ee}\| = \|r-q_{e}+q_e-p_{ee}\|.   ' class='latex' />
<p>Applying the triangle inequality gives</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%7Cr-p_%7Bee%7D+%5C%7C+%5Cleq+%5C%7Cr-q_%7Be%7D%5C%7C+%2B+%5C%7Cq_e-p_%7Bee%7D%5C%7C.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \|r-p_{ee} \| \leq \|r-q_{e}\| + \|q_e-p_{ee}\|. ' title='   \|r-p_{ee} \| \leq \|r-q_{e}\| + \|q_e-p_{ee}\|. ' class='latex' />
<p>Observe that <img src='https://s0.wp.com/latex.php?latex=%5C%7Cq_e-p_%7Bee%7D%5C%7C+%3D+%5C%7Cq-p_e%5C%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\|q_e-p_{ee}\| = \|q-p_e\|' title='\|q_e-p_{ee}\| = \|q-p_e\|' class='latex' />, since the extra <img src='https://s0.wp.com/latex.php?latex=0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0' title='0' class='latex' /> appended to the end of both vectors makes no difference to the <img src='https://s0.wp.com/latex.php?latex=l_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l_1' title='l_1' class='latex' /> norm.  And so the previous inequality may be rewritten</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%7Cr-p_%7Bee%7D+%5C%7C+%5Cleq+%5C%7Cr-q_%7Be%7D%5C%7C+%2B+%5C%7Cq-p_e%5C%7C.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \|r-p_{ee} \| \leq \|r-q_{e}\| + \|q-p_e\|. ' title='   \|r-p_{ee} \| \leq \|r-q_{e}\| + \|q-p_e\|. ' class='latex' />
<p>Now we can apply the results of last section twice on the right-hand side to obtain</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%7Cr-p_%7Bee%7D+%5C%7C+%5Cleq+%5Cfrac%7B2%7D%7BN%2B2%7D+%2B+%5Cfrac%7B2%7D%7BN%2B1%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \|r-p_{ee} \| \leq \frac{2}{N+2} + \frac{2}{N+1}. ' title='   \|r-p_{ee} \| \leq \frac{2}{N+2} + \frac{2}{N+1}. ' class='latex' />
<p>More generally, suppose we add <img src='https://s0.wp.com/latex.php?latex=%5CDelta+N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\Delta N' title='\Delta N' class='latex' /> new pages to the web. Denote the final <img src='https://s0.wp.com/latex.php?latex=N%2B%5CDelta+N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N+\Delta N' title='N+\Delta N' class='latex' />-dimensional PageRank vector by <img src='https://s0.wp.com/latex.php?latex=q&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='q' title='q' class='latex' />.  Let <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> denote the initial <img src='https://s0.wp.com/latex.php?latex=N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N' title='N' class='latex' />-dimensional PageRank vector, and let <img src='https://s0.wp.com/latex.php?latex=p_e&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_e' title='p_e' class='latex' /> denote the <img src='https://s0.wp.com/latex.php?latex=N%2B%5CDelta+N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N+\Delta N' title='N+\Delta N' class='latex' />-dimensional extension obtained by appending <img src='https://s0.wp.com/latex.php?latex=%5CDelta+N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\Delta N' title='\Delta N' class='latex' /> <img src='https://s0.wp.com/latex.php?latex=0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0' title='0' class='latex' /> entries to <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' />.  Then applying an argument similar to that above, we obtain:</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%7Cq-p_e+%5C%7C+%5Cleq+%5Csum_%7Bj%3D1%7D%5E%7B%5CDelta+N%7D+%5Cfrac%7B2%7D%7BN%2Bj%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \|q-p_e \| \leq \sum_{j=1}^{\Delta N} \frac{2}{N+j}. ' title='   \|q-p_e \| \leq \sum_{j=1}^{\Delta N} \frac{2}{N+j}. ' class='latex' />
<h3>Adding multiple pages, adding and removing multiple links</h3>
<p>Suppose now that we add many new pages to the web, containing links to existing webpages.  How does the PageRank change?  Not surprisingly, answering this kind of question can be done in a straightforward way using the techniques described already in this post.  To see this, suppose we add a single new webpage, which contains <img src='https://s0.wp.com/latex.php?latex=n&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='n' title='n' class='latex' /> new links back to existing webpages.  As before, we use <img src='https://s0.wp.com/latex.php?latex=q&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='q' title='q' class='latex' /> to denote the new PageRank vector, <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> to denote the old PageRank vector, and <img src='https://s0.wp.com/latex.php?latex=p_e&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_e' title='p_e' class='latex' /> to denote the extension of <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> obtained by appending a <img src='https://s0.wp.com/latex.php?latex=0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0' title='0' class='latex' /> entry.  We have</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%7Cq-p_e%5C%7C+%3D+%5C%7CM%27q-M_e+p_e%5C%7C%2C+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \|q-p_e\| = \|M&#039;q-M_e p_e\|, ' title='   \|q-p_e\| = \|M&#039;q-M_e p_e\|, ' class='latex' />
<p>where <img src='https://s0.wp.com/latex.php?latex=M_e&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M_e' title='M_e' class='latex' /> is the <img src='https://s0.wp.com/latex.php?latex=%28N%2B1%29+%5Ctimes+%28N%2B1%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='(N+1) \times (N+1)' title='(N+1) \times (N+1)' class='latex' /> matrix obtained by adding a row of <img src='https://s0.wp.com/latex.php?latex=0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0' title='0' class='latex' />s to the bottom of <img src='https://s0.wp.com/latex.php?latex=M&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M' title='M' class='latex' />, and a column (except the bottom right entry) of <img src='https://s0.wp.com/latex.php?latex=1%2FN&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='1/N' title='1/N' class='latex' /> values to the right of <img src='https://s0.wp.com/latex.php?latex=M&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M' title='M' class='latex' />.  We can write <img src='https://s0.wp.com/latex.php?latex=M%27+%3D+M_e%2B%5CDelta_1%2B%5CDelta_2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M&#039; = M_e+\Delta_1+\Delta_2' title='M&#039; = M_e+\Delta_1+\Delta_2' class='latex' />, where <img src='https://s0.wp.com/latex.php?latex=%5CDelta_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\Delta_1' title='\Delta_1' class='latex' /> is the change in <img src='https://s0.wp.com/latex.php?latex=M_e&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M_e' title='M_e' class='latex' /> due to the added page, and <img src='https://s0.wp.com/latex.php?latex=%5CDelta_2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\Delta_2' title='\Delta_2' class='latex' /> is the change in <img src='https://s0.wp.com/latex.php?latex=M_e&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M_e' title='M_e' class='latex' /> due to the added links.  Applying a similar analysis to before we obtain</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%7Cq-p_e%5C%7C+%5Cleq+%5Cfrac%7B%5C%7C%5CDelta_1+p_e%5C%7C+%2B+%5C%7C+%5CDelta_2+p_e+%5C%7C%7D%7Bt%7D.+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \|q-p_e\| \leq \frac{\|\Delta_1 p_e\| + \| \Delta_2 p_e \|}{t}. ' title='   \|q-p_e\| \leq \frac{\|\Delta_1 p_e\| + \| \Delta_2 p_e \|}{t}. ' class='latex' />
<p>We can bound the first term on the right-hand side by <img src='https://s0.wp.com/latex.php?latex=2%2F%28N%2B1%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='2/(N+1)' title='2/(N+1)' class='latex' />, by our earlier argument.  The second term vanishes since <img src='https://s0.wp.com/latex.php?latex=%5CDelta_2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\Delta_2' title='\Delta_2' class='latex' /> is zero everywhere except in the last column, and <img src='https://s0.wp.com/latex.php?latex=p_e&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_e' title='p_e' class='latex' />&#8216;s final entry is zero.  As a result, we have</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%7Cq-p_e%5C%7C+%5Cleq+%5Cfrac%7B2%7D%7BN%2B1%7D%2C+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \|q-p_e\| \leq \frac{2}{N+1}, ' title='   \|q-p_e\| \leq \frac{2}{N+1}, ' class='latex' />
<p>i.e., the bound is exactly as if we had added a new page with no links.</p>
<h3>Exercises</h3>
<ul>
<li> There is some ambiguity in the specification of <img src='https://s0.wp.com/latex.php?latex=%5CDelta_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\Delta_1' title='\Delta_1' class='latex' /> and     <img src='https://s0.wp.com/latex.php?latex=%5CDelta_2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\Delta_2' title='\Delta_2' class='latex' /> in the above analysis.  Resolve this ambiguity by     writing out explicit forms for <img src='https://s0.wp.com/latex.php?latex=%5CDelta_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\Delta_1' title='\Delta_1' class='latex' /> and <img src='https://s0.wp.com/latex.php?latex=%5CDelta_2&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\Delta_2' title='\Delta_2' class='latex' />, and then     verifying that the remainder of the proof of the bound goes through. </ul>
<p>Suppose, instead, that the new link had been added from an existing page.  Then the term <img src='https://s0.wp.com/latex.php?latex=%5C%7C+%5CDelta_2+p_e%5C%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\| \Delta_2 p_e\|' title='\| \Delta_2 p_e\|' class='latex' /> above would no longer vanish, and the bound would become</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%7Cq-p_e%5C%7C+%5Cleq+%5Cfrac%7B2%7D%7BN%2B1%7D+%2B+%5Cfrac%7B2%281-t%29%7D%7Bt%7D+%5Cfrac%7Bp%7D%7B%28l%2B1%29%7D%2C+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \|q-p_e\| \leq \frac{2}{N+1} + \frac{2(1-t)}{t} \frac{p}{(l+1)}, ' title='   \|q-p_e\| \leq \frac{2}{N+1} + \frac{2(1-t)}{t} \frac{p}{(l+1)}, ' class='latex' />
<p>where <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> was the initial PageRank for the page to which the link was added, and <img src='https://s0.wp.com/latex.php?latex=l&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='l' title='l' class='latex' /> was the initial number of links from that page.</p>
<p>Generalizing this analysis, it&#8217;s possible to write a single inequality that unites all our results to date.  In particular, suppose we add <img src='https://s0.wp.com/latex.php?latex=%5CDelta+N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\Delta N' title='\Delta N' class='latex' /> new pages to the web.  Denote the final <img src='https://s0.wp.com/latex.php?latex=N%2B%5CDelta+N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N+\Delta N' title='N+\Delta N' class='latex' />-dimensional PageRank vector by <img src='https://s0.wp.com/latex.php?latex=q&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='q' title='q' class='latex' />.  Let <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> denote the initial <img src='https://s0.wp.com/latex.php?latex=N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N' title='N' class='latex' />-dimensional PageRank vector, and let <img src='https://s0.wp.com/latex.php?latex=p_e&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_e' title='p_e' class='latex' /> denote the <img src='https://s0.wp.com/latex.php?latex=N%2B%5CDelta+N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='N+\Delta N' title='N+\Delta N' class='latex' />-dimensional extension obtained by appending <img src='https://s0.wp.com/latex.php?latex=%5CDelta+N&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\Delta N' title='\Delta N' class='latex' /> <img src='https://s0.wp.com/latex.php?latex=0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='0' title='0' class='latex' /> entries to <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' />.  Suppose we add <img src='https://s0.wp.com/latex.php?latex=m_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='m_j' title='m_j' class='latex' /> outbound links to page <img src='https://s0.wp.com/latex.php?latex=j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j' title='j' class='latex' />, and remove <img src='https://s0.wp.com/latex.php?latex=n_j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='n_j' title='n_j' class='latex' /> outbound links from page <img src='https://s0.wp.com/latex.php?latex=j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j' title='j' class='latex' />.  Then an analysis similar to that above shows that:</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%7Cq-p_e+%5C%7C+%5Cleq+%5Csum_%7Bj%3D1%7D%5E%7B%5CDelta+N%7D+%5Cfrac%7B2%7D%7BN%2Bj%7D%2B+%5Cfrac%7B2%281-t%29%7D%7Bt%7D+%5Cleft%28+%5Csum_%7Bj%3A%7B%5Crm+nd%7D%7D+%5Cfrac%7B%5Cmax%28m_j%2Cn_j%29p_j%7D%7B%5Cmax%28l_j%2Cl_j%27%29%7D+++++%2B+%5Csum_%7Bj%3A%7B%5Crm+d%7D%7D+%5Cfrac%7B%28N-n_j%29p_j%7D%7BN%7D+%5Cright%29%2C+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \|q-p_e \| \leq \sum_{j=1}^{\Delta N} \frac{2}{N+j}+ \frac{2(1-t)}{t} \left( \sum_{j:{\rm nd}} \frac{\max(m_j,n_j)p_j}{\max(l_j,l_j&#039;)}     + \sum_{j:{\rm d}} \frac{(N-n_j)p_j}{N} \right), ' title='   \|q-p_e \| \leq \sum_{j=1}^{\Delta N} \frac{2}{N+j}+ \frac{2(1-t)}{t} \left( \sum_{j:{\rm nd}} \frac{\max(m_j,n_j)p_j}{\max(l_j,l_j&#039;)}     + \sum_{j:{\rm d}} \frac{(N-n_j)p_j}{N} \right), ' class='latex' />
<p>where, as before, the sum <img src='https://s0.wp.com/latex.php?latex=%5Csum_%7Bj%3A%7B%5Crm+nd%7D%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\sum_{j:{\rm nd}}' title='\sum_{j:{\rm nd}}' class='latex' /> is over those pages <img src='https://s0.wp.com/latex.php?latex=j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j' title='j' class='latex' /> which are non-dangling, and the sum <img src='https://s0.wp.com/latex.php?latex=%5Csum_%7Bj%3A%7B%5Crm+d%7D%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\sum_{j:{\rm d}}' title='\sum_{j:{\rm d}}' class='latex' /> is over dangling pages <img src='https://s0.wp.com/latex.php?latex=j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j' title='j' class='latex' />.  Note that we can omit all the new pages <img src='https://s0.wp.com/latex.php?latex=j&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='j' title='j' class='latex' /> from this latter sum, for the same reason the quantity <img src='https://s0.wp.com/latex.php?latex=%5C%7C+%5CDelta_2+p_e+%5C%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\| \Delta_2 p_e \|' title='\| \Delta_2 p_e \|' class='latex' /> vanished earlier in this section.  This inequality generalizes all our earlier results.</p>
<h3>Problems</h3>
<ul>
<li> Fill in the details of the above proof.  To do so it helps to   consider the change in PageRank in two steps: (1) the change due to   the addition of new links to existing pages; and (2) the change due   to the addition of new webpages, and links on those webpages.
</ul>
<h3>Recomputing PageRank on the fly</h3>
<p>Part of my reason for being interested in the questions discussed in this post is that I&#8217;m interested in how to quickly update PageRank during a web crawl.  The most obvious way to compute the PageRank of a set of crawled webpages is as a batch job, doing the computation maybe once a week or once a month.  But if you&#8217;re operating a web crawler, and constantly adding new pages to an index, you don&#8217;t want to have to wait a week or a month before computing the PageRank (or similar measure) of a newly crawled page.  You&#8217;d like to compute &#8211; or at least estimate &#8211; the PageRank immediately upon adding the page to the index, without having to do an enormous matrix computation.  Is there a way this can be done?</p>
<p>Unfortunately, I don&#8217;t know how to quickly update the PageRank. However, the bounds in this post do help at least a little in figuring out how to update the PageRank as a batch job.  Suppose we had an extant multi-billion page crawl, and knew the corresponding values for PageRank.  Suppose then that we did a mini-crawl to update the index, perhaps with a few million new pages and links.  Suppose <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> was the (known value of) the PageRank vector before the mini-crawl, and <img src='https://s0.wp.com/latex.php?latex=q&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='q' title='q' class='latex' /> is the PageRank after the mini-crawl, which we are now trying to compute. Suppose also that <img src='https://s0.wp.com/latex.php?latex=M%27&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M&#039;' title='M&#039;' class='latex' /> is the new PageRank matrix.  One natural way of updating our estimate for PageRank would be to apply <img src='https://s0.wp.com/latex.php?latex=M%27&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M&#039;' title='M&#039;' class='latex' /> repeatedly to <img src='https://s0.wp.com/latex.php?latex=p&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p' title='p' class='latex' /> (more precisely, its extension, <img src='https://s0.wp.com/latex.php?latex=p_e&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_e' title='p_e' class='latex' />), in the hopes that it will quickly converge to <img src='https://s0.wp.com/latex.php?latex=q&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='q' title='q' class='latex' />.  The bounds in this post can help establish how quickly this convergence will happen.  To see this, observe that</p>
<img src='https://s0.wp.com/latex.php?latex=+++%5C%7Cq-%28M%27%29%5Enp_e%5C%7C+%3D+%5C%7C+%28M%27%29%5En%28q-p_e%29+%5C%7C+%5Cleq+%281-t%29%5En+%5C%7Cq-p_e%5C%7C%2C+&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='   \|q-(M&#039;)^np_e\| = \| (M&#039;)^n(q-p_e) \| \leq (1-t)^n \|q-p_e\|, ' title='   \|q-(M&#039;)^np_e\| = \| (M&#039;)^n(q-p_e) \| \leq (1-t)^n \|q-p_e\|, ' class='latex' />
<p>where we used the PageRank equation <img src='https://s0.wp.com/latex.php?latex=M%27q+%3D+q&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M&#039;q = q' title='M&#039;q = q' class='latex' /> to get the first equality, and the contractivity of PageRank to get the second inequality.  This equation tells us that we can compute an estimate for <img src='https://s0.wp.com/latex.php?latex=q&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='q' title='q' class='latex' /> by applying <img src='https://s0.wp.com/latex.php?latex=M%27&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M&#039;' title='M&#039;' class='latex' /> repeatedly to <img src='https://s0.wp.com/latex.php?latex=p_e&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_e' title='p_e' class='latex' />, and it will converge exponentially quickly.  This is all standard stuff about PageRank. However, the good news is that the results in this post tell us something about how to bound <img src='https://s0.wp.com/latex.php?latex=%5C%7Cq-p_e%5C%7C&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\|q-p_e\|' title='\|q-p_e\|' class='latex' />, and frequently this quantity will be very small to start with, and so the convergence will occur much more quickly than we would <em>a priori</em> expect.</p>
<h3>Problems for the author</h3>
<ul>
<li> Is there a better way of updating PageRank on the fly, rather   than applying powers of <img src='https://s0.wp.com/latex.php?latex=M%27&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='M&#039;' title='M&#039;' class='latex' />? </ul>
<p><em>Interested in more?  Please <a href="htp://www.michaelnielsen.org/ddi/feed/>subscribe to this blog</a>, or <a href="http://twitter.com/\#!/michael_nielsen">follow me on Twitter</a>.  You may also enjoy reading my new book about  open science, <a href="http://www.amazon.com/Reinventing-Discovery-New-Networked-Science/dp/product-description/0691148902">Reinventing Discovery</a>.</em> </p>
]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
