<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>Generation 5</title>
	
	<link>http://gen5.info/q</link>
	<description>Towards Intelligent Systems</description>
	<pubDate>Tue, 23 Jun 2009 15:28:56 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.7.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/Generation5" /><feedburner:info uri="generation5" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><feedburner:emailServiceId>Generation5</feedburner:emailServiceId><feedburner:feedburnerHostname>http://feedburner.google.com</feedburner:feedburnerHostname><item>
		<title>Closures, Javascript And The Arrow Of Time</title>
		<link>http://feedproxy.google.com/~r/Generation5/~3/2yEuaBC7wpg/</link>
		<comments>http://gen5.info/q/2009/06/23/closures-javascript-and-the-arrow-of-time/#comments</comments>
		<pubDate>Tue, 23 Jun 2009 15:24:40 +0000</pubDate>
		<dc:creator>Paul Houle</dc:creator>
		
		<category><![CDATA[Asynchronous Communications]]></category>

		<guid isPermaLink="false">http://gen5.info/q/?p=280</guid>
		<description><![CDATA[
Introduction
In most written media,  time progresses as you move down a page:  mainstream computing languages are no different.   Anonymous Closures are a language mechanism that,  effectively,  lets programmers create new control structures.  Although people associate this power with exotic dynamic languages such as FORTH,  Scheme and TCL,   closures are becoming a feature of  mainstream languages [...]]]></description>
			<content:encoded><![CDATA[<p><img align=right title="time-machine" src="http://gen5.info/q/wp-content/uploads/2009/06/time-machine.jpg" alt="time-machine" width="175" height="285" /></p>
<h2>Introduction</h2>
<p>In most written media,  time progresses as you move down a page:  mainstream computing languages are no different.   <em>Anonymous Closures</em> are a language mechanism that,  effectively,  lets programmers create new control structures.  Although people associate this power with exotic dynamic languages such as FORTH,  Scheme and TCL,   closures are becoming a feature of  mainstream languages such as Javascript and PHP (and even static languages such as C#.)</p>
<p>Although this article talks about issues that you&#8217;ll encounter in languages such as C# and Scheme,  I&#8217;m going to focus on Javascript written on top of the popular JQuery library:  I do that because JQuery is a great toolkit that lets programmers and designers of all skill levels do a lot by writing very little code.  Because JQuery smooths away low-level details,  it lets us clearly illustrate little weirdnesses of its programming model. Although things often work &#8220;just right&#8221; on the small scale,  little strange things snowball in larger programs &#8212; a careful look atthe  little problems is a key step towards the avoidance and mitigation of problems in big RIA projects.</p>
<p><span id="more-280"></span></p>
<h2>Time In Programming Languages</h2>
<p>Absent control structures,  execution proceeds from top to bottom in a computer program,  consider:</p>
<pre>01 document.write('Hello');
02 document.write(' ');
03 document.write('World!');</pre>
<p>Today&#8217;s languages are based on structured programming,  which encourages to use a limited number of control structures,  such as conditional expressions</p>
<pre>04 if(happy=true) {
05    document.write('Hello');
06 } else {
07    document.write('Goodbye Cruel');
08 }
09
10 document.write('  ');
11 document.write('World!');</pre>
<p>and loops</p>
<pre>12 var sum=0;
13 for(i=0;i&lt;x.length;i++) {
14   sum += x[i];
15 }</pre>
<p>With GOTO expunged from the programming canon,   execution proceeds from top to down,   except for a limited number of  control structures that let us return to the beginning or jump to the end of a block.  It&#8217;s a simple and intuitive model that we don&#8217;t often think about,  but it&#8217;s easy to write code that breaks this model when programming with closures.</p>
<h2>Emulating Foreach() in Javascript with Closures</h2>
<p>Popular languages such as PHP and C# have a <em>foreach()</em> statement that loops over the elements of a collection without the need to keep track of a counter variable.  Popular Javascript frameworks such as JQuery and Dojo implement <em>foreach</em> functions;  with the magic of closures,  we can get an effect similar to a built-in <em>foreach</em> construct.  For instance,  in JQuery,  we can write</p>
<pre>16 var sum=0;
17 $.each(x,function(i,val) {
18    sum += arg;
19 });</pre>
<p>In this case,  the second argument of the <em>$.each</em> (<a href="http://docs.jquery.com/Utilities/jQuery.each">JQuery.each</a>) function is a anonymous function.  The special property it has,  as a closure,  is that the anonymous function has access to variables in the enclosing scope,  namely<em> sum</em>.  This allows the code in 16-19 to look and work a lot like the code in 12-15,  despite the fact that line 18 is in a different function than line 16.</p>
<p>As an aside,  this kind of higher-order function is often simple to implement;  although the real<em> $.each(</em>) function is capable of iterating over a wide variety of types (thus more complex),  a function that can iterate over arrays is as simple as</p>
<pre>20 function foreachArrayElement(arr,loopBody) {
21    for(i=0;i&lt;x.length;i++) {
22       loopBody(i,x[i])
23    }
24 }</pre>
<h2>Breaking The Conventions of Time</h2>
<p>Although you&#8217;ve got other choices,  closures are  a popular and concise way to write callbacks for aysnchronous communication.  Suppose that we want to GET a packet of JSON data from a server and update a user inteface.  In JQuery we could write:</p>
<pre>25 $.get("JsonService", {}, function(data, textStatus) {
26     var obj=JSON.parse(data);
27     someUIElement.update(obj);
28  });
29
30 ... do more stuff ...</pre>
<p>Note that time isn&#8217;t progressing downward in this script anymore.  We got from line 25 directly to line 30.  The code between 26-28 only executes when data gets back from the server.  In this case,  this behavior is caused by the asynchronous nature of <em>XMLHttpRequest</em>,  for which <em>$.get()</em> is  wrapper.  Similar behavior can be had by attaching an event handler to a DOM object,  or even by adding a function reference to an object or the global scope,  causing the anonymous function to execute when some other function,  defined elsewhere,  executes.</p>
<h2>&#8220;Race Conditions&#8221; In Asynchronous Communication</h2>
<p>Although it&#8217;s not wrong to break the conventions of time,  doing so risks the introduction of tricky and subtle errors.  Suppose we&#8217;re building an AJAX application where we&#8217;d like to cache an object,  retrieving the object only if we don&#8217;t have a copy in cache.  One obvious approach is to write</p>
<pre>31 if (cachedCopy==undefined) {
32  $.get("DataService",{},function(data,textStatus) {
33       cachedCopy=data;
34       updateUIOne(data);
35   }
36 } else {
37   updateUIOne(cachedCopy);
38 }
39
40 updateUITwo();</pre>
<p><em>UpdateUIOne(data)</em> populates the user interface with the data.  <em>UpdateUITwo()</em> makes some other change to the user interface.</p>
<p>Unfortunately,  this code has a potential bug that&#8217;s hidden by the conventions of time.  When data is in the cache,  line 37 executes before line 40,  so that <em>updateUIOne(data</em>) is called before <em>updateUITwo()</em>.   When data is not in the cache,  line 40 executes before 33 (communication callbacks don&#8217;t run in Javascript until the code being run returns to the browser.)  It&#8217;s all fine if the order in which you run <em>updateUIOne</em> and <em>updateUITwo</em> doesn&#8217;t matter &#8212; and it&#8217;s a disaster if it does&#8230;  This kind of code does one thing sometimes and does another thing other times:  code of this type is difficult to test,  and leads to the kind of bug that drives people to drink.</p>
<p>The real answer to these problems is to take a systematic approach to asynchronous communication:  any strategy based on band-aids that work here or these is going will eventally break down under the weight of growing requirements.  That said,  I&#8217;ll offer a few strategies for patching this kind of problem:</p>
<ol>
<li>If we could move <em>updateUITwo() </em>from line 40 to before line 31,  <em>updateUITwo()</em> and <em>updateUIOne(</em>) could be run in a consistent order.</li>
<li>Modifying <em>updateUIOne(data)</em> to call <em>updateUITwo()</em> after it completes also results in a consistent order</li>
<li>We can schedule <em>UpdateUIOne</em> to run AFTER the executing code returns to the browser,  by replacing line 34 with</li>
</ol>
<pre>41    setTimeout(function() { updateUIOne(data) },1);</pre>
<h2>Structural Instability</h2>
<p>Let&#8217;s consider another example where the conventions of time are treacherous:  suppose we need to retrieve three chunks of data from the server</p>
<pre>42 $.get("Chunk1Server",{},function(data1,textStatus1) {
43   UpdateUIOne(data1);
44   $.get("Chunk2Server",{},function(data2,textStatus2) {
45       UpdateUITwo(data2);
46        $.get("Chunk3Server",{},function(data3,textStatus3) {
47            UpdateUIThree(data3);
48       });
49    });
50 });</pre>
<p>Note that this specific chunk of code executes exactly the way that it looks.  Although there are some gaps,  execution proceeds progressively from line 42 to 47.  The &#8220;nested closure&#8221; idiom is common in Javascript code,   probably because it looks a lot the synchronous code that does the same job,  but it&#8217;s treacherous:  a tiny change in the code can break the conventions of time,  causing strange things to happen.  This property is called structural instability.</p>
<p>For example,  the above code might return directly to the browser,   eliminating the possibility of the &#8220;race conditions&#8221; seen in the last section. If we add the following line:</p>
<pre>51 UpdateUIFour();</pre>
<p>we&#8217;ll find that <em>UpdateUIFour()</em> runs ~before~ the other other functions.  This isn&#8217;t bad behavior if we expect it,  but could have spooky effects if we don&#8217;t.  This example is trivial,  but similar mistakes can be made in any of the closures,  resulting in errors that can be quite baffling.</p>
<p>The structural instability of nested closures pushes complex AJAX applications towards other styles of coding,  such as a state-machine organization.</p>
<h2>Order Of Arguments</h2>
<p>This is small compared to the other issues,  but the order of arguments of callback functions can add to the the confusions around the time:  The <em>$.get()</em> function provides a good example,  since it support four parameters:</p>
<pre>51 $.get(url,parameters,callback,type);</pre>
<p>all of the parameters,  except for the url,  are optional.   It&#8217;s generally good to put less used optional parameters towards the right,  but placing the type declaration after the callback disturbs the flow of time and hurts readability.  If you choose to have the return data parsed as JSON format,</p>
<pre>52 $.get(targetUrl,{},function(data,textStatus) {
53    ... handler code ...
54 },"json");</pre>
<p>you&#8217;ll see that the type specification occurs after the callback.  This,  by itself,  adds a small amount of confusion,  but when you&#8217;ve got multiple nested closures or if you were computing the values defined after the callback,  code becomes immediately difficult to understand.</p>
<h2>Event Handlers And &#8220;COMEFROM&#8221;</h2>
<p>In 1973 R. Lawrence Clark kiddingly introduced a COMEFROM statement in response to Edgser Dijkstra&#8217;s famous &#8220;GO TO Considered Harmful&#8221;,  from <a href="http://en.wikipedia.org/wiki/COMEFROM">the associated Wikipedia Article</a>,  I take an example of a program in a hypothetical BASIC dialect that uses COMEFROM:</p>
<pre class="source-qbasic"><span class="nu0">55</span> COMEFROM <span class="nu0">58</span>
<span class="nu0">56 </span><span class="kw3">INPUT</span> <span class="st0">"WHAT IS YOUR NAME? "</span>; A$
<span class="nu0">57 </span><span class="kw3">PRINT</span> <span class="st0">"HELLO, "</span>; A$
<span class="nu0">58 </span><span class="co2">REM
</span></pre>
<p><span class="co2">COMEFROM is a bit disturbing because it involves an &#8220;action at a distance:&#8221;  line 55 modifies the behavior of line 58.  A person looking a line 58 in isolation would have a hard time understanding what happens in the program when execution reaches line 58.</span></p>
<p><span class="co2">The use of event handlers,  both built in and custom,  has an effect similar to COMEFROM.  Consder the common example<br />
</span></p>
<pre><span class="co2">59 $("#activeButton").click(function() {
60    ... handle button click...
61 }</span></pre>
<p><span class="co2">Once more,  the code at line 60 will execute after the code that follows line 61.  Line 59 modifies the behavior of a built-in UI element,  which might be defined like</span></p>
<pre><span class="co2">62 &lt;input type="button" id="activeButton" value="Button X"&gt;
</span></pre>
<p><span class="co2">Looking at line 62 alone,  it&#8217;s impossible to tell what what happens when you click on the button;  an &#8220;onClick&#8221; handler in line 62 would be more immediately obvious in function.  That said,  events are a proven model for GUI programming that introduces a useful decoupling between the parts of a system &#8212; a looser coupling that lets us build bigger systems.  In the context of JQuery,  it&#8217;s great to be able to write something like</span></p>
<pre><span class="co2">63 &lt;div class="SpecialArea"&gt;... content ...&lt;/div&gt;</span></pre>
<p><span class="co2">Javascript code can then use the $() function to attach styles,  event handlers and to manipulate DOM elements inside the .SpecialArea to give the area specific behaviors and appearance.  This lets a developer provide a powerful but easy-to-use abstraction to a designer:   adding a single class specification lets us reuse tens of CSS specifications and several event handlers.   This is a big win,  but we can&#8217;t forget the hidden COMEFROM that&#8217;s implied.<br />
</span></p>
<p><span class="co2">Events are dynamically added and removed,  even in static languages such as Java and C#.  Although this dynamism adds flexibility,  it comes at a cost.  IDEs such as Eclipse and Visual Studio can do a great job of helping programmers understand static method calls:  for instance,  you can click on the name of a method and see a list of places where this method is called.  Because events aren&#8217;t amenable to static analysis,  inappropriate use (or even appropriate use) of events impair some of the tools we have for understanding programs.<br />
</span></p>
<p><span class="co2">Events are a great tool,  but programmers need to understand the weirdness that underlies them.  Used deliberately,  events can provide a concise and scalable way to move information and direct control in an application.  Poorly used events can cause software to degenerate to &#8220;spaghetti code&#8221;.</span></p>
<h2><span class="co2">Conclusion</span></h2>
<p><span class="co2">Closures are a powerful and concise way to express your intentions to a computer:  however,  closures break some of the intuitive assumptions that people use to understand software &#8212; specifically,  the idea that time moves downward through the execution of a procedure.  This leads to a kind of <em>structural instability</em>,   where software that&#8217;s perfectly simple and clear at a simple level can suddenly get much harder to understand when several complications converge.  This article uses JQuery-Javascript as an example not because JQuery is bad,  but because it&#8217;s good:   it&#8217;s a tool that helps beginning (and advanced) programmers accomplish a lot with little code.  Yet,  as we build increasingly complex applications,  we need a clear understand of the big and little weirdnesses of RIA programming.<br />
</span></p>
<p><span class="co2"><br />
</span></p>
<img src="http://feeds.feedburner.com/~r/Generation5/~4/2yEuaBC7wpg" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://gen5.info/q/2009/06/23/closures-javascript-and-the-arrow-of-time/feed/</wfw:commentRss>
		<feedburner:origLink>http://gen5.info/q/2009/06/23/closures-javascript-and-the-arrow-of-time/</feedburner:origLink></item>
		<item>
		<title>Getting a Good Track From a Garmin eTrex</title>
		<link>http://feedproxy.google.com/~r/Generation5/~3/1l2hUCntdDM/</link>
		<comments>http://gen5.info/q/2009/06/03/getting-a-good-track-from-a-garmin-etrex/#comments</comments>
		<pubDate>Wed, 03 Jun 2009 16:15:45 +0000</pubDate>
		<dc:creator>Paul Houle</dc:creator>
		
		<category><![CDATA[GIS]]></category>

		<guid isPermaLink="false">http://gen5.info/q/?p=272</guid>
		<description><![CDATA[Introduction
The Garmin eTrex series are popular handheld GPS units.  The eTrex H is a inexpensive unit with excellent sensitivity and accuracy,  and the eTrex Vista HCx is a favorite of OpenStreetMap contributors because it accepts a microSD card which can hold OSM maps.  I intended to use my Vista HCx to contribute to OSM and [...]]]></description>
			<content:encoded><![CDATA[<h2>Introduction</h2>
<p>The Garmin eTrex series are popular handheld GPS units.  The <a href="http://www.amazon.com/gp/product/B000PDV0CE?ie=UTF8&amp;tag=honeymediasys-20&amp;linkCode=as2&amp;camp=1789&amp;creative=390957&amp;creativeASIN=B000PDV0CE">eTrex H</a> is a inexpensive unit with excellent sensitivity and accuracy,  and the <a href="http://www.amazon.com/gp/product/B000PDR1LS?ie=UTF8&amp;tag=honeymediasys-20&amp;linkCode=as2&amp;camp=1789&amp;creative=390957&amp;creativeASIN=B000PDR1LS">eTrex Vista HCx</a> is a favorite of <a href="http://www.openstreetmap.org/">OpenStreetMap</a> contributors because it accepts a microSD card which can<a href="http://wiki.openstreetmap.org/wiki/OSM_Map_On_Garmin"> hold OSM maps</a>.  I intended to use my Vista HCx to contribute to OSM and to georeference photographs,  but I was shocked to discover that <a href="http://tchester.org/sgm/analysis/gps/etrex_saved_track_accuracy.html">eTrex units remove time information from saved tracks.</a> This means that saved tracks aren&#8217;t useful if you want to georeference photographs with an application like <a href="http://code.google.com/p/gpicsync/">GPicSync</a>.  There&#8217;s a simple solution to this problem:  avoid using saved tracks,  and download the &#8220;Active Track&#8221; instead.</p>
<h2>Saved Tracks</h2>
<p>The eTrex units &#8220;dumb down&#8221; saved tracks by (i) reducing the number of points, (ii) removing time information,  and (iii) applying spatial filtering.  A saved track looks like this in Garmin MapSource:</p>
<p><img class="aligncenter size-full wp-image-273" title="badtrack" src="http://gen5.info/q/wp-content/uploads/2009/06/badtrack.png" alt="badtrack" width="578" height="453" /></p>
<p>This track could be useful for mapping,  but lacking time information,  it can&#8217;t be used to reference events that occur at a particular moment in time.  Although the eTrex has a number of menu items for configuring the active track (to,  for instance,  increase the sample rate at which points are taken) there are no options that influence the information stored in saved tracks.</p>
<p><span id="more-272"></span></p>
<h2>What&#8217;s in your eTrex?</h2>
<p>Once you&#8217;ve loaded tracks from your eTrex, you&#8217;ll see that there are both saved tracks and a collection of &#8220;active logs&#8221;:</p>
<p><img class="aligncenter size-full wp-image-274" title="trackmenu" src="http://gen5.info/q/wp-content/uploads/2009/06/trackmenu.png" alt="trackmenu" width="316" height="437" /></p>
<p>You&#8217;ll usually find more than one active log:  new ones are created in order when the unit is power cycled or the track is otherwise reset.  Active logs respect the sampling rate you&#8217;ve specified,  keep timestamps,  and are less filtered than saved tracks:</p>
<p><img class="aligncenter size-full wp-image-275" title="goodtrack" src="http://gen5.info/q/wp-content/uploads/2009/06/goodtrack.png" alt="goodtrack" width="575" height="463" /></p>
<h2>Conclusion</h2>
<p>Avoid saved tracks if timestamps matter to you.  You&#8217;ll lose the advantage of using saved tracks to organize your tracks,  but this should matter little if you&#8217;re georeferencing photographs:  GPicSync looks at all of the tracks in the .gpx file and uses timestamps to match everything up.</p>
<img src="http://feeds.feedburner.com/~r/Generation5/~4/1l2hUCntdDM" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://gen5.info/q/2009/06/03/getting-a-good-track-from-a-garmin-etrex/feed/</wfw:commentRss>
		<feedburner:origLink>http://gen5.info/q/2009/06/03/getting-a-good-track-from-a-garmin-etrex/</feedburner:origLink></item>
		<item>
		<title>Carpictures.cc,  My First Web 3.0 Site</title>
		<link>http://feedproxy.google.com/~r/Generation5/~3/6I2d8UnrvOk/</link>
		<comments>http://gen5.info/q/2009/04/01/carpicturescc-my-first-web-30-site/#comments</comments>
		<pubDate>Thu, 02 Apr 2009 02:23:08 +0000</pubDate>
		<dc:creator>Paul Houle</dc:creator>
		
		<category><![CDATA[Asynchronous Communications]]></category>

		<category><![CDATA[Semantic Web]]></category>

		<guid isPermaLink="false">http://gen5.info/q/?p=268</guid>
		<description><![CDATA[I haven&#8217;t blogged much in the last month because I&#8217;ve been busy.  I&#8217;ve got a day job,  and I&#8217;m a parent,  and I&#8217;ve also been working on a new site:  Creative Commons Car Pictures.  What you see on the surface of &#8220;Car Pictures&#8221; might be familiar,  but what&#8217;s under the surface [...]]]></description>
			<content:encoded><![CDATA[<p>I haven&#8217;t blogged much in the last month because I&#8217;ve been busy.  I&#8217;ve got a day job,  and I&#8217;m a parent,  and I&#8217;ve also been working on a new site:  <a href="http://carpictures.cc/cars/photo/">Creative Commons Car Pictures</a>.  What you see on the surface of &#8220;Car Pictures&#8221; might be familiar,  but what&#8217;s under the surface is something you&#8217;ve probably never seen before:</p>
<p><img class="alignnone size-full wp-image-269" title="carpicturesfrontpage" src="http://gen5.info/q/wp-content/uploads/2009/04/carpicturesfrontpage.png" alt="carpicturesfrontpage" width="500" height="300" /></p>
<p>A collection of car pictures under a Creative Commons license,  it was built from a taxonomy that was constructed from <a href="http://dbpedia.org/About">Dbpedia</a> and <a href="http://freebase.com/">Freebase</a>.  My editor and I then used a scalable process to clean up the taxonomy,  find matching images and enrich image metadata.  As you can imagine,  this is a process that can be applied to other problem domains:  we&#8217;ve tested some of the methodology on our site <a href="http://animalphotos.info/a/">Animal Photos!</a> but this is the first site designed from start to finish based on the new technology.</p>
<p><a href="http://carpictures.cc/cars/photo/">Car Pictures</a> was made possible by data sets available on the semantic web,  and it&#8217;s soon about to export data via the semantic web.  We&#8217;re currently talking with people about exporting our content in a machine-readable <a href="http://linkeddata.org/">linked data</a> format &#8212; in our future plans,  this will enable a form of semantic search that will revolutionize multimedia search:  rather than searching for inprecise keywords,  it will become possible to look up pictures,  video and documents about named entities found in systems such as Freebase,  Wikipedia,  WordNet,  Cyc and OpenCalais.</p>
<p>In the next few days we&#8217;re going to watch carefully how Car Pictures interacts with users and with web crawlers.  Then we&#8217;ll be adding content,  community features,  and a linked data interface.   In the meantime,  we&#8217;re planning to build something at least a hundred times bigger.</p>
<p>Quite literally, thousands of people have contributed to Car Pictures,  but I&#8217;d like to particularly thank <a href="http://danbri.org/">Dan Brickley</a>,  who&#8217;s been helpful in the process of interpreting Dbpedia with his <a href="http://arc.semsol.org/">ARC2</a> RDF tools,  and <a href="http://twitter.com/kidehen">Kingsley Idehen</a>,  who&#8217;s really helped me sharpen my vision of what the semantic web can be.</p>
<p>Anyway,  I&#8217;d love any feedback that you can give me about Car Pictures and any help I can get in spreading the word about it.  If you&#8217;re interested in making  similar site about some other subject,  please contact me:  it&#8217;s quite possible that we can help.</p>
<img src="http://feeds.feedburner.com/~r/Generation5/~4/6I2d8UnrvOk" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://gen5.info/q/2009/04/01/carpicturescc-my-first-web-30-site/feed/</wfw:commentRss>
		<feedburner:origLink>http://gen5.info/q/2009/04/01/carpicturescc-my-first-web-30-site/</feedburner:origLink></item>
		<item>
		<title>First-Class Functions and Logical Negation in C#</title>
		<link>http://feedproxy.google.com/~r/Generation5/~3/XelDUbyRlxs/</link>
		<comments>http://gen5.info/q/2009/03/09/first-class-functions-and-logical-negation-in-c/#comments</comments>
		<pubDate>Mon, 09 Mar 2009 13:50:39 +0000</pubDate>
		<dc:creator>Paul Houle</dc:creator>
		
		<category><![CDATA[Dot Net]]></category>

		<category><![CDATA[Functional Programming]]></category>

		<category><![CDATA[Linq]]></category>

		<guid isPermaLink="false">http://gen5.info/q/?p=253</guid>
		<description><![CDATA[
Introduction
Languages such as LISP,  ML,  oCaml F# and Scala have supported first-class functions for a long time.  Functional programming features are gradually diffusing into mainstream languages such as C#,  Javascript and PHP.   In particular,  Lambda expressions,  implicit typing,  and delegate autoboxing have made  C# 3.0 an much more expressive language than it&#8217;s predecssors.
In this article,  [...]]]></description>
			<content:encoded><![CDATA[<p><img style="margin-right:20px; margin-bottom:20px;" title="negation" src="http://gen5.info/q/wp-content/uploads/2009/03/negation.png" alt="negation" width="99" height="90" align="right" /></p>
<h2>Introduction</h2>
<p>Languages such as LISP,  ML,  oCaml F# and Scala have supported first-class functions for a long time.  Functional programming features are gradually diffusing into mainstream languages such as C#,  Javascript and PHP.   In particular,  Lambda expressions,  implicit typing,  and delegate autoboxing have made  C# 3.0 an much more expressive language than it&#8217;s predecssors.</p>
<p>In this article,  I develop a simple function that acts on functions:  given a boolean function<em> f</em>,  <em>F.Not(f)</em> returns a new boolean function which is the logical negation of <em>f</em>.  (That is,  <em>F.Not(f(x)) == !f(x)</em>).   Although the end function is simple to use,  I had to learn a bit about the details of C# to ge tthe behavior  I wanted &#8212; this article records that experience.<br />
<span id="more-253"></span></p>
<h2>Why?</h2>
<p>I was debugging an application that uses Linq-to-Objects the other day,  and ran into a bit of a pickle.  I had written an expression that would return true if all of the values in a List&lt;String&gt; were valid phone numbers:</p>
<pre>[01] dataColumn.All(Validator.IsValidPhoneNumber);</pre>
<p>I was expecting the list to validate successfully,  but the function was returning false.  Obviously at least one of the values in the function was not validating &#8212; but which one?  I tried using a lambda expression in the immediate window (which lets you type in expression while debugging) to reverse the order of matching:</p>
<pre>[02] dataColumn.Where(s =&gt; !ValidatorIsValidPhoneNumber(x));</pre>
<p>but that gave me the following error message:</p>
<pre>[03] Expression cannot contain lambda expressions</pre>
<p>(The expression compiler used in the immediate window doesn&#8217;t support all of the language features supported by the full C# compiler.)  I was able to answer find the non matching elements using the set difference operator</p>
<pre>[04] dataColumn.Except(dataColumn.Where(Validator.IsValidPhoneNumber).ToList()</pre>
<p>but I wanted an easier and more efficient way &#8212; with a logical negation function,  I could just write</p>
<pre>[05] dataColumn.Except(F.Not(Validator.IsValidPhoneNumber).ToList();</pre>
<h2>Try 1: Extension Method</h2>
<p>I wanted to make the syntax for negation as simple as possible.  I didn&#8217;t want to even have to name a class eliminate the need to name a class,  so I tried the following extension method:</p>
<pre>[06] static class FuncExtensions {
[07]    public static Func&lt;Boolean&gt; Not(this Func&lt;Boolean&gt; innerFunction) {
[08]       return ()  =&gt; innerFunction();
[09]    }
[10]    ...
[11]   }</pre>
<p>I was hoping that,  given a function like</p>
<pre>[12] public static boolean AlwaysTrue() { return true };</pre>
<p>that I could negate the function by writing</p>
<pre>[13] AlwaysTrue.Not()</pre>
<p>Unfortunately,  it doesn&#8217;t work that way:  the extension method can only be called on a delegate of type Func&lt;Boolean&gt;.  Although the compiler will &#8220;autobox&#8221; function references to delegates in many situations,  it doesn&#8217;t do it when you reference an extension method.  I could write</p>
<pre>[14] (Func&lt;Boolean&gt; AlwaysTrue).Not()</pre>
<p>but that&#8217;s not a pretty syntax.   At that point,  I tried another tack.</p>
<h2>Try 2: Static Method</h2>
<p>Next,  I defined a set of negation functions as static methods on a static class:</p>
<pre>[15] public static class F {
[16]    public static Func&lt;Boolean&gt; Not(Func&lt;Boolean&gt; innerFunction) {
[17]       return () =&gt; !innerFunction();
[18]    }
[19]
[20]    public static Func&lt;T1,Boolean&gt; Not&lt;T1&gt;(
[21]       Func&lt;T1,Boolean&gt; innerFunction) {
[22]          return x =&gt;!innerFunction(x);
[23]    }
[24]
[25]    public static Func&lt;T1, T2,Boolean&gt; Not&lt;T1,T2&gt;(
[26]       Func&lt;T1, T2,Boolean&gt; innerFunction) {
[27]          return (x,y) =&gt; !innerFunction(x,y);
[28]    }
[29]
[30]    public static Func&lt;T1, T2, T3, Boolean&gt; Not&lt;T1,T2,T3&gt;(
[31]       Func&lt;T1, T2, T3, Boolean&gt; innerFunction) {
[32]           return (x, y, z) =&gt; !innerFunction(x, y, z);
[33]    }
[34]
[35]    public static Func&lt;T1, T2, T3, T4, Boolean&gt; Not&lt;T1, T2, T3, T4&gt;(
[36]       Func&lt;T1, T2, T3, T4,Boolean&gt; innerFunction) {
[37]          return (x, y, z, a) =&gt; !innerFunction(x, y, z, a);
[38]    }
[39] }</pre>
<p>Now I can write</p>
<pre>[40] F.Not(AlwaysTrue)() // always == false</pre>
<p>or</p>
<pre>[41] Func&lt;int,int,int,int,Boolean&gt; testFunction = (a,b,c,d) =&gt; (a+b)&gt;(c+d)
[42] F.Not(testFunction)(1,2,3,4)</pre>
<p>which is sweet &#8212; the C# compiler now automatically autoboxes the argument to <em>F.Not</em> on line [40].  Note two details of how type inference works here:</p>
<ol>
<li>The compiler automatically infers the type parameters of <em>F.Not()</em> by looking at the arguments of the <em>innerFunction</em>.  If it didn&#8217;t do that,  you&#8217;d need to write
<pre>F.Not&lt;int,int,int,int&gt;(testFunction)(1,2,3,4)</pre>
<p>which would be a lot less fun.</li>
<li>On line [40],  note that the compiler derives the types of the parameters <em>a</em>,<em>b</em>,<em>c</em>, and <em>d</em> using the type declaration on the right hand side (RHS)
<pre>Func&lt;int,int,int,int,Boolean&gt;</pre>
<p>you can&#8217;t write <em>var</em> on the RHS in this situation because that doesn&#8217;t give the compiler information about the parameters and return values of the lambda.</li>
</ol>
<p>Although good things are happening behind the scenes,  there are also two bits of ugliness:</p>
<ol>
<li>I need to implement <em>F.Not()</em> 5 times to support functions with 0 to 4 parameters:  once I&#8217;ve done that,  however,  the compiler automatically resolves the overloading and picks the right version of the function.</li>
<li>The generic <em>Func&lt;&gt;</em> and <em>Action&lt;&gt;</em> delegates support at most 4 parameters.  Although it&#8217;s certainly true that functions with a large number of parameters can be difficult to use and maintain (and should be discouraged),  this is a real limitation.</li>
</ol>
<h2>Extending IEnumerable&lt;T&gt;</h2>
<p>One of the nice things about Linq is that you can extend it by adding new extension methods to IEnumerable.  Not everbody agrees,  but I&#8217;ve always liked the <em>unless()</em> satement in Perl,  which is equivalent to if(!test):</p>
<pre>[42] unless(everything_is_ok()) {
[43]   abort_operation;
[44] }</pre>
<p>A replacement for the <em>Where(predicate)</em> method that negates the <em>predicate</em> function would be convenient my debugging problem:</p>
<p>I built an<em> Unless()</em> extension method that combines <em>Where()</em> with<em> F.Not()</em>:</p>
<pre>[45] public static IEnumerable&lt;T&gt; Unless&lt;T&gt;(
[46]    this IEnumerable&lt;T&gt; input, Func&lt;T, Boolean&gt; fn) {
[47]       return input.Where(F.Not(fn));
[48]     }</pre>
<p>Now I can write</p>
<pre>[49] var list = new List&lt;int&gt;() { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
[50] var filtered = list.Unless(i =&gt; i &gt; 3).ToList();</pre>
<p>and get the result</p>
<pre>[51] filtered = { 1,2,3 }</pre>
<h2>Conclusion</h2>
<p>New features in C# 3.0 make it an expressive language for functional programming.  Although the syntax of C# isn&#8217;t quite as sweet as F# or Scala,  a programmer who works with the implicit typing and autoboxing rules of the compiler can create functions that act on functions that are easy to use &#8212; in this article we develop a set of functions that negate boolean functions and apply this to add a new restriction method to IEnumerable&lt;T&gt;.</p>
<p>
<a href="http://www.dotnetkicks.com/kick/?url=http%3a%2f%2fgen5.info%2fq%2f2009%2f03%2f09%2ffirst-class-functions-and-logical-negati"><img src="http://www.dotnetkicks.com/Services/Images/KickItImageGenerator.ashx?url=http%3a%2f%2fgen5.info%2fq%2f2009%2f03%2f09%2ffirst-class-functions-and-logical-negati" border="0" alt="kick it on DotNetKicks.com" /></a></p>
<img src="http://feeds.feedburner.com/~r/Generation5/~4/XelDUbyRlxs" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://gen5.info/q/2009/03/09/first-class-functions-and-logical-negation-in-c/feed/</wfw:commentRss>
		<feedburner:origLink>http://gen5.info/q/2009/03/09/first-class-functions-and-logical-negation-in-c/</feedburner:origLink></item>
		<item>
		<title>Putting Freebase in a Star Schema</title>
		<link>http://feedproxy.google.com/~r/Generation5/~3/C1-sDQp5LH8/</link>
		<comments>http://gen5.info/q/2009/02/25/putting-freebase-in-a-star-schema/#comments</comments>
		<pubDate>Wed, 25 Feb 2009 14:33:36 +0000</pubDate>
		<dc:creator>Paul Houle</dc:creator>
		
		<category><![CDATA[Freebase]]></category>

		<category><![CDATA[SQL]]></category>

		<category><![CDATA[Semantic Web]]></category>

		<guid isPermaLink="false">http://gen5.info/q/?p=223</guid>
		<description><![CDATA[What&#8217;s Freebase?

Freebase is a open database of things that exist in the world:  things like people,  places,  songs and television shows.   As of the January 2009 dump,  Freebase contained about 241 million facts,  and it&#8217;s growing all the time.  You can browse it via the web and even edit it,  much like Wikipedia.  Freebase [...]]]></description>
			<content:encoded><![CDATA[<h2>What&#8217;s Freebase?</h2>
<p><img style="margin-left: 10px" title="cyclopedia" src="http://gen5.info/q/wp-content/uploads/2009/02/cyclopedia.jpg" alt="cyclopedia" width="240" height="161" align="right" /><br />
<a href="http://www.freebase.com/">Freebase</a> is a open database of things that exist in the world:  things like people,  places,  songs and television shows.   As of the January 2009 dump,  Freebase contained about 241 million facts,  and it&#8217;s growing all the time.  You can browse it via the web and even edit it,  much like Wikipedia.  Freebase also has an API that lets programs add data and make queries using a language called <a href="http://mql.freebaseapps.com/">MQL</a>.  Freebase is complementary to <a href="http://dbpedia.org/About">DBpedia</a> and other sources of information.  Although it takes a different approach to the semantic web than systems based on RDF standards,  it interoperates with them via <a href="http://www.semanticfocus.com/blog/entry/title/freebase-officially-linked-data-with-release-of-rdf-service/"> linked data</a>.</p>
<p>The January 2009 Freebase dump is about 500 MB in size.  Inside a bzip-compressed files,  you&#8217;ll find something that&#8217;s similar in spirit to a Turtle RDF file,  but is in a simpler format and represents facts as a collection of four values rather than just three.</p>
<h2>Your Own Personal Freebase</h2>
<p>To start exploring and extracting from Freebase,  I wanted to load the database into a star schema in a mysql database &#8212; an architecture similar to some RDF stores,  such as <a href="http://arc.semsol.org/">ARC</a>.  The project took about a week of time on a modern x86 server with 4 cores and 4 GB of RAM and resulted in a 18 GB collection of database files and indexes.</p>
<p>This is sufficient for my immediate purposes,  but future versions of Freebase promise to be much larger:  this article examines the means that could be used to improve performance and scalability using parallelism as well as improved data structures and algorithms.<span id="more-223"></span></p>
<p>I&#8217;m interested in using generic databases such as Freebase and Dbpedia as a data source for building web sites.  It&#8217;s possible to access generic databases through  APIs,  but there are advantages to having your own copy:  you don&#8217;t need to worry about API limits and network latency,  and you can ask questions that cover the entire universe of discourse.</p>
<p>Many RDF stores use variations of a format known as a Star Schema for representing RDF triples;  the Star Schema is commonly used in data warehousing application because it can efficiently represent repetitive data.   Freebase is similar to,  but not quite an RDF system.  Although there are multiple ways of thinking about Freebase,  the quarterly dump files provided by Metaweb are presented as quads:  groups of four related terms in tab-delimited terms.  To have a platform for exploring freebase,  I began a project of loading Freebase into a Star Schema in a relational database.</p>
<h2>A Disclaimer</h2>
<p>Timings reported in this article are approximate.  This work was done on a server that was doing other things; little effort was made to control sources of variation such as foreign workload,  software configuration and hardware characteristics.  I think it&#8217;s orders of magnitude that matter here:  with much larger data sets becoming available,  we need tools that can handle databases 10-100 times as big,  and quibbling about 20% here or there isn&#8217;t so important.  I&#8217;ve gotten similar results with the ARC triple store.  Some products do about an order of magnitude better:  the <a href="http://virtuoso.openlinksw.com/wiki/main/Main/VOSRDF">Virtuoso</a> server can load DBpedia,  a larger database,  <a href="http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData"> in about 16 to 22 hours on a 16 GB computer</a>:  several papers on RDF store performance are available [<a href="http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/">1</a>] [<a href="http://www4.wiwiss.fu-berlin.de/benchmarks-200801/">2</a>] [<a href="http://esw.w3.org/topic/RdfStoreBenchmarking">3</a>].  Although the system described in this paper isn&#8217;t quite an RDF store,  it&#8217;s performance is comprable to a relatively untuned RDF store.</p>
<p>It took about a week of calendar time to load the 241 million quads in the January 2009 Freebase into a Star Schema using a modern 4-core web server with 4GB of RAM;  this time could certainly be improved with microoptimizations,  but it&#8217;s in the same range that people are observing that it takes to load 10^8 triples into other RDF stores.  (One product is claimed to load DBPedia,  which contains about 100 million triples,  in about <a href="http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData">16 hours with &#8220;heavy-duty hardware&#8221;</a>.)   Data sets exceeding 10^9 triples are becoming rapidly available &#8212; these will soon exceed what can be handled with simple hardware and software and will require new techniques:  both the use of parallel computers and optimized data structures.</p>
<h2>The Star Schema</h2>
<p>In a star schema,  data is represented in separate fact and dimension tables,</p>
<p><img class="aligncenter size-full wp-image-229" title="300px-star-schema" src="http://gen5.info/q/wp-content/uploads/2009/02/300px-star-schema.png" alt="300px-star-schema" width="300" height="188" /></p>
<p>all of the rows in the fact table (<em>quad</em>) contain integer keys &#8212; the values associated with the keys are defined in dimension tables (<em>cN_value</em>).  This efficiently compresses the data and indexes for the fact table,  particularly when the values are highly repetitive.</p>
<p>I loaded data into the following schema:</p>
<pre>create table c1_value (
   id             integer primary key auto_increment,
   value          text,
   key(value(255))
) type=myisam;

... identical c2_value, c3_value and c4_value tables ...

create table quad (
   id             integer primary key auto_increment,
   c1             integer not null,
   c2             integer not null,
   c3             integer not null,
   c4             integer not null
) type=myisam;</pre>
<p>Although I later created indexes on<em> c1, c2, c3,</em> and <em>c4</em> in the quad table,  I left unnecessary indexes off of the tables during the loading process because it&#8217;s more efficient to create indexes after loading data in a table.  The keys on the value fields of the dimension tables are important,  because the loading process does frequent queries to see if values already exist in the dimension table.  The sequentially assigned <em>id</em> in the <em>quad</em> field isn&#8217;t necessary for many applications,  but it gives each a fact a unique identity and makes the system aware of the order of facts in the dump file.</p>
<h2>The Loading Process</h2>
<p>The loading script was written in PHP and used a naive method to build the index incrementally.  In pseudo code it looked something like this:</p>
<pre>function insert_quad($q1,$q2,$q3,$q4) {
    $c1=get_dimension_key(1,$q1);
    $c2=get_dimension_key(2,$q2);
    $c3=get_dimension_key(3,$q3);
    $c4=get_dimension_key(4,$q4);
    $conn-&gt;insert_row("quad",null,$c1,$c2,$c3,$c4)
}

function get_dimension_key($index,$value) {
    $cached_value=check_cache($value);
    if ($cached_value)
        return $cached_value;

    $table="$c{$index}_value";
    $row=$conn-&gt;fetch_row_by_value($table,$value);
    if ($row)
        return $row-&gt;id;
    $conn-&gt;insert_row($table,$value);
    return $conn-&gt;last_insert_id
};</pre>
<p>Caching frequently used dimension values improves performance by a factor of five or so,  at least in the early stages of loading.  A simple cache management algorithm,  clearing the cache every 500,000 facts,  controls memory use.  Timing data shows that a larger cache or better replacement algorithm would make at most an increment improvement in performance.  (Unless a complete dimension table index can be held in RAM,  in which case all read queries can be eliminated.)</p>
<p>I performed two steps after the initial load:</p>
<ol>
<li>Created indexes on<em> quad(c1), quad(c2), quad(c3) </em>and <em>quad(c4)</em></li>
<li>Used myisam table compression to reduce database size and improve performance</li>
</ol>
<h2>Loading Performance</h2>
<p>It took about 140 hours (nearly 6 days) to do the initial load.  Here&#8217;s a graph of facts loaded vs elapsed time:</p>
<p><img class="alignnone size-full wp-image-225" title="quad_time" src="http://gen5.info/q/wp-content/uploads/2009/02/quad_time.png" alt="quad_time" width="604" height="427" /></p>
<p>The important thing Iabout this graph is that it&#8217;s convex upward:  the loading process slows down as the number of facts increases.  The first 50 quads are loaded at a rate of about 6 million per hour;  the last 50 are loaded at a rate of about 1 million per hour.  An explanation of the details of the curve would be complex,  but<em> log N </em>search performance of B-tree indexes and the ability of the database to answer queries out of the computer&#8217;s RAM cache would be significant.  Generically,  all databases will perform the same way,  becoming progressively slower as the size of the database increases:  you&#8217;ll eventually reach a database size where the time to load the database becomes unacceptable.</p>
<p>The process of constructing b-tree indexes on the mysql tables took most of a day.  On average it took about four hours to construct a b-tree index on one column of <em>quad</em>:</p>
<pre>mysql&gt; create index quad_c4 on quad(c4);
Query OK, 243098077 rows affected (3 hours 40 min 50.03 sec)
Records: 243098077  Duplicates: 0  Warnings: 0</pre>
<p>It took about an hour to compress the tables and rebuild indexes,  at which point the data directory looks like:</p>
<pre>-rw-r----- 1 mysql root        8588 Feb 22 18:42 c1_value.frm
-rw-r----- 1 mysql root   713598307 Feb 22 18:48 c1_value.MYD
-rw-r----- 1 mysql root   557990912 Feb 24 10:48 c1_value.MYI
-rw-r----- 1 mysql root        8588 Feb 22 18:56 c2_value.frm
-rw-r----- 1 mysql root      485254 Feb 22 18:46 c2_value.MYD
-rw-r----- 1 mysql root      961536 Feb 24 10:48 c2_value.MYI
-rw-r----- 1 mysql root        8588 Feb 22 18:56 c3_value.frm
-rw-r----- 1 mysql root   472636380 Feb 22 18:51 c3_value.MYD
-rw-r----- 1 mysql root   370497536 Feb 24 10:51 c3_value.MYI
-rw-r----- 1 mysql root        8588 Feb 22 18:56 c4_value.frm
-rw-r----- 1 mysql root  1365899624 Feb 22 18:44 c4_value.MYD
-rw-r----- 1 mysql root  1849223168 Feb 24 11:01 c4_value.MYI
-rw-r----- 1 mysql root          65 Feb 22 18:42 db.opt
-rw-rw---- 1 mysql mysql       8660 Feb 23 17:16 quad.frm
-rw-rw---- 1 mysql mysql 3378855902 Feb 23 20:08 quad.MYD
-rw-rw---- 1 mysql mysql 9927788544 Feb 24 11:42 quad.MYI</pre>
<p>At this point it&#8217;s clear that the indexes are larger than the actual databases:  note that <em>c2_value</em> is much smaller than the other tables because it holds a relatively small number of predicate types:</p>
<pre>mysql&gt; select count(*) from c2_value;
+----------+
| count(*) |
+----------+
|    14771 |
+----------+
1 row in set (0.04 sec)

mysql&gt; select * from c2_value limit 10;
+----+-------------------------------------------------------+
| id | value                                                 |
+----+-------------------------------------------------------+
|  1 | /type/type/expected_by                                |
|  2 | reverse_of:/community/discussion_thread/topic         |
|  3 | reverse_of:/freebase/user_profile/watched_discussions |
|  4 | reverse_of:/freebase/type_hints/included_types        |
|  5 | /type/object/name                                     |
|  6 | /freebase/documented_object/tip                       |
|  7 | /type/type/default_property                           |
|  8 | /type/type/extends                                    |
|  9 | /type/type/domain                                     |
| 10 | /type/object/type                                     |
+----+-------------------------------------------------------+
10 rows in set (0.00 sec)</pre>
<p>The total size of the mysql tablespace comes to about 18GB,  anexpansion of about 40 times relative to the bzip2 compressed dump file.</p>
<h2>Query Performance</h2>
<p>After all of this trouble,  how does it perform?  Not too bad if we&#8217;re asking a simple question,  such as pulling up the facts associated with a particular object</p>
<pre>mysql&gt; select * from quad where c1=34493;
+---------+-------+------+---------+--------+
| id      | c1    | c2   | c3      | c4     |
+---------+-------+------+---------+--------+
| 2125876 | 34493 |   11 |      69 | 148106 |
| 2125877 | 34493 |   12 | 1821399 |      1 |
| 2125878 | 34493 |   13 | 1176303 | 148107 |
| 2125879 | 34493 | 1577 |      69 | 148108 |
| 2125880 | 34493 |   13 | 1176301 | 148109 |
| 2125881 | 34493 |   10 | 1713782 |      1 |
| 2125882 | 34493 |    5 | 1174826 | 148110 |
| 2125883 | 34493 | 1369 | 1826183 |      1 |
| 2125884 | 34493 | 1578 | 1826184 |      1 |
| 2125885 | 34493 |    5 |      66 | 148110 |
| 2125886 | 34493 | 1579 | 1826185 |      1 |
+---------+-------+------+---------+--------+
11 rows in set (0.05 sec)</pre>
<p>Certain sorts of aggregate queries are reasonably efficient,  if you don&#8217;t need to do them too often:  we can look up the most common 20 predicates in about a minute:</p>
<pre>select
   (select value from c2_value as v where v.id=q.c2) as predicate,count(*)
   from quad as q
     group by c2
     order by count(*) desc
     limit 20;</pre>
<pre>+-----------------------------------------+----------+
| predicate                               | count(*) |
+-----------------------------------------+----------+
| /type/object/type                       | 27911090 |
| /type/type/instance                     | 27911090 |
| /type/object/key                        | 23540311 |
| /type/object/timestamp                  | 19462011 |
| /type/object/creator                    | 19462011 |
| /type/permission/controls               | 19462010 |
| /type/object/name                       | 14200072 |
| master:9202a8c04000641f800000000000012e |  5541319 |
| master:9202a8c04000641f800000000000012b |  4732113 |
| /music/release/track                    |  4260825 |
| reverse_of:/music/release/track         |  4260825 |
| /music/track/length                     |  4104120 |
| /music/album/track                      |  4056938 |
| /music/track/album                      |  4056938 |
| /common/document/source_uri             |  3411482 |
| /common/topic/article                   |  3369110 |
| reverse_of:/common/topic/article        |  3369110 |
| /type/content/blob_id                   |  1174046 |
| /type/content/media_type                |  1174044 |
| reverse_of:/type/content/media_type     |  1174044 |
+-----------------------------------------+----------+
20 rows in set (43.47 sec)</pre>
<p>You&#8217;ve got to be careful how you write your queries:  the above query with the subselect is efficient,  but I found it took 5 hours to run when I joined <em>c2_value</em> with quad and grouped on <em>value</em>.  A person who wishes to do frequent aggregate queries would find it most efficient to create a materialized views of the aggregates.</p>
<h2>Faster And Large</h2>
<p>It&#8217;s obvious that the Jan 2009 Freebase is pretty big to handle with the techniques I&#8217;m using.  One thing I&#8217;m sure of is that that Freebase will be much bigger next quarter &#8212; I&#8217;m not going to do it the same way again.  What can I do to speed the process up?</p>
<h3>Don&#8217;t Screw Up</h3>
<p>This kind of process involves a number of lengthy steps.  Mistakes,  particularly if repeated,  can waste days or weeks.  Although services such as EC2 are a good way to provision servers to do this kind of work,  the use of automation and careful procedures is key to saving time and money.</p>
<h3>Partition it</h3>
<p>Remember how the loading rate of a data set decreases as the size of the set increase?  If I could split the data set into 5 partitions of 50 M quads each,  I could increase the loading rate by a factor of 3 or so.  If I can build those 5 partitions in parallel (which is trivial),  I can reduce wallclock time by a factor of 15.</p>
<h3>Eliminate Random Access I/O</h3>
<p>This loading process is slow because of the involvement of random access disk I/O.  All of Freebase canbe loaded into mysql with the following statement,</p>
<p>LOAD DATA INFILE &#8216;/tmp/freebase.dat&#8217; INTO TABLE q FIELDS TERMINATED  BY &#8216;\t&#8217;;</p>
<p>which took me about 40 minutes to run.   Processes that do a &#8220;full table scan&#8221; on the raw Freebase table with a <em>grep</em> or <em>awk</em>-type pipeline take about 20-30 minutes to complete.  Dimension tables can be built quickly if they can be indexed by a RAM  hasthable.   The process that builds the dimension table can emit a list of key values for the associated quads:  this output can be sequentially merged to produce the fact table.</p>
<h3>Bottle It</h3>
<p>Once a data source has been loaded into a database,  a physical copy of the database can be made and copied to another machine.  Copies can be made in the fraction of the time that it takes to construct the database.  A good example is the<a href="http://virtuoso.openlinksw.com/wiki/main/Main/VirtInstallationEC2"> Amazon EC2 AMI</a> that contains a preinstalled and preloaded <a href="http://virtuoso.openlinksw.com/">Virtuoso database</a> loaded with billions of triples from DBPedia,  MusicBrainz,  NeuroCommons and a number of other databases.  Although the process of creating the image is complex,  a new instance can be provisioned in 1.5 hours at the click of a button.</p>
<h3>Compress Data Values</h3>
<p>Unique object identifiers in freebase are coded in an inefficient ASCII representation:</p>
<pre>mysql&gt; select * from c1_value limit 10;
+----+----------------------------------------+
| id | value                                  |
+----+----------------------------------------+
|  1 | /guid/9202a8c04000641f800000000000003b |
|  2 | /guid/9202a8c04000641f80000000000000ba |
|  3 | /guid/9202a8c04000641f8000000000000528 |
|  4 | /guid/9202a8c04000641f8000000000000836 |
|  5 | /guid/9202a8c04000641f8000000000000df3 |
|  6 | /guid/9202a8c04000641f800000000000116f |
|  7 | /guid/9202a8c04000641f8000000000001207 |
|  8 | /guid/9202a8c04000641f80000000000015f0 |
|  9 | /guid/9202a8c04000641f80000000000017dc |
| 10 | /guid/9202a8c04000641f80000000000018a2 |
+----+----------------------------------------+
10 rows in set (0.00 sec)</pre>
<p>These are 38 bytes apiece.  The hexadecimal part of the guid could be represented in 16 bytes in a binary format,  and it appears that about half of the guid is a constant prefix that could be further excised.</p>
<p>A similar efficiency can be gained in the construction of in-memory dimension tables: md5 or sha1 hashes could be used as proxies for values.</p>
<p>The freebase dump is littered with &#8220;reverse_of:&#8221; properties which are superfluous if the correct index structures exist to do forward and backward searches.</p>
<h3>Parallelize it</h3>
<p>Loading can be parallelized in many ways:  for instance,  the four dimension tables can be built in parallel.  Dimension tables can also be built by a sorting process that can be performed on a computer cluster using map/reduce techniques.  A cluster of computers can also store a knowledge base in RAM,  trading sequential disk I/O for communication costs.  Since the availability of data is going to grow faster than the speed of storage systems,  parallelism is going to become essential for handling large knowledge bases &#8212; an issue identified by Japanese AI workers in the early 1980&#8217;s.</p>
<h3>Cube it?</h3>
<p>Some queries  benefit from indexes built on combinations of tables,  such as</p>
<p>CREATE INDEX quad_c1_c2 ON quad(c1,c2);</p>
<p>there are 40 combinations of columns on which an index could be useful &#8212; however,  the cost in time and storage involved in creating those indexes would be excessively expensive.  If such indexes were indeed necessary, a <a href="http://www.design-ireland.net/index.php?http%3A//www.design-ireland.net/alpha/controller/view_article.php%3Foid%3D00000000036">Multidimensional database</a> can create a cube index that is less expensive than a complete set of B-tree indexes.</p>
<h3>Break it up into separate tables?</h3>
<p>It might be anathema to many semweb enthusiasts,  but I think that Freebase (and parts of Freebase) could be efficiently mapped to conventional relational tables.  That&#8217;s because facts in Freebase are associated with types,  see,  for instance,  <a href="http://www.freebase.com/type/schema/music/composer">Composer</a> from the <a href="http://www.freebase.com/view/music">Music Commons</a>.  It seems reasonable to map types to relational tables and to create satellite tables to represent many-to-many relationships between types.  This scheme would automatically partition Freebase in a reasonable way and provide an efficient representation where many obvious questions (ex. &#8220;Find Female Composers Born In 1963 Who Are More Than 65 inches tall&#8221;) can be answered with a minimum number of joins.</p>
<h2>Conclusion</h2>
<p>Large knowledge bases are becoming available that cover large areas of human concern:  we&#8217;re finding many applications for them.  It&#8217;s possible to to handle databases such as Freebase and DBpedia on a single computer of moderate size,  however,  the size of generic databases and the hardware to store them on are going to grow larger than the ability of a singler computer to process them.  Fact stores that (i) use efficient data structures,  (ii) take advantage of parallelism,  and (iii) can be tuned to the requirements of particular applications,  are going to be essential for further progress in the Semantic Web.</p>
<h2>Credits</h2>
<ul>
<li>Metaweb Technologies, <a href="http://download.freebase.com/datadumps/">Freebase Data Dumps</a>, January 13, 2009</li>
<li><a href="http://www.openlinksw.com/blog/~kidehen">Kingsley Idehen</a>,  for several links about RDF store performance.</li>
<li><a href="http://www.flickr.com/photos/stewart/">Stewart Butterfield</a> for<a href="http://www.flickr.com/photos/stewart/461099066/"> encyclopedia photo</a>.</li>
</ul>
<img src="http://feeds.feedburner.com/~r/Generation5/~4/C1-sDQp5LH8" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://gen5.info/q/2009/02/25/putting-freebase-in-a-star-schema/feed/</wfw:commentRss>
		<feedburner:origLink>http://gen5.info/q/2009/02/25/putting-freebase-in-a-star-schema/</feedburner:origLink></item>
		<item>
		<title>Using Linq To Tell if the Elements of an IEnumerable Are Distinct</title>
		<link>http://feedproxy.google.com/~r/Generation5/~3/-Konbk-yTmw/</link>
		<comments>http://gen5.info/q/2009/02/13/using-linq-to-tell-if-the-elements-of-an-ienumerable-are-distinct/#comments</comments>
		<pubDate>Fri, 13 Feb 2009 21:15:38 +0000</pubDate>
		<dc:creator>Paul Houle</dc:creator>
		
		<category><![CDATA[Dot Net]]></category>

		<category><![CDATA[Linq]]></category>

		<guid isPermaLink="false">http://gen5.info/q/?p=213</guid>
		<description><![CDATA[The Problem
I&#8217;ve got an IEnumerable&#60;T&#62; that contains a list of values:  I want to know if all of the values in that field are distinct.  The function should be easy to use a LINQ extension method and,  for bonus points,  simply expressed in LINQ itself
One Solution
First,  define an extension method
01   public static class IEnumerableExtensions {
02        [...]]]></description>
			<content:encoded><![CDATA[<h2><a href="http://gen5.info/q/wp-content/uploads/2009/02/ducks.jpg"><img class="title=&quot;ducks&quot;" src="http://gen5.info/q/wp-content/uploads/2009/02/ducks-300x195.jpg" alt="" width="300" height="195" align="right" /></a>The Problem</h2>
<p>I&#8217;ve got an <em>IEnumerable&lt;T&gt;</em> that contains a list of values:  I want to know if all of the values in that field are distinct.  The function should be easy to use a LINQ extension method and,  for bonus points,  simply expressed in LINQ itself</p>
<h2>One Solution</h2>
<p>First,  define an extension method</p>
<pre>01   public static class IEnumerableExtensions {
02        public static bool AllDistinct&lt;T&gt;(this IEnumerable&lt;T&gt; input) {
03            var count = input.Count();
04            return count == input.Distinct().Count();
05        }
06    }</pre>
<p>When you want to test an <em>IEnumerable&lt;T&gt;</em>,  just write</p>
<pre>07 var isAPotentialPrimaryKey=CandidateColumn.AllDistinct();</pre>
<h2><span id="more-213"></span>Analysis</h2>
<p>This solution is simple and probably scales as well as any solution in the worst case.  However,  it enumerates the <em>IEnumerable&lt;T&gt;</em> twice and does a full scan of the IEnumerable even if non-distinct elements are discovered early in the enumeration.  I could certainly make an implementation that aborts early using a <em>Dictionary&lt;T,bool&gt;</em> to store elements we&#8217;ve seen and a foreach loop,  but I wonder if anyone out there can think of a better pure-Linq solution.</p>
<img src="http://feeds.feedburner.com/~r/Generation5/~4/-Konbk-yTmw" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://gen5.info/q/2009/02/13/using-linq-to-tell-if-the-elements-of-an-ienumerable-are-distinct/feed/</wfw:commentRss>
		<feedburner:origLink>http://gen5.info/q/2009/02/13/using-linq-to-tell-if-the-elements-of-an-ienumerable-are-distinct/</feedburner:origLink></item>
		<item>
		<title>Subverting XAML: How To Inherit From Silverlight User Controls</title>
		<link>http://feedproxy.google.com/~r/Generation5/~3/BFodYIYDC7w/</link>
		<comments>http://gen5.info/q/2009/02/10/subverting-xaml-how-to-inherit-from-silverlight-user-controls/#comments</comments>
		<pubDate>Tue, 10 Feb 2009 20:31:21 +0000</pubDate>
		<dc:creator>Paul Houle</dc:creator>
		
		<category><![CDATA[Dot Net]]></category>

		<category><![CDATA[Silverlight]]></category>

		<guid isPermaLink="false">http://gen5.info/q/?p=188</guid>
		<description><![CDATA[The Problem
Many Silverlighters use XAML to design the visual appearance of their applications.  A UserControl defined with XAML is a DependencyObject that has a complex lifecycle:  there&#8217;s typically a .xaml file,  a .xaml.cs file,  and a .xaml.g.cs file that is generated by visual studio. The .xaml.g.cs file is generated by Visual Studio,  and ensures that [...]]]></description>
			<content:encoded><![CDATA[<h2><a href="http://gen5.info/q/wp-content/uploads/2009/02/noxamlinheritance.png"><img style="float: right;" title="noxamlinheritance" src="http://gen5.info/q/wp-content/uploads/2009/02/noxamlinheritance.png" alt="" width="343" height="254" /></a>The Problem</h2>
<p>Many Silverlighters use XAML to design the visual appearance of their applications.  A UserControl defined with XAML is a <em>DependencyObject</em> that has a complex lifecycle:  there&#8217;s typically a .xaml file,  a .xaml.cs file,  and a .xaml.g.cs file that is generated by visual studio. The .xaml.g.cs file is generated by Visual Studio,  and ensures that objects defined in the XAML file correspond to fields in the object (so they are seen in intellisense and available to your c# code.)  The XAML file is re-read at runtime,  and drives a process that instantiates the actual objects defined in the XAML file &#8212; a program can compile just fine,  but fail during initialization if the XAML file is invalid or if you break any of the assumptions of the system.</p>
<p>XAML is a pretty neat system because it&#8217;s not tied to WPF or WPF/E.  It can be used to initialize any kind of object:  for instance,  it can be used to design workflows in asynchronous server applications based on Windows Workflow Foundation.</p>
<p>One problem with XAML,  however,  is that you cannot write controls that inherit from a UserControl that defined in XAML.  Visual Studio might compile the classes for you,  but they will fail to initialize at at runtime.  This is serious because it makes it impossible to create subclasses that let you make small changes to the appearance or behavior of a control.</p>
<p><span id="more-188"></span></p>
<h2>Should We Just Give Up?</h2>
<p>One approach is to give up on XAML.  Although Visual Studio encourages you to create UserControls with XAML,  there&#8217;s nothing to stop you from creating a new class file and writing something like</p>
<pre>class MyUserControl:UserControl {
      public MyUserControl {
      var innerPanel=new StackPanel();
      Content=innerPanel;
      innerPanel.Children.Add(new MyFirstVisualElement());
      innerPanel.Children.Add(new MySecondVisualElement());
      ...
}</pre>
<p>UserControls defined like this have no (fundamental) problems with inheritance,  since you&#8217;ve got complete control of the initialization process.  If it were up to me,  I&#8217;d write a lot of controls like this,  but I work on a team with people who design controls in XAML,  so I needed a better solution.</p>
<h2>Or Should We Just Cheat?</h2>
<p>If we still want to use XAML to define the appearance of a control,  we&#8217;ve got another option.  We can move the XAML into a control that is outside the inheritance hierarchy:  the XAML control is then contained inside another control which doesn&#8217;t have restrictions as to how it is used:</p>
<p><a href="http://gen5.info/q/wp-content/uploads/2009/02/cheatingxaml.png"><img class="aligncenter size-full wp-image-192" title="cheatingxaml" src="http://gen5.info/q/wp-content/uploads/2009/02/cheatingxaml.png" alt="" /></a></p>
<p>The <em>XamlPanelInnards.xaml.cs</em> code-behind is almost completely empty:  it contains only the constructor created by Visual Studio&#8217;s XAML designer.  <em>XamlPanel.cs</em> contains a protected member called <em>InnerControl</em> which is of type <em>XamlPanelInnards</em>.  <em>InnerControl</em> is initialized in the constructor of <em>XamlPanel</em>,  and is also assigned to the <em>Content</em> property of the <em>XamlPanel</em>,  to make it visible.  Everything that you&#8217;d ordinarily put in the code-behind <em>XamlPanelInnards.xaml.cs</em> goes into XamlPanel.cs,  which uses the <em>InnerControl</em> member to get access to the <em>internal</em> members defined by the XAML designer.</p>
<h2>Step-by-Step Refactoring</h2>
<p>Let&#8217;s imagine that we&#8217;ve got an existing <em>UserControl</em> implemented in XAML called the <em>XamlPanel</em>.  We&#8217;d like to subclass the <em>XamlPanel </em>so we can use it for multiple purposes.  We can do this by:</p>
<ol>
<li>Renaming <em>XamlPanel</em> to <em>OldXamlPanel</em>;  this renames both the *.xaml and *.xaml.cs files so we can have them to look at</li>
<li>Use Visual Studio to create a new &#8220;Silverlight User Control&#8221; called <em>XamlPanelInnards</em>.  This will have both a *.xaml and *.xaml.xs file</li>
<li>Copy the contents of OldXamlPanel.xaml to XamlPanelInnards.cs.  Edit the x:Class attribute of the &lt;UserControl&gt; element to reflect the new class name,  &#8220;XamlPanelInnards&#8221;</li>
<li>Use Visual Studio to create a new <strong>class</strong>,  called XamlPanel.cs.  Do not create a XamlPanel.xaml.cs file!</li>
<li>Copy the constructor,  methods,  fields and properties from the OldXamlPanel.xaml.cs file to the XamlPanel.cs file.</li>
<li>Create a new private field in XamlPanel.cs like
<pre>private XamlPanelInnards InnerControl;</pre>
</li>
<li>Now we modify the constructor,  so that it does something like this:
<pre>public XamlPanel() {
   InnerControl = new XamlPanelInnards();
   Content = InnerControl;
   ... remainder of the constructor ...
}</pre>
</li>
<li>Most likely you&#8217;ll have compilation errors in the XamlPanel.cs file because there are a lot of references to public fields that are now in the InnerControl.  You need to track these down,  and replace code that looks like
<pre>TitleTextBlock.Text="Some Title";</pre>
<p>with</p>
<pre>InnerControl.TitleTextBlock.Text="SomeTitle";</pre>
</li>
<li>Any event handler attachments done from the XAML file will fail to work (they&#8217;ll trigger an error when you load the application.)  You&#8217;ll need to convert
<pre>&lt;Button x:PressMe ... Click="PressMe_OnClick"&gt;</pre>
<p>in the XAML file to</p>
<pre>InnerControl.PressMe.Click += PressMe_OnClick</pre>
<p>in the constructor of XamlPanel.</li>
<li style="text-align: left;">UI Elements that are defined in XAML are public,  so there&#8217;s a good chance that other classes might expect UI Elements inside the XamlPanel to be accessable.  You&#8217;ve got some choices:  (a) make the InnerControl public and point those references to the InnerControl (yuck!),  (b) selectively add properties to XamlPanel to let outside classes access the elements that they need to access,  or (c) rethink the encapsulation so that other class don&#8217;t need direct access to the members of XamlPanelInnards.</li>
<li>There are a few special methods that are defined in DependencyObject and FrameworkElement that you&#8217;ll need to pay attention to.  For instance,  if your class uses FindName to look up elements dynamically,  you need to replace
<pre>FindName("Control"+controlNumber);</pre>
<p>with</p>
<pre>InnerControl.FindName("Control"+controlNumber);</pre>
</li>
</ol>
<p>So far this is a refactoring operation:  we&#8217;re left with a program that does what it already did,  but is organized differently.  Where can we go from here?</p>
<h2>Extending The XamlPanel</h2>
<p>At this point,  the XamlPanel is an ordinary UserControl class.  It&#8217;s initialization logic is self-sufficient,  so it can inherit (it doesn&#8217;t necessarily have to derive from UserControl) and be inherited from quite freely.</p>
<p>If we want to change the behavior of the XamlPanel,  for instance,  we could declare it abstract and leave certain methods (such as event handlers) abstract.  Alternately,  methods could be declared as virtual.</p>
<p>A number of methods exist to customize the appearance of XamlPanel:  since XamlPanel can see the objects inside XamlPanelInnards,  it can change colors,  image sources and text contents.  If you&#8217;re interested in adding additional graphical element to the Innards,  the innards can contain an empty StackPanel &#8212; children of XamlPanel can Add() something to the StackPanel in their constructors.</p>
<p>Note that you can still include a child XamlPanel inside a control defined in XAML by writing something like</p>
<pre>&lt;OURNAMESPACE:SecondXamlPanel x:Name="MyInstance"&gt;</pre>
<p>you&#8217;re free to make XamlPanel and it&#8217;s children configurable via the DependencyObject mechanisms.  The one thing that you lose is public access to the graphical element inside the StackPanelInnards:  many developers would think that this increase in encapsulation is a good thing,  but it may involve a change in the way you do things.</p>
<h2>Related articles</h2>
<p>The community at <a href="http://silverlight.net/">silverlight.net</a> has pointed me to a few other articles about XAML,  inheritance and configuring Custom XAML controls.</p>
<p>In a lavishly illustrated blog entry,  Amyo Kabir explains <a href="http://amyotech.spaces.live.com/blog/cns!4B03AA8222DC3C5E!226.entry">how to make a XAML-defined control inherit from a user-defined base class</a>.  It&#8217;s the converse of what I&#8217;m doing in this article,  but it&#8217;s also a powerful technique:  I&#8217;m using it right now to implement a number of Steps in a Wizard.</p>
<p>Note that Silverlight has mechanisms for creating <a href="http://www.silverlightshow.net/items/Creating-a-Silverlight-Custom-Control-The-Basics.aspx">Custom Controls</a>:  these are controls that use the DependencyObject mechanism to be configurable via XAML files that include them.  If you&#8217;re interested in customizing individual controls rather than customizing subclasses,  this is an option worth exploring.</p>
<h2>Conclusion</h2>
<p>XAML is a great system for configuring complex objects,  but you can&#8217;t inherit from a Silverlight class defined in XAML.  By using a containment relation instead of an inheritance relation,  we can push the XAML-configured class outside of our inheritance hierarchy,  allowing the container class to participate as we desire.  This way we can have both visual UI generation and the software engineering benefits of inheritance.</p>
<div class="zemanta-pixie" style="margin-top: 10px; height: 15px;"><a class="zemanta-pixie-a" title="Zemified by Zemanta" href="http://reblog.zemanta.com/zemified/1309efe8-1c65-4657-8a29-aea199856afb/"><img class="zemanta-pixie-img" style="border: medium none ; float: right;" src="http://img.zemanta.com/reblog_e.png?x-id=1309efe8-1c65-4657-8a29-aea199856afb" alt="Reblog this post [with Zemanta]" /></a></div>
<img src="http://feeds.feedburner.com/~r/Generation5/~4/BFodYIYDC7w" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://gen5.info/q/2009/02/10/subverting-xaml-how-to-inherit-from-silverlight-user-controls/feed/</wfw:commentRss>
		<feedburner:origLink>http://gen5.info/q/2009/02/10/subverting-xaml-how-to-inherit-from-silverlight-user-controls/</feedburner:origLink></item>
		<item>
		<title>Twitter Joins Me</title>
		<link>http://feedproxy.google.com/~r/Generation5/~3/DCrlDypWzsA/</link>
		<comments>http://gen5.info/q/2009/01/27/twitter-joins-me/#comments</comments>
		<pubDate>Wed, 28 Jan 2009 01:24:50 +0000</pubDate>
		<dc:creator>Paul Houle</dc:creator>
		
		<category><![CDATA[Media]]></category>

		<guid isPermaLink="false">http://gen5.info/q/?p=180</guid>
		<description><![CDATA[I&#8217;ve watched Twitter from a distance for the past year or so,  sometimes making fun of it in blog comments,  but I never actually joined.
Last week I was looking at my web server log with the good old tail -f,  and found that several other bloggers had hotlinked the copy of the twitter fail whale [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve watched Twitter from a distance for the past year or so,  sometimes making fun of it in blog comments,  but I never actually joined.</p>
<p>Last week I was looking at my web server log with the good old <em>tail -f</em>,  and found that several other bloggers had hotlinked the copy of the <a href="http://gen5.info/q/wp-content/uploads/2008/08/twitter-whale.png">twitter fail whale</a> that was in my old &#8220;<a href="http://gen5.info/q/2008/08/27/what-do-you-do-when-youve-caught-an-exception/">What do you do if you catch an exception?</a>&#8221; post.  It turns out that my copy of the whale currently ranks #1 in Google Image Search.  It&#8217;s not bringing in a vast amount of traffic,  but it seems to be really engaging people,  because Google blog search is finding a new reference to the image just about every day.</p>
<p>I&#8217;ve been spending a lot of time developing sites where that&#8217;s the whole idea:  to build virtuous circles where people <a href="http://animalphotos.info/a/">find images</a>,  put them on their sites,  link back to my site,  which attracts more visitors.  Sometimes you can succeed at this without trying,  but making a business out of it is a matter of being lucky consistently.</p>
<p>After that I broke down and joined twitter.  My username on twitter is <a href="http://twitter.com/paul_houle">paul_houle</a>;  Right now it seems like a strange and lonely place,  but I can see some discipline in expressing oneself in 140 characters.  I&#8217;ve noticed quite a few characters already:  everything from the very corporate people who tweet in calculated sound bites to people that tweet like <em>/dev/random</em>.  Perhaps I can&#8217;t do anything about the &#8220;strange&#8221; bit,  but perhaps I can about the &#8220;lonely&#8221; part.  If you like the things that I blog about,  you&#8217;re <a href="http://twitter.com/paul_houle">certainly invited to follow me</a>,  and I&#8217;m interested in following like minded people.</p>
<img src="http://feeds.feedburner.com/~r/Generation5/~4/DCrlDypWzsA" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://gen5.info/q/2009/01/27/twitter-joins-me/feed/</wfw:commentRss>
		<feedburner:origLink>http://gen5.info/q/2009/01/27/twitter-joins-me/</feedburner:origLink></item>
		<item>
		<title>Manipulate HTML Forms With Silverlight 2</title>
		<link>http://feedproxy.google.com/~r/Generation5/~3/hrBRdWAL7MY/</link>
		<comments>http://gen5.info/q/2009/01/21/manipulate-html-forms-with-silverlight-2/#comments</comments>
		<pubDate>Wed, 21 Jan 2009 22:17:30 +0000</pubDate>
		<dc:creator>Paul Houle</dc:creator>
		
		<category><![CDATA[Silverlight]]></category>

		<guid isPermaLink="false">http://gen5.info/q/?p=164</guid>
		<description><![CDATA[
Introduction
Lately I&#8217;ve been working on a web application based on Silverlight 2.  The application uses a traditional web login system based on a cryptographically signed cookie.  In early development,  users logged in on an HTML page,  which would load a Silverlight application on successful login.  Users who didn&#8217;t have Silverlight installed would be asked to [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.flickr.com/photos/skippy/11865024/"><img class="size-full wp-image-161" style="float: right; margin: 5px" title="11865024_af61ae2358_m" src="http://gen5.info/q/wp-content/uploads/2009/01/11865024_af61ae2358_m.jpg" border="0" alt="Remote Control" width="180" height="240" /></a></p>
<h2>Introduction</h2>
<p>Lately I&#8217;ve been working on a web application based on Silverlight 2.  The application uses a traditional web login system based on a <a href="http://www.google.com/url?sa=t&amp;source=web&amp;ct=res&amp;cd=1&amp;url=http%3A%2F%2Fpdos.csail.mit.edu%2Fpapers%2Fwebauth%3Asec10.pdf&amp;ei=a493SdXUCuPetgfhhPmgBg&amp;usg=AFQjCNFt9lqlfx_ALls4BBI9IPax51wBpw&amp;sig2=QsCk70fmt9nJASYWb4Dp0g">cryptographically signed cookie</a>.  In early development,  users logged in on an HTML page,  which would load a Silverlight application on successful login.  Users who didn&#8217;t have Silverlight installed would be asked to install it after logging in,  rather than before.</p>
<p>Although it&#8217;s (sometimes) possible to determine what plug-ins a user has installed using Javascript,  the methods are dependent on the specific browser and the plug-ins.  We went for a simple and effective method:  make the login form a Silverlight application,  so that users would be prompted to install Silverlight before logging in.</p>
<p>Our solutionn  was to make the Silverlight application a drop-in replacement for the original HTML form.  The Silverlight application controls a hidden HTML form:  when a user hits the &#8220;Log In&#8221; buttonin the Silverlight application,  the application inserts the appropriate information into the HTML form and submits it.  This article describes the technique in detail.<span id="more-164"></span></p>
<h2>The HTML Form</h2>
<p>The HTML form is straightforward &#8212; the main thing that&#8217;s funky about it is that all of the controls are invisible,  since we don&#8217;t want users to see them or interact with them directly:</p>
<pre>[01] &lt;form id="loginForm" method="post" action=""&gt;
[02]     &lt;div class="loginError" id="loginError"&gt;
[03]        &lt;%= Server.HtmlEncode(errorCondition) %&gt;
[04]     &lt;/div&gt;
[05]     &lt;input type="hidden" name="username" id="usernameField"
[06]        value="&lt;%= Server.HtmlEncode(username) %&gt;" /&gt;
[07]     &lt;input type="hidden" name="password" id="passwordField"  /&gt;
[08] &lt;/form&gt;</pre>
<p>The CSS file for the form contains the following directive:</p>
<pre>[09] .loginError { display:none }</pre>
<p>to prevent the error message from being visible on the HTML page.</p>
<h2>Executing Javascript In Silverlight 2</h2>
<p>A Silverlight 2 application can execute Javascript using the <em>HtmlPage.Window.Eval() </em>method;  HtmlPage.Window is static,  so you can use it anywhere,  so long as you&#8217;re running in the UI thread.  Our simple application doesn&#8217;t do communication and doesn&#8217;t launch new threads,  so we don&#8217;t need to worry about threads.  We add a simple wrapper method to help the rest of the code flow off the tips of our fingers</p>
<pre>[10] using System.Windows.Browser
[11] ...
[12] namespace MyApplication {
[13] public partial class Page : UserControl {
[14]      ...
[15]        Object JsEval(string jsCode) {
[16]            return HtmlPage.Window.Eval(jsCode);
[17]        }</pre>
<p>Note that Visual Studio doesn&#8217;t put the <em>using</em> in by default,  so you&#8217;ll need to add it.  The application is really simple,   with just two event handlers and a little XAML to define the interface,  so it&#8217;s all in a single class,  the Page.xaml and Page.xaml.cs files created when I made the project in VIsual Studio.</p>
<p>Note that JsEval returns an Object,  so you can look at the return value of a javascript evaluation.  Numbers are returned as doubles and strings are returned as strings,  which can be quite useful.  References to HTML elements are returned as instances of the HtmlElement class.  HtmlElement has some useful methods,  such as GetAttribute() and SetAttribute(),  but I&#8217;ve found that I get more reliable results by writing snippets of Javascript code.</p>
<h2>Finding HTML Elements</h2>
<p>One practical is problem is how to find the HTML Elements on the page that you&#8217;d like to work with.  I&#8217;ve been spoiled by the $() function in Prototype and JQuery,  so I like to access HTML elements by id.  There isn&#8217;t a standard method to do this in all browser,  so I slipped the following snippet of Javascript into the document:</p>
<pre>[18]    function returnObjById(id) {
[19]        if (document.getElementById)
[20]            var returnVar = document.getElementById(id);
[21]        else if (document.all)
[22]            var returnVar = document.all[id];
[23]        else if (document.layers)
[24]            var returnVar = document.layers[id];
[25]        return returnVar;
[26]    }</pre>
<p>(code snippet courtesy of <a href="http://www.netlobo.com/javascript_get_element_id.html">NetLobo</a>)</p>
<p>I wrote a few convenience methods in C# inside my Page class:</p>
<pre>[27]   String IdFetchScript(string elementId) {
[28]        return String.Format("returnObjById('{0}')", elementId);
[29]   }
[30]
[31]   HtmlElement FetchHtmlElement(string elementId) {
[32]        return (HtmlElement)JsEval(IdFetchScript(elementId));
[33]   }</pre>
<h2>Manipulating HTML Elements</h2>
<p>At this point you can do anything that can be done in Javascript.  That said,  all I need is a few methods to get information in and out of the form:</p>
<pre>[34]       string GetTextInElement(string elementId) {
[35]           HtmlElement e = FetchHtmlElement(elementId);
[36]           if (e == null)
[37]               return "";
[38]
[39]           return (string)JsEval(IdFetchScript(elementId) + ".innerHTML");
[40]       }
[41]
[42]       void Set(string fieldId, string value) {
[43]            HtmlElement e = FetchHtmlElement(fieldId);
[44]            e.SetAttribute("value", value);
[45]       }
[46]
[47]       string GetFormFieldValue(string fieldId) {
[48]            return (string) JsEval(IdFetchScript(fieldId)+".value");
[49]       }
[50]
[51]       void SubmitForm(string formId) {
[52]            JsEval(IdFetchScript(formId) + ".submit()");
[53]       }</pre>
<p>Isolating the Javascript into a set of helper methods helps make the code maintainable:  these methods are usable by a C#er who isn&#8217;t a Javascript expert &#8212; if we discover problems with cross-browser compatibility,  we can fix them in a single place.</p>
<h2>Putting it all together</h2>
<p>A little bit of code in the constructor loads form information into the application:</p>
<pre>[54]        public Page() {
[55]            ...
[56]            LoginButton.Click += LoginButton_Click;
[57]            UsernameInput.Text = GetFormFieldValue("usernameField");
[58]            ErrorMessage.Text = GetTextInElement("loginError");
[59]         }</pre>
<p>The LoginButton_Click event handler populates the invisible HTML form and submits it:</p>
<pre>[60]      void LoginButton_Click(object sender, RoutedEventArgs e) {
[61]           SetFormFieldValue("usernameField", UsernameInput.Text);
[62]           SetFormFieldValue("passwordField", PasswordInput.Password);
[63]           SubmitForm("loginForm");</pre>
<pre>[64]       }</pre>
<h2>Conclusion</h2>
<p>Although Silverlight 2 is capable of direct http communication with a web server,  sometimes it&#8217;s convenient for a Silverlight application to directly manipulate HTML forms.   This article presents sample source code that simplifies that task.</p>
<p><strong>P.S.</strong> A commenter on the Silverlight.net forums pointed me to <a href="http://timheuer.com/blog/archive/2008/03/09/calling-javascript-functions-from-silverlight-2.aspx">another article by Tim Heuer</a> that describes alternate methods for manipulating the DOM and calling Javascript functions.</p>
<p><small>(Thanks <a href="http://www.flickr.com/photos/skippy/">skippy</a> for the <a href="http://www.flickr.com/photos/skippy/11865024/">remote control image.</a>)</small></p>
<p><a href="http://www.dotnetkicks.com/kick/?url=http%3a%2f%2fgen5.info%2fq%2f2009%2f01%2f21%2fmanipulate-html-forms-with-silverlight-2%2f"><img src="http://www.dotnetkicks.com/Services/Images/KickItImageGenerator.ashx?url=http%3a%2f%2fgen5.info%2fq%2f2009%2f01%2f21%2fmanipulate-html-forms-with-silverlight-2%2f" border="0" alt="kick it on DotNetKicks.com" /></a></p>
<img src="http://feeds.feedburner.com/~r/Generation5/~4/hrBRdWAL7MY" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://gen5.info/q/2009/01/21/manipulate-html-forms-with-silverlight-2/feed/</wfw:commentRss>
		<feedburner:origLink>http://gen5.info/q/2009/01/21/manipulate-html-forms-with-silverlight-2/</feedburner:origLink></item>
		<item>
		<title>require(),  require_once() and Dynamic Autoloading in PHP</title>
		<link>http://feedproxy.google.com/~r/Generation5/~3/CdfTYwypcco/</link>
		<comments>http://gen5.info/q/2009/01/09/an-awesome-autoloader-for-php/#comments</comments>
		<pubDate>Fri, 09 Jan 2009 21:19:16 +0000</pubDate>
		<dc:creator>Paul Houle</dc:creator>
		
		<category><![CDATA[PHP]]></category>

		<category><![CDATA[r]]></category>

		<guid isPermaLink="false">http://gen5.info/q/?p=137</guid>
		<description><![CDATA[Introduction
I program in PHP a lot,  but I&#8217;ve avoided using autoloaders,  except when I&#8217;ve been working in frameworks,  such as symfony,   that include an autoloader.  Last month I started working on a system that&#8217;s designed to be part of a software product line:  many scripts,  for instance,  are going to need to deserialize objects that [...]]]></description>
			<content:encoded><![CDATA[<h2>Introduction</h2>
<p>I program in PHP a lot,  but I&#8217;ve avoided using autoloaders,  except when I&#8217;ve been working in frameworks,  such as symfony,   that include an autoloader.  Last month I started working on a system that&#8217;s designed to be part of a software product line:  many scripts,  for instance,  are going to need to deserialize objects that didn&#8217;t exist when the script was written:  autoloading went from a convenience to a necessity.</p>
<p>The majority of autoloaders use a fixed mapping between class names and PHP file names.  Although that&#8217;s fine if you obey a strict &#8220;one class,  one file&#8221; policy,  that&#8217;s a policy that I don&#8217;t follow 100% of the time.  An additional problem is that today&#8217;s PHP applications often reuse code from multiple frameworks and libraries that use different naming conventions:  often applications end up registering multiple autoloaders.  I was looking for an autoloader that &#8220;just works&#8221; with a minimum of convention and configuration &#8212; and I found that in a <a href="http://ajbrown.org/blog/2008/12/02/an-auto-loader-using-php-tokenizer.html">recent autoloader developed by A.J. Brown</a>.<span id="more-137"></span></p>
<p>After presenting the way that I integrated Brown&#8217;s autoloader into my in-house frameowrk,  this article considering the growing controversy over require(),  require_once() and autoloading performance:  to make a long story short,  people are experiencing very different results in different environments,  and the growing popularity of autoloading is going to lead to changes in PHP and the ecosystem around it.</p>
<h2>History:  PHP 4 And The Bad Old Days</h2>
<p>In the past,  PHP programmers have included class definitions in their programs with the following four built-in functions:</p>
<ul>
<li><em>include</em></li>
<li><em>require</em></li>
<li><em>include_once</em></li>
<li><em>require_once</em></li>
</ul>
<p>The difference between the <em>include</em> and <em>require</em> functions is that execution of a program will continue if a call to <em>include</em> fails and will result in an error if a call to <em>require</em> fails.  <em>require_</em> and <em>require_once</em> are reccomended for general use,  since you&#8217;d probably rather have an application fail if libraries are missing rather than barrel on with unpredictable results.  (Particularly if the missing library is responsible for authentication.)</p>
<p>If you write</p>
<pre>01 require "filename.php";</pre>
<p>PHP will scan the PHP include_path for a directory that contains a file called &#8220;filename.php&#8221;;  it then executes the content of &#8220;filename.php&#8221; right where the function is called.  You can do some pretty funky things this way,  for instance,  you can write</p>
<pre>02 for (int i=0;i&lt;10;$i++) {
03    require "template-$i.php";
04 }</pre>
<p>to cause the sequential execution of a  namset of PHP files named &#8220;template-0.php&#8221; through &#8220;template-9.php.&#8221;  A <em>required</em> file has access to local variables in the scope that require is called,  so require is particularly useful for web templating and situations that require dynamic dispatch (when you compute the filename.)  A cool,  but slightly obscure feature,  is that an included file can return a value.  If an source file,  say &#8220;compute-value.php&#8221; uses the return statement,</p>
<pre>05 return $some_value;</pre>
<p>the value $some_value will be return by require:</p>
<pre>06 $value=require "compute-value.php";</pre>
<p>These features can be used to create an MVC-like framework where views and controllers are implemented as source files rather than objects.</p>
<p>require isn&#8217;t so appropriate,  however,  when you&#8217;re requiring a file that contains class definitions.  Imagine we have a hiearchy of classes like Entity -&gt; Picture -&gt; PictureOfACar,PictureOfAnAnimal.  It&#8217;s quite tempting for PictureofACar.php and PictureofAnAnimal.php to both</p>
<pre>07 require "Picture.php";</pre>
<p>this works fine if an application only uses PictureOfACar.php and requires it once by writing</p>
<pre>08 require "PictureOfACar.php";</pre>
<p>It fails,  however,  if an application requires both PictureOfACar and PictureOfAnAnimal since PHP only allows a class to be defined once.</p>
<p><em>require_once</em> neatly solves this problem by keeping a list of files that have been required and doing nothing if a file has already been required.  You can use require_once in any place where you&#8217;d like to guarantee that a class is available,  and expect that things &#8220;just work&#8221;</p>
<h2>__autoload</h2>
<p>Well,  things don&#8217;t always &#8220;just work&#8221;.  Although large PHP applications can be maintained with require_once,  it becomes an increasing burden to keep track of require files as applications get larger.  require_once also breaks down once frameworks start to dynamically instantiate classes that are specified in configuration files.  The developers of PHP found a creative solution in the __autoload function,  a &#8220;magic&#8221; function that you can define.  __autoload($class_name) gets called whenever a PHP references an undefined class.  A very simple __autoload implementation can work well:  for instance,  the PHP manual page for __autoload has the following code snippet:</p>
<pre>09 function _autoload($class_name) {
10    require_once $class_name . '.php';
11 }</pre>
<p>If you write</p>
<pre>12 $instance=new MyUndefinedClass();</pre>
<p>this autoloader will search the PHP include path for &#8220;MyUndefinedClass.php.&#8221;  (A real autoloader should be a little more complex than this:  the above autoloader could be persuaded to load from an unsafe filename if user input is used to instantiate a class dynamically,  i.e.</p>
<pre>13 $instance=new $derived_class_name();</pre>
<h2>Static autoloading and autoloader proliferation</h2>
<p>Unlike Java,  PHP does not have a standard to relate source file names with class names.  Some PHP developers imitate the Java convention to define one file per class and name their files something like</p>
<p>ClassName.php,  or<br />
ClassName.class.php</p>
<p>A typical project that uses code from several sources will probably have sections that are written with different conventions  For instance,  the Zend framework turns &#8220;_&#8221; into &#8220;/&#8221; when it creates paths,  so the definition of &#8220;Zend_Loader&#8221; would be found underneath &#8220;Zend/Loader.php.&#8221;</p>
<p>A single autoloader could try a few different conventions,  but the answer that&#8217;s become most widespread is for each PHP framework or library to contain it&#8217;s own autoloader.  PHP 5.2 introduced the spl_register_autoload() function to replace __autoload().  spl_register_autoload() allows us to register multiple autoloaders,  instead of just one.  This is ugly,  but it works.</p>
<h2>One Class Per File?</h2>
<p>A final critique of static autoloading is that it&#8217;s not universally held that &#8220;one class one file&#8221; is the best practice for PHP development.  One of the advantages of OO scripting languages such as PHP and Python is that you can start with a simple procedural script and gradually evolve it into an OO program by gradual refactoring.  A convention that requires to developers to create a new file for each class tends to:</p>
<ol>
<li>Discourage developers from creating classes</li>
<li>Discourage developers from renaming classes</li>
<li>Discourage developers from deleting classes</li>
</ol>
<p>These can cumulatively lead programmers to make decisions based on what&#8217;s convenient to do with their development tools,  not based on what&#8217;s good for the software in the long term.  These considerations need to be balanced against:</p>
<ol>
<li>The ease of finding classes when they are organized &#8220;once class per file&#8221;,</li>
<li>The difficulty of navigating huge source code files that contain a large number of classes,  and</li>
<li>Toolset simplification and support.</li>
</ol>
<p>The last of these is particularly important when we compare PHP with Java.  Since the Java compiler enforces a particular convention,  that convention is supported by Java IDE&#8217;s.  The problems that I mention above are greatly reduced if you use an IDE such as Eclipse,  which is smart enough to rename files when you rename a class.  PHP developers don&#8217;t benefit from IDEs that are so advanced &#8212; it&#8217;s much more difficult for IDE&#8217;s to understand a dynamic language.  Java also supports inner classes,  which allow captive classes (that are only accessed from within an enclosing class) to be defined inside the same file as the enclosing class.  Forcing captive classes to be defined in separate files can cause a bewildering number of files to appear,  which,  in turn,  can discourage developers from using captive classes &#8212; and that can lead to big mistakes.</p>
<h2>Dynamic Autoloading</h2>
<p><a href="http://ajbrown.org/blog/">A. J. Brown</a> has developed an <a href="http://ajbrown.org/blog/2008/12/02/an-auto-loader-using-php-tokenizer.html">autoloader</a> that uses PHP&#8217;s tokenizer() to search a directory full of PHP files,  search the files for classes,  and create a mapping from class names to php source files.  <em>tokenizer() </em>is a remarkable metaprogramming facility that makes it easy to write PHP programs that interpret PHP source.  In 298 lines of code,  Brown defines three classes.  To make his autoloader fit into my in-house framework,  I copied his classes into two files:</p>
<ul>
<li>lib/nails_core/autoloader.php: ClassFileMap, ClassFileMapAutoloader</li>
<li>lib/nails_core/autoloader_initialize.php: ClassFileMapFactory</li>
</ul>
<p>I&#8217;m concerned about the overhead of repeatedly traversing PHP library directories and parsing the files,  so I run the following program to create the <em>ClassFileMap</em>,  serialize it,  and store it in a file:</p>
<pre><span style="text-decoration: underline;"><strong>bin/create_class_map.php:</strong></span>
14 &lt;?php
15
16 $SUPRESS_AUTOLOAD=true;
17 require_once(dirname(__FILE__)."/../_config.php");
18 require_once "nails_core/autoloader_initialize.php";
19 $lib_class_map = ClassFileMapFactory::generate($APP_BASE."/lib");
20 $_autoloader = new ClassFileMapAutoloader();
21 $_autoloader-&gt;addClassFileMap($lib_class_map);
22 $data=serialize($_autoloader);
23 file_put_contents("$APP_BASE/var/classmap.ser",$data);</pre>
<p>Note that I&#8217;m serializing the <em>ClassFileMapAutoloader </em>rather than the <em>ClassFileMap</em>,  since I&#8217;d like to have the option of specifying more than one search directory.  To follow the &#8220;convention over configuration&#8221; philosophy,  a future version will probable traverse all of the directories in the <em>php_include_path</em>.</p>
<p>All of the PHP pages,  controllers and command-line scripts in my framework have the line</p>
<pre>24 require_once(dirname(__FILE__)."/../_config.php");</pre>
<p>which includes a file that is responsible for configuring the application and the PHP environment.   I added a bit of code to the _config.php to support the autoloader:</p>
<pre><span style="text-decoration: underline;"><strong>_config.php:</strong></span>
25 &lt;?php
26 $APP_BASE = "/where/app/is/in/the/filesystem";
   ...
27 if (!isset($SUPPRESS_AUTOLOADER)) {
28    require_once "nails_core/autoloader.php";
29    $_autoloader=unserialize(file_get_contents($APP_BASE."/var/classmap.ser"));
30    $_autoloader-&gt;registerAutoload();
31 };</pre>
<p>Pretty simple.</p>
<h2>Autoloading And Performance</h2>
<p>Although there&#8217;s plenty of controversy about issues of software maintainability,  I&#8217;ve learned the hard way that it&#8217;s hard to make blanket statements about performance &#8212; results can differ based on your workload and the exact environment you&#8217;re working in.  Although Brown makes the statement that &#8220;We do add a slight overhead to the application,&#8221;  many programmers are discovering that autoloading improves performance over require_once:</p>
<p><a href="http://framework.zend.com/wiki/display/ZFDEV/Performance+-+Requiring+the+Autoloader">Zend_Loader Performance Analysis</a><br />
<a href="http://www.mikebrittain.com/blog/2008/03/27/autoloading-php-classes-to-reduce-cpu-usage/">Autoloading Classes To Reduce CPU Usage</a></p>
<p>There seem to be two issues here:  first of all,  most systems that use require_once are going to err on the side of including more files than they need rather than fewer &#8212; it&#8217;s better to make a system slower and bloated than to make it incorrect.  A system that uses autoloading will spend less time loading classes,  and,  just as important,  less memory storing them.  Second,  PHP programmers appear to be experience variable results with require() and require_once():</p>
<p><a href="http://www.techyouruniverse.com/software/php-performance-tip-require-versus-require_once">Wikia Developer Finds require_once() Slower Than require()</a><br />
<a href="http://arin.me/php/php-require-vs-include-vs-require_once-vs-include_once-performance-test">Another Developer Finds Little Difference</a><br />
<a href="http://article.gmane.org/gmane.comp.php.pear.devel/43593">Yet Another Developer Finds It Depends On His Cache Configuration</a><br />
<a href="http://www.nabble.com/require_once-performance-issue-in-ZF-td19218102.html">Rumor has it,  PHP 5.3 improves require_once() performance</a></p>
<p>One major issues is that require_once() calls the realpath() C function,  which in turn calls the lstat() system call.  The cost of system calls can vary quite radically on different operating systems and even different filesystems.  The use of an opcode cache such as XCache or APC can also change the situation.</p>
<p>It appears that current opcode caches (as of Jan 2008) don&#8217;t efficiently support autoloading:<br />
<a href="http://blog.digitalstruct.com/2007/12/23/zend-framework-performance-zend_loader/"><br />
Mike Willbanks Experience Slowdown With Zend_Loader</a><br />
<a href="http://pooteeweet.org/blog/538/">APC Developer States That Autoloading is Incompatible With Cacheing</a><br />
<a href="http://forum.lighttpd.net/topic/17500">Rambling Discussion of the state of autoloading with XCache<br />
</a></p>
<p>the issue is that they don&#8217;t,  at compile time,  know what files are going be required by the application.  Opcode caches also reduce the overhead of loading superfluous classes,  so they don&#8217;t get the benefits experienced with plain PHP.</p>
<p>It all reminds me of the situation with <em>synchronized</em> in Java.  In early implementations of Java,  <em>synchronized</em> method calls had an execution time nearly ten times longer than ordinary message calls.  Many developers designed systems (such as the Swing windowing toolkit) around this performance problem.  Modern VM&#8217;s have greatly accelerated the synchronization mechanism and can often optimize superfluous synchronizations away &#8212; so the performance advice of a decade ago is bunk.</p>
<p>Language such as Java are able to treat individual classes as compilation units:  and I&#8217;d imagine that,  with certain restrictions,  a PHP bytecode cache should be able to do just that.  This may involve some changes in the implementation of PHP.</p>
<h2>Conclusion</h2>
<p>Autoloading is an increasingly popular practice among PHP developers.  Autoloading improves development productivity in two ways:</p>
<ol>
<li>It frees developers from thinking about loading the source files needed by an application,  and</li>
<li>It enables dynamic dispatch,  situations where a script doesn&#8217;t know about all the classes it will interact with when it&#8217;s written</li>
</ol>
<p>Since PHP allows developers to create their own autoloaders,  a number of autoloaders exist.  Many frameworks,  such as the Zend Framework,  symfony,  and CodeIgniter,  come with autoloaders &#8212; as a result,  some PHP applications might contain more than one autoloader.  Most autoloaders require that classes be stored in files with specific names,  but Brown&#8217;s autoloader can scan directory trees to automatically  locate PHP classes and map them to filenames.  Eliminating the need for both convention and configuration,  I think it&#8217;s a significant advance:  in many cases I think it could replace the proliferation of autoloaders that we&#8217;re seeing today.</p>
<p>You&#8217;ll hear very different stories about the performance of autoload,  require_once() and other class loading mechanisms from different people.  The precise workload,  operating system,  PHP version,  and the use of an opcode cache appear to be important factors.  Widespread use of autoloading will probably result in optimization of autoloading throughout the PHP ecosystem.</p>
<img src="http://feeds.feedburner.com/~r/Generation5/~4/CdfTYwypcco" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://gen5.info/q/2009/01/09/an-awesome-autoloader-for-php/feed/</wfw:commentRss>
		<feedburner:origLink>http://gen5.info/q/2009/01/09/an-awesome-autoloader-for-php/</feedburner:origLink></item>
	</channel>
</rss><!-- Dynamic Page Served (once) in 0.471 seconds --><!-- Cached page served by WP-Cache -->
