<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atom10full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><feed xmlns="http://www.w3.org/2005/Atom" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">

 <title>al3xandr3</title>
 
 <link type="text/html" rel="alternate" href="http://domain/" />
 <updated>2012-02-16T06:27:08-08:00</updated>
 <id>http://al3xandr3.github.com/</id>
 <author>
   <name>al3xandr3</name>
   <email>al3xandr3@gmail.com</email>
 </author>

 
 <atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/atom+xml" href="http://feeds.feedburner.com/al3xandr3" /><feedburner:info uri="al3xandr3" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><entry>
   <title>How to get into the Semantic Web</title>
   <link href="http://feedproxy.google.com/~r/al3xandr3/~3/NSJzhHHIrvE/semanticweb.html" />
   <updated>2012-01-11T00:00:00-08:00</updated>
   <id>http://al3xandr3.github.com/2012/01/11/semanticweb</id>
   <category term="data" label="data" />
   <category term="semanticweb" label="semanticweb" />
   <category term="javascript" label="javascript" />
   <category term="statistics" label="statistics" />
   <category term="visualization" label="visualization" />
   <category term="SPARQL" label="SPARQL" />
   
   <content type="html">&lt;p&gt;Practical examples on how to get onto the semantic web and on using it.&lt;/p&gt;

&lt;h3&gt;Getting in There&lt;/h3&gt;

&lt;p&gt;Start by creating a semantic web personal online profile using the Friend-Of-A-Friend (FOAF) vocabulary. The FOAF has became the standard for personal profiles on the semantic web, and as the name implies, it also lets you link to people you know.&lt;/p&gt;

&lt;p&gt;I used the &lt;a href="http://www.ldodds.com/foaf/foaf-a-matic"&gt;foaf-a-matic online tool&lt;/a&gt; and then uploaded the results to my site's &lt;a href="http://al3xandr3.github.com/foaf.rdf"&gt;foaf.rdf&lt;/a&gt;. Easy.&lt;/p&gt;

&lt;p&gt;With this, can already use the semantic web query language, called SPARQL, to inquire about what it knows about me:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;PREFIX foaf: &amp;lt;http://xmlns.com/foaf/0.1/&amp;gt;
SELECT ?property ?value
FROM &amp;lt;http://al3xandr3.github.com/foaf.rdf&amp;gt;
WHERE { 
  ?me foaf:name "Alexandre Matos Martins" .
  ?me ?property ?value .
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;a href="http://www.sparql.org/sparql?query=++++PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E%0D%0A++++SELECT+%3Fproperty+%3Fvalue%0D%0A++++FROM+%3Chttp%3A%2F%2Fal3xandr3.github.com%2Ffoaf.rdf%3E%0D%0A++++where+%7B+%0D%0A++++++%3Fme+foaf%3Aname+%22Alexandre+Matos+Martins%22+.%0D%0A++++++%3Fme+%3Fproperty+%3Fvalue+.%0D%0A++++%7D&amp;default-graph-uri=&amp;output=xml&amp;stylesheet=%2Fxml-to-html.xsl" target="_blank"&gt;run on sparql.org &amp;rarr;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Another Online Profile?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;How many times have you filled in your personal profile information on web sites? Google+, Facebook, YouTube, Yahoo!, MSN, Blogspot, Amazon, Twitter, LinkedIn, Flickr, Tumblr, Ebay, mySpace, hi5!, etc... How many times more we need to do it again?&lt;/p&gt;

&lt;p&gt;The semantic web is about sharing data in an agreed upon format, so that the data can be easily linked-to and (re)used. Thus, once my profile is on the semantic web any new site that I sign-up for, can just read-in this data instead of asking me to fill it in.&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;Sharing data, in an agreed upon format, is an incentive for re-use and disincentive for wasteful duplication. Choose #semanticweb.&lt;/p&gt;&lt;/blockquote&gt;

&lt;h3&gt;Adding the Site&lt;/h3&gt;

&lt;p&gt;Next step is to add the web site onto the semantic web.&lt;/p&gt;

&lt;p&gt;Augmenting a web site content with semantic data, facilitates data sharing, essentially web pages became little standalone data repositories that can be understood by the semantic web tools.&lt;/p&gt;

&lt;p&gt;The way to do it is simple enough; add (invisible)html tags into the existing web pages that specify (the semantics)meaning of the html elements.&lt;/p&gt;

&lt;p&gt;These extra html tags, that add meaning to web pages, are defined in a microformat called RDFa, quoting &lt;a href="http://en.wikipedia.org/wiki/RDFa"&gt;Wikipedia&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;RDFa defines how to embed RDF subject-predicate-object expressions within XHTML documents, it also enables the extraction of RDF model triples by compliant user agents.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;For example in the &lt;strong&gt;About&lt;/strong&gt; section of this site, I added some rdfa tags saying that all that section is about Me and that "Alexandre Matos Martins" is my name and that I am of type Person, and that the Twitter link is one of my OnlineAccounts, etc...&lt;/p&gt;

&lt;p&gt;With the rdfa tags added to the site is now possible to use semantic web tools to query website data. For example: find the topics(subjects) of the site (the &lt;em&gt;Tags&lt;/em&gt; cloud):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;PREFIX dcterms: &amp;lt;http://purl.org/dc/terms/&amp;gt;
SELECT ?subject 
FROM &amp;lt;http://www.w3.org/2007/08/pyRdfa/extract?uri=http://al3xandr3.github.com/&amp;gt;
WHERE {
 &amp;lt;http://al3xandr3.github.com&amp;gt; ?predicate ?subject . 
 ?s dcterms:subject ?subject .
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;a href="http://www.sparql.org/sparql?query=++++PREFIX++dcterms%3A+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E%0D%0A++++SELECT+%3Fsubject+%0D%0A++++FROM+%3Chttp%3A%2F%2Fwww.w3.org%2F2007%2F08%2FpyRdfa%2Fextract%3Furi%3Dhttp%3A%2F%2Fal3xandr3.github.com%2F%3E%0D%0A++++WHERE+%7B%0D%0A+++++%3Chttp%3A%2F%2Fal3xandr3.github.com%3E+%3Fpredicate+%3Fsubject+.+%0D%0A+++++%3Fs+dcterms%3Asubject+%3Fsubject+.%0D%0A++++%7D&amp;default-graph-uri=&amp;output=xml&amp;stylesheet=%2Fxml-to-html.xsl" target="_blank"&gt;run on sparql.org &amp;rarr;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note that this is a live search, i.e. whenever i add a new topic(subject) into the tag cloud of the site, re-running the query above will show the new topics also.&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;RDFa augments web pages as standalone data repositories that #semanticweb can understand, doubling as normal web pages, imagine web scraping done right.&lt;/p&gt;&lt;/blockquote&gt;

&lt;h3&gt;Using It&lt;/h3&gt;

&lt;p&gt;Why is all this data useful? well for a more futuristic good use case check the &lt;a href="http://al3xandr3.github.com/2011/12/18/data.html"&gt;Data, Data, Data! semantic web use case on Xmas&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;But in the meanwhile, we can already play around with more mundane things, for example, predicting how likely is it, that i will write a twitter quote for any given day.&lt;/p&gt;

&lt;p&gt;I collected a few of my twitter quotes on the &lt;a href="http://al3xandr3.github.com/pages/quotes.html"&gt;quotes page&lt;/a&gt; and each quote has an rdfa date on it.&lt;/p&gt;

&lt;p&gt;So we can use the following sparql query to fetch directly from the quotes page, the dates and how many quotes, on each date, I've wrote:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;PREFIX  dcterms: &amp;lt;http://purl.org/dc/terms/&amp;gt;
SELECT ?date (count(?subject) AS ?total)
FROM &amp;lt;http://www.w3.org/2007/08/pyRdfa/extract?uri=http://al3xandr3.github.com/pages/quotes.html&amp;gt;
WHERE { 
  ?subject dcterms:date ?date .
}
GROUP BY ?date
ORDER BY ?date
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;a href="http://www.sparql.org/sparql?query=++++PREFIX++dcterms%3A+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E%0D%0A++++SELECT+%3Fdate+%28count%28%3Fsubject%29+AS+%3Ftotal%29%0D%0A++++FROM+%3Chttp%3A%2F%2Fwww.w3.org%2F2007%2F08%2FpyRdfa%2Fextract%3Furi%3Dhttp%3A%2F%2Fal3xandr3.github.com%2Fpages%2Fquotes.html%3E%0D%0A++++WHERE+%7B+%0D%0A++++++%3Fsubject+dcterms%3Adate+%3Fdate+.%0D%0A++++%7D%0D%0A++++GROUP+BY+%3Fdate%0D%0A++++ORDER+BY+%3Fdate&amp;default-graph-uri=&amp;output=xml&amp;stylesheet=%2Fxml-to-html.xsl" target="_blank"&gt;run on sparql.org &amp;rarr;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then i see the day-of-the-week for each of those dates and sum up the number of quotes per day of the week.&lt;/p&gt;

&lt;p&gt;Having this, I can calculate the probability(the expected value) for each day, and can then just lookup the probability for any given day.&lt;/p&gt;

&lt;p&gt;For a full (a)live data experience this is implemented in javascript that fetches the data when this page is opened.&lt;/p&gt;

&lt;p&gt;I use jquery .ajax to go fetch the data of the sparql query defined above, do some data manipulation, plot it using &lt;a href="http://mbostock.github.com/d3/"&gt;d3.js&lt;/a&gt;, and finally output the prediction.&lt;/p&gt;

&lt;p&gt;Look at the source code of this page, to see how it works.&lt;/p&gt;

&lt;h4&gt;Quotes per day of the week:&lt;/h4&gt;

&lt;br /&gt;


&lt;br /&gt;


&lt;script type="text/javascript" src="http://www.datejs.com/build/date.js"&gt;&lt;/script&gt;


&lt;script type="text/javascript" src="http://mbostock.github.com/d3/d3.js"&gt;&lt;/script&gt;




&lt;div id="chart"&gt;&lt;img alt='SemanticWebQuotes' id="backup" src='http://al3xandr3.github.com/img/semanticweb_quotes.png'/&gt;&lt;/div&gt;


&lt;script type="text/javascript"&gt;
var count = "                                  \
  PREFIX  dcterms: &lt;http://purl.org/dc/terms/&gt; \
  SELECT ?date (count(?subject) AS ?total)     \
  FROM &lt;http://www.w3.org/2007/08/pyRdfa/extract?uri=http://al3xandr3.github.com/pages/quotes.html&gt; \
  WHERE {                         \
  ?subject dcterms:date ?date .   \
  }                               \
  GROUP BY ?date                  \
  ORDER BY ?date                  \
  ", 
  day, per_day={"Mon":0,"Tue":0,"Wed":0,"Thu":0,"Fri":0,"Sat":0,"Sun":0},
  data=[],
  total_tw=0, today_tw=0;

$.ajax({
  url: "http://sparql.org/sparql?query=" + encodeURIComponent(count) + "&amp;output=json",
  dataType: "jsonp",
  success: function(d) {

    // group per day-of-week
    $.each(d.results.bindings, function(i, v) { 
      day = (new Date.parse(v.date.value)).toString("ddd");
      per_day[day] += parseInt(v.total.value);
    });

    // plot &amp; stats
    $.each(per_day, function(k, v) { 
      // setup for plot
      data.push({day: k, value: v});

      // stats
      total_tw += v;
      if (k == Date.today().toString("ddd")) {
        today_tw = v;
      }
    });
    
    // we have data, static backup
    if (per_day["Thu"] !== 0) {
      $('#backup').remove();
    }

    var w = 420,
    h = 200,
    x = d3.scale.linear().domain([0, data.length]).range([0, w]),
    y = d3.scale.linear().domain([0, d3.max(data, function(d) {return d.value;})]).range([0,h]);

    var chart = d3.select("#chart")
      .append("svg")
        .attr("width",  w+30)
        .attr("height", h+30); 
  
    chart.selectAll("rect")
        .data(data)
      .enter().append("rect")
        .attr("x", function(d,i) { return x(i); })
        .attr("y", function(d) { return h - y(d.value); })
        .attr("height", function(d) { return y(d.value); })
        .attr("width", 40)
        .attr("fill", "#2d578b");

    chart.selectAll("text.bars")
        .data(data)
      .enter().append("text")
        .attr("x", function(d, i) { return x(i) + 20; })
        .attr("y", function(d) { return h - y(d.value) + 3; })
        .attr("dy", "1.2em")
        .attr("text-anchor", "middle")
        .text(function(d) { return d.value;})
        .attr("fill", "white");

    chart.selectAll("text.xAxis")
        .data(data)
      .enter().append("text")
        .attr("x", function(d,i) { return x(i) + 20; })
        .attr("y", h)
        .attr("text-anchor", "middle")
        .attr("style", "font-size: 12 important!; font-family: Helvetica, sans-serif")
        .text(function(d) { return d.day;})
        .attr("transform", "translate(0, 18)")
        .attr("class", "xAxis");

    // Prediction Text
    $('#prediction').html("Today is &lt;strong&gt;" + 
                        Date.today().toString("dddd") + 
                        "&lt;/strong&gt;, so is &lt;strong&gt;" +
                        Math.round(today_tw/total_tw*100) +
                        "%&lt;/strong&gt; likely that I'll tweet.");
  }
});
&lt;/script&gt;


&lt;p&gt;&lt;span id="prediction"&gt;For example on &lt;strong&gt;Thursday&lt;/strong&gt; is &lt;strong&gt;21%&lt;/strong&gt; likelly that I'll tweet.&lt;/span&gt;&lt;/p&gt;

&lt;br /&gt;


&lt;h4&gt;References&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://al3xandr3.github.com/2011/12/18/data.html"&gt;Data, Data, Data! semantic web use case on Xmas&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.ldodds.com/foaf/foaf-a-matic"&gt;foaf-a-matic online tool&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://t.co/4bhVAHfV"&gt;rdfa cheat sheet pdf&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://rdfadev.sourceforge.net/"&gt;Firefox rdfa debug&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://dailyjs.com/2010/11/26/linked-data-and-javascript/"&gt;linked data javascript tutorial&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;img src="http://feeds.feedburner.com/~r/al3xandr3/~4/NSJzhHHIrvE" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://al3xandr3.github.com/2012/01/11/semanticweb.html</feedburner:origLink></entry>
 
 <entry>
   <title>Data, Data, Data!</title>
   <link href="http://feedproxy.google.com/~r/al3xandr3/~3/lAnH-reBdwU/data.html" />
   <updated>2011-12-18T00:00:00-08:00</updated>
   <id>http://al3xandr3.github.com/2011/12/18/data</id>
   <category term="data" label="data" />
   <category term="semanticweb" label="semanticweb" />
   <category term="idea" label="idea" />
   
   <content type="html">&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/data.jpg" alt="SemanticWeb" /&gt;&lt;/p&gt;

&lt;h3&gt;Once Upon a Time...&lt;/h3&gt;

&lt;h4&gt;After a big meal at Xmas&lt;/h4&gt;

&lt;p&gt;computer: Given your heart condition and age your risk of heart attack has now increased from 1 to 5% and is estimated to increase to 10% in 2 weeks continuing at this rate. Your Family doctor has been noticed.&lt;/p&gt;

&lt;p&gt;person: f*k, why, how?&lt;/p&gt;

&lt;p&gt;computer: excessive ingestion of foods with fat and cholesterol leads to increased risks. Your meals for these past weeks have contributed to that. It was about the same last year on this Xmas season. And is a quite common across the world for most people.&lt;/p&gt;

&lt;p&gt;person: what now?&lt;/p&gt;

&lt;p&gt;computer: Recommend better nutrition. I estimate that ideal diet with exercise can reduce the risk close to 1% in about 2-3 weeks. Should i create a recovery plan?&lt;/p&gt;

&lt;p&gt;person: yes&lt;/p&gt;

&lt;p&gt;computer: and order food from local shop?&lt;/p&gt;

&lt;p&gt;person: yes&lt;/p&gt;

&lt;p&gt;computer: and book gym time?&lt;/p&gt;

&lt;p&gt;person: oh god, have to go to gym? I hate gym&lt;/p&gt;

&lt;p&gt;computer: for faster recovery physical activity is recommended. Without it, i estimate recovery will take longer, but if you prefer, in the next few week there will be in your area: a wilderness day and a walking activity.&lt;/p&gt;

&lt;p&gt;person: Yes, lets do that instead.&lt;/p&gt;

&lt;p&gt;computer: ok, you are signed up for them. I will also be guiding your meals and sport activity for the next weeks.&lt;/p&gt;

&lt;p&gt;person: thanks&lt;/p&gt;

&lt;h4&gt;A few weeks after at the workplace&lt;/h4&gt;

&lt;p&gt;computer: Your heart attack risk has decreased to close to 1%. Recover plan has been successful, you are now back to normal healthy levels.&lt;/p&gt;

&lt;p&gt;person: Whohoo!&lt;/p&gt;

&lt;p&gt;computer: Do note that your sugar levels are presently low so your intellectual productivity is estimated bellow 40%, recommend stop for the day and go get some food.&lt;/p&gt;

&lt;p&gt;person: ok thanks, but one last thing, you mentioned that the increased health risks around Xmas season is a common problem ?&lt;/p&gt;

&lt;p&gt;computer: yes, a world wide problem in fact. Do you want to get the full report?&lt;/p&gt;

&lt;p&gt;person: No. just wondering if we can do something about it... Can you run some simulations?&lt;/p&gt;

&lt;p&gt;computer: yes&lt;/p&gt;

&lt;p&gt;person: For example, what are the impacts of a campaign on tv about the health risks just before the Xmas season...&lt;/p&gt;

&lt;p&gt;computer: Simulation estimates that it would improve global health 2% and reduce global hospital costs about 3% for the next year.&lt;/p&gt;

&lt;p&gt;person: How about a Santa visit to schools including a nutrition lecture? Or school teachers nutrition training? Or a new tv cartoon themed around healthy food? Or a reduction on healthy foods tax for this season? Or an increase in unhealthy food tax for this season? Or any mix of these... Run the simulations on the impacts.&lt;/p&gt;

&lt;p&gt;computer: done.&lt;/p&gt;

&lt;p&gt;person: ok, please generate an action plan and forward it to all members of the assembly for voting.&lt;/p&gt;

&lt;p&gt;computer: will do Governor&lt;/p&gt;

&lt;p&gt;&lt;em&gt;all lived happily ever after... the End&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;Data, Data, Data!&lt;/h3&gt;

&lt;p&gt;See where this is going? &lt;strong&gt;data, data, data!&lt;/strong&gt; that contributes to improve our lives In health, government, business, education, etc...&lt;/p&gt;

&lt;p&gt;In the truly data driven future:&lt;/p&gt;

&lt;p&gt;All data will be collected automatically, by the devices around us, all new data is automatically linked with existing one and made available to be used by everyone, everywhere.&lt;/p&gt;

&lt;p&gt;Many data aware systems will exist and constantly improve themselves: correlating, mix and matching data, inferring and generating new knowledge by using the newly generated data that keeps on springing all around us.&lt;/p&gt;

&lt;p&gt;There will be data available about each person, each government, each household, each company, each car, each food, each animal, each school, each computer... essentially everything and it will be collected all the time(almost in real time) by all kinds of gadgets.&lt;/p&gt;

&lt;p&gt;With this information we will be able to have computers that can quickly see where a specific trend is going and suggest real time actions to ensure optimal effects.&lt;/p&gt;

&lt;h4&gt;Privacy&lt;/h4&gt;

&lt;p&gt;Might sound scary all this personal data floating around, what if it gets used in a wrong way? guess what... data aware systems everywhere will itself prevent fraud and misuse of information: any kind of bias, discrimination is immediately picked up and identified.&lt;/p&gt;

&lt;p&gt;Also I'd argue further that this won't actually happen because in a very data driven education any wrong influences are quickly identified and corrected early in education process. Thus the tendency is that among the grown ups the amount of wrong doers will be very small.&lt;/p&gt;

&lt;h3&gt;Data Sharing &amp;amp; Semantic Web&lt;/h3&gt;

&lt;p&gt;How we go about it? To start with, the data sharing part is key and is currently under-developed (or just not wide spread yet). The usual approach nowadays is to collect everything into 1 gigantic private silo and then use that.&lt;/p&gt;

&lt;p&gt;We need to get to the point where all new generated data is available, linked and automatically understood by all other data systems, so we need a lingua franca for data formats and data sharing that everybody talks and understands.&lt;/p&gt;

&lt;p&gt;So, as a practical 1st step, introducing the "Semantic Web" by Tim Berners-Lee:&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;A web of data that can be processed directly and indirectly by machines. &lt;a href="http://en.wikipedia.org/wiki/Semantic_Web"&gt;@wikipedia&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Semantic Web will make this (data sharing) possible, by providing an open format for the representation and exchange of knowledge and expertise. &lt;a href="http://lifeboat.com/ex/minding.the.planet"&gt;@lifeboat&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Semantic Web changes the economics of processing knowledge. &lt;a href="http://lifeboat.com/ex/minding.the.planet"&gt;@lifeboat&lt;/a&gt;&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;A start. And in fact, the image at the top of this post is a diagram of the already existing semantic webs. &lt;a href="http://richard.cyganiak.de/2007/10/lod/"&gt;@lod_cloud&lt;/a&gt;&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/al3xandr3/~4/lAnH-reBdwU" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://al3xandr3.github.com/2011/12/18/data.html</feedburner:origLink></entry>
 
 <entry>
   <title>Table.query</title>
   <link href="http://feedproxy.google.com/~r/al3xandr3/~3/Jh0HIdNaJBQ/table-query.html" />
   <updated>2011-10-14T00:00:00-07:00</updated>
   <id>http://al3xandr3.github.com/2011/10/14/table-query</id>
   <category term="data" label="data" />
   <category term="ruby" label="ruby" />
   <category term="statistics" label="statistics" />
   
   <content type="html">&lt;p&gt;A small ruby class inspired by R's &lt;a href="http://code.google.com/p/sqldf/"&gt;&lt;strong&gt;sqldf&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use Case:&lt;/strong&gt; Parse data from a log file or a web service and then need to do some data manipulation and summaries like: joining with other data, filtering, pivoting (group by), augment data with calculated columns, calculate sums, counts, averages, etc...&lt;/p&gt;

&lt;p&gt;Leverages the power of sql for data analyses inside ruby with a minimal API:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; require 'table'
&amp;gt; Table.new ["user", "value"], [["bob", 3], ["eve", 1]], "tbl"
&amp;gt; Table.query "select sum(value) from tbl"
[[4]]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Features:
 - automatically infers the data type (numeric vs text)
 - shortcut to get a column&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;    &amp;gt; tbl = Table.new ["user", "value"], [["bob", 3], ["eve", 1]], "tbl"
    &amp;gt; tbl.user
    ["bob", "eve"]
&lt;/code&gt;&lt;/pre&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;augment your data by adding columns:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt; &amp;gt; tbl = Table.new ["user", "value"], [["bob", 3], ["eve", 1]], "tbl"
 &amp;gt; tbl.add "sex", ["male", "female"]
 &amp;gt; Table.query "select user, sex from tbl"
 [["bob", "male"], ["eve", "female"]]
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;by naming the query(value at the end) create a new table:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt; &amp;gt; Table.new ["user", "value"], [["bob", 3], ["eve", 1]], "tbl"
 &amp;gt; tbl2 = Table.query "select sum(value) as total from tbl", "tbl2"
 &amp;gt; tbl2.total
 [4]
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;direct access to db driver when needed:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt; &amp;gt; Table.new ["user", "value"], [["bob", 3], ["eve", 1]], "tbl"
 &amp;gt; Table.with_db {|db| db.execute("update tbl set value=5 where user='eve'") }
 &amp;gt; Table.query "select sum(value) from age"
 [[8]]
&lt;/code&gt;&lt;/pre&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;persists in file table.db thus also means is then accessible by other tools; R, Excel, Java, etc...&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;small :)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;Code&lt;/h3&gt;

&lt;pre&gt;&lt;code&gt;require 'sqlite3'

class Table
  attr_accessor :name

  def self.with_db(&amp;amp;block)
    db = SQLite3::Database.new("./table.db")
    yield db if block_given?
    db.close  
  end

  def self.query q, new_table_name=nil
    if new_table_name
      val = []
      Table.with_db {|db| val = db.execute2(q) }
      Table.new val.shift, val, new_table_name
    else 
      Table.with_db {|db| return db.execute(q) }
    end
  end  

  # new
  def initialize header, data, name
    @table = name

    # sql to create new table
    sql = "create table #{@table} ("
    header.each_with_index do |h, i|

      # iterates each row, looking for a non empty value
      val = ""
      data.each do |row| 
        if row[i] != nil and row[i] != ""
          val = row[i]
          break
        end
      end

      if val.class == Fixnum or val.class == Float
        sql &amp;lt;&amp;lt; "#{h} NUMERIC,"
      else 
        sql &amp;lt;&amp;lt; "#{h} TEXT,"
      end
    end
    sql &amp;lt;&amp;lt; ");"

    Table.with_db do |db|
      db.execute( "drop table if exists #{@table};" ) # remove if exists
      db.execute(sql.gsub(",);", ");"))               # create new table  
      data.each do |row|                              # insert data
        db.execute( "insert into #{@table} values ( '#{row.join("','")}' );" )        
      end
    end
  end

  # column  
  def method_missing(m, *args, &amp;amp;block) 
    Table.with_db do |db|
      return db.execute( "select #{m} from #{@table}" ).flatten
    end
  end

  # add  
  def add col, data
    Table.with_db do |db|
      if data[0].class == Fixnum or data[0].class == Float
        db.execute( "alter table #{@table} add #{col} NUMERIC;" )
      else
        db.execute( "alter table #{@table} add #{col} TEXT;" )
      end

      # add specified value to each row
      data.each_with_index do |val, i|
        db.execute( "update #{@table} set #{col} = '#{val}' where ROWID = #{i+1};" )
      end
    end
  end

  # list tables
  def list
    Table.with_db do |db|
      return @db
        .execute("select name from sqlite_master where type='table' ORDER BY name;")
        .flatten    
    end
  end
end




if __FILE__ == $0
  require "test/unit"
  class TestTable &amp;lt; Test::Unit::TestCase

    def setup
      require 'fileutils'
      File.delete "./table.db" if File.exists? "./table.db"
    end    

    def test_insert
      Table.new ["id", "v1", "v2"], [[1, 23, "a"], [2, 34, "b"]], "test"
      assert_equal [[23.0, "a"], [34.0, "b"]], Table.query("select v1,v2 from test")
    end

    def test_method_missing
      tbl = Table.new ["id", "col1", "col2"], [[1, 23, "a"], [2, 34, "b"]], "test"
      assert_equal [23.0, 34.0], tbl.col1
    end

    def test_add
      tbl = Table.new ["id", "col1"], [[1, 23], [2, 34]], "test"
      tbl.add("col2", tbl.col1.map{|v|v+1} ) # v1+=1 as v2
      assert_equal [24.0, 35.0], tbl.col2
      tbl.add("col3", ["random", "stuff"])
      assert_equal ["random", "stuff"], tbl.col3
    end

    def test_join
      Table.new ["id", "v1", "v2"], [[1, 23, "a"], [2, 34, "b"]], "tbl1"
      Table.new ["id", "v3", "v4"], [[1, 24, "c"], [2, 36, "d"]], "tbl2"
      sql = "select tbl1.id,v1,v4 from tbl1 join tbl2 on tbl1.id = tbl2.id"
      assert_equal [[1, 23, "c"], [2, 34, "d"]], Table.query(sql)
    end

    def test_alias
      Table.new ["g", "v"], [["a",11], ["a",9], ["b",2], ["b",2]], "tbl"
      tbl2 = Table.query("select g, sum(v) as value from tbl group by g", "tbl2")
      assert_equal ["a", "b"], tbl2.g
      assert_equal [20.0, 4.0], tbl2.value
    end

    def test_join_with_new_table
      Table.new ["id", "v1", "v2"], [[1, 11, "a"], [2, 12, "b"]], "tbl1"
      Table.new ["id", "v3", "v4"], [[1, 21, "c"], [2, 22, "d"]], "tbl2"
      sql = "select tbl1.id,v1,v4 from tbl1 join tbl2 on tbl1.id = tbl2.id"
      Table.query(sql, "tbl3")
      assert_equal [[1, 11, "c"], [2, 12, "d"]], Table.query("select * from tbl3")
    end

    def test_ad_hoc
      Table.new ["id", "v1", "v2"], [[1, 23, "a"], [2, 34, "b"]], "test"
      val = nil
      Table.with_db {|db| val = db.execute("select v1 from test limit 1") }
      assert_equal [23.0], val.flatten
    end

    def test_date
      Table.new ["ts", "v"], [[Date.today, 23], [Date.today+10, 34]], "test"
      sql = "select v from test where ts &amp;lt; '#{Date.today+2}'"
      assert_equal [23.0], Table.query(sql).flatten
    end

    def test_time
      Table.new ["ts", "v"], [[Time.now, 10], [Time.now+10, 20]], "test"
      sql = "select v from test where ts &amp;lt; '#{Time.now+2}'"
      assert_equal [10.0], Table.query(sql).flatten
    end

  end
end 
&lt;/code&gt;&lt;/pre&gt;
&lt;img src="http://feeds.feedburner.com/~r/al3xandr3/~4/Jh0HIdNaJBQ" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://al3xandr3.github.com/2011/10/14/table-query.html</feedburner:origLink></entry>
 
 <entry>
   <title>Monitoring Productivity II - the Others</title>
   <link href="http://feedproxy.google.com/~r/al3xandr3/~3/qw59Y0LGQIo/productivity-others.html" />
   <updated>2011-09-30T00:00:00-07:00</updated>
   <id>http://al3xandr3.github.com/2011/09/30/productivity-others</id>
   <category term="data" label="data" />
   <category term="productivity" label="productivity" />
   <category term="statistics" label="statistics" />
   <category term="visualization" label="visualization" />
   <category term="ruby" label="ruby" />
   <category term="R" label="R" />
   
   <content type="html">&lt;p&gt;In previous Monitoring Productivity Experiment &lt;a href="http://al3xandr3.github.com/2010/10/20/monitoring-productivity-experiment.html"&gt;post&lt;/a&gt; I looked into the hours I spent in computer, now will look into the hours &lt;strong&gt;Others&lt;/strong&gt; spend in computer, which is far more interesting :) To find things like what day people spend more time on computer, how many hours they work, and general activity patterns.&lt;/p&gt;

&lt;h3&gt;Collecting data&lt;/h3&gt;

&lt;p&gt;In osx, is possible to use growl to display a message when a skype user signs in. So I configured growl to log the sign-in's and sign-out's of my skype contacts.&lt;/p&gt;

&lt;p&gt;Like so:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt; touch ~/Desktop/growl.log
&amp;gt; defaults write com.Growl.GrowlHelperApp GrowlLoggingEnabled -bool YES
&amp;gt; defaults write com.Growl.GrowlHelperApp GrowlLogType 1
&amp;gt; defaults write com.Growl.GrowlHelperApp "Custom log 1" ~/Desktop/growl.log
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Instructions &lt;a href="http://gthing.net/enable-growl-log-and-show-it-in-geektool"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;And then I left my skype signed in for a few weeks, while I was on vacations.&lt;/p&gt;

&lt;h3&gt;Parsing data&lt;/h3&gt;

&lt;p&gt;Read the log file and create a semicolon separated file:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;#!/usr/bin/env ruby
puts "timestamp;user;status"
File.open(ARGV[0]).each_line do |l|
  if l.include? "online" or l.include? "offline"
    date  = l.split('Skype')[0].strip
    user  = l.scan(/Skype:([^\(]*)/)[0][0].strip
    status = l.include?("online") ? "online" : "offline"
    puts "#{date};#{user};#{status}"
  end
end
&lt;/code&gt;&lt;/pre&gt;

&lt;h3&gt;Load it into R&lt;/h3&gt;

&lt;pre&gt;&lt;code&gt;data = read.csv("/my/proj/skype-growl/log.csv", sep=";", header=TRUE)

# parse dates "Aug 24, 2011 3:58:01 PM"
data$date = as.POSIXct(strptime(data$timestamp,"%b %d, %Y %I:%M:%S %p")) # DateTime
data$hour = format(data$date, format="%H:%M:%S")       # string
data$time = as.POSIXct(data$hour, format = "%H:%M:%S") # DateTime
data$day  = format(data$date, format="%m/%d/%y")       # string
data$weekday = format(data$date, format="%A")          # string

# filter for complete days of data
data = sqldf("select * from data where day &amp;gt;= '08/25/2011' and day &amp;lt;= '09/21/2011'")
sqldf("select count(distinct(day)) from data") 
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;27 days of data.&lt;/p&gt;

&lt;h3&gt;The sign-in's and sign-out's of a random person&lt;/h3&gt;

&lt;pre&gt;&lt;code&gt;randomperson = sqldf("select user from data group by random() limit 1")

d = sqldf(sprintf("select * 
                   from data 
                   where user = '%s' and day  &amp;gt;= '09/04/2011' and 
                   day  &amp;lt;= '09/12/2011'", randomperson[1,1]))

ggplot(data=d, aes(y=time, x=date)) + geom_point(aes(color=status), alpha=0.6) +  scale_x_datetime(major = "1 days") + scale_y_datetime(major = "1 hours")
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/others_random_person.png" alt="Random Person" /&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;10-Sep is Saturday and 11-Sep is Sunday, means skype was off on the weekend&lt;/li&gt;
&lt;li&gt;start of workday between 9h-11h&lt;/li&gt;
&lt;li&gt;end of workday between 18h-19h&lt;/li&gt;
&lt;li&gt;skype is offline after working hours, except on night of Monday 05-Sep&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;Online Activity Patterns&lt;/h3&gt;

&lt;p&gt;Plotting all sign-in's and sign-out's over each weekday we can get a feeling for overall online activity:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;ggplot(data, aes(x=time,..density..)) + geom_histogram() + facet_grid(weekday ~ .)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/others_daily_activity.png" alt="Daily Activity" /&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Night time has less activity, and gets progressively smaller as night goes by&lt;/li&gt;
&lt;li&gt;Around 9am activity spikes (people start work?)&lt;/li&gt;
&lt;li&gt;Around 17h/18h activity spikes (people ending work?)&lt;/li&gt;
&lt;li&gt;The 9h &amp;amp; 17h spikes are not so well formed in the weekend, thus very likelly  connected to work&lt;/li&gt;
&lt;li&gt;Sundays after dinner time(or so) people seems to start get online again before going to sleep&lt;/li&gt;
&lt;li&gt;On weekends computer gets more use later in the day&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;How many hours people work?&lt;/h3&gt;

&lt;p&gt;More tricky to accurately measure but we can have a guess:
 - assuming that people are working during working hours of workdays
 - assuming that nobody start works before 6am, and nobody ends work after 21pm&lt;/p&gt;

&lt;p&gt;Then, the first activity after 6am is start of work, and the last activity change before 21pm is the end of work.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;d = sqldf("select user, 
                  day, 
                  weekday,
                  min(hour) as start, 
                  max(hour) as end
           from data
           where hour &amp;gt;= '06:00:00' and hour &amp;lt;= '21:00:00' and
                 weekday &amp;lt;&amp;gt; 'Saturday' and weekday &amp;lt;&amp;gt; 'Sunday'
           group by user, day")
d$totalhours = difftime(as.POSIXct(d$end, format = "%H:%M:%S"), as.POSIXct(d$start, format = "%H:%M:%S"))
d$totalhours = as.numeric(d$totalhours, units="hours")

# excude less than 2 hours/day, means bots, vacations, etc...
dt = sqldf("select * from d where totalhours &amp;gt; 2")

al3x.load() # my own collection of R functions
al3x.hist(dt, "totalhours")
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/others_totalhours.png" alt="Total Hours" /&gt;&lt;/p&gt;

&lt;p&gt;Workday total hours are mostly between 6 and 12 hours, most common being the 8.5 hours/day.&lt;/p&gt;

&lt;h3&gt;Which day people spend more time in computer?&lt;/h3&gt;

&lt;p&gt;We can try counting the amount of sign-in's/sign-out's changes per day, means people are more likely to be in computer.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;d = sqldf("select weekday, count(status) as amount
           from data
           group by weekday
           order by sum(time) DESC")
ggplot(d, aes(x=weekday,y=amount)) + geom_bar(stat="identity")
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/others_day_activity.png" alt="What day most active?" /&gt;&lt;/p&gt;

&lt;p&gt;As the above could be biased in a number of ways lets use another way to measure it and if the results match then original estimate should be ok.&lt;/p&gt;

&lt;p&gt;For example, way to go about it is to sum up the total working hours for each day:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;d = sqldf("select user, 
                  day, 
                  weekday,
                  min(hour) as start, 
                  max(hour) as end
           from data
           where hour &amp;gt;= '06:00:00' and hour &amp;lt;= '21:00:00'
           group by user, day")
d$totalhours = difftime(as.POSIXct(d$end, format = "%H:%M:%S"), as.POSIXct(d$start, format = "%H:%M:%S"))
d$totalhours = as.numeric(d$totalhours, units="hours")

sqldf("select weekday, sum(totalhours) as amount
           from d
           group by weekday
           order by sum(totalhours) DESC")
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Getting:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;    weekday    amount
1   Tuesday 15404.471
2 Wednesday 15191.946
3    Monday 14298.472
4    Friday 12426.091
5  Thursday 11638.443
6  Saturday  5222.874
7    Sunday  5198.367
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Almost same results, great.&lt;/p&gt;

&lt;p&gt;Thus &lt;strong&gt;Tuesday&lt;/strong&gt; is the day people spend more time in computer, and in decreasing order:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tuesday &gt; Wednesday &gt; Monday &gt; Friday &gt; Thursday &gt; (Saturday or Sunday)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;On productivity this means that &lt;strong&gt;Tuesday&lt;/strong&gt; is the most productive day, while &lt;strong&gt;Thursday&lt;/strong&gt; is the least.&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/al3xandr3/~4/qw59Y0LGQIo" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://al3xandr3.github.com/2011/09/30/productivity-others.html</feedburner:origLink></entry>
 
 <entry>
   <title>Dashboarding</title>
   <link href="http://feedproxy.google.com/~r/al3xandr3/~3/TrVNcHt8NOc/dashboards.html" />
   <updated>2011-05-24T00:00:00-07:00</updated>
   <id>http://al3xandr3.github.com/2011/05/24/dashboards</id>
   <category term="data" label="data" />
   <category term="visualization" label="visualization" />
   <category term="ruby" label="ruby" />
   <category term="automation" label="automation" />
   <category term="dashboard" label="dashboard" />
   <category term="javascript" label="javascript" />
   <category term="statistics" label="statistics" />
   
   <content type="html">&lt;p&gt;An important part of being data driven is to have a daily feedback on data,
here's a couple of &lt;strong&gt;automated dashboards&lt;/strong&gt; i've built recently:
&lt;img src="http://al3xandr3.github.com/img/dash2.png" alt="http://al3xandr3.github.com/img/dash2.png" /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/dash1.png" alt="http://al3xandr3.github.com/img/dash1.png" /&gt;&lt;/p&gt;

&lt;p&gt;Its the first iteration, where mostly all data is displayed as is, next
iteration could enrich the data further with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Adding the data of a year/6months ago for direct comparison could be interesting.&lt;/li&gt;
&lt;li&gt;Fits to data, like a regression line that shows the overall tendency, plus allows to make predictions on next day/week/month values.&lt;/li&gt;
&lt;li&gt;More of relative change plots, like the &lt;a href="http://vis.stanford.edu/protovis/ex/index-chart.html"&gt;protovis index-chart&lt;/a&gt; are very useful.&lt;/li&gt;
&lt;li&gt;Confidence intervals pointing out that the changes are unlikely to be by chance.&lt;/li&gt;
&lt;li&gt;etc…&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;Tools &amp;amp; Code&lt;/h3&gt;

&lt;p&gt;Coded in ruby, it aggregates data from different sources, and based on an html
template it generates html with the full dashboard. Its fully automated, and i
make it run on daily basis using a cron job.&lt;/p&gt;

&lt;p&gt;Uses highcharts as the javascript charting engine, which i can only say good
things about, very nice looking and allows user interaction.&lt;/p&gt;

&lt;p&gt;I placed on github the code i use as the base to build the dashboards, find it
here: &lt;a href="https://github.com/al3xandr3/Dashboard"&gt;https://github.com/al3xandr3/Dashboard&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A few bits:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Getting RescueTime Data&lt;/strong&gt;&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;require 'open-uri'
require 'date'

key = "yourownkey"

res = {} 
open("https://www.rescuetime.com/anapi/data?key=#{key}&amp;amp;perspective=interval&amp;amp;format=csv&amp;amp;resolution_time=day&amp;amp;restrict_kind=activity") do |f|
  i=0    
  f.each do |l|    
    unless i==0
      t, sec, some, app, cat, prod = l.split(",")
      res[:week] += sec.to_i
      res[:day] += sec.to_i if Date.parse(t).day == Date.today.day
    end
    i+= 1
  end
end

print res
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;strong&gt;Getting Google Spreadsheets Data&lt;/strong&gt;&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;require 'gdata/client'  
require 'gdata/http'  
require 'gdata/auth'
require 'open-uri'
require 'date'

client = GData::Client::Spreadsheets.new
client.clientlogin('yourmail@gmail.com', "yourpass")
key = "yourspreadsheetkey"
test = client.get("http://spreadsheets.google.com/feeds/download/spreadsheets/Export?key=#{key}&amp;amp;fmcmd&amp;amp;exportFormat=csv")
values = []
i=0
test.body.each_line do |l|
  t,w,co,wa,h = l.gsub("\n","").split(',')
  unless i==0
    values &amp;lt;&amp;lt; [Date.parse(t), w.to_f, wa.to_f, h.to_f] 
  end
  i+=1
end

print values
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;strong&gt;Getting imap mail attachments&lt;/strong&gt;&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;require 'net/imap'
require 'date'

opts[:inbox]   ||= "Inbox"
opts[:search]  ||= ["SINCE", "8-Aug-2007"]
opts[:attach]  ||= ["CSV"]
opts[:savedir] ||= "."

imap = Net::IMAP.new('mail.server.com', :port =&amp;gt; 993, :ssl =&amp;gt; true)
imap.login('yourmail@server.com', 'yourpassw')    
imap.select(opts[:inbox])
imap.search(opts[:search]).each do |uid|
  msg = imap.fetch(uid, ["ENVELOPE","UID","BODY"])[0]
  body = msg.attr["BODY"]
  date = Date.parse(msg.attr["ENVELOPE"].date)
  i = 1
  while body.parts[i] != nil
    type = body.parts[i].subtype
    encoding = body.parts[i].encoding
    name = body.parts[i].param["NAME"] || date.to_s
    i+=1
    attachment = imap.fetch(uid, "BODY[#{i}]")[0].attr["BODY[#{i}]"]
    p "#{name}, #{type}, #{encoding}"
    if opts[:attach].include? type and not attachment.nil?
      File.open(opts[:savedir] + name,'wb+') do |f|
        if encoding == "BASE64"
          f.write(attachment.unpack('m')[0])
        else
          f.write(attachment)
        end          
      end
    end
  end  
end 
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;strong&gt;Posting html to a confluence wiki&lt;/strong&gt;&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;require 'xmlrpc/client'

user = "username"
pass = "password"
area = "area"
page_name="page"
content = "&amp;lt;h1&amp;gt;Big Header&amp;lt;/h1&amp;gt;"
confluence = XMLRPC::Client
      .new2("https://#{user}:#{pass}@confluence.server.com/rpc/xmlrpc")
      .proxy("confluence1")
page = confluence.getPage("", area, page_name)
page["content"] = "{html}#{content}{html}"
confluence.storePage("", page)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;strong&gt;Creating a highcharts JS chart&lt;/strong&gt;&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;require 'erb'
require 'date'

def line arg={}

  arg[:height] = arg[:height] || ""
  arg[:width] = arg[:width] || ""

  line_chart = %{
    &amp;lt;div id="&amp;lt;%= arg[:name] %&amp;gt;" style="height:&amp;lt;%= arg[:height] %&amp;gt;px;width:&amp;lt;%= arg[:width] %&amp;gt;px;"&amp;gt;&amp;lt;/div&amp;gt;
    &amp;lt;script type="text/javascript"&amp;gt;
     var month = new Array("Jan","Feb","Mar","Apr","May","Jun",
                           "Jul","Aug","Sept","Oct","Nov","Dec");
     var chart;
     $(document).ready(function() {
     chart = new Highcharts.Chart({
        chart: {
           renderTo: '&amp;lt;%= arg[:name] %&amp;gt;',
           defaultSeriesType: 'line',
           marginRight: 40,
           marginBottom: 40
        },
        credits:{
          enabled:false
        },
        plotOptions: {
           line: {
              dataLabels: {
                 enabled: &amp;lt;%= arg[:data_labels] || false %&amp;gt;
              }
           }
        },
        title: {
           text: '&amp;lt;%= arg[:name] %&amp;gt;',
           x: -20 //center
        },
        subtitle: {
           text: '&amp;lt;%= arg[:subtitle] %&amp;gt;',
           x: -20
        },
        xAxis: {
           type: "datetime",
           title: {
              text: '&amp;lt;%= arg[:xlabel] %&amp;gt;'
           },
        },
        yAxis: {
           min: &amp;lt;%= arg[:ymin] || 0 %&amp;gt;,
           title: {
              text: '&amp;lt;%= arg[:ylabel] %&amp;gt;'
           },
        },
        tooltip: {
           formatter: function() {
             return (new Date(this.x)).getDate() + ' ' +   
                    month[(new Date(this.x)).getMonth()] + 
                     ': '+ this.y;
           }
        },
        legend: {
           layout: 'vertical',
           align: 'right',
           verticalAlign: 'top',
           x: 0,
           y: 0,
           borderWidth: 2
        },
        series: [{
           pointInterval: 24 * 3600 * 1000,
           pointStart: &amp;lt;%= arg[:start_time] %&amp;gt;,
           name: '&amp;lt;%= arg[:name] %&amp;gt;',
           data: &amp;lt;%= arg[:values] %&amp;gt;
        }]
       });
      });
    &amp;lt;/script&amp;gt;
    }
  ERB.new(line_chart).result(binding)
end

c = line(:name =&amp;gt; "My Fancy Chart",
     :subtitle =&amp;gt; "subtitle",
     :xlabel =&amp;gt; "y label",
     :ylabel =&amp;gt; "y label",
     :start_time =&amp;gt; (Date.today-7).to_time.to_i * 1000, 
     :values =&amp;gt; [12.2, 13.3, 11.1, 15.5])
print c
&lt;/code&gt;&lt;/pre&gt;
&lt;img src="http://feeds.feedburner.com/~r/al3xandr3/~4/TrVNcHt8NOc" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://al3xandr3.github.com/2011/05/24/dashboards.html</feedburner:origLink></entry>
 
 <entry>
   <title>Who Chats the Most?</title>
   <link href="http://feedproxy.google.com/~r/al3xandr3/~3/g6K8d7uH9Zk/skype-chats.html" />
   <updated>2011-04-28T00:00:00-07:00</updated>
   <id>http://al3xandr3.github.com/2011/04/28/skype-chats</id>
   <category term="data" label="data" />
   <category term="visualization" label="visualization" />
   <category term="statistics" label="statistics" />
   <category term="ruby" label="ruby" />
   
   <content type="html">&lt;p&gt;From my Skype chat history, a visualization of the counts of chats by
(anonymised) author.&lt;/p&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/contacts.png" alt="http://al3xandr3.github.com/img/contacts.png" /&gt;&lt;/p&gt;

&lt;h3&gt;Code&lt;/h3&gt;

&lt;pre&gt;&lt;code&gt;require 'sqlite3'
require 'rubyvis'

contacts={}

# count
db = SQLite3::Database.new("[skype-folder]/main.db")
db.execute("SELECT author, count(author) FROM Messages GROUP BY author ORDER BY count(author) DESC" ) do |author, count|
  #contacts[author]=count # real ones
  contacts[author.split('').sample(3).join]=count if count&amp;gt;60 # Anonymized
end

cs=pv.Colors.category20()
format=Rubyvis::Format.number
color = pv.Colors.category20
nodes = pv.dom(contacts).root("rubyvis").nodes

vis = pv.Panel.new()
    .width(600)
    .height(1000)

treemap = vis.add(Rubyvis::Layout::Treemap).
  nodes(nodes).
  mode("squarify").
  round(true)

treemap.leaf.add(Rubyvis::Panel).
  fill_style(lambda{ |d| cs.scale(d) }).
  stroke_style("#fff").
  line_width(1).
  antialias(true).
  title(lambda {|d| d.node_name + " " + format.format(d.node_value)})

treemap.node_label.add(Rubyvis::Label).
  text_style(lambda {|d| pv.rgb(0, 0, 0, 1)}).
  font(lambda{|d| v=d.node_value/90; (v&amp;lt;=8)? "#8px sans-serif" : "#{v}px sans-serif"})
vis.render

# saves an svg
File.open("contacts.svg", "w+").write vis.to_svg
&lt;/code&gt;&lt;/pre&gt;
&lt;img src="http://feeds.feedburner.com/~r/al3xandr3/~4/g6K8d7uH9Zk" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://al3xandr3.github.com/2011/04/28/skype-chats.html</feedburner:origLink></entry>
 
 <entry>
   <title>Machine Learning Ex5.2 - Regularized Logistic Regression</title>
   <link href="http://feedproxy.google.com/~r/al3xandr3/~3/0yVxgP2XtwI/ml-ex52.html" />
   <updated>2011-03-20T00:00:00-07:00</updated>
   <id>http://al3xandr3.github.com/2011/03/20/ml-ex52</id>
   <category term="machinelearning" label="machinelearning" />
   <category term="R" label="R" />
   
   <content type="html">&lt;script type="text/javascript" src="http://cdn.mathjax.org/mathjax/1.1-latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML-full"&gt;
    MathJax.Hub.Config({
            jax: ["input/TeX", "output/HTML-CSS"],
        extensions: ["tex2jax.js","TeX/AMSmath.js","TeX/AMSsymbols.js",
                     "TeX/noUndefined.js"],
        tex2jax: {
            inlineMath: [ ["\\(","\\)"] ],
            displayMath: [ ['$$','$$'], ["\\[","\\]"], ["\\begin{displaymath}","\\end{displaymath}"] ],
            skipTags: ["script","noscript","style","textarea","pre","code"],
            ignoreClass: "tex2jax_ignore",
            processEscapes: false,
            processEnvironments: true,
            preview: "TeX"
        },
        showProcessingMessages: true,
        displayAlign: "left",
        displayIndent: "2em",
 
        "HTML-CSS": {
             scale: 100,
             availableFonts: ["STIX","TeX"],
             preferredFont: "TeX",
             webFont: "TeX",
             imageFont: "TeX",
             showMathMenu: true,
        },
        MMLorHTML: {
             prefer: {
                 MSIE:    "MML",
                 Firefox: "MML",
                 Opera:   "HTML",
                 other:   "HTML"
             }
        }
    });
&lt;/script&gt;


&lt;p&gt;&lt;a href="http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&amp;amp;doc=exercises/ex5/ex5.html"&gt;Exercise 5.2&lt;/a&gt; Improves the Logistic Regression implementation done in
&lt;a href="http://al3xandr3.github.com/2011/03/16/ml-ex4.html"&gt;Exercise 4&lt;/a&gt; by adding a regularization parameter that reduces the problem
of over-fitting. We will be using Newton's Method.&lt;/p&gt;

&lt;p&gt;With implementation in R.&lt;/p&gt;

&lt;h2&gt;Data&lt;/h2&gt;

&lt;p&gt;Here's how the data we want to fit, looks like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;google.spreadsheet &amp;lt;- function (key) {
  library(RCurl)
  # ssl validation off
  ssl.verifypeer &amp;lt;- FALSE

  tt &amp;lt;- getForm("https://spreadsheets.google.com/spreadsheet/pub", 
                hl ="en_GB",
                key = key, 
                single = "true", gid ="0", 
                output = "csv", 
                .opts = list(followlocation = TRUE, verbose = TRUE)) 

  read.csv(textConnection(tt), header = TRUE)
}

# load the data
mydata = google.spreadsheet("0AnypY27pPCJydHZPN2pFbkZGd1RKeU81OFY3ZHJldWc")

# plot the data
plot(mydata$u[mydata$y == 0], mydata$v[mydata$y == 0],, xlab="u", ylab="v")
points(mydata$u[mydata$y == 1], mydata$v[mydata$y == 1], col="blue", pch=3)
legend("topright", c("y=0","y=1"), pch=c(1, 3), col=c("black", "blue"), bty="n")
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/ml-ex52-data.png" alt="http://al3xandr3.github.com/img/ml-ex52-data.png" /&gt;&lt;/p&gt;

&lt;p&gt;The idea of "fitting" is to create a mathematical model, that will separate
the circles from the crosses in the plot above by learning from the existing
data. That will then allow to make predictions for a new u and v value, the
probability of being a cross.&lt;/p&gt;

&lt;h2&gt;Theory&lt;/h2&gt;

&lt;p&gt;Hypothesis is:&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
h_\theta(x) = g(\theta^T x) = \frac{1}{ 1 + e ^{- \theta^T x} }
&lt;/script&gt;


&lt;p&gt;Regularization is all about loosen up the tight fit, avoiding over-fitting and
thus obtain a more generalized fit, that more likely will work better on new
data(for doing predictions).&lt;/p&gt;

&lt;p&gt;For that we define the cost function, with an added generalization parameter
that blunts the fit, like so:&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
J(\theta) = \frac{1}{m} \sum_{i=1}^m [(-y)log(h_\theta(x)) - (1 - y) log(1- h_\theta(x))] + \frac{\lambda}{2m} \sum_{i=1}^n \theta^2
&lt;/script&gt;


&lt;p&gt;lambda is called the regularization parameter.&lt;/p&gt;

&lt;p&gt;The iterative theta updates using Newton's method is defined as:&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
\theta^{(t+1)} = \theta^{(t)} - H^{-1} \nabla_{\theta}J 
&lt;/script&gt;


&lt;p&gt;And the gradient and Hessian are defined like so(in vectorized versions):&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
\nabla_{\theta}J  = \frac{1}{m} \sum_{i=1}^m (h_\theta(x) - y) x + \frac{\lambda}{m} \theta
&lt;/script&gt;




&lt;script type="math/tex; mode=display"&gt;
H = \frac{1}{m} \sum_{i=1}^m [h_\theta(x) (1 - h_\theta(x)) x^T x] + \frac{\lambda}{m} \begin{bmatrix} 
0 &amp; &amp; &amp; \\ &amp; 1 &amp; &amp; \\ &amp; &amp; ... &amp; \\ &amp; &amp; &amp; 1 
\end{bmatrix}
&lt;/script&gt;


&lt;h2&gt;Implementation&lt;/h2&gt;

&lt;p&gt;Lets first define the functions above, with the added generalization
parameter:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# sigmoid function
g = function (z) {
  return (1 / (1 + exp(-z)))
} # plot(g(c(1,2,3,4,5,6)), type="l")

# build hight order feature vector
# for 2 features, for a given degree
hi.features = function (f1,f2,deg) {
  n = ncol(f1)
  ma = matrix(rep(1,length(f1)))
  for (i in 1:deg) {
    for (j in 0:i)    
      ma = cbind(ma, f1^(i-j) * f2^j)
  }
  return(ma)
} # hi.features(c(1,2), c(3,4),2)
# creates: 1 u v u^2 uv v^2 ...

# hypothesis
h = function (x,th) {
  return(g(x %*% th))
} # h(x,th)

# derivative of J 
grad = function (x,y,th,m,la) {
  G = (la/m * th)
  G[1,] = 0
  return((1/m * t(x) %*% (h(x,th) - y)) +  G)
} # grad(x,y,th,m,la)

# hessian
H = function (x,y,th,m,la) {
  n = length(th)
  L = la/m * diag(n)
  L[1,] = 0
  return((1/m * t(x) %*% x * diag(h(x,th)) * diag(1 - h(x,th))) + L)
} # H(x,y,th,m,la)

# cost function
J = function (x,y,th,m,la) {
  pt = th
  pt[1] = 0
  A = (la/(2*m))* t(pt) %*% pt
  return((1/m * sum(-y * log(h(x,th)) - (1 - y) * log(1 - h(x,th)))) + A)
} # J(x,y,th,m,la)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now we can make it iterate until convergence, first for lambda=1&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# setup variables
m = length(mydata$u) # samples
x = hi.features(mydata$u, mydata$v,6)
n = ncol(x) # features
y = matrix(mydata$y, ncol=1)

# lambda = 1
# use the cost function to check is works
th1 = matrix(0,n)
la = 1
jiter = array(0,c(15,1))
for (i in 1:15) {
  jiter[i] = J(x,y,th1,m,la)
  th1 = th1 - solve(H(x,y,th1,m,la)) %*% grad(x,y,th1,m,la) 
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Validate that is converging properly, by plotting the Cost(J) function against
the number of iterations.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# check that is converging correctly
plot(jiter, xlab="iterations", ylab="cost J")
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/ml-ex52-j.png" alt="http://al3xandr3.github.com/img/ml-ex52-j.png" /&gt;&lt;/p&gt;

&lt;p&gt;Converging well and fast, as is typical from Newton's method.&lt;/p&gt;

&lt;p&gt;And now we make it iterate for lambda=0 and lambda=10 for comparing
fits later:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# lambda = 0
th0 = matrix(0,n)
la = 0
for (i in 1:15) {
  th0 = th0 - solve(H(x,y,th0,m,la)) %*% grad(x,y,th0,m,la) 
}

# lambda = 10
th10 = matrix(0,n)
la = 10
for (i in 1:15) {
  th10 = th10 - solve(H(x,y,th10,m,la)) %*% grad(x,y,th10,m,la) 
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Finally calculate the decision boundary line and visualize it:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# calculate the decision boundary line
# by creating many points
u = seq(-1, 1.2, len=200);
v = seq(-1, 1.2, len=200);
z0 = matrix(0, length(u), length(v))
z1 = matrix(0, length(u), length(v))
z10 = matrix(0, length(u), length(v))
for (i in 1:length(u)) {
  for (j in 1:length(v)) {
    z0[j,i] =  hi.features(u[i],v[j],6) %*% th0
    z1[j,i] =  hi.features(u[i],v[j],6) %*% th1
    z10[j,i] =  hi.features(u[i],v[j],6) %*% th10
  }
}

# plots
contour(u,v,z0,nlev = 0, xlab="u", ylab="v", nlevels=0, col="black",lty=2)
contour(u,v,z1,nlev = 0, xlab="u", ylab="v", nlevels=0, col="red",lty=2, add=TRUE)
contour(u,v,z10,nlev = 0, xlab="u", ylab="v", nlevels=0, col="green3",lty=2, add=TRUE)
points(mydata$u[mydata$y == 0], mydata$v[mydata$y == 0])
points(mydata$u[mydata$y == 1], mydata$v[mydata$y == 1], col="blue", pch=3)
legend("topright",  c(expression(lambda==0), expression(lambda==1),expression(lambda==10)), lty=1, col=c("black", "red","green3"),bty="n" )
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/ml-ex52-fit.png" alt="http://al3xandr3.github.com/img/ml-ex52-fit.png" /&gt;&lt;/p&gt;

&lt;p&gt;See that the black line (lambda=0) is the more tightly fit to the
crosses, and as we increase the lambda values it becomes more loose(and more
generalized) and consequently a better predictor for new unseen data.&lt;/p&gt;

&lt;h4&gt;References&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Thanks to Andrew Ng and &lt;a href="http://openclassroom.stanford.edu/MainFolder/HomePage.php"&gt;OpenClassRoom&lt;/a&gt; for the great lessons.&lt;/li&gt;
&lt;/ul&gt;

&lt;img src="http://feeds.feedburner.com/~r/al3xandr3/~4/0yVxgP2XtwI" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://al3xandr3.github.com/2011/03/20/ml-ex52.html</feedburner:origLink></entry>
 
 <entry>
   <title>Machine Learning Ex5.1 - Regularized Linear Regression</title>
   <link href="http://feedproxy.google.com/~r/al3xandr3/~3/I37XTX2zApg/ml-ex51.html" />
   <updated>2011-03-18T00:00:00-07:00</updated>
   <id>http://al3xandr3.github.com/2011/03/18/ml-ex51</id>
   <category term="machinelearning" label="machinelearning" />
   <category term="R" label="R" />
   
   <content type="html">&lt;script type="text/javascript" src="http://cdn.mathjax.org/mathjax/1.1-latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML-full"&gt;
    MathJax.Hub.Config({
            jax: ["input/TeX", "output/HTML-CSS"],
        extensions: ["tex2jax.js","TeX/AMSmath.js","TeX/AMSsymbols.js",
                     "TeX/noUndefined.js"],
        tex2jax: {
            inlineMath: [ ["\\(","\\)"] ],
            displayMath: [ ['$$','$$'], ["\\[","\\]"], ["\\begin{displaymath}","\\end{displaymath}"] ],
            skipTags: ["script","noscript","style","textarea","pre","code"],
            ignoreClass: "tex2jax_ignore",
            processEscapes: false,
            processEnvironments: true,
            preview: "TeX"
        },
        showProcessingMessages: true,
        displayAlign: "left",
        displayIndent: "2em",
 
        "HTML-CSS": {
             scale: 100,
             availableFonts: ["STIX","TeX"],
             preferredFont: "TeX",
             webFont: "TeX",
             imageFont: "TeX",
             showMathMenu: true,
        },
        MMLorHTML: {
             prefer: {
                 MSIE:    "MML",
                 Firefox: "MML",
                 Opera:   "HTML",
                 other:   "HTML"
             }
        }
    });
&lt;/script&gt;


&lt;p&gt;&lt;a href="http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&amp;amp;doc=exercises/ex5/ex5.html"&gt;Exercise 5.1&lt;/a&gt; Improves the Linear Regression implementation done in
&lt;a href="http://al3xandr3.github.com/2011/03/08/ml-ex3.html"&gt;Exercise 3&lt;/a&gt; by adding a regularization parameter that reduces the problem
of over-fitting.&lt;/p&gt;

&lt;p&gt;Over-fitting occurs especially when fitting a high-order polynomial, that we
will try to do here.&lt;/p&gt;

&lt;p&gt;With implementation in R.&lt;/p&gt;

&lt;h2&gt;Data&lt;/h2&gt;

&lt;p&gt;Here's the points we will make a model from:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;google.spreadsheet &amp;lt;- function (key) {
  library(RCurl)
  # ssl validation off
  ssl.verifypeer &amp;lt;- FALSE

  tt &amp;lt;- getForm("https://spreadsheets.google.com/spreadsheet/pub", 
                hl ="en_GB",
                key = key, 
                single = "true", gid ="0", 
                output = "csv", 
                .opts = list(followlocation = TRUE, verbose = TRUE)) 

  read.csv(textConnection(tt), header = TRUE)
}

# load the data
mydata = google.spreadsheet("0AnypY27pPCJydGhtbUlZekVUQTc0dm5QaXp1YWpSY3c")

# view data
plot(mydata)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/ml-ex51-data.png" alt="http://al3xandr3.github.com/img/ml-ex51-data.png" /&gt;&lt;/p&gt;

&lt;h2&gt;Theory&lt;/h2&gt;

&lt;p&gt;We will fit a 5th order polynomial, so the hypothesis is:&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
h_\theta(x) = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2^2 + \theta_3 x_3^3 + \theta_4 x_4^4 + \theta_5 x_5^5
&lt;/script&gt;


&lt;p&gt;With x_0 = 1&lt;/p&gt;

&lt;p&gt;The idea of the regularization is to blunt the fit a bit, i.e. loosen up the
tight fit.&lt;/p&gt;

&lt;p&gt;For that we define the cost function like so:&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
J(\theta) = \frac{1}{2m} [\sum_{i=1}^m ((h_\theta(x^{(i)}) - y^{(i)})^2) + \lambda \sum_{i=1}^n \theta^2]
&lt;/script&gt;


&lt;p&gt;The Lambda is called the regularization parameter.&lt;/p&gt;

&lt;p&gt;The regularization parameter added at the end will influence the exact cost
values on all parameters. This will reflect in the search for the (\theta)
parameters and consequently loosen up the tight fit.&lt;/p&gt;

&lt;p&gt;After some math that is not shown here, the &lt;strong&gt;normal equations&lt;/strong&gt; with the
regularization parameter added, become:&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
\theta = (X^T X + \lambda \begin{bmatrix} 0 &amp; &amp; &amp; \\ &amp; 1 &amp; &amp; \\ &amp; &amp; ... &amp; \\ &amp; &amp; &amp; 1 \end{bmatrix} )^{-1} (X^T y)
&lt;/script&gt;


&lt;h2&gt;Implementation&lt;/h2&gt;

&lt;p&gt;We will try 3 different lambda values to see how it influences the fit.
Starting with lambda=0 where we can see the fit without the
regularization parameter.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# setup variables
m = length(mydata$x) # samples
x = matrix(c(rep(1,m), mydata$x, mydata$x^2, mydata$x^3, mydata$x^4, mydata$x^5), ncol=6)
n = ncol(x) # features
y = matrix(mydata$y, ncol=1)
lambda = c(0,1,10)
d = diag(1,n,n)
d[1,1] = 0
th = array(0,c(n,length(lambda)))

# apply normal equations for each of the lambda's
for (i in 1:length(lambda)) {
  th[,i] = solve(t(x) %*% x + (lambda[i] * d)) %*% (t(x) %*% y)
}

# plot
plot(mydata)

# lets create many points
nwx = seq(-1, 1, len=50);
x = matrix(c(rep(1,length(nwx)), nwx, nwx^2, nwx^3, nwx^4, nwx^5), ncol=6)
lines(nwx, x %*% th[,1], col="blue", lty=2)
lines(nwx, x %*% th[,2], col="red", lty=2)
lines(nwx, x %*% th[,3], col="green3", lty=2)
legend("topright", c(expression(lambda==0), expression(lambda==1),expression(lambda==10)), lty=2,col=c("blue", "red", "green3"), bty="n")
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/ml-ex51-regularized.png" alt="http://al3xandr3.github.com/img/ml-ex51-regularized.png" /&gt;&lt;/p&gt;

&lt;p&gt;With the lambda=0 the fit is very tight to the original points (the blue
line) but as we increase lambda, the model gets less tight(more generalized)
and thus avoiding over-fitting.&lt;/p&gt;

&lt;h3&gt;References:&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://al3xandr3.github.com/2011/03/08/ml-ex3.html"&gt;Exercise 3, original Linear Regression implementation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Thanks to Andrew Ng and &lt;a href="http://openclassroom.stanford.edu/MainFolder/HomePage.php"&gt;OpenClassRoom&lt;/a&gt; for the great lessons.&lt;/li&gt;
&lt;/ul&gt;

&lt;img src="http://feeds.feedburner.com/~r/al3xandr3/~4/I37XTX2zApg" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://al3xandr3.github.com/2011/03/18/ml-ex51.html</feedburner:origLink></entry>
 
 <entry>
   <title>Machine Learning Ex4 - Logistic Regression</title>
   <link href="http://feedproxy.google.com/~r/al3xandr3/~3/rQSSKYg_ufg/ml-ex4.html" />
   <updated>2011-03-16T00:00:00-07:00</updated>
   <id>http://al3xandr3.github.com/2011/03/16/ml-ex4</id>
   <category term="machinelearning" label="machinelearning" />
   <category term="R" label="R" />
   
   <content type="html">&lt;script type="text/javascript" src="http://www.mathjax.org/mathjax/MathJax.js"&gt;
    MathJax.Hub.Config({
            jax: ["input/TeX", "output/HTML-CSS"],
        extensions: ["tex2jax.js","TeX/AMSmath.js","TeX/AMSsymbols.js",
                     "TeX/noUndefined.js"],
        tex2jax: {
            inlineMath: [ ["\\(","\\)"] ],
            displayMath: [ ['$$','$$'], ["\\[","\\]"], ["\\begin{displaymath}","\\end{displaymath}"] ],
            skipTags: ["script","noscript","style","textarea","pre","code"],
            ignoreClass: "tex2jax_ignore",
            processEscapes: false,
            processEnvironments: true,
            preview: "TeX"
        },
        showProcessingMessages: true,
        displayAlign: "left",
        displayIndent: "2em",
 
        "HTML-CSS": {
             scale: 100,
             availableFonts: ["STIX","TeX"],
             preferredFont: "TeX",
             webFont: "TeX",
             imageFont: "TeX",
             showMathMenu: true,
        },
        MMLorHTML: {
             prefer: {
                 MSIE:    "MML",
                 Firefox: "MML",
                 Opera:   "HTML",
                 other:   "HTML"
             }
        }
    });
&lt;/script&gt;


&lt;p&gt;&lt;a href="http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&amp;amp;doc=exercises/ex4/ex4.html"&gt;Exercise 4&lt;/a&gt; is all about implementing Logistic Regression using Newton's
Method, on a classification problem.&lt;/p&gt;

&lt;p&gt;For all this to make sense i suggest having a look at &lt;a href="http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning"&gt;Andrew Ng machine
learning lectures&lt;/a&gt; on &lt;a href="http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning"&gt;openclassroom&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We start with a dataset representing 40 students who were admitted to college
and 40 students who were not admitted, and their corresponding grades for 2
exams. &lt;em&gt;Your mission, should you decide to accept it&lt;/em&gt; is to build a binary
classification model that estimates college admission chances based on a
student's scores on two exams(test1 and test2).&lt;/p&gt;

&lt;p&gt;With implementation in R.&lt;/p&gt;

&lt;h2&gt;Plot the Data&lt;/h2&gt;

&lt;p&gt;We start by looking at the data.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;google.spreadsheet &amp;lt;- function (key) {
  library(RCurl)
  # ssl validation off
  ssl.verifypeer &amp;lt;- FALSE

  tt &amp;lt;- getForm("https://spreadsheets.google.com/spreadsheet/pub", 
                hl ="en_GB",
                key = key, 
                single = "true", gid ="0", 
                output = "csv", 
                .opts = list(followlocation = TRUE, verbose = TRUE)) 

  read.csv(textConnection(tt), header = TRUE)
}

# load the data
mydata = google.spreadsheet("0AnypY27pPCJydC1vRVEzM1VJQnNneFo5dWNzR1F5Umc")

# plots
plot(mydata$test1[mydata$admitted == 0], mydata$test2[mydata$admitted == 0], xlab="test1", ylab="test2", , col="red")
points(mydata$test1[mydata$admitted == 1], mydata$test2[mydata$admitted == 1], col="blue", pch=2)
legend("bottomright", c("not admitted", "admitted"), pch=c(1, 2), col=c("red", "blue") )
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/ml-ex4-plotdata.png" alt="http://al3xandr3.github.com/img/ml-ex4-plotdata.png" /&gt;&lt;/p&gt;

&lt;h2&gt;A Bit of Theory&lt;/h2&gt;

&lt;p&gt;Most of the ideas explored in linear regression apply in same way, first we
define what the hypothesis equation looks like(the mathematical representation
of this knowledge). It originates from the line equation, but has now evolved
into a new equation that returns values between [0,1] suited for binary
classification. That is, we made up an equation that given test1 value and
test2 value, will return the probability that the student will be
admitted(y=1) into college:&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
h_\theta(x) = g(\theta^T x) = \frac{1}{ 1 + e ^{- \theta^T x} }
&lt;/script&gt;


&lt;p&gt;g is the sigmoid function. And this returns:&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
h_\theta(x) = P (y=1 | x; \theta)
&lt;/script&gt;


&lt;p&gt;Now we need to find the (\theta) parameters for getting a working hypothesis
equation. To help with that search we define a cost equation, that for a given
(\theta) returns how far off we are compared to the sample data.&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
J(\theta) = \frac{1}{m} \sum_{i=1}^m ((-y)log(h_\theta(x)) - (1 - y) log(1- h_\theta(x)) )
&lt;/script&gt;


&lt;p&gt;The lower the cost the better(closer to real data we get). Thus, the goal
becomes to minimize the cost.&lt;/p&gt;

&lt;p&gt;We can use &lt;a href="http://en.wikipedia.org/wiki/File:NewtonIteration_Ani.gif"&gt;Newton's method&lt;/a&gt; for that. Newton's method, similarly to
gradient descent, is a way to search for the 0(minimum) of the derivative of
the cost function. And after doing some math, the iterative (\theta) updates
using Newton's method is defined as:&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
\theta^{(t+1)} = \theta^{(t)} -  H^{-1} \nabla_{\theta}J
&lt;/script&gt;


&lt;p&gt;And the gradient and Hessian are defined like so(in vectorized versions):&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
\nabla_{\theta}J  = \frac{1}{m} \sum_{i=1}^m (h_\theta(x) - y) x
&lt;/script&gt;




&lt;script type="math/tex; mode=display"&gt;
H = \frac{1}{m} \sum_{i=1}^m [h_\theta(x) (1 - h_\theta(x)) x^T x]
&lt;/script&gt;


&lt;h2&gt;Implementation&lt;/h2&gt;

&lt;p&gt;First we implement the above equations:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# sigmoid
g = function (z) {
  return (1 / (1 + exp(-z) ))
} # plot(g(c(1,2,3,4,5,6)))

# hypothesis 
h = function (x,th) {
  return( g(x %*% th) )
} # h(x,th)

# cost
J = function (x,y,th,m) {
  return( 1/m * sum(-y * log(h(x,th)) - (1 - y) * log(1 - h(x,th))) )
} # J(x,y,th,m)

# derivative of J (gradient)
grad = function (x,y,th,m) {
  return( 1/m * t(x) %*% (h(x,th) - y))
} # grad(x,y,th,m)

# Hessian
H = function (x,y,th,m) {
  return (1/m * t(x) %*% x * diag(h(x,th)) * diag(1 - h(x,th)))
} # H(x,y,th,m)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Make it go (iterate until convergence):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# setup variables
j = array(0,c(10,1))
m = length(mydata$test1)
x = matrix(c(rep(1,m), mydata$test1, mydata$test2), ncol=3)
y = matrix(mydata$admitted, ncol=1)
th = matrix(0,3)

# iterate 
# note that the newton's method converges fast, 10x is enough
for (i in 1:10) {
  j[i] = J(x,y,th,m) # stores each iteration Cost
  th = th - solve(H(x,y,th,m)) %*% grad(x,y,th,m) 
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Have a look at the cost function by iteration:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;plot(j, xlab="iterations", ylab="cost J")
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/ml-ex4-j.png" alt="http://al3xandr3.github.com/img/ml-ex4-j.png" /&gt;&lt;/p&gt;

&lt;p&gt;See that the number of iterations needed is only 4-5, converges much faster
than gradient descent.&lt;/p&gt;

&lt;p&gt;Exercise questions:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# 1. What values of  did you get? How many iterations were required for convergence?
print("1.")
print(th)

# 2. What is the probability that a student with a score of 20 on Exam 1
# and a score of 80 on Exam 2 will not be admitted?
print("2.")
print((1 - g(c(1, 20, 80) %*% th))* 100)


          1
[1] "1."
            [,1]
[1,] -16.4469479
[2,]   0.1457278
[3,]   0.1618285
[1] "2."
         [,1]
[1,] 64.24722
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To visualize the fit, an important remark is that: \(P(y=1 | x ;\theta) =
0.5\) that happens when:&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
 \theta^T x=0
&lt;/script&gt;


&lt;p&gt;Thus&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# when ax0 + bx2 + cx3 = 0 is the middle(decision boundary line),
# so given x1 from sample data, solving to x2, we get:
x2 = (-1/th[3,]) * ((th[2,] * x1) + th[1,])

# get 2 points (that will define a line)
x1 = c(min(x[,2]), max(x[,2]))

# plot
plot(x1,x2, type='l',  xlab="test1", ylab="test2")
points(mydata$test1[mydata$admitted == 0], mydata$test2[mydata$admitted == 0], col="red")
points(mydata$test1[mydata$admitted == 1], mydata$test2[mydata$admitted == 1], col="blue", pch=2)
legend("bottomright", c("not admitted", "admitted"), pch=c(1, 2), col=c("red", "blue") )
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/ml-ex4-fit.png" alt="http://al3xandr3.github.com/img/ml-ex4-fit.png" /&gt;&lt;/p&gt;

&lt;p&gt;Beautiful&lt;/p&gt;

&lt;h4&gt;Notes &amp;amp; References:&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Thanks Tal Galili for adding my blog into the &lt;a href="http://www.r-bloggers.com/"&gt;R-bloggers.com&lt;/a&gt; list. &lt;a href="http://www.r-bloggers.com/"&gt;Go have a peek, R-bloggers&lt;/a&gt; is a great source of R information.&lt;/li&gt;
&lt;li&gt;&lt;a href="http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&amp;amp;doc=exercises/ex4/ex4.html"&gt;Exercise 4 here&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning"&gt;Lectures here&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Thanks to Andrew Ng and &lt;a href="http://openclassroom.stanford.edu/MainFolder/HomePage.php"&gt;OpenClassRoom&lt;/a&gt; for the great lessons.&lt;/li&gt;
&lt;/ul&gt;

&lt;img src="http://feeds.feedburner.com/~r/al3xandr3/~4/rQSSKYg_ufg" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://al3xandr3.github.com/2011/03/16/ml-ex4.html</feedburner:origLink></entry>
 
 <entry>
   <title>Machine Learning Ex3 - Multivariate Linear Regression</title>
   <link href="http://feedproxy.google.com/~r/al3xandr3/~3/titGxUFc2Lw/ml-ex3.html" />
   <updated>2011-03-08T00:00:00-08:00</updated>
   <id>http://al3xandr3.github.com/2011/03/08/ml-ex3</id>
   <category term="machinelearning" label="machinelearning" />
   <category term="R" label="R" />
   
   <content type="html">&lt;script type="text/javascript" src="http://www.mathjax.org/mathjax/MathJax.js"&gt;
    MathJax.Hub.Config({
            jax: ["input/TeX", "output/HTML-CSS"],
        extensions: ["tex2jax.js","TeX/AMSmath.js","TeX/AMSsymbols.js",
                     "TeX/noUndefined.js"],
        tex2jax: {
            inlineMath: [ ["\\(","\\)"] ],
            displayMath: [ ['$$','$$'], ["\\[","\\]"], ["\\begin{displaymath}","\\end{displaymath}"] ],
            skipTags: ["script","noscript","style","textarea","pre","code"],
            ignoreClass: "tex2jax_ignore",
            processEscapes: false,
            processEnvironments: true,
            preview: "TeX"
        },
        showProcessingMessages: true,
        displayAlign: "left",
        displayIndent: "2em",
 
        "HTML-CSS": {
             scale: 100,
             availableFonts: ["STIX","TeX"],
             preferredFont: "TeX",
             webFont: "TeX",
             imageFont: "TeX",
             showMathMenu: true,
        },
        MMLorHTML: {
             prefer: {
                 MSIE:    "MML",
                 Firefox: "MML",
                 Opera:   "HTML",
                 other:   "HTML"
             }
        }
    });
&lt;/script&gt;


&lt;p&gt;&lt;a href="http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&amp;amp;doc=exercises/ex3/ex3.html"&gt;Exercise 3&lt;/a&gt; is about multivariate linear regression. Start by finding a
good learning rate (alpha) and then implement linear regression using the
normal equations instead of the gradient descent algorithm.&lt;/p&gt;

&lt;p&gt;With implementation in R.&lt;/p&gt;

&lt;h2&gt;Data&lt;/h2&gt;

&lt;p&gt;As usual hosted in google docs:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;google.spreadsheet &amp;lt;- function (key) {
  library(RCurl)
  # ssl validation off
  ssl.verifypeer &amp;lt;- FALSE

  tt &amp;lt;- getForm("https://spreadsheets.google.com/spreadsheet/pub", 
                hl ="en_GB",
                key = key, 
                single = "true", gid ="0", 
                output = "csv", 
                .opts = list(followlocation = TRUE, verbose = TRUE)) 

  read.csv(textConnection(tt), header = TRUE)
}

# load the data
mydata = google.spreadsheet("0AnypY27pPCJydExfUzdtVXZuUWphM19vdVBidnFFSWc")
&lt;/code&gt;&lt;/pre&gt;

&lt;h3&gt;Feature Scaling&lt;/h3&gt;

&lt;p&gt;When applying the gradient descent, we need to make sure that features values
are in the same order of magnitudes, otherwise it will not converge well, so
here's a helper function to scale features:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# given a data frame and the column names i want to scale
# creates new columns: feature.scale = (feature - mean)/std
feature.scale = function (dta, cols) {
  for (col in cols) {
    sigma = sd(dta[col])
    mu = mean(dta[col])
    dta[paste(names(dta[col]), ".scale", sep = "")] = (dta[col] - mu)/sigma
  }
  return(dta)
}

dta = feature.scale(mydata, c("area", "bedrooms"))
tail(dta, 5)



   area bedrooms  price area.scale bedrooms.scale
43 2567        4 314000  0.7126179      1.0904165
44 1200        3 299000 -1.0075229     -0.2236752
45  852        2 179900 -1.4454227     -1.5377669
46 1852        4 299900 -0.1870900      1.0904165
47 1203        3 239500 -1.0037479     -0.2236752
&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;Finding Alpha&lt;/h2&gt;

&lt;p&gt;Recall from &lt;a href="http://al3xandr3.github.com/2011/02/24/ml-ex2-linear-regression.html"&gt;ex2&lt;/a&gt; that the gradient descent equation for the updates of
theta is:&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
\theta := \theta - \alpha \frac{1}{m} x^T (x\theta^T - y)
&lt;/script&gt;


&lt;p&gt;For finding a good alpha(\(\alpha\)) we will use a trial and error approach.
The idea is look at how the Cost value \(J(\alpha)\) drops with the number of
iterations, the fastest the drop the better, but if goes up then the alpha
value is already too large.&lt;/p&gt;

&lt;p&gt;The Cost is given by(in vectorized form):&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
J(\theta) = \frac{1}{2m} (X\theta - y)^T (X\theta - y)
&lt;/script&gt;


&lt;p&gt;See the lessons on details how to reach that equation.&lt;/p&gt;

&lt;p&gt;Implementing:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# lets try out a few alpha's
alpha = c(0.03, 0.1, 0.3, 1, 1.3, 2)

# store the J values over the iterations
J = array(0,c(50,length(alpha)))
m = length(dta$price)
theta = matrix(c(0,0,0), nrow=1)
x = matrix(c(rep(1,m), dta$area.scale, dta$bedrooms.scale), ncol=3)
y = matrix(dta$price, ncol=1)

# the delta updates
delta = function(x,y,th) {
  delta = (t(x) %*% ((x %*% t(th)) - y))
  return(t(delta))
}

# the cost for a given theta
cost = function(x,y,th,m) {
  prt = ((x %*% t(th)) - y)
  return(1/m * (t(prt) %*% prt))
}

# run J for 50x, on each alpha
for (j in 1:length(alpha)) {
  for (i in 1:50) {
    J[i,j] = cost(x,y,theta,m) # capture the Cost
    theta = theta - alpha[j] * 1/m * delta(x,y,theta)
  }
}

# lets have a look
par(mfrow=c(3,2))
for (j in 1:length(alpha)) {
  plot(J[,j], type="l", xlab=paste("alpha", alpha[j]), ylab=expression(J(theta)))
}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/ml-ex3-alpha.png" alt="http://al3xandr3.github.com/img/ml-ex3-alpha.png" /&gt;&lt;/p&gt;

&lt;p&gt;alpha 1 seems to be the best.&lt;/p&gt;

&lt;p&gt;Setting \(\alpha=1\) and running until convergence:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# running until convergence
for (i in 1:50000) {
  theta = theta - 1 * 1/m * delta(x,y,theta)
  if (abs(delta(x,y,theta)[2]) &amp;lt; 0.0000001) {  
    break # to interrupt updates
  }
}

# 1. The final values of theta
print("Theta:")
print(theta)

# 2. The predicted price of a house with 1650 square feet and 3 bedrooms.
# Don't forget to scale your features when you make this prediction!
print("Prediction for a house with 1650 square feet and 3 bedrooms:")
print(theta %*% c(1, (1650 - mean(dta["area"]))/sd(dta["area"]), (3 - mean(dta["bedrooms"]))/sd(dta["bedrooms"])))



 Warning message:
closing unused connection 3 (tt)
[1] "Theta:"
         [,1]     [,2]      [,3]
[1,] 340412.7 110631.1 -6649.474
[1] "Prediction for a house with 1650 square feet and 3 bedrooms:"
         [,1]
[1,] 293081.5
&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;Normal Equations&lt;/h2&gt;

&lt;p&gt;Given the cost function:&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
J(\theta) = \frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2
&lt;/script&gt;


&lt;p&gt;Recall that this function returns how big is the error of our model vs the
data. Thus our goal is to minimize it. And in order to find its minimum there
is also a more direct approach (instead of using gradient descent) we can just
calculate its derivative set it to 0 and find the value of theta:&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
\frac{\delta}{\delta \theta_j} J(\theta_j) = 0
&lt;/script&gt;


&lt;p&gt;thats for \(\theta_j\). We need of course to account for every j.&lt;/p&gt;

&lt;p&gt;If we write it down into matrix notation, calculate its derivatives and set it
to 0, then the value of theta will be obtained with:&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
\theta =  (X^T X)^{-1} (X^T y)
&lt;/script&gt;


&lt;p&gt;That can be easily implemented like so:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;x = matrix(c(rep(1,m), mydata$area, mydata$bedrooms), ncol=3)
y = matrix(mydata$price, ncol=1)
theta.normal = solve(t(x) %*% x) %*% (t(x) %*% y)

# 1. In your program, use the formula above to calculate. Remember that
# while you don't need to scale your features, you still need to add 
# an intercept term.
print("Theta:")
print(theta.normal)

# 2. Once you have found  from this method, use it to make a price prediction 
# for a 1650-square-foot house with 3 bedrooms. Did you get the same price 
# that you found through gradient descent?
print("Price prediction for a 1650-square-foot house with 3 bedrooms")
t(theta.normal) %*%  c(1, 1650, 3)



[1] "Theta:"
           [,1]
[1,] 89597.9095
[2,]   139.2107
[3,] -8738.0191
[1] "Price prediction for a 1650-square-foot house with 3 bedrooms"
         [,1]
[1,] 293081.5
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Normal equations are more direct but also more costly than gradient descent to
run, so depending on situation you might need to choose one or the other.&lt;/p&gt;

&lt;h4&gt;References&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning"&gt;OpenClassroom Machine Learning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&amp;amp;doc=exercises/ex3/ex3.html"&gt;Exercise 3: Multivariate Linear Regression&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Thanks to Andrew Ng and &lt;a href="http://openclassroom.stanford.edu/MainFolder/HomePage.php"&gt;OpenClassRoom&lt;/a&gt; for the great lessons.&lt;/li&gt;
&lt;/ul&gt;

&lt;img src="http://feeds.feedburner.com/~r/al3xandr3/~4/titGxUFc2Lw" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://al3xandr3.github.com/2011/03/08/ml-ex3.html</feedburner:origLink></entry>
 
 <entry>
   <title>Machine Learning Ex2 - Linear Regression</title>
   <link href="http://feedproxy.google.com/~r/al3xandr3/~3/wRRJ7vChOLk/ml-ex2-linear-regression.html" />
   <updated>2011-02-24T00:00:00-08:00</updated>
   <id>http://al3xandr3.github.com/2011/02/24/ml-ex2-linear-regression</id>
   <category term="machinelearning" label="machinelearning" />
   <category term="R" label="R" />
   
   <content type="html">&lt;script type="text/javascript" src="http://www.mathjax.org/mathjax/MathJax.js"&gt;
    MathJax.Hub.Config({
            jax: ["input/TeX", "output/HTML-CSS"],
        extensions: ["tex2jax.js","TeX/AMSmath.js","TeX/AMSsymbols.js",
                     "TeX/noUndefined.js"],
        tex2jax: {
            inlineMath: [ ["\\(","\\)"] ],
            displayMath: [ ['$$','$$'], ["\\[","\\]"], ["\\begin{displaymath}","\\end{displaymath}"] ],
            skipTags: ["script","noscript","style","textarea","pre","code"],
            ignoreClass: "tex2jax_ignore",
            processEscapes: false,
            processEnvironments: true,
            preview: "TeX"
        },
        showProcessingMessages: true,
        displayAlign: "left",
        displayIndent: "2em",
 
        "HTML-CSS": {
             scale: 100,
             availableFonts: ["STIX","TeX"],
             preferredFont: "TeX",
             webFont: "TeX",
             imageFont: "TeX",
             showMathMenu: true,
        },
        MMLorHTML: {
             prefer: {
                 MSIE:    "MML",
                 Firefox: "MML",
                 Opera:   "HTML",
                 other:   "HTML"
             }
        }
    });
&lt;/script&gt;


&lt;p&gt;Andrew Ng has posted introductory machine learning lessons on the
&lt;a href="http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning"&gt;OpenClassRoom&lt;/a&gt; site. I've watched the first set and will here solve
&lt;a href="http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&amp;amp;doc=exercises/ex2/ex2.html"&gt;Exercise 2&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The exercise is to build a linear regression implementation, I'll use R.&lt;/p&gt;

&lt;p&gt;The point of linear regression is to come up with a mathematical
function(model) that represents the data as best as possible, that is done by
fitting a straight line to the observed data. This model will then allow us to
make predictions on new data.&lt;/p&gt;

&lt;p&gt;For example, the data we use here are boys ages and their corresponding
heights, so when we get the mathematical model we will be able to guess the
boys height from his age.&lt;/p&gt;

&lt;h2&gt;Data&lt;/h2&gt;

&lt;pre&gt;&lt;code&gt;google.spreadsheet &amp;lt;- function (key) {
  library(RCurl)
  # ssl validation off
  ssl.verifypeer &amp;lt;- FALSE

  tt &amp;lt;- getForm("https://spreadsheets.google.com/spreadsheet/pub", 
                hl ="en_GB",
                key = key, 
                single = "true", gid ="0", 
                output = "csv", 
                .opts = list(followlocation = TRUE, verbose = TRUE)) 

  read.csv(textConnection(tt), header = TRUE)
}

# load the data
mydata = google.spreadsheet("0AnypY27pPCJydDB4N3MxM0tENlk3UElnZ013cW1iM3c")

# include ggplot2
library(ggplot2)

ex2plot = ggplot(mydata, aes(x, y)) + geom_point() + 
       ylab('Height in meters') +
       xlab('Age in years')
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/ml-ex2-data.png" alt="http://al3xandr3.github.com/img/ml-ex2-data.png" /&gt;&lt;/p&gt;

&lt;h2&gt;Theory&lt;/h2&gt;

&lt;p&gt;The model we will get at the end is a line that fits the data, is defined like
so:&lt;/p&gt;

&lt;p&gt;Setting \(x_0 = 1\):&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
h_\theta(x) = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + ...
&lt;/script&gt;


&lt;p&gt;That can be summarized by (last is matrix notation):&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
h_\theta(x) = \sum_{i=0}^n \theta_i x_i = \theta^T x
&lt;/script&gt;


&lt;p&gt;Matrix representation is useful because has good support in software tools.&lt;/p&gt;

&lt;p&gt;Goal is to get the line closest to observed data points as possible, thus we
can define a cost function that returns the difference of the real data vs
myModel:&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
J(\theta) = \frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2
&lt;/script&gt;


&lt;p&gt;where i is each data example we have and m is their total.&lt;/p&gt;

&lt;p&gt;With J we now have a metric to check if the hypotheses line is getting closer
to data points or not.&lt;/p&gt;

&lt;p&gt;Next step is to find the smaller cost as possible from J, and in fact thats
exactly what the &lt;a href="http://mathworld.wolfram.com/MethodofSteepestDescent.html"&gt;gradient descent algorithm does&lt;/a&gt;: starting with an
initial guess it iterates to smaller and smaller values of a given function by
following the &lt;a href="http://www.wolframalpha.com/input/?i=Plot[{x^2,+2+x},+{x,+0,+2.2}]"&gt;direction of the derivative&lt;/a&gt;:&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
x_i := x_{i-1} - \epsilon f^' (x_{i-1})
&lt;/script&gt;


&lt;p&gt;Applying to our J:&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
\theta_j := \theta_j - \alpha \frac{\delta}{\delta \theta_j} J(\theta)
&lt;/script&gt;


&lt;p&gt;And doing a bit of calculus on derivatives we get:&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}
&lt;/script&gt;


&lt;p&gt;Where alpha defines the size of steps of the convergence to \(\theta\).&lt;/p&gt;

&lt;p&gt;Now lets check if all this math really works.&lt;/p&gt;

&lt;h2&gt;Implementation - take 1&lt;/h2&gt;

&lt;pre&gt;&lt;code&gt;alpha = 0.07
m = length(mydata$x)
theta = c(0,0)
x = mydata$x
y = mydata$y 
delta = function(x,y,th,m) {
  sum = 0
  for (i in 1:m) {
    sum = sum + (((t(th) %*% c(1,x[i])) - y[i]) * c(1,x[i]))
  }
  return (sum)
}

# 1 iteration
theta - alpha * 1/m * delta(x,y,theta,m)

          1
[1] 0.07452802 0.38002167
&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;Implementation - take 2&lt;/h2&gt;

&lt;p&gt;After having a peek at the &lt;a href="http://openclassroom.stanford.edu/MainFolder/courses/MachineLearning/exercises/ex2materials/ex2.m"&gt;Matlab solution&lt;/a&gt;, i learned that is possible to
replace the sum in the equation with a transpose matrix multiplication(like
done with the line equation):&lt;/p&gt;

&lt;script type="math/tex; mode=display"&gt;
\theta := \theta - \alpha \frac{1}{m} x^T (x\theta^T - y)
&lt;/script&gt;


&lt;p&gt;So we can get a full matrix implementation:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;alpha = 0.07
m = length(mydata$x)
theta = matrix(c(0,0), nrow=1)
x = matrix(c(rep(1,m), mydata$x), ncol=2)
y = matrix(mydata$y, ncol=1)
delta = function(x,y,th) {
  delta = (t(x) %*% ((x %*% t(th)) - y))
  return(t(delta))
}

# 1 iteration
theta - alpha * 1/m * delta(x,y,theta)



           [,1]      [,2]
[1,] 0.07452802 0.3800217
&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;The Model&lt;/h2&gt;

&lt;p&gt;First we run several iterations, until convergence:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;for (i in 1:1500) {
  theta = theta - alpha * 1/m * delta(x,y,theta)
}
theta



          [,1]       [,2]
[1,] 0.7501504 0.06388338
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And finally we see how well the line(model) fits the data:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;ex2plot + geom_abline(intercept=theta[1], slope=theta[2])
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/ml-ex2-fit.png" alt="http://al3xandr3.github.com/img/ml-ex2-fit.png" /&gt;&lt;/p&gt;

&lt;h4&gt;References&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://www.math.umaine.edu/~hiebeler/comp/matlabR.html"&gt;MATLAB / R Reference, by David Hiebeler&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="ftp://ftp.ams.org/pub/tex/doc/amsmath/short-math-guide.pdf"&gt;Short Math Guide for LaTex(.pdf)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://wims.unice.fr/wims/en_tool~linear~matmult.en.html"&gt;Matrix multiplier tool&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Thanks to Andrew Ng and &lt;a href="http://openclassroom.stanford.edu/MainFolder/HomePage.php"&gt;OpenClassRoom&lt;/a&gt; for the great lessons.&lt;/li&gt;
&lt;/ul&gt;

&lt;img src="http://feeds.feedburner.com/~r/al3xandr3/~4/wRRJ7vChOLk" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://al3xandr3.github.com/2011/02/24/ml-ex2-linear-regression.html</feedburner:origLink></entry>
 
 <entry>
   <title>30 is the new 20, right?</title>
   <link href="http://feedproxy.google.com/~r/al3xandr3/~3/pjmVQRUF9RA/30-is-the-new-20.html" />
   <updated>2011-02-09T00:00:00-08:00</updated>
   <id>http://al3xandr3.github.com/2011/02/09/30-is-the-new-20</id>
   
   <content type="html">&lt;p&gt;I wish...&lt;/p&gt;

&lt;p&gt;Opened the computer(and phone) today morning and was welcomed with many great
wishes and kind words from Family and Friends.&lt;/p&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/bday-saxeo.png" alt="http://al3xandr3.github.com/img/bday-saxeo.png" /&gt;&lt;/p&gt;

&lt;h2&gt;Phone&lt;/h2&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/bday-phone.jpg" alt="http://al3xandr3.github.com/img/bday-phone.jpg" /&gt;&lt;/p&gt;

&lt;h2&gt;Skype&lt;/h2&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/bday-skype.png" alt="http://al3xandr3.github.com/img/bday-skype.png" /&gt;&lt;/p&gt;

&lt;h2&gt;Facebook&lt;/h2&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/bday-fb.png" alt="http://al3xandr3.github.com/img/bday-fb.png" /&gt;&lt;/p&gt;

&lt;h2&gt;Email&lt;/h2&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/bday-superbock.png" alt="http://al3xandr3.github.com/img/bday-superbock.png" /&gt; (Portuguese Beer
Company, wait what???)&lt;/p&gt;

&lt;p&gt;And to finish in style, from my cousin Cristina, Popcorn:&lt;/p&gt;

&lt;p&gt;&lt;object width="560" height="315"&gt;&lt;param name="movie" value="http://www.youtube.com/v/AvDvTnTGjgQ?version=3&amp;amp;hl=en_US"&gt;&lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt;&lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt;&lt;/param&gt;&lt;embed src="http://www.youtube.com/v/AvDvTnTGjgQ?version=3&amp;amp;hl=en_US" type="application/x-shockwave-flash" width="560" height="315" allowscriptaccess="always" allowfullscreen="true"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;/p&gt;

&lt;h2&gt;How cool is that? Thanks Everybody!&lt;/h2&gt;
&lt;img src="http://feeds.feedburner.com/~r/al3xandr3/~4/pjmVQRUF9RA" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://al3xandr3.github.com/2011/02/09/30-is-the-new-20.html</feedburner:origLink></entry>
 
 <entry>
   <title>Weight Loss Predictor</title>
   <link href="http://feedproxy.google.com/~r/al3xandr3/~3/zKoPJKNJ2n8/weight-loss-predictor.html" />
   <updated>2011-02-05T00:00:00-08:00</updated>
   <id>http://al3xandr3.github.com/2011/02/05/weight-loss-predictor</id>
   <category term="data" label="data" />
   <category term="R" label="R" />
   <category term="health" label="health" />
   <category term="montecarlo" label="montecarlo" />
   <category term="statistics" label="statistics" />
   <category term="visualization" label="visualization" />
   
   <content type="html">&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/diet.jpeg" alt="http://al3xandr3.github.com/img/diet.jpeg" /&gt; Got for 2010 Xmas a very cool
book called the "4 Hour Body"(thanks Jose Santos) written by Tim Ferriss who
write a previous favorite of mine about productivity, the 4 hour work week.&lt;/p&gt;

&lt;p&gt;Its an interesting book, because it has a scientific approach, it doesn't just
say do this do that and you'll be healthy, it actually says: I(Tim Ferriss)
have tried this, exactly with these steps, during this time, this is how i
measured, these are the results i got and by looking at most up-to-date
medical research this is the most likely explanation for these results… Notice
the similar principles of AB testing.&lt;/p&gt;

&lt;p&gt;This book couldn't have arrived in a better time as i just peeked my heaviest
weight in a long time, blame it on [insert favorite reason]… so, long story
short and I am now on the 3rd week of the low-carb diet described in the book.&lt;/p&gt;

&lt;p&gt;But of course, like with all diets, I'm quickly growing impatient of when i'm
going to reach my goal (&lt;a href="http://www.wolframalpha.com/input/?i=body+mass+index&amp;amp;a=*C.body+mass+index-_*Formula.dflt-&amp;amp;a=*FS-_**BodyMassIndex.BMI-.*BodyMassIndex.H-.*BodyMassIndex.W--&amp;amp;f3=75+kg&amp;amp;x=11&amp;amp;y=4&amp;amp;f=BodyMassIndex.W_75+kg&amp;amp;f4=176+cm&amp;amp;f=BodyMassIndex.H_176+cm&amp;amp;a=*FVarOpt.1-_**-.***BodyMassIndex.S---.*--"&gt;of adequate BMI&lt;/a&gt;), so lets use R and monte carlo
simulations to generate predictions and understand better what to expect.&lt;/p&gt;

&lt;h2&gt;Data&lt;/h2&gt;

&lt;p&gt;Have been tracking my weight using google spreadsheets, so i can get the data
into R like so:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;google.spreadsheet &amp;lt;- function (key) {
  library(RCurl)
  # ssl validation off
  ssl.verifypeer &amp;lt;- FALSE

  tt &amp;lt;- getForm("https://spreadsheets.google.com/spreadsheet/pub", 
                hl ="en_GB",
                key = key, 
                single = "true", gid ="0", 
                output = "csv", 
                .opts = list(followlocation = TRUE, verbose = TRUE)) 

  read.csv(textConnection(tt), header = TRUE)
}

# load the data
mydata = google.spreadsheet("0AnypY27pPCJydEstZnVOeHYycjVWWktjbHpvS1NRMUE")

# Create a new column with the proper date format
mydata$timestamp = as.Date(mydata$timestamp, format='%d/%m/%Y')
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The last 5 measurements:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;tail(mydata, 5)



     timestamp   kg
140 2011-03-04 77.3
141 2011-03-05 76.9
142 2011-03-06 77.2
143 2011-03-07 77.3
144 2011-03-08 76.9
&lt;/code&gt;&lt;/pre&gt;

&lt;h3&gt;Past Years weight&lt;/h3&gt;

&lt;p&gt;Lets have a look at the weight fluctuations over the past 3,5 years(before
diet).&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# include ggplot2
library(ggplot2)

beforediet = subset(mydata, timestamp &amp;lt; "2011-01-18")
ggplot(beforediet, aes(x=timestamp, y=kg)) + geom_point() + geom_smooth() 
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/w-loss-normal.png" alt="http://al3xandr3.github.com/img/w-loss-normal.png" /&gt;&lt;/p&gt;

&lt;p&gt;Weight has been mostly(in average) 80.5kg, but in 2nd half of 2010 we see a
big jump. Also note that middle of year(summer here) appears to be where jumps
in weight happen.&lt;/p&gt;

&lt;h2&gt;Predicting the Future&lt;/h2&gt;

&lt;p&gt;Now using Monte Carlo methods lets simulate the future based on the weight
changes that happened since start of diet.&lt;/p&gt;

&lt;p&gt;Because its going to use weight changes by day we need to do some trickery to
fill in the missing days and calculate the changes for every single day. The
idea for filling in missing days is: if we have only day1=81 and day3=80, then
we calculate that day2=80.5, because in 2 days we see a diference of 1, then
per day is 0.5.&lt;/p&gt;

&lt;p&gt;We can confirm(prove) that this calculated assumption is a good one, by later
comparing the simulation on data with all days filled in against the same data
with some removed days in the middle, and confirm that results are the same.&lt;/p&gt;

&lt;p&gt;So lets get the weight(kg) change(delta), for every day:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;library(zoo) # for missing values interpolation

fill.all.days = function (mydata, timecolname, valuecolname) {
  dtrange = range(mydata[,timecolname])

  # create a data frame with every single day
  alldays = data.frame(tmp=seq(as.Date(dtrange[1]), as.Date(dtrange[2]), "days"))
  colnames(alldays) = c(timecolname) # rename tmp to proper timecolname

  # add the existing values
  alldays = merge(alldays, mydata, by=timecolname, all=TRUE)

  # fill in the missing ones
  alldays[,valuecolname] = na.approx(alldays[,valuecolname])
  return(alldays)
}

# from start of diet
dietdata = subset(mydata, timestamp &amp;gt;= "2011-01-17")
lastweight = tail(dietdata$kg, n=1)

# fill in missing days
dietalldays = fill.all.days(dietdata, "timestamp", "kg")

# get difference day by day into data frame
kgdelta = diff(dietalldays$kg)
dietalldays$delta = c(0, kgdelta)

# print only the 10 last values
tail(dietalldays, 5)



    timestamp   kg delta
47 2011-03-04 77.3   0.0
48 2011-03-05 76.9  -0.4
49 2011-03-06 77.2   0.3
50 2011-03-07 77.3   0.1
51 2011-03-08 76.9  -0.4
&lt;/code&gt;&lt;/pre&gt;

&lt;h3&gt;So what is going to be my weight in a week?&lt;/h3&gt;

&lt;pre&gt;&lt;code&gt;predict.weight.in.days = function(days, inicialweight, deltavector) {
  weight = inicialweight
  for (i in 1:days) {
    weight = weight + sample(deltavector, 1, replace=TRUE)
  }
  return(weight)
}

# simulate it 10k times
mcWeightWeek = replicate(10000, predict.weight.in.days(7, lastweight, kgdelta))

summary(mcWeightWeek)



   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   71.8    75.2    75.9    75.9    76.6    79.7
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Another good thing about monte carlo methods is that they give a distribution
of the prediction, so its possible to get a feeling of how certain the average
is; either very certain with a big central peak, or not that certain when the
graph is flatter and all over the place:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;gghist = function(mydata, mycolname) {
  pl = ggplot(data = mydata)
  subvp = viewport(width=0.35, height=0.35, x=0.84, y=0.84)

  his = pl + 
        geom_histogram(aes_string(x=mycolname,y="..density.."),alpha=0.2) + 
        geom_density(aes_string(x=mycolname)) + 
        opts(title = names(mydata[mycolname]))

  qqp = pl + 
        geom_point(aes_string(sample=mycolname), stat="qq") + labs(x=NULL, y=NULL) + 
        opts(title = "QQ")

  print(his)
  print(qqp, vp = subvp)
}

gghist(data.frame(kg=mcWeightWeek), "kg")
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/w-loss-week.png" alt="http://al3xandr3.github.com/img/w-loss-week.png" /&gt;&lt;/p&gt;

&lt;h3&gt;And when am i getting to 75kg?&lt;/h3&gt;

&lt;pre&gt;&lt;code&gt;days.to.weight = function(weight, inicialweight, deltavector) {
  target = inicialweight
  days = 0
  while (target &amp;gt; weight) {
    target = target + sample(deltavector, 1, replace=TRUE)
     days = days + 1
     if (days &amp;gt;= 1095) # if value too crazy just interrupt the loop
        break
  }
  return(days)
}

# simulate it 10k times
mcDays75 = replicate(10000, days.to.weight(75, lastweight, kgdelta))

summary(mcDays75)



   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.00    8.00   12.00   15.36   19.25  102.00
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And the cumulative distribution:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# add dates to it, from today's date + #days
days75 = sort(tail(mydata$timestamp, 1) + mcDays75)

# get the ecdf values into a dataframe
days75.ecdf = summarize(data.frame(days=days75), days = unique(days), 
                        ecdf = ecdf(days)(unique(days)))

# date where its 80% sure i'll reach goal
prob80 = head(days75.ecdf[days75.ecdf$ecdf&amp;gt;0.80,],1)

# plot
ggplot(days75.ecdf, aes(days, ecdf)) + geom_step() +
       ylab("probability") + 
       geom_point(aes(x = prob80$days, y = prob80$ecdf)) +
       geom_text(aes(x = prob80$days, y = prob80$ecdf, 
                    label = paste("80% sure to be 75kg on",
                            format(prob80$days, "%a, %d %b %Y"))), 
                     hjust=-0.04)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/w-loss-75.png" alt="http://al3xandr3.github.com/img/w-loss-75.png" /&gt;&lt;/p&gt;

&lt;p&gt;Also note that, weight loss is faster at the beginning of a diet, it tends to
slow down over time, so to keep the predictions valid we need to continue
record the weight and re-run the predictions frequently.&lt;/p&gt;

&lt;p&gt;But as you see the slow carb diet seems to work, even without exercise. Tim's
book is great, focusing on the smallest things possible for the bigger
results(=efficiency).&lt;/p&gt;

&lt;h2&gt;&lt;em&gt;Update&lt;/em&gt;&lt;/h2&gt;

&lt;p&gt;Got to 75.0kg on 25 March!!!! Thats 67 days (aprox. 9 weeks) for a 9kg loss,
thus aprox. 1kg per week. Which is within the recommended(0.9kg per week)
weight loss recommendations. Thus am now within normal &lt;a href="http://www.wolframalpha.com/input/?i=bmi+75kg+1.76m"&gt;BMI values&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Regarding the diet itself i have to mention that a key food were lentils, that
replaced all pasta, potato and rice, and became the main food. Around the
77kg, where i plateau'd the weight for a while, i relaxed the strict low carb
diet and adopted some ideas from the &lt;a href="http://www.southbeachdiet.com/sbd/publicsite/index.aspx"&gt;South Beach Diet&lt;/a&gt;, that allows to add
other things in moderation and makes a distinction between good and bad carbs,
also stopped the binging(over-eating) 1 day per week. This diet was done
without any gym or sports, it was all about the food, will soon start to add
some sport into the equation and see.&lt;/p&gt;

&lt;p&gt;The predictor was surprisingly good, even with little data on the beginning of
diet. With time there'a tendency to slow down, that is expected, so maybe
adding a weight giving more importance to the most recent measures could
improve accuracy in weight loss prediction with monte carlo methods.&lt;/p&gt;

&lt;h3&gt;References&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Hard drive occupation prediction with R: &lt;a href="http://lpenz.github.com/articles/df0pred-1/index.html"&gt;part 1&lt;/a&gt; and &lt;a href="http://lpenz.github.com/articles/df0pred-2/index.html"&gt;part 2&lt;/a&gt;, and thanks to Leandro Penz on the feedback.&lt;/li&gt;
&lt;li&gt;Big thanks for Jose and Tim, on jumpstarting this experiment!&lt;/li&gt;
&lt;/ul&gt;

&lt;img src="http://feeds.feedburner.com/~r/al3xandr3/~4/zKoPJKNJ2n8" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://al3xandr3.github.com/2011/02/05/weight-loss-predictor.html</feedburner:origLink></entry>
 
 <entry>
   <title>On Tools. Featuring guitar pedals, cattle growing and math</title>
   <link href="http://feedproxy.google.com/~r/al3xandr3/~3/2XFoWyNrx8g/on-tools.html" />
   <updated>2011-01-12T00:00:00-08:00</updated>
   <id>http://al3xandr3.github.com/2011/01/12/on-tools</id>
   <category term="productivity" label="productivity" />
   
   <content type="html">&lt;h2&gt;The Idea&lt;/h2&gt;

&lt;p&gt;Bear with me during all the technical and guitar lingo, and keep reading because
there is a point.&lt;/p&gt;

&lt;p&gt;After few last weeks of obsessively investigating guitar pedals, because of:
"what should i get this year for my birthday?" i decided to change some parts of my guitar setup and long story short I ended up with only 1 analog pedal in my setup, which like everybody knows is too little, so I came up with an idea: i could build my own guitar pedal that i can tweak and make custom and unique sounding, Lets Go!&lt;/p&gt;

&lt;h2&gt;Do I need to Grow Cattle?&lt;/h2&gt;

&lt;p&gt;After some web browsing on DIY analog pedal building and schematics, figured out that an
analog pedal is not really worth, because without knowledge of analog electronics i will just end up building up some schematic i find on the web without truly understanding it and thus limiting my power to improve it.&lt;/p&gt;

&lt;p&gt;So maybe a better idea is to do instead a software guitar pedal, where i can do all the programming and thus be in a better position to tweak it to create my own sounds. Right? Wrong!&lt;/p&gt;

&lt;p&gt;I had a look at the audio plugin programming world and looked to me like: incompatible
platforms, buffers, streaming, real-time, whole new set of programming environments to learn, etc etc... a big mess... i don't care about all that, i just want to play around with sound design ideas.&lt;/p&gt;

&lt;p&gt;Is like wanting to make my own pizza, but for the cheese I have to grow cattle, get the milk and make the cheese. When what i really want, is just play around with different cheeses and ingredients to create my own pizza.
Its a whole different set of skills and goals...&lt;/p&gt;

&lt;h2&gt;Finding Gold&lt;/h2&gt;

&lt;p&gt;Then at some point landed on a page that described mathematical models of
"amplifier tubes" and realized, this is a model simulating a real sound behavior, so is exactly focused on the right thing, and also, a theoretical math model stands the test of time, it can be implemented in any computer simulation or even in a stand alone digital chip. The math modeling work is truly platform independent and without expiration date, this is real gold.&lt;/p&gt;

&lt;p&gt;And look at the implementation side of it. Implementing is actually monkey / mechanical work, i.e. nothing new about implementing a math formula in a programming language and besides, most of the effort will unfortunately be on the audio plugin programming environment(the growing cattle part)... on top of that, even if I implement it on the current most modern platform, most likely will only be running for a couple of years, because when a new audio plugin platform arrives, it gets incompatible, and i will have to do it again... Also, i don't wanna go trough all the cattle growing process just to find at the end that actually that model does not work as i though...
There has to be a better way...&lt;/p&gt;

&lt;h2&gt;Making it Real&lt;/h2&gt;

&lt;p&gt;Then again, just the Math by itself will not make my new guitar pedal, so
we still need something else that will run that theoretical model on the computer...
Finally i stumbled on guitar effects simulations made in Matlab, looked at the source code... and... Brilliant! Short and directly to the point, high level algorithm implementation of audio effects.
This is great, it will allow to try out the math models i found before much easier &amp;amp; quicker compared to using the full stack of audio plugin development.&lt;/p&gt;

&lt;p&gt;With the caveat that Matlab is not really suited for building a production-standard-final-product, is targeted instead at creating a working prototype, which is perfectly fine, because if the models really work out well then, and only then, i could create a company, get investors / money and hire someone who knows better and is interested about the cattle growing part more than me to create that production-standard-final-product and maintain it.&lt;/p&gt;

&lt;p&gt;In the meanwhile with math and quick prototyping tool i can focus just on modeling and trying out guitar pedal ideas.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The point:&lt;/strong&gt; Tools matter, because they can distract you from focusing on the right things. Some Cattle Growing is required, but try keep it to a minimum and focus instead on the gold.&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/al3xandr3/~4/2XFoWyNrx8g" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://al3xandr3.github.com/2011/01/12/on-tools.html</feedburner:origLink></entry>
 
 <entry>
   <title>Homemade Auto-Updater</title>
   <link href="http://feedproxy.google.com/~r/al3xandr3/~3/JLsM11uLNm0/homemade-autoupdater.html" />
   <updated>2010-12-01T00:00:00-08:00</updated>
   <id>http://al3xandr3.github.com/2010/12/01/homemade-autoupdater</id>
   <category term="automation" label="automation" />
   <category term="emacs" label="emacs" />
   
   <content type="html">&lt;p&gt;Here's a script that i use frequently to update an application to the last
version. It automates the process of downloading and installing the app.&lt;/p&gt;

&lt;h3&gt;Aquamacs&lt;/h3&gt;

&lt;p&gt;Every day there's a new release of Aquamacs, its called the nightly build,
made from the latest developed code. And is always available from the same
url.&lt;/p&gt;

&lt;p&gt;So the script does the following: creates temporary folder for it, downloads
the latest version, unpacks it, requests to close Aquamacs if running(so it
can replace it), creates a backup version AquamacsOld.app (in case the new
ones has troubles i can use the previous), copies the files to the
applications folder and cleans up the temporary downloaded files.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;#!/bin/bash
mkdir /tmp/emacsdownload &amp;amp;&amp;amp; cd /tmp/emacsdownload
curl http://braeburn.aquamacs.org/~dr/Aquamacs/24/Aquamacs-nightly.tar.bz2 --silent -o /tmp/emacsdownload/aquamacs.tar.bz2
tar xjf /tmp/emacsdownload/aquamacs.tar.bz2
RUN=`ps -ef | grep Aquamacs | grep -v grep`
if [ -n "$RUN" ]; then
  x=`/usr/bin/osascript &amp;lt;&amp;lt;EOT
    tell application "Finder"
      activate
      set myReply to button returned of (display dialog "Please close Aquamacs to update" default button 2 buttons {"No", "Ok"})
    end tell
EOT`
  if [[ $x = "No" ]]; then exit; fi
fi
rm -rf /Applications/AquamacsOld.app
mv -f /Applications/Aquamacs.app /Applications/AquamacsOld.app
cp -R /tmp/emacsdownload/Aquamacs.app /Applications
rm -rf /tmp/emacsdownload
&lt;/code&gt;&lt;/pre&gt;

&lt;h3&gt;Chromium&lt;/h3&gt;

&lt;p&gt;For Chromium on each build the download url changes, so we have to add extra
logic for this, first it figures out the latest version and then uses that
information for the download url, the rest is similar to the Aquamacs script.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;#!/bin/bash
CHROMEDIR="http://build.chromium.org/f/chromium/snapshots/Mac/"
mkdir /tmp/chromedownload &amp;amp;&amp;amp; cd /tmp/chromedownload
curl $CHROMEDIR/LATEST -o /tmp/chromedownload/LATEST --silent &amp;amp;&amp;amp; LATEST=`cat /tmp/chromedownload/LATEST`
curl $CHROMEDIR/$LATEST/chrome-mac.zip --silent -o /tmp/chromedownload/chrome-mac.zip
unzip -qq /tmp/chromedownload/chrome-mac.zip
RUN=`ps -ef | grep Chromium | grep -v grep`
if [ -n "$RUN" ]; then
  x=`/usr/bin/osascript &amp;lt;&amp;lt;EOT
    tell application "Finder"
      activate
      set myReply to button returned of (display dialog "Please close Chromium to update" default button 2 buttons {"No", "Ok"})
    end tell
EOT`
  if [[ $x = "No" ]]; then exit; fi
fi
rm -rf /Applications/Chromium.app
cp -R /tmp/chromedownload/chrome-mac/Chromium.app /Applications
rm -rf /tmp/chromedownload
&lt;/code&gt;&lt;/pre&gt;

&lt;h3&gt;Automate it&lt;/h3&gt;

&lt;p&gt;To automate we can add it into a cron job like so:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;01      11      *       *       *       update-chromium
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;That runs every day at 11h01&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/al3xandr3/~4/JLsM11uLNm0" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://al3xandr3.github.com/2010/12/01/homemade-autoupdater.html</feedburner:origLink></entry>
 
 <entry>
   <title>Monitoring Productivity Experiment</title>
   <link href="http://feedproxy.google.com/~r/al3xandr3/~3/-NCjIEovZ4A/monitoring-productivity-experiment.html" />
   <updated>2010-10-20T00:00:00-07:00</updated>
   <id>http://al3xandr3.github.com/2010/10/20/monitoring-productivity-experiment</id>
   <category term="data" label="data" />
   <category term="productivity" label="productivity" />
   <category term="statistics" label="statistics" />
   <category term="visualization" label="visualization" />
   <category term="emacs" label="emacs" />
   <category term="R" label="R" />
   
   <content type="html">&lt;p&gt;&lt;a href="http://thechive.com/2010/08/10/girl-quits-her-job-on-dry-erase-board-emails-entire-office-33-photos/"&gt;&lt;img src="http://al3xandr3.github.com/img/prod-intro.jpeg" alt="prod-intro.jpeg" /&gt;&lt;/a&gt; For over a year now, i've been collecting how much
time i spend in computer and how much of it is actually used in
creative/productive activities.&lt;/p&gt;

&lt;p&gt;By productive activity i mean that the time spent in text editor(emacs),
terminal, excel or a database client is likely to be more creative/productive
than the time spent in youtube, twitter, reading rss feeds, IM Chatting or
replying Email. In average.&lt;/p&gt;

&lt;p&gt;This is overly simplified, but the tools i'm using work specially well for
this, including automatic data collection without the need for manual data
entry.&lt;/p&gt;

&lt;h2&gt;Tracking Time&lt;/h2&gt;

&lt;p&gt;I'm using the &lt;a href="https://www.rescuetime.com/"&gt;RescueTime&lt;/a&gt; application that tracks when there's user
activity on a particular computer application. And then i copy the data onto a
google doc spreadsheet, keeping only a summary per week. RescueTime like any
other app, can have its hiccups, and i've noticed a couple of rare occasions
when it was not tracking well, but overall works well.&lt;/p&gt;

&lt;h2&gt;Data&lt;/h2&gt;

&lt;p&gt;Per week i collect the hours of &lt;em&gt;total&lt;/em&gt;, &lt;em&gt;productive&lt;/em&gt; and &lt;em&gt;distracting&lt;/em&gt; time.&lt;/p&gt;

&lt;p&gt;Besides productive and distracting, there's also the &lt;strong&gt;neutral&lt;/strong&gt; time, that is
something in between, for example, things like moving files around(in finder
the osx equivalent of windows explorer), a google search or even a data gap
that i am not able to classify they all go into the neutral time bucket.&lt;/p&gt;

&lt;p&gt;thus, total = productive + distracting + neutral&lt;/p&gt;

&lt;p&gt;I'll look here at a full year(52 weeks worth of data).&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;google.spreadsheet &amp;lt;- function (key) {
  library(RCurl)
  # ssl validation off
  ssl.verifypeer &amp;lt;- FALSE

  tt &amp;lt;- getForm("https://spreadsheets.google.com/spreadsheet/pub", 
                hl ="en_GB",
                key = key, 
                single = "true", gid ="0", 
                output = "csv", 
                .opts = list(followlocation = TRUE, verbose = TRUE)) 

  read.csv(textConnection(tt), header = TRUE)
}

# load the data
mydata = google.spreadsheet("0AnypY27pPCJydGNCcDhIVVRyZ1ZMWnBTbjBQbmJ0WVE")

# Create a new column with the proper date format
mydata$date = as.Date(mydata$date, format='%d/%m/%Y')

# include ggplot2
library(ggplot2)
&lt;/code&gt;&lt;/pre&gt;

&lt;h3&gt;How is data distributed (Looking for normality)&lt;/h3&gt;

&lt;pre&gt;&lt;code&gt;pl &amp;lt;- ggplot(data = mydata)
#subplot viewport
subvp &amp;lt;- viewport(width=0.4, height=0.4, x=0.22, y=0.80)


his = pl + 
      geom_histogram(aes(x=total,y=..density..),alpha=0.2,binwidth=2) + 
      geom_density(aes(x=total)) + 
      opts(title = "Total")
qqp = pl + 
      geom_point(aes(sample=total), stat="qq") + labs(x=NULL, y=NULL) + 
      opts(title = "QQ")
print(his)
print(qqp, vp = subvp)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/prod-hist-total.png" alt="http://al3xandr3.github.com/img/prod-hist-total.png" /&gt;&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;his = pl + 
      geom_histogram(aes(x=productive,y=..density..),alpha=0.2,binwidth=2) + 
      geom_density(aes(x=productive)) + opts(title="Productive")
qqp = pl + 
      geom_point(aes(sample=productive), stat="qq") + labs(x=NULL, y=NULL) + 
      opts(title = "QQ")

print(his)
print(qqp, vp = subvp)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/prod-hist-prod.png" alt="http://al3xandr3.github.com/img/prod-hist-prod.png" /&gt;&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;his = pl + 
      geom_histogram(aes(x=distracting,y=..density..),alpha=0.2,binwidth=2) + 
      geom_density(aes(x=distracting)) + opts(title = "Distracting")
qqp = pl + 
      geom_point(aes(sample=distracting), stat="qq") + labs(x=NULL, y=NULL) + 
      opts(title = "QQ")

print(his)
print(qqp, vp = subvp)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/prod-hist-dist.png" alt="http://al3xandr3.github.com/img/prod-hist-dist.png" /&gt;&lt;/p&gt;

&lt;p&gt;For the exception of a couple loose ends, we see that the data follows the
normal distribution quite well. Which allows for a few assumptions when
analyzing it. And we could even cut off those loose ends(by excluding data),
for even a more perfect match to the normal distribution.&lt;/p&gt;

&lt;h2&gt;How many hours spent in computer?&lt;/h2&gt;

&lt;pre&gt;&lt;code&gt;(sum(mydata$total) / 24) / 365
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Almost &lt;strong&gt;1/4(~25%) of the whole year&lt;/strong&gt; in front of computer. Wow!&lt;/p&gt;

&lt;h2&gt;How many hours per day?&lt;/h2&gt;

&lt;pre&gt;&lt;code&gt;ttest &amp;lt;- t.test(mydata$total, conf.level = 0.95)
print(ttest)

          1
[1] 0.2391553

        One Sample t-test

data:  mydata$total 
t = 27.3738, df = 51, p-value &amp;lt; 2.2e-16
alternative hypothesis: true mean is not equal to 0 
95 percent confidence interval:
 37.33372 43.24320 
sample estimates:
mean of x 
 40.28846
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Values are between [37.33372, 43.24320] for 95% confidence. Which means that
~40 is a very good estimation of the average time.&lt;/p&gt;

&lt;p&gt;So thats close to &lt;strong&gt;40 hours per week&lt;/strong&gt;, almost &lt;strong&gt;6 hours per day&lt;/strong&gt; in
computer. And this is in average for the whole year, that is, it includes
weekends, vacations, holidays, etc...&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note:&lt;/em&gt; during the (8h)work day we are not 100% of the time active in
computer, from my own data, RescueTime says that for a full hour in front of
computer without interruptions, it captures in average 45min of activity. So,
from a 8h working day you get already only 6h of active computer time, if you
then add in the meetings, breaks, ocasional discussions, etc... that value goes
lower.&lt;/p&gt;

&lt;h2&gt;Searching for Correlations&lt;/h2&gt;

&lt;pre&gt;&lt;code&gt;plotmatrix(mydata[2:4]) + geom_smooth(method="lm")
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/prod-corr.png" alt="http://al3xandr3.github.com/img/prod-corr.png" /&gt;&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cor(mydata[2:4])

                total productive distracting
total       1.0000000  0.8719531   0.6884407
productive  0.8719531  1.0000000   0.4027419
distracting 0.6884407  0.4027419   1.0000000
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Total and Productive time seem to be strongly correlated, what it means?
there's 2 ways to look at it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;increasing productive time the total goes up.&lt;/li&gt;
&lt;li&gt;increasing total time the productive goes up.&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;So, 1. is obvious and not interesting, but could 2. be true?&lt;/p&gt;

&lt;p&gt;Well, if we compare productive vs distracting, we see that productive(0.872)
has a stronger correlation to total time than distracting(0.688). And because
increasing distracting time will always increase the total(in exactly the same
way as productivity will, as 1.) then it means that increasing the total is
more likely to increase productivity time then the distracting time.&lt;/p&gt;

&lt;h2&gt;Trends&lt;/h2&gt;

&lt;pre&gt;&lt;code&gt;ggplot(mydata, aes(x=date)) +  labs(x=NULL, y=NULL) + 
  opts(legend.position="bottom") +
  geom_line(aes(y = total, colour="total")) +
  geom_smooth(aes(y = total, colour = "total")) + 
  geom_line(aes(y = productive, colour="productive")) +
  geom_smooth(aes(y = productive, colour = "productive")) +
  geom_line(aes(y = distracting, colour="distracting")) +
  geom_smooth(aes(y = distracting, colour = "distracting"))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/prod-trend.png" alt="http://al3xandr3.github.com/img/prod-trend.png" /&gt;&lt;/p&gt;

&lt;p&gt;The big drop towards the end is a 2 week vacation, where i barely used
computer.&lt;/p&gt;

&lt;p&gt;In the first half of the plot there is a drop in productivity, accompanied by
an increase on distracting time.&lt;/p&gt;

&lt;p&gt;It also shows that close to the end(last couple of months) there's a tendency
for increase in all categories.&lt;/p&gt;

&lt;h2&gt;The Gear&lt;/h2&gt;

&lt;p&gt;This post was also made to try out the &lt;a href="http://orgmode.org/worg/org-contrib/babel/"&gt;OrgMode Babel&lt;/a&gt; mode that i've
discovered recently, that allows for literate programming(mixing in same
document live/executable code and text).&lt;/p&gt;

&lt;p&gt;This doc was written in (Aqua)Emacs using Orgmode. R as the statistics
toolbox, loaded with the nice ggplot2 graphics package. This allows for a very
smooth work flow for creating this type of documents and it works very well :)&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/al3xandr3/~4/-NCjIEovZ4A" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://al3xandr3.github.com/2010/10/20/monitoring-productivity-experiment.html</feedburner:origLink></entry>
 
 <entry>
   <title>How to download videolectures.net videos with VLC</title>
   <link href="http://feedproxy.google.com/~r/al3xandr3/~3/ocNgy4CRafc/download-videolectures-videos-vlc.html" />
   <updated>2010-08-08T00:00:00-07:00</updated>
   <id>http://al3xandr3.github.com/2010/08/08/download-videolectures-videos-vlc</id>
   <category term="automation" label="automation" />
   
   <content type="html">&lt;p&gt;&lt;a href="http://videolectures.net/"&gt;videolectures.net&lt;/a&gt; has very good content but, no good way to download the
videos(at least as of now).&lt;/p&gt;

&lt;p&gt;And oftentimes you want to watch them offline, so here's a way to dowload them
if you have VLC video player installed.&lt;/p&gt;

&lt;p&gt;This of course works for all videos that are streamed from an mms:// address
and not only the videolectures ones.&lt;/p&gt;

&lt;p&gt;Btw, found same idea, but using mplayer instead, from &lt;a href="http://measuringmeasures.blogspot.com/2009/12/downloading-from-videolecturesnet.html"&gt;bradfordcross here&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;How to&lt;/h2&gt;

&lt;p&gt;Find the mms:// address for the video(from the web page source) and do:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ /Applications/VLC.app/Contents/MacOS/VLC -I rc mms://velblod2.ijs.si:80/2009/pascal2/mlss09uk_cambridge/mackay_it/mlss09uk_mackay_it_01.wmv --sout ~/Desktop/information-theory.avi
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If you are not on a Mac, then you need to update the paths.&lt;/p&gt;

&lt;h2&gt;Make a bash script from it&lt;/h2&gt;

&lt;p&gt;Its annoying to write all of the above every time you want to download a
video, so is worth to make a bash script from it.&lt;/p&gt;

&lt;p&gt;Create a new file, with content:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# getvideo.sh 
/Applications/VLC.app/Contents/MacOS/VLC -I rc $1 --sout ~/Desktop/$2.avi vlc://quit;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Note that i added the &lt;em&gt;vlc://quit;&lt;/em&gt; at the end, this will make it the script
exit when finished.&lt;/p&gt;

&lt;p&gt;Make it executable and reachable from everywhere:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ chmod 711 getvideo.sh 
$ ln -s getvideo.sh /usr/bin/getvideo
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And use like so:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ getvideo mms://velblod2.ijs.si:80/2009/pascal2/mlss09uk_cambridge/mackay_it/mlss09uk_mackay_it_01.wmv information-theory
&lt;/code&gt;&lt;/pre&gt;
&lt;img src="http://feeds.feedburner.com/~r/al3xandr3/~4/ocNgy4CRafc" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://al3xandr3.github.com/2010/08/08/download-videolectures-videos-vlc.html</feedburner:origLink></entry>
 
 <entry>
   <title>Clojure and Selenium cov3 - part ii</title>
   <link href="http://feedproxy.google.com/~r/al3xandr3/~3/2E5KOi-Mv80/clojure-selenium2-crawler-cov3.html" />
   <updated>2010-05-22T00:00:00-07:00</updated>
   <id>http://al3xandr3.github.com/2010/05/22/clojure-selenium2-crawler-cov3</id>
   <category term="automation" label="automation" />
   <category term="clojure" label="clojure" />
   <category term="webcrawler" label="webcrawler" />
   
   <content type="html">&lt;p&gt;Have been playing around with selenium and clojure for a while, and now that
selenium 2 is in beta ended up making a little web crawler library called
cov3.&lt;/p&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/crawler.png" alt="http://al3xandr3.github.com/img/crawler.png" /&gt;&lt;/p&gt;

&lt;p&gt;It has 3 flavors of crawling:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the usual crawler, give him a url, and keeps on going until he visits all of the linked pages (that point to same domain).&lt;/li&gt;
&lt;li&gt;a sitemap crawler, give him a sitemap.xml and visits the urls he finds in the sitemap.&lt;/li&gt;
&lt;li&gt;a step crawler, give him a csv file with the list of urls(steps) to visit.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;On each page the crawler visits it executes a bit of javascript code that we
can define as a validation. These validations are usefull to test your site in
an automated way, say for example, you want to check:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;if all pages contain meta tags&lt;/li&gt;
&lt;li&gt;if all pages contain a title&lt;/li&gt;
&lt;li&gt;if you have web analytics tracking in all pages&lt;/li&gt;
&lt;li&gt;find out what links are broken&lt;/li&gt;
&lt;li&gt;test your own javascript in an automated way in all pages.&lt;/li&gt;
&lt;li&gt;etc..&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Also allows to use, for the crawling, either Firefox, Internet Explorer(when
on windows), Chrome or HtmlUnit(a GUI-Less browser).&lt;/p&gt;

&lt;h2&gt;Usage&lt;/h2&gt;

&lt;pre&gt;&lt;code&gt;(require '[cov3 :as cov3])

;; then (:ff is short for firefox, use :hu for HtmlUnit, 
;; :ch for Chrome, and :ie for Internet Explorer)
(cov3/crawl :ff "http://al3xandr3.github.com/" '("document.title"))

;; or (10 is the sample size to pick from sitemap.xml)
(cov3/sitemap-crawl :ff "http://al3xandr3.github.com/sitemap.xml" "" "" 10 '("document.title"))

;; or (assuming you have a csv file with the steps to take, see more on documentation)
;; for example the line: http://al3xandr3.github.com/,"document.title",,
(cov3/step-crawl :ff "data/steps.csv")
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Is available from github: &lt;a href="http://github.com/al3xandr3/cov3"&gt;http://github.com/al3xandr3/cov3&lt;/a&gt;&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/al3xandr3/~4/2E5KOi-Mv80" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://al3xandr3.github.com/2010/05/22/clojure-selenium2-crawler-cov3.html</feedburner:origLink></entry>
 
 <entry>
   <title>jQuery Twitter 'mini' plugin</title>
   <link href="http://feedproxy.google.com/~r/al3xandr3/~3/fnUcl0RcFLA/jquery-twitter-plugin.html" />
   <updated>2010-04-10T00:00:00-07:00</updated>
   <id>http://al3xandr3.github.com/2010/04/10/jquery-twitter-plugin</id>
   <category term="javascript" label="javascript" />
   
   <content type="html">&lt;p&gt;Here's a little jQuery plugin for displaying a twitter feed into a web page.&lt;/p&gt;

&lt;p&gt;The goal was to put my latest 'tweets' on my blog, and also learn jQuery.&lt;/p&gt;

&lt;p&gt;Ended up making a 'mini' jQuery plugin that can easily be added into any web
page.&lt;/p&gt;

&lt;h2&gt;Demo:&lt;/h2&gt;

&lt;p&gt;For the code:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$(function() {
  $('#tw').click(function() {
    $('#tw').twitter({'user':'al3xandr3','count':1});
  });
});
&lt;/code&gt;&lt;/pre&gt;

&lt;script type="text/javascript"&gt;
$(function() {
  $('#tw').click(function() {
    $('#tw').twitter({'user':'al3xandr3','count':1});
  });
});
&lt;/script&gt;


&lt;p id="tw"&gt;&lt;strong&gt;click me&lt;/strong&gt;&lt;/p&gt;


&lt;h2&gt;How It Works&lt;/h2&gt;

&lt;p&gt;It makes an Ajax request to twitter that returns json data of the feed. That
data is then read and injected into the selected html element(s).&lt;/p&gt;

&lt;p&gt;See in:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$.ajax({
  url: "http://twitter.com/status/user_timeline/" + settings.user + 
       ".json?count="+ settings.count +"&amp;amp;callback=?",
  dataType: 'json',
  success: function (data) {
    $this.html(""); //clean previous html

    $.each(data, function (i, item) {

      //text
      $this.append("&amp;lt;p id=" + item.id + "&amp;gt;" + 
                   replaceURLWithHTMLLinks(item.text) + 
                   "&amp;lt;/p&amp;gt;");

      //date
      if (typeof prettyDate(item.created_at) !== "undefined") {       
        $("&amp;lt;div&amp;gt;&amp;lt;a style='font-size: 80%;' href='http://twitter.com/" +
          settings.user + "/status/" + item.id + "' target='_blank'&amp;gt;" +
          prettyDate(item.created_at) + "&amp;lt;/a&amp;gt;&amp;lt;/div&amp;gt;").appendTo("#" + item.id);
      }
    });}
});
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;jQuery is a very nice designed lib, simple and powerfull. Some say &lt;a href="http://importantshock.wordpress.com/2009/01/18/jquery-is-a-monad/"&gt;its just
like a functional programming Monad&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Full source code is available from github: &lt;a href="http://github.com/al3xandr3/jquery-twitter-plugin"&gt;http://github.com/al3xandr3
/jquery-twitter-plugin&lt;/a&gt;&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/al3xandr3/~4/fnUcl0RcFLA" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://al3xandr3.github.com/2010/04/10/jquery-twitter-plugin.html</feedburner:origLink></entry>
 
 <entry>
   <title>AB testing tools in the Future</title>
   <link href="http://feedproxy.google.com/~r/al3xandr3/~3/15ty4d5e1bc/ab-testing-future.html" />
   <updated>2010-03-16T00:00:00-07:00</updated>
   <id>http://al3xandr3.github.com/2010/03/16/ab-testing-future</id>
   <category term="data" label="data" />
   <category term="abtesting" label="abtesting" />
   <category term="idea" label="idea" />
   
   <content type="html">&lt;p&gt;A view on AB testing tools of the future.&lt;/p&gt;

&lt;h2&gt;How it works:&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;You plugin the AB testing tool to your application and say: optimize page A, on the measurable goal X (for example downloads).&lt;/li&gt;
&lt;li&gt;The tool by itself: creates new UI variation -&gt; tests it -&gt; analyses results -&gt; makes it default -&gt; creates new UI variation -&gt; tests it -&gt; etc... This goes ad eternum... Much like natural evolution, keeps experimenting/mutating, until it finds the UI that works best for the defined goals.&lt;/li&gt;
&lt;/ol&gt;


&lt;h2&gt;Details:&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;New UI variations&lt;/strong&gt; do not need(shouldn't even) be 100% random, they should use smarter techniques like: genetic(and other search/optimization) algorithms + tried out design heuristics + branding guidelines(avoid color A, use font B, etc..) + (sampled)user filtering + some amount of randomness + etc..&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Knowledge Base&lt;/strong&gt;: Build a Database with the test results, that collects knowledge of what worked and what didn't (for a given context). Just as Pandora collects user input for building its recomendation system, this accumulated knowledge would serve as input for the task of generating the new UI variations. &lt;em&gt;Note:&lt;/em&gt; The amount of data is key; the bigger the amount of test results, the closer to all possible variations thus the closer to all the best optimizations. With a large amount of test and tried out results quicky we would get the perfect UI rules.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Page Flow&lt;/strong&gt;: Tool should optimize not only the page itself, but also navigation along pages, customizing content depending on the flow For example, forward the user to a different page, depending on the keyword used in a search engine when arriving at the website.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Personalized UI&lt;/strong&gt;: What works for user A might not work for B. A 16 years old likes different things than a 50 years old. Even for a unique user, his tastes changes over over time: winter vs summer, week vs weekend, working hours vs non-working hours etc... So the perfect interface might need to be changing over time(? Don't assume, experiment and see if it works...).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;img src="http://feeds.feedburner.com/~r/al3xandr3/~4/15ty4d5e1bc" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://al3xandr3.github.com/2010/03/16/ab-testing-future.html</feedburner:origLink></entry>
 
 <entry>
   <title>Automating todo tasks reports with org-mode</title>
   <link href="http://feedproxy.google.com/~r/al3xandr3/~3/CjGjtdGh838/automate-orgmode-todo-export.html" />
   <updated>2010-03-08T00:00:00-08:00</updated>
   <id>http://al3xandr3.github.com/2010/03/08/automate-orgmode-todo-export</id>
   <category term="automation" label="automation" />
   <category term="ruby" label="ruby" />
   <category term="emacs" label="emacs" />
   
   <content type="html">&lt;p&gt;Here's the geek automation of the week, its for helping creating reports from
my TODO tasks list when using the amazing emacs org-mode(&lt;a href="http://orgmode.org/"&gt;see here whats
orgmode all about&lt;/a&gt;).&lt;/p&gt;

&lt;h2&gt;(simplified) Work Flow&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;I get a request, add it into my todo list queue, marking it as a TODO item.&lt;/li&gt;
&lt;li&gt;Work, work, work, guided by the todo listed tasks, balancing priority and effort and (..add your own reason here..).&lt;/li&gt;
&lt;li&gt;When finished, mark an item DONE.&lt;/li&gt;
&lt;li&gt;Generate a report every week with the done tasks.&lt;/li&gt;
&lt;/ol&gt;


&lt;h2&gt;Generating the Report&lt;/h2&gt;

&lt;p&gt;(I use this setup on Mac, with some adaptations should also on Linux and
Windows)&lt;/p&gt;

&lt;p&gt;Every week i then generate a report of the DONE Tasks, by running:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# file: get-work-done.sh 
# run: sh get-work-done.sh

# Uses emacs to extract the DONE items from work.org, generating a work-done.csv
/Applications/Aquamacs.app/Contents/MacOS/Aquamacs -batch -l ~/.emacs -eval '(org-batch-agenda-csv "+TODO=\"DONE\"" org-agenda-files (quote ("/.../work.org")))' &amp;gt; work-done.csv

# Applies desired report formatting to the exported work-done.csv, generating work.csv
ruby format-report.rb work-done.csv

# Clean up the originally exported file
rm work-done.csv

# Opens the final file in the default .csv handler, typically Excel.
open work.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;See the comments "#" to understand what it does in each step.&lt;/p&gt;

&lt;p&gt;Then I use the format-report.rb that apply's the final formatting to the
report, for example: add my own header, add/remove columns, Dates, change
names, calculate values, etc, etc… see example:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# file: format-report.rb

flines = File.open(ARGV[0]).readlines

column_map = { 
  "from_name1"  =&amp;gt; "to_name1", 
  "from_name2"  =&amp;gt; "to_name2",  
}

File.open( "work.csv","w+") do |fl|  
  fl.puts "header1,header2,header3,header4"
  flines.each do |l|
    a = l.split(",")

    # Time, mapping-defined-in-column_map, original-column-2
    fl.puts Time.now.strftime("%m/%d/%Y") + "," + 
            (column_map[a[1]] || a[1]) + "," + a[2] "," + a[3]
  end
end
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And voila, i run this and an excel(my default .csv app) sheet opens up with
the report of the week.&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/al3xandr3/~4/CjGjtdGh838" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://al3xandr3.github.com/2010/03/08/automate-orgmode-todo-export.html</feedburner:origLink></entry>
 
 <entry>
   <title>Probability simulation of basketball throws</title>
   <link href="http://feedproxy.google.com/~r/al3xandr3/~3/DpVUlK6kX5I/clojure-basketball-probabilities.html" />
   <updated>2009-08-27T00:00:00-07:00</updated>
   <id>http://al3xandr3.github.com/2009/08/27/clojure-basketball-probabilities</id>
   <category term="montecarlo" label="montecarlo" />
   <category term="statistics" label="statistics" />
   <category term="clojure" label="clojure" />
   
   <content type="html">&lt;p&gt;A little probability simulation from the book &lt;em&gt;Resampling: The New
Statistics&lt;/em&gt;, using Clojure and Incanter.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Example 7-3:&lt;/em&gt; What is the probability that a basketball player will score
three or more baskets in five shots from a spot 30 feet from the basket, if on
the average she succeeds with 25 percent of her shots from that spot?&lt;/p&gt;

&lt;p&gt;Three or More Successful Basketball Shots in Five Attempts, that is, a Two-
Outcome Sampling with Unequally-Likely Outcomes, with Replacement, Binomial
Experiment.&lt;/p&gt;

&lt;p&gt;Lets start with the Resampling book code solution, and then implement it using
clojure and incanter.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;REPEAT 100000
  GENERATE 5 1,4 a
  COUNT a =1 b 
  SCORE b z
END
COUNT z &amp;gt;= 3 k
DIVIDE k 100000 kk
PRINT kk
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now onto the clojure solution.&lt;/p&gt;

&lt;h2&gt;Starting simple, 1 throw&lt;/h2&gt;

&lt;p&gt;Before going for the full thing I tried out first a simulation of just 1 throw
and visualized it.&lt;/p&gt;

&lt;p&gt;It simulates 100 thousand times a basketball throw with the probabilities
described above.&lt;/p&gt;

&lt;p&gt;The category-count function is a helper function to count :miss and :hit's but
its not absolutely needed.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;(use '(incanter core stats charts io))

(defn category-count
  [aseq]
  "counts the category's from a given sequence, ex:
   $ (category-count '(:hit :hit :miss)) =&amp;gt; {:miss 1, :hit 2}"
  (into {} (let [dis (distinct aseq)]
  (for [item dis]
      {item (count (filter #(= item %) aseq))}))))

(def one-throw
  (for [_ (range 100000)]  
    ; does not matter if replacement is true or false
    (sample [:hit :miss :miss :miss] :size 1 :replacement false)))

(def throws-count (category-count one-throw))
(view (bar-chart (keys throws-count) (vals throws-count)))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/basketball-1throw.png" alt="http://al3xandr3.github.com/img/basketball-1throw.png" /&gt;&lt;/p&gt;

&lt;p&gt;Results make sense, its an 1/4 (25%) of probabilities of making the basket, so
simulations seems to be working correctly.&lt;/p&gt;

&lt;h2&gt;With 5 Throws&lt;/h2&gt;

&lt;p&gt;Same logic, its the sample method from Incanter doing all the magic of picking
randomly if each throw is a :miss or a :hit. Then test if number of hits is &gt;=
3, identify it as an :ok and finally see the percentage of all :ok's from the
total simulation.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;(defn five-throws
  []
  (category-count 
    (for [_ (range 100000)]  
      (let [smp (sample [:hit :miss :miss :miss] :size 5 :replacement true)
            catg (category-count smp)
            hits (:hit catg)]
        (if (and (not (nil? hits)) (&amp;gt;= hits 3))
          :ok
          :nok)))))

(double (/ (:ok (five-throws)) 100000))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Result is 0.10283, so there's only a ~10% chance of that basketball player
making 3 or more baskets in a 5 throws sequence.&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/al3xandr3/~4/DpVUlK6kX5I" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://al3xandr3.github.com/2009/08/27/clojure-basketball-probabilities.html</feedburner:origLink></entry>
 
 <entry>
   <title>Clojure test-is results in twitter</title>
   <link href="http://feedproxy.google.com/~r/al3xandr3/~3/XYBHtrxxWbg/clojure-test-is-twitter.html" />
   <updated>2009-05-15T00:00:00-07:00</updated>
   <id>http://al3xandr3.github.com/2009/05/15/clojure-test-is-twitter</id>
   <category term="automation" label="automation" />
   <category term="clojure" label="clojure" />
   
   <content type="html">&lt;p&gt;Here's how you can have your your test-is results posted on twitter
automatically as they run.&lt;/p&gt;

&lt;p&gt;This is useful for example if you have regression tests that run automatically
in a remote machine, once a week(for example) and want to find a good way to
see the results.&lt;/p&gt;

&lt;p&gt;Start by getting &lt;em&gt;jtwitter&lt;/em&gt;, put it in your classphath and create:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;(import 'winterwell.jtwitter.Twitter)

(defn twitter-update [the_message]
  (let [twitter (new Twitter "username" "password")]
    (.updateStatus twitter the_message)))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Update "username" and "password" with your twitter login.&lt;/p&gt;

&lt;p&gt;Then hook it up to the /test-is'/library(test library in clojure): just
overwrite the report summary method that by default prints out the summary of
the executed tests.&lt;/p&gt;

&lt;p&gt;Before:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;(defmethod report :summary [m]
  (with-test-out
    (println "\nRan" (:test m) "tests containing"
    (+ (:pass m) (:fail m) (:error m)) "assertions.")
    (println (:fail m) "failures," (:error m) "errors.")))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;After:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;(defmethod report :summary [m]
  (with-test-out
    (twitter-update (str "[coverager] " (:fail m) " failures, " (:error m) " errors."))
    (println "\nRan" (:test m) "tests containing"
    (+ (:pass m) (:fail m) (:error m)) "assertions.")
    (println (:fail m) "failures," (:error m) "errors.")))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Note that i just added added a line calling the twitter-update method.&lt;/p&gt;

&lt;p&gt;And thats it, now every time you run your tests, you will have the failures
and errors twittered:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/clj-twitter-alert.png" alt="http://al3xandr3.github.com/img/clj-twitter-alert.png" /&gt;&lt;/p&gt;

&lt;p&gt;I've created a 2nd account on twitter where i post these automated messages.
And have my clojure tests(regression tests) running by themselves and i just
check twitter to see if all is good.&lt;/p&gt;

&lt;p&gt;This little twitter-update method is very easy to re-use for other alerts and
automations.&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/al3xandr3/~4/XYBHtrxxWbg" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://al3xandr3.github.com/2009/05/15/clojure-test-is-twitter.html</feedburner:origLink></entry>
 
 <entry>
   <title>Clojure and Selenium</title>
   <link href="http://feedproxy.google.com/~r/al3xandr3/~3/Ffi8XMnMPoM/clojure-selenium-crawler.html" />
   <updated>2009-04-24T00:00:00-07:00</updated>
   <id>http://al3xandr3.github.com/2009/04/24/clojure-selenium-crawler</id>
   <category term="automation" label="automation" />
   <category term="clojure" label="clojure" />
   <category term="webcrawler" label="webcrawler" />
   
   <content type="html">&lt;p&gt;I needed a kind of crawler to go around a list of pages, invoke some javascript(tests) and collect the output.&lt;/p&gt;

&lt;p&gt;Curl or a regular http lib’s don’t do the trick because i need to run javascript on each requested page. For that i can use Selenium, Selenium is a great framework to perform web testing, that uses directly a browser and thus we can run Javascript.&lt;/p&gt;

&lt;p&gt;Selenium can be scripted from Java which matches very well with my wish to learn Clojure :)&lt;/p&gt;

&lt;h2&gt;Solution&lt;/h2&gt;

&lt;p&gt;What i implemented is not really a crawler in the sense that it does not go around automatically following all the links it finds, it actually gets the list of links to check from the site sitemap.xml. But is not that hard to use this as a base for a crawler.&lt;/p&gt;

&lt;p&gt;As some sitemaps.xml are huge, i added also a little pick-a-sample function that randomly selects only a subset from all the sitemap.&lt;/p&gt;

&lt;h2&gt;Code&lt;/h2&gt;

&lt;p&gt;Im on the process of learning Clojure, so probably a lot of things could be improved.&lt;/p&gt;

&lt;p&gt;For Selenium, we need first to start the server, then the client, and then use the client to browse the pages. As is not very elegant to have a start server and“start client on the top of the script and a“stop client and stop server call at the end of the script, i've wrapped those around a macro (one of the major strengths of Lisp like languages).&lt;/p&gt;

&lt;p&gt;The whole thing goes like this:&lt;/p&gt;

&lt;p&gt;process-sitemap receives a sitemap, transforms it into a map(with xml-to-zip), collects the url links in it, then picks a sample from them(with pick-a-sample) and calls check-pages with them.&lt;/p&gt;

&lt;p&gt;check-pages gets a list of urls. It starts by using the macro, obtains a-browser from it, then iterates over the list of urls, calling check-a-page on each url(a-url). Note that at this point the standard output is redirected to a file, so i can log the results from check-a-page.&lt;/p&gt;

&lt;p&gt;check-a-page gets a-browser and a-url, so you can guess what it will do :)It opens that url in the browser, calls the javascript, and prints to standard output the return of the js call.&lt;/p&gt;

&lt;p&gt;Hope google does not mind to use their site as an example. But do not run this on Google site, its just an example, use it on your own site!&lt;/p&gt;

&lt;p&gt;For this to run you will need to have in your classpath a bunch of jar libs, this is how my lib folder looks like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;lib/
  clojure-contrib.jar
  clojure.jar
  jline-0.9.94.jar
  selenium-java-client-driver.jar
  selenium-server.jar
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I called this app“/coverager/&lt;/p&gt;

&lt;p&gt;Code:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;;;file: coverager.clj
(ns coverager
  (:import (com.thoughtworks.selenium DefaultSelenium)
    (org.openqa.selenium.server SeleniumServer)
      java.util.Date
      (java.io FileWriter)
      (java.text SimpleDateFormat))
  (:use clojure.contrib.zip-filter.xml)
  (:require [clojure.zip :as zip]
            [clojure.xml :as xml]))

(defmacro with-selenium
  [browser &amp;amp; body]
  `(let [server# (new SeleniumServer)]
    (.start server#)
    (let [~browser 
         (new DefaultSelenium "localhost", 4444, "*firefox", "http://www.google.com/")]
      (.start ~browser)
      (.setTimeout ~browser "100000")
      ~@body
      (.stop ~browser))
      (.stop server#)))

(def *js-eval* "this.browserbot.getCurrentWindow().document.title;")                                            
(defn check-a-page [a-browser a-url] 
  (try 
  (.open a-browser a-url)
    (Thread/sleep 3000) ; make a little timeout, to avoid overloading server
    (println (str a-url "," (.getEval a-browser *js-eval*)))
    (catch Exception e 
    (println (str a-url "," e)))))

(defn check-pages [url-list]
  (with-selenium browser
    (binding [*out* (FileWriter. 
         (str "output/log_" (.format (SimpleDateFormat. "yyyy-MM-dd") (Date.)) ".csv"))]
      (doseq [a-url url-list]
        (check-a-page browser a-url)))))

(defn xml-to-zip [url]
  "read xml url into a tree"
  (zip/xml-zip (xml/parse url)))

(defn pick-a-sample [a-percentage a-list]
  "picks a subset (a-)percentage of the total"
    (filter #(if (&amp;gt; (rand) (- 1 (/ a-percentage 100))) %) a-list))

(defn process-sitemap [sitemap-url]
  (let [u-list (xml-&amp;gt; (xml-to-zip sitemap-url) :url :loc text)]
    (check-pages (pick-a-sample 1 u-list))))

(def *sitemap* "http://www.google.com/sitemap.xml")

;use: (process-sitemap *sitemap*)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And of course tests for it:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;;;file: coverager_test.clj
(ns coverager_test
  (:use clojure.contrib.test-is)
  (:use coverager)
  (:use clojure.contrib.zip-filter.xml)
  (:require [clojure.zip :as zip]
            [clojure.xml :as xml]))

(deftest browse-page
  (with-selenium abrowser  
    (.open abrowser "http://www.google.com/a/")
    (is (.startsWith (.getTitle abrowser) "Google Apps"))))

(def abit "&amp;lt;?xml version='1.0' encoding='UTF-8'?&amp;gt;
&amp;lt;urlset xmlns='http://www.sitemaps.org/schemas/sitemap/0.9'&amp;gt;
 &amp;lt;url&amp;gt;
  &amp;lt;loc&amp;gt;http://www.google.com/&amp;lt;/loc&amp;gt;
  &amp;lt;lastmod&amp;gt;2009-04-03&amp;lt;/lastmod&amp;gt;
  &amp;lt;priority&amp;gt;0.5000&amp;lt;/priority&amp;gt;
 &amp;lt;/url&amp;gt;
 &amp;lt;url&amp;gt;
  &amp;lt;loc&amp;gt;http://www.google.com/a&amp;lt;/loc&amp;gt;
  &amp;lt;lastmod&amp;gt;2009-04-03&amp;lt;/lastmod&amp;gt;
  &amp;lt;priority&amp;gt;0.5000&amp;lt;/priority&amp;gt;
 &amp;lt;/url&amp;gt;
&amp;lt;/urlset&amp;gt;
")

(deftest xml-process
  (let [res (xml-to-zip (org.xml.sax.InputSource. (java.io.StringReader. abit)))]
    (let [lis (xml-&amp;gt; res :url :loc text)]
      (is (= (first lis) "http://www.google.com/"))
      (is (= (last lis) "http://www.google.com/a")))))

(deftest on-picking-sample
  (let [the-sample (pick-a-sample 10 '(0 1 2 3 4 5 6 7 8 9))]
    ;not completely garanteed will take only 1, 
    ;it should, on most cases but more important is
    ;to picking up randomly a small subset from list
    ;so less than 3 items is reasonable test
    (is (&amp;lt; (count the-sample) 3))))

(defn run-them []
  (run-tests 'coverager_test))
&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;Take away&lt;/h2&gt;

&lt;p&gt;Clojure is great! Its my opinion that on the Lisp family of languages the code is more elegant and visually cleaner than the C family.&lt;/p&gt;

&lt;p&gt;I don't care much for working directly with the Java language, but working on the JVM with other languages like JRuby, Clojure, and harnessing all the vast amount of Java libs and infrastructure out there is a MAJOR advantage.&lt;/p&gt;

&lt;p&gt;I suspect i will be spending more time with Clojure in future :)&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/al3xandr3/~4/Ffi8XMnMPoM" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://al3xandr3.github.com/2009/04/24/clojure-selenium-crawler.html</feedburner:origLink></entry>
 
 <entry>
   <title>Lem-E-Tweakit and Logic programming</title>
   <link href="http://feedproxy.google.com/~r/al3xandr3/~3/GaV2FmO8mnY/sicp-prolog-sql.html" />
   <updated>2009-02-02T00:00:00-08:00</updated>
   <id>http://al3xandr3.github.com/2009/02/02/sicp-prolog-sql</id>
   <category term="SQL" label="SQL" />
   
   <content type="html">&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/tweakit.png" alt="http://al3xandr3.github.com/img/tweakit.png" /&gt; While watching the SICP
lectures 8a &amp;amp; 8b, one thing that i realized is that this logic programming
they mention seems to be very similar to kind of things we use SQL for, just
better... ie. a lot more flexible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;So why we do use SQL after all?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After a bit of googling found this good article: &lt;a href="http://search.cpan.org/dist/AI-Prolog/lib/AI/Prolog/Article.pod"&gt;http://search.cpan.org/dist
/AI-Prolog/lib/AI/Prolog/Article.pod&lt;/a&gt; one of the references says:&lt;/p&gt;

&lt;p&gt;"...So if Prolog(read AI) and SQL(DB) are so similar, Why is one so successful
commercially and the other deemed a complete failure in terms of
scalability?..."&lt;/p&gt;

&lt;p&gt;Relational Databases implement powerful techniques to improve performance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Indexing&lt;/li&gt;
&lt;li&gt;Hashing&lt;/li&gt;
&lt;li&gt;Reordering goals to reduce backtracking&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Where as Prolog based systems have very few such techniques.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is SQL missing?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;One of the most powerful features of Prolog is recursion.&lt;/p&gt;

&lt;p&gt;SQL does not have recursion built into it. This is a severe Handicap. However
there is a way to overcome this problem by invoking multiple SQL queries from
a host language like C Or Java. SQL3 has begun supporting recursion.&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/al3xandr3/~4/GaV2FmO8mnY" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://al3xandr3.github.com/2009/02/02/sicp-prolog-sql.html</feedburner:origLink></entry>
 
 <entry>
   <title>Why my keyboard has a QWERTY layout?</title>
   <link href="http://feedproxy.google.com/~r/al3xandr3/~3/SQllySShQW0/keyboard-qwerty-layout.html" />
   <updated>2009-01-21T00:00:00-08:00</updated>
   <id>http://al3xandr3.github.com/2009/01/21/keyboard-qwerty-layout</id>
   <category term="abtesting" label="abtesting" />
   <category term="keyboard" label="keyboard" />
   <category term="idea" label="idea" />
   
   <content type="html">&lt;p&gt;The most common keyboard layout in use today is called QWERTY, it takes its
name from the first six characters seen in the far left of the keyboard's top
row of letters. &lt;img src="http://al3xandr3.github.com/img/keyb.png" alt="http://al3xandr3.github.com/img/keyb.png" /&gt;&lt;/p&gt;

&lt;h2&gt;Why QWERTY?&lt;/h2&gt;

&lt;p&gt;from Wikipedia:&lt;/p&gt;

&lt;p&gt;The QWERTY layout was introduced in the 1860s, being used on the first
commercially-successful typewriter, the machine invented by Christopher
Sholes. The QWERTY layout was designed so that successive keystrokes would
alternate between sides of the keyboard so as to avoid jams. Improvements in
typewriter design made key jams less of a problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;So the mechanical issues of the original typewriters was the main reason for this layout design.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A second popular layout is called Dvorak, designed in the 1930's and it tries
to address another problem:&lt;/p&gt;

&lt;p&gt;"...the introduction of the electric typewriter in the 1930s made typist fatigue
more of a problem, leading to increased interest in the Dvorak layout..."&lt;/p&gt;

&lt;p&gt;But even the Dvorak layout is already some years old, so is the &lt;strong&gt;typist
fatigue&lt;/strong&gt; the same as now? Do we write, for example, the same words as in
1930's? I think not...&lt;/p&gt;

&lt;h2&gt;Is there a better layout?&lt;/h2&gt;

&lt;p&gt;Take a look at the texts you write every day, and imagine if changing the
keyboard would make it more practical for you.&lt;/p&gt;

&lt;p&gt;For example, for this text i had to write the word WHY a lot ;) so if i had
whyrtu instead of qwerty on my keyboard, would be an improvement...right? Maybe
yes, but there's other things to take into account, for example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;counting ALL of the most used words (not only the why).&lt;/li&gt;
&lt;li&gt;key combinations that minimizes vertical finger, because jumping from 1st to 3rd row, is less efficient than a jump from 1st to 2nd row.&lt;/li&gt;
&lt;li&gt;what key combinations minimizes horizontal finger movements?&lt;/li&gt;
&lt;li&gt;frequent use of pinky, should be minimized, related to the above horizontal and vertical long movements.&lt;/li&gt;
&lt;li&gt;etc... (many other heuristics)&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;And then try out several keyboard layouts and variations.&lt;/p&gt;

&lt;p&gt;Imagining to do this by hand, like Dvorak and Sholes did for QWERTY is a pain.
Luckily we have computers now and as a matter of fact someone already played
around with this problem (which made think about all this subject in the first
place).&lt;/p&gt;

&lt;p&gt;See here for experience: &lt;a href="http://klausler.com/evolved.html"&gt;http://klausler.com/evolved.html&lt;/a&gt;, and also the
final keyboard layout: http://klausler.com/evolved.pdf&lt;/p&gt;

&lt;h2&gt;Your Own keyboard&lt;/h2&gt;

&lt;p&gt;I'm not sure that there is the ONE perfect keyboard for everyone, because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;different people use different words.&lt;/li&gt;
&lt;li&gt;different languages use different words.&lt;/li&gt;
&lt;li&gt;15 years old person words are different from a 70 years old person words.&lt;/li&gt;
&lt;li&gt;use of computer during working hours compared out-of-work working hours...&lt;/li&gt;
&lt;li&gt;etc...&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;So even for you, an optimal keyboard layout would probably change over time.&lt;/p&gt;

&lt;p&gt;But imagine a future intelligent computer that can be all the time analyzing
what you write in the keyboard and can auto-adapt itself to give your own very
optimal layout. Not too frequently, of course, you don' want your keyboard
changing every day. But maybe every 3 years is not so impossible to imagine...&lt;/p&gt;

&lt;p&gt;Keyboards of the future should come with blank keys that can be personalized
with the character you want.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How would this work if someone changes computer&lt;/strong&gt;? The keyboard profile, could be fetched from internet when you login to computer.&lt;/p&gt;

&lt;p&gt;Another idea, is to have full words on keys, like with your top 5 most used
words.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Touch screens&lt;/strong&gt; with a touch screen there's a whole new level of layout possibilities, we could make more frequent keys bigger than others for example, play with different keyboard shapes, that can adapt to each person real hands sizes for example, etc...&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/al3xandr3/~4/SQllySShQW0" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://al3xandr3.github.com/2009/01/21/keyboard-qwerty-layout.html</feedburner:origLink></entry>
 
 <entry>
   <title>Little 'ruby' photo organizer</title>
   <link href="http://feedproxy.google.com/~r/al3xandr3/~3/9IW2ivGKhEM/ruby-foto-organizer.html" />
   <updated>2008-11-25T00:00:00-08:00</updated>
   <id>http://al3xandr3.github.com/2008/11/25/ruby-foto-organizer</id>
   <category term="automation" label="automation" />
   <category term="ruby" label="ruby" />
   
   <content type="html">&lt;p&gt;Place this script in a folder full of pictures, run it, and it will organize
the pictures into folders by the day they were taken.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ ruby photo_organizer.rb
&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;Code:&lt;/h2&gt;

&lt;pre&gt;&lt;code&gt;# filename: photo_organizer.rb
#!/usr/bin/env ruby

require 'fileutils'
require 'exifr'

def picDate(file)
  begin 
    ex = EXIFR::TIFF.new(file) || EXIFR::JPEG.new(file)
    (ex.date_time).strftime "%Y%m%d" if ex.exif?
  rescue
    File.mtime(file).strftime "%Y%m%d"
  end
end

def isPic?(file)
  [".JPG",".JPEG",".PNG",".AVI",".WAV",".NEF",".MOV",".TIFF"]
    .include? File.extname(file).upcase
end

print "Creating dirs"
Dir.foreach(".") do |f|
  dt = picDate f 
  Dir.mkdir d and print '.' if isPic? f unless File.directory? dt
end

print "\nMoving pics"
Dir.foreach(".") do |f| 
  FileUtils.mv(f, picDate(f)+'/'+f) and print '.' if isPic? f
end
puts
&lt;/code&gt;&lt;/pre&gt;
&lt;img src="http://feeds.feedburner.com/~r/al3xandr3/~4/9IW2ivGKhEM" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://al3xandr3.github.com/2008/11/25/ruby-foto-organizer.html</feedburner:origLink></entry>
 
 <entry>
   <title>Funds 'R' US</title>
   <link href="http://feedproxy.google.com/~r/al3xandr3/~3/Q5jhghlT0Ys/funds-r-us.html" />
   <updated>2008-02-12T00:00:00-08:00</updated>
   <id>http://al3xandr3.github.com/2008/02/12/funds-r-us</id>
   <category term="idea" label="idea" />
   
   <content type="html">&lt;p&gt;Is it only me, or its a bit annoying every year around Christmas, Birthday's,
etc, where you never know exactly what to offer and end up buying &amp;amp; offering useless stuff...&lt;/p&gt;

&lt;p&gt;Its still nice to offer little things anyway, a typical valentine's
day gift should not be changed i guess, but from a pragmatic point of view,
offering yet another pijama is just.. waste...&lt;/p&gt;

&lt;p&gt;A common Deja Vu ?: "Its a week to Christmas, i'm on the shopping mall
already, i need to buy Andrew a gift, what can i get him?? lets check the
bookstore. What could he like... maybe this cars collection book... or this
landscapes photo album...&lt;/p&gt;

&lt;p&gt;How about if we offer something to Andrew that he really wants/needs? I know
that already exists wishlists, amazon, google etc.. have them, and they are
very nice.. you can look at Andrew's wish list he has a book there he really
wants, cool lets get that. But how about big things? things that by yourself
you cannot offer to Andrew.&lt;/p&gt;

&lt;p&gt;Like, Andrew really wants a piano. I cannot offer Andrew a piano, but maybe i
can contribute to Andrew's piano, kinda like buying a couple of piano keys...&lt;/p&gt;

&lt;p&gt;Thats where Funds'R'US come in, Andrew would go to Funds'R'US and create there
a piano fund for him. When its Christmas time i can go to Funds'R'US and check
Andrew's Funds, and find out that he really wants a piano, so I can online,
quickly, without needing to go look for random stuff in shops, give him
something useful and that will make Andrew really happy. So i would deposit
some money into Andrew's piano fund and then Funds'R'US could even send a
personalized postcard to Andrew with a picture of a piano with a fraction of
the picture selected saying: Merry Christmas, Alex. After enough christmas and
birthdays passing by, Andrew will eventually get money for a full piano,
instead of a pile of not so useful stuff.&lt;/p&gt;

&lt;p&gt;Discussing this idea, with a friend of mine, Safin Ahmed (the zen master of
(crazy?)ideas :) ) he suggested extending this idea a bit further, actually
the Funds'R'US could even support more than personal funds, they could have
organizations Funds. Like Unicef wants to offer a health care center to a
needing city. Or your own town homeless center supporting more 5 people for
next winter. etc.. etc...&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/al3xandr3/~4/Q5jhghlT0Ys" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://al3xandr3.github.com/2008/02/12/funds-r-us.html</feedburner:origLink></entry>
 
 <entry>
   <title>Big Brother Google</title>
   <link href="http://feedproxy.google.com/~r/al3xandr3/~3/_R79BOOEogU/google-big-brother.html" />
   <updated>2008-01-18T00:00:00-08:00</updated>
   <id>http://al3xandr3.github.com/2008/01/18/google-big-brother</id>
   <category term="data" label="data" />
   
   <content type="html">&lt;p&gt;&lt;strong&gt;(A)Tipical Day?&lt;/strong&gt;: Its morning, you get to your computer, go check the gmail, ah there's some some news fotos on web picasa from a friend, lets check them out...&lt;/p&gt;

&lt;p&gt;Then, whats new today? lets check some rss feeds from the google reader and by
the way, how is the real world doing? lets check the google news, humm this is
interesting, lets do some browsing on it. Open google search ...&lt;/p&gt;

&lt;p&gt;&lt;em&gt;search, search, search ...&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Lunch time, check gmail again, then the gtalk inside the browser pop's up with
a friend asking how you doing, you make a little talk.&lt;/p&gt;

&lt;p&gt;Lets take also a peek at how the stock options doing on google finance.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;search, search, search ...&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Its late afternoon, you scheduled a dinner and you wonder: exactly where is
that restaurant street? lets check google maps, (or google earth for an even
cooler experience), you find out exactly where is the restaurant, even add a
placemark...&lt;/p&gt;

&lt;p&gt;Get back home from dinner, download you digital camera pictures onto computer,
upload them into web picasa, make a blog post(on google blogger) showing a
picture from that night adding some comments where you where, and what you
did, and who you were with.&lt;/p&gt;

&lt;p&gt;Check your mail again, and your friend sent you and youtube video, lets check
that out, and maybe 1, 2, 3... 10 videos more ...&lt;/p&gt;

&lt;p&gt;Besides all these you might also use google for making your home page, share
documents, share a calendar, google analytics for collecting your web site
stats, google desktop search to make fast searches on your computer? google
shopping, google images, etc, etc...&lt;/p&gt;

&lt;p&gt;So overall, how many google applications are you using? how much information
does google has or can potentially have about you? By the way, note that being
able to handle this amount of data in a usable way is no easy task but... still
we can only imagine ...&lt;/p&gt;

&lt;p&gt;And actually, from a data mining perspective, this is a dream situation
almost, imagine the knowledge collected from all the situation described
above; the searches you do, the photos you make, the places you look at in the
world, where you go, what images you search about, what you click on, what are
your interests, what you blog about, what news you subscribe, maybe some
information on what you work on, and maybe some information of your school,
with google desktop indexing all documents...its endless, your agenda, your
shopping habits, etc, etc...&lt;/p&gt;

&lt;p&gt;Soon, Google knows more about you than you do, imagine the ultimate google
application, where you type the following into the search box:&lt;/p&gt;

&lt;p&gt;Me: Show me what's new today!&lt;/p&gt;

&lt;p&gt;Google(all the news you are interested about):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Video of the new MacBook Air&lt;/li&gt;
&lt;li&gt;Ruby 1.9 released&lt;/li&gt;
&lt;li&gt;News on Skype new version for mac&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Me: Entertain me!&lt;/p&gt;

&lt;p&gt;Google: Do you want to?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;play a Joe Satriani cd.&lt;/li&gt;
&lt;li&gt;watch Seinfeld episode.&lt;/li&gt;
&lt;li&gt;Check out a chocolate cake recipe&lt;/li&gt;
&lt;li&gt;Buy a chocolate cake&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Me: What is my favorite color, movie, drink?&lt;/p&gt;

&lt;p&gt;Google: black, simpsons, coffee!&lt;/p&gt;

&lt;p&gt;Google: Hey!!&lt;/p&gt;

&lt;p&gt;Me: yes?&lt;/p&gt;

&lt;p&gt;Google: haven't you forgot to pay your water bill? And don't forget that your
gradma birthday is in 3 days, go buy something nice!&lt;/p&gt;

&lt;p&gt;Me: humm ... ops .... yeah thanks&lt;/p&gt;

&lt;p&gt;Google: Maybe you want to check out this cooking book(link), 13 recipes
contain ingredients your grandma likes... and shipping costs are free to her
address area.&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/al3xandr3/~4/_R79BOOEogU" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://al3xandr3.github.com/2008/01/18/google-big-brother.html</feedburner:origLink></entry>
 
 <entry>
   <title>Visualizing Data, with Processing and JRuby</title>
   <link href="http://feedproxy.google.com/~r/al3xandr3/~3/-P3DJfy4E98/visualizing-data-processing-ruby.html" />
   <updated>2008-01-14T00:00:00-08:00</updated>
   <id>http://al3xandr3.github.com/2008/01/14/visualizing-data-processing-ruby</id>
   <category term="data" label="data" />
   <category term="visualization" label="visualization" />
   <category term="statistics" label="statistics" />
   <category term="ruby" label="ruby" />
   
   <content type="html">&lt;p&gt;Here's a data visualization experiment including a mini data warehouse, to
visualize the amount of vegetarians around the word.
&lt;img src="http://al3xandr3.github.com/img/vis-visual.png" alt="http://al3xandr3.github.com/img/vis-visual.png" /&gt;&lt;/p&gt;

&lt;p&gt;I planned the following:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/vis-mmap.png" alt="http://al3xandr3.github.com/img/vis-mmap.png" /&gt;&lt;/p&gt;

&lt;p&gt;It goes like this, imagine looking at world map, with each country showing the
number of vegetarians, you should be able to zoom in to europe for example, to
see which country leads vegetarian eating or zoom out and see the whole world
picture, click on a specific country and see the statistics for that country
for a full month, or choose a particular day of month to visualize the whole
world on that day. Also is desired map like navigation, were is possible to
drag the map around, zoom in and out.&lt;/p&gt;

&lt;p&gt;On the technical side, as i was interested in using Processing framework and
because I am a ruby addict, this turned out to be a good excuse to play with
jruby.&lt;/p&gt;

&lt;h2&gt;part i, Aggregating Data&lt;/h2&gt;

&lt;p&gt;Normally this process involves a lot of work, but i had a shortcut, i was able
to collect clean data from another database. I'm interested in the table with
vegetarian people. But what to collect? what to summarize/aggregate? what to
calculate ?&lt;/p&gt;

&lt;p&gt;&lt;em&gt;NOTE:&lt;/em&gt; Specify up front what is the goal of the visualization as much as
possible, this will influence the way all design will be done. I had to repeat
initial steps a couple of times as the visualization ideas developed, like re-
agregating all the data and calculate new fields.&lt;/p&gt;

&lt;p&gt;So in an warehouse fashion lets choose the facts and dimensions:&lt;/p&gt;

&lt;p&gt;Facts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;number of vegetarians.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Dimensions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Time.&lt;/li&gt;
&lt;li&gt;Localization(country).&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&lt;em&gt;facts:&lt;/em&gt; are generally numeric data that captures specific values.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;dimensions:&lt;/em&gt; contain the reference information that gives each transaction
its context. When dimensions are created they should be as enriched with most
information as possible (and calculated values).&lt;/p&gt;

&lt;p&gt;Next Step is to build the warehouse, for this is used a plain database where i
created 3 tables:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/vis-dw.png" alt="http://al3xandr3.github.com/img/vis-dw.png" /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dimension Country:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Initially i only had 2 char ISO code identifying country, but i enriched the
dimension with all the other values.&lt;/p&gt;

&lt;p&gt;I used geoname.org webservice to collect other values. Specially important are
the geo coordinates for the country bounding box which where used to calculate
central latitude and a central longitude of a country, that is going to be
used for the visualization.&lt;/p&gt;

&lt;p&gt;Things like continent, population, capital, are can be used later for
summarizing data for continent, for showing ratio of number of vegetarians for
total of population, number of vegetarians for square meter, etc, etc... think
of the possibilities... :)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dimension Time:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Made the finest granularity detail as a day. Then from a day, we can
calculate, day, month, year, day of week, weekday?, day in year, day in month,
quarter, week day name, etc etc...&lt;/p&gt;

&lt;p&gt;What is this useful for? Well imagine you want to see number of vegetarians on
wendenesday's compared to monday's, or the same for quarters, or months, maybe
getting close to summer months, the number of veggies might go up a bit ?&lt;/p&gt;

&lt;h3&gt;Aggregating&lt;/h3&gt;

&lt;p&gt;With the basic schema laid-out, its time for data collection. I used the
ActiveRecord part of the rails framework, using jruby. Its not the first time
i've used ActiveRecord as standalone and i like it a lot... simplifies data
access hugely, and because its all inside ruby, i get the added bonus of doing
some calculations that would be much harder in pure sql. These collected and
calculated values are then inserted into a local mySql using the schema above:
factvegetarian, dimdate and dimcountry.&lt;/p&gt;

&lt;p&gt;I've collected values for a whole month.&lt;/p&gt;

&lt;p&gt;Resulted in 225 lines of code for the warehouse part code, with some comments...
but no repeated code.&lt;/p&gt;

&lt;h2&gt;part ii, Building a Visualizer&lt;/h2&gt;

&lt;p&gt;The visualizer is a cycle that refreshes the interface, on each cycle the
database is queried, with a set of filters, like view, date, country. The
filters are updated when user clicks on the interface. Like clicking on US,
will set the the filter country to US, on the next refresh data for US is
obtained and the interface updated accordingly.&lt;/p&gt;

&lt;p&gt;Application was divided into different drawing components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Show World Data, its the opening scenario, showing the whole world for a 1 month's period.&lt;/li&gt;
&lt;li&gt;Show Country, used showing a specific country stats.&lt;/li&gt;
&lt;li&gt;Show Stats, a strip at bottom showing a graph of the number of vegetarians per day, over a month's perdiod.&lt;/li&gt;
&lt;li&gt;Show Buttons, button used to control zoom, reset, etc...&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;(Probably a refactoring will reduce the Show World Map and the Show Country
into a single Drawing component, has a lot of repeated code.)&lt;/p&gt;

&lt;p&gt;I've created a different module for each one, which were then mixed into main
class the inherits from Processing.Sketch.&lt;/p&gt;

&lt;p&gt;Made some stuff clickable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;country codes, displayed on top of the countries, so the user has the possibility to filter and see stats on bottom of a single country. This is done by identifying which country coordinates is closer to the mouse coordinates.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Also on the bottom, the stats strip has on the x axis the possibility to click on the day of the month, so the user can select a particular day and that will update the world visualization, showing the numbers of the number of vegetarians for a given day for all the world.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;And here's what it looks like:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/vis-visual.png" alt="http://al3xandr3.github.com/img/vis-visual.png" /&gt;&lt;/p&gt;

&lt;p&gt;When Zoomed in, and showing Portugal stats on the bottom:&lt;/p&gt;

&lt;p&gt;&lt;img src="http://al3xandr3.github.com/img/vis-visual-zoom.png" alt="http://al3xandr3.github.com/img/vis-visual-zoom.png" /&gt;&lt;/p&gt;

&lt;p&gt;Ended up with 584 lines of code, with a big chunk of repeated code, on the
visualization part.&lt;/p&gt;

&lt;p&gt;Overall making the visualization was a lot more work that the warehouse part,
because I had a lot of fighting around with correct coordinates positioning,
getting a decent map, maintaining map country coordinates with the zooms.&lt;/p&gt;

&lt;p&gt;Using jruby was mostly a nice experience, there are a couple of things to
learn at first, for example on how to include java libraries, no biggie, but I
had also a type conversion issue when i tried to refactor the code at some
point, i guess its because of the java type's, that jruby guys hide and
convert automatically ... but most likelly its because of my inexperience with
jruby...&lt;/p&gt;

&lt;p&gt;I've used version 1.0 of jruby, i think is a great work that jruby guys have
done, making accessible to ruby community all the millions of java libraries
out there. But of course don't expect to do 100% ruby code like you do with
old ruby, sometimes there's some java lurking out of the jruby box.&lt;/p&gt;

&lt;h2&gt;the Good&lt;/h2&gt;

&lt;p&gt;Well, its very cool to be able to use ruby for Drawing. Gives power that
regular ruby does not have. Exists huge amount of libraries, to use with it.
Connection to Java is indeed very powerful.&lt;/p&gt;

&lt;h2&gt;the Bad&lt;/h2&gt;

&lt;p&gt;Visualizations are hard to get right, and ended up of having code repeated and
all over the place. Why? Well partly because im a newbie in jRuby, but partly
because Processing seems to fit better for small Sketch visualizations.&lt;/p&gt;

&lt;h2&gt;Ideas&lt;/h2&gt;

&lt;p&gt;Is it possible to do a little architecture around it?, to make it a bit
better, isolating all drawing stuff.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Processing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Processing is great, has also huge potential, had a couple of troubles with 1
or 2 plugins i tried, but i end up using base distribution and that works and
feels 100%. I look forward to do more stuff with it, its fun!&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/al3xandr3/~4/-P3DJfy4E98" height="1" width="1"/&gt;</content>
 <feedburner:origLink>http://al3xandr3.github.com/2008/01/14/visualizing-data-processing-ruby.html</feedburner:origLink></entry>
 

</feed>

