<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>Quantitative thoughts</title>
	
	<link>http://www.investuotojas.eu</link>
	<description>Quantitative investment strategies</description>
	<lastBuildDate>Thu, 10 Jan 2013 08:53:28 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.4.2</generator>
		<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/investuotojas" /><feedburner:info uri="investuotojas" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item>
		<title>Machine learning for hackers</title>
		<link>http://feedproxy.google.com/~r/investuotojas/~3/DxyMs6D6p78/</link>
		<comments>http://www.investuotojas.eu/2012/10/23/machine-learning-for-hackers/#comments</comments>
		<pubDate>Tue, 23 Oct 2012 15:48:35 +0000</pubDate>
		<dc:creator>Dzidorius Martinaitis</dc:creator>
				<category><![CDATA[books]]></category>
		<category><![CDATA[data analysis]]></category>
		<category><![CDATA[EN]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[R-language]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[Lithuania]]></category>
		<category><![CDATA[mining]]></category>
		<category><![CDATA[ML]]></category>
		<category><![CDATA[politics]]></category>
		<category><![CDATA[quantitative]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.investuotojas.eu/?p=1006</guid>
		<description><![CDATA[Which way do you prefer to learn a new material &#8211; deep theoretical background first and practice later or do you like to break things in order to fix them? If latter is your way of learning things, then most likely you will enjoy Machine Learning for Hackers. The book has chapters on machine learning [...]]]></description>
			<content:encoded><![CDATA[<p>Which way do you prefer to learn a new material &#8211; deep theoretical background first and practice later or do you like to break things in order to fix them? If latter is your way of learning things, then most likely you will enjoy <a href="http://www.amazon.com/gp/product/1449303714/ref=as_li_tf_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=1449303714&amp;linkCode=as2&amp;tag=quantitativ0e-20">Machine Learning for Hackers</a>.</p>
<p>The book has chapters on machine learning techniques such as PCA, kNN, analysis of social graphs hence even advanced R users might find something interesting. So I want you to show you my example of visualisation of similarity between parliamentarians in <a href="http://en.wikipedia.org/wiki/Lithuania">Lithuania</a> which idea is taken for chapter 9.</p>
<p>In most of the cases you should be able to get access to voting results of legislative body in your country. Nevertheless the data can be buried in &quot;wrong&quot; format such as html or pdf. I use <a href="http://scrapy.org/">Scrapy</a> framework to parse html pages, however I have faced a problem, when my IP address was blocked due to many requests (10 000) within 2 hours. But in cloud age the problem was quickly solved and I made a delay in my crawler. <a href="https://github.com/kafka399/votingDistance/tree/master/getdata">Here is</a> the examples of the data in CSV format.</p>
<p>With data in hand it was easy to proceed further. To find similarities between parliamentarians I took voting results of approximately 4000 legislations and built a matrix, where rows represent parliamentarians and columns &#8211; legislations. &quot;Yes&quot; votes were encoded as 1, &quot;No&quot; as -1 and the rest as 0. R has a handy function <code>dist</code> to compute the distances between the rows (parliamentarians) of a data matrix. The result of the function is one dimension data of the distance between parliamentarians, however to reveal the structure of a data set we need two dimensions. Once again, R has a function <code>cmdscale</code> which does <a href="http://en.wikipedia.org/wiki/Multidimensional_scaling">Classical Multidimensional Scaling (CMS)</a>. I found <a href="http://www.bristol.ac.uk/cmm/publications/aimdss-2nd-ed/chapter3.pdf">this document</a> very useful in explaining Multidimensional Scaling. Here is the final result:</p>
<p><a href="http://dl.dropbox.com/u/6360678/blog/big_mds.png"><img src="http://dl.dropbox.com/u/6360678/blog/small_mds.png"></a></p>
<p>Click on the image to enlarge.</p>
<p>The plot above reveals, that right wing party TSLKD has a majority in parliament and LSDP (socialists) are in opposition and liberals (LSF, JF, MG) are in the center. You might argue, that that is already known, however the plot is based on actual data, therefore differences in voting support outlooks of the parliamentarians(right, central, left).<br />The map shows which members of the party are outliers and which one from the other party can be invited while forming a new parliament (second tour of the election is on the way).<br />Members of the left wing are mixed up and it would make sense to them to merge or form a coalition.</p>
<p>Are you looking for source code? <a href="https://github.com/kafka399/votingDistance">Click here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.investuotojas.eu/2012/10/23/machine-learning-for-hackers/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		<feedburner:origLink>http://www.investuotojas.eu/2012/10/23/machine-learning-for-hackers/</feedburner:origLink></item>
		<item>
		<title>Garmin data visualization</title>
		<link>http://feedproxy.google.com/~r/investuotojas/~3/GdUETWpxjgY/</link>
		<comments>http://www.investuotojas.eu/2012/10/04/garmin-data-visualization/#comments</comments>
		<pubDate>Thu, 04 Oct 2012 15:52:14 +0000</pubDate>
		<dc:creator>Dzidorius Martinaitis</dc:creator>
				<category><![CDATA[data analysis]]></category>
		<category><![CDATA[EN]]></category>
		<category><![CDATA[R-language]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[quantitative]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[runnig]]></category>

		<guid isPermaLink="false">http://www.investuotojas.eu/?p=969</guid>
		<description><![CDATA[People go on rage, when governments initiate surveillance projects like CleanIT, nevertheless share very private data without a doubt. I have to admit, that some data leaks are well buried in the process. Take for example Garmin which produces GPS training devices for runners. In order to see your workouts you are forced to upload [...]]]></description>
			<content:encoded><![CDATA[<p>People go on rage, when governments initiate surveillance projects like <a href="http://www.edri.org/cleanIT">CleanIT</a>, nevertheless share very private data without a doubt.</p>
<p>I have to admit, that some data leaks are well buried in the process. Take for example Garmin which produces GPS training devices for runners. In order to see your workouts you are forced to upload sensitive data on internet. In response you are given a visualization tool and a storage facility. What are alternatives? It seems, that in the past there was a desktop version, however I was not able to find it. So, we are left with the last option &#8211; hack it.</p>
<p>First of all you need to transfer data from Garmin device to computer. I own Forerunner 610 with relays on <a href="http://en.wikipedia.org/wiki/ANT_(network)">ANT</a> network and I found Python <a href="https://github.com/kafka399/Garmin-Forerunner-610-Extractor">script</a> with takes care of data transfer. Once data is transfered there is another obstacle &#8211; Garmin uses a proprietary format <a href="http://www.thisisant.com/pages/developer-zone/fit-sdk">FIT</a>. In order to tackle this problem I use another <a href="https://github.com/dtcooper/python-fitparse">Python script</a> which I have adapted to have <a href="https://github.com/kafka399/fitparse/blob/master/run.py">csv format</a>.</p>
<p>Once data is in CSV format R can be used to plot data.</p>
<div class="figure"><img src="http://dl.dropbox.com/u/6360678/blog/garmin_1.png" alt="" /></p>
<p class="caption">
</div>
<div class="figure"><img src="http://dl.dropbox.com/u/6360678/blog/garmin_2.png" alt="" /></p>
<p class="caption">
</div>
<p>I had a lot of fun by trying to understand Garmin longitude and latitude coordinates. Here is a short explantion by Hal Mueller:</p>
<blockquote><p>The mapping Garmin uses (180 degrees to 2^31 semicircles) allows them to use a standard 32 bit unsigned integer to represent the full 360 degrees of longitude. Thus you get the maximum precision that 32 bits allows you (about double what youâd get from a floating point value), and they still get to use integer arithmetic instead of floating point.</p></blockquote>
<div class="figure"></div>
<div class="figure"><img src="http://dl.dropbox.com/u/6360678/blog/garmin_map.png" alt="" /></p>
<p class="caption"><a href="https://github.com/kafka399/fitparse">Source code</a></p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://www.investuotojas.eu/2012/10/04/garmin-data-visualization/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		<feedburner:origLink>http://www.investuotojas.eu/2012/10/04/garmin-data-visualization/</feedburner:origLink></item>
		<item>
		<title>RStudio server through ssh</title>
		<link>http://feedproxy.google.com/~r/investuotojas/~3/6UF1zzzvF_A/</link>
		<comments>http://www.investuotojas.eu/2012/08/10/rstudio-server-through-ssh/#comments</comments>
		<pubDate>Fri, 10 Aug 2012 15:14:06 +0000</pubDate>
		<dc:creator>Dzidorius Martinaitis</dc:creator>
				<category><![CDATA[EN]]></category>
		<category><![CDATA[R-language]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[rstudio]]></category>
		<category><![CDATA[ssh]]></category>

		<guid isPermaLink="false">http://www.investuotojas.eu/?p=960</guid>
		<description><![CDATA[R language has numerous IDEs &#8211; RStudio, Vim plugin, Eclipse plugin. RStudio really shines for R language, nevertheless Vim plugin might be well adapted for R if you are Vim guru. Eclipse? Who needs such behemoth? Turns out a student in Ljubljana badly needs it. Most of the time I use remote server for R [...]]]></description>
			<content:encoded><![CDATA[<p>R language has numerous IDEs &#8211; <a href="http://rstudio.org/">RStudio</a>, <a href="http://www.vim.org/scripts/script.php?script_id=2628">Vim plugin</a>, Eclipse plugin. RStudio really shines for R language, nevertheless Vim plugin might be well adapted for R if you are Vim guru. Eclipse? Who needs such behemoth? Turns out a student in Ljubljana <a href="http://danganothererror.wordpress.com/2012/08/09/show-me-yours-and-ill-show-you-mine/">badly needs it</a>.</p>
<p>Most of the time I use remote server for R related tasks, which I access through ssh. To get graphical UI and RStudio I use <a href="http://www.nomachine.com/">No Machine</a> remote desktop client over ssh. I could use Vim, however I never managed to run it my way. The cool thing about RStudio that it comes in two flavours &#8211; desktop and server side. RStudio server version allows you to access R through web interface, meaning that you can use it remotely. Despite the fact I was afraid to open other port than ssh on remote machine for security reasons. The traffic between client and RStudio server is encrypted, however that does not eliminate vulnerabilities of RStudio itself. One way to come around this problem is to use <a href="http://en.wikipedia.org/wiki/Virtual_private_network">Virtual Private Network</a>, but installation of VPN is not that trivial.</p>
<p>During the holidays, when my brain was half empty, I started to wonder, if I really need VPN to access RStudio server in secure way. Well, not at all! What you have to do is to run <a href="http://en.wikipedia.org/wiki/SOCKS">SOCKS proxy</a> on local machine, which is very easy on Linux/Mac:</p>
<pre><code>ssh -D 2001 username@mysecureserver.com </code></pre>
<p>Keep the connection open and then, open a web browser and set up all traffic to go through proxy on local machine and port 2001. Here is an example for Firefox:</p>
<ul>
<li>enter <a href="about:config">about:config</a></li>
<li>change the value of <code>network.proxy.socks</code> to <code>127.0.0.1</code></li>
<li>change the value of <code>network.proxy.socks_port</code> to <code>2001</code></li>
<li>change the value of <code>network.proxy.socks_version</code> to <code>4</code></li>
<li>enter an internal address of remote server, something like <code>http://192.168.0.120:8787/</code></li>
<li>enjoy R and data mining</li>
</ul>
<p>If you need user friendly description of SOCKS proxy set up, refer this <a href="http://www.mikeash.com/ssh_socks.html">post</a>. More information about RStudio server can be found <a href="http://www.rstudio.org/docs/server/getting_started">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.investuotojas.eu/2012/08/10/rstudio-server-through-ssh/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		<feedburner:origLink>http://www.investuotojas.eu/2012/08/10/rstudio-server-through-ssh/</feedburner:origLink></item>
		<item>
		<title>Building a presentation, report or paper in R</title>
		<link>http://feedproxy.google.com/~r/investuotojas/~3/qk4iMYMCGg4/</link>
		<comments>http://www.investuotojas.eu/2012/08/01/building-a-presentation-report-or-paper-in-r/#comments</comments>
		<pubDate>Wed, 01 Aug 2012 10:40:35 +0000</pubDate>
		<dc:creator>Dzidorius Martinaitis</dc:creator>
				<category><![CDATA[data analysis]]></category>
		<category><![CDATA[EN]]></category>
		<category><![CDATA[quant]]></category>
		<category><![CDATA[R-language]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[markdown]]></category>
		<category><![CDATA[mining]]></category>
		<category><![CDATA[presentation]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.investuotojas.eu/?p=943</guid>
		<description><![CDATA[If you need to build a presentation, obviously you have following options: Powerpoint alike presentation Online engines LaTex The first two are beloved by business people and the third one is widely used in academia. The objective of the first group is shiny presentation, contrary to the second where asceticism and demand for automation are [...]]]></description>
			<content:encoded><![CDATA[<p>If you need to build a presentation, obviously you have following options:</p>
<ul>
<li>Powerpoint alike presentation</li>
<li>Online engines</li>
<li><a href="http://www.latex-project.org/">LaTex</a></li>
</ul>
<p>The first two are beloved by business people and the third one is widely used in academia. The objective of the first group is shiny presentation, contrary to the second where asceticism and demand for automation are top priorities. However, if you are data scientist or any other data specialist with a need to build an automated report, then you know, that LaTex is just wrong.<br />LaTex allows you to build a shiny presentation or outstanding paper, however it can take light years to build something useful for beginners . If you never tried LaTex here is an example of the monster &#8211; you literally have to <strong>code</strong> a document or presentation:</p>
<pre><code>\documentclass{article}
\title {Investment strategy}
\author {Dzidorius Martinaitis}
\begin{document}
\maketitle</code></pre>
<p>So, what do you do, if you need only 1% of all LaTex features and a report/document needs to be build automatically? Turns out, that HTML little brother <a href="http://en.wikipedia.org/wiki/Markdown">Markdown</a> is saving the world. Markdown(.md) source files are easy to read and easy to write and you can convert it into .html, .pdf, .docx, .tex or any other format. There are many ways to do conversion, however I use <a href="http://johnmacfarlane.net/pandoc/">Pandoc</a> utility. By the way this post was written in markdown in <a href="http://www.vim.org/about.php">Vim</a> and you can check the <a href="https://github.com/kafka399/haxogreen.lu">source file</a>.</p>
<p>However, the nicest thing about Markdown is integration with R. You can build your report in one file, where R code would be embed in Markdown. <a href="http://yihui.name/knitr/">Knitr</a> package will help you to convert R code into Markdown simply by calling this piece of code:</p>
<pre><code>require(knitr);
knit(&#39;workshop.Rmd&#39;, &#39;workshop.md&#39;);</code></pre>
<p>Below you will find an excerpt of .Rmd file which is mix of R and Markdown:</p>
<pre><code>Get the data
===

Who is tweeting about #Haxogreen

```{r results=asis,comment=NA, message=FALSE}
require(twitteR)
load(&#39;tweets.Rdata&#39;)
names=sapply(tweets,function(x)x$screenName)
rez=(aggregate(names,list(factor(names)),length))
rez=rez[order(rez$x),]
colnames(rez)=c(&#39;name&#39;,&#39;count&#39;)
options(xtable.type = &#39;html&#39;)
require(xtable)
xtable(t(tail(rez,6)))
```

Plot top10 tweeters
===
```{r topspam, figure=TRUE,fig.cap=&#39;&#39;}
barplot(tail(rez$count,10),names.arg=as.character(tail(rez$name,10)),cex.names=.7,las=2)
```</code></pre>
<p><a href="http://dl.dropbox.com/u/6360678/workshop.html">Here is a workshop presentation</a> which contains the example above &#8211; I built it for <a href="www.haxogreen.lu">Haxogreen</a> hackers camp and source code can be found on <a href="https://github.com/kafka399/haxogreen.lu/blob/master/workshop.Rmd">gitHub</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.investuotojas.eu/2012/08/01/building-a-presentation-report-or-paper-in-r/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		<feedburner:origLink>http://www.investuotojas.eu/2012/08/01/building-a-presentation-report-or-paper-in-r/</feedburner:origLink></item>
		<item>
		<title>How to track Twitter unfollowers in R</title>
		<link>http://feedproxy.google.com/~r/investuotojas/~3/mE9aS1nE4fs/</link>
		<comments>http://www.investuotojas.eu/2012/07/18/how-to-track-twitter-unfollowers-in-r/#comments</comments>
		<pubDate>Wed, 18 Jul 2012 13:58:29 +0000</pubDate>
		<dc:creator>Dzidorius Martinaitis</dc:creator>
				<category><![CDATA[data analysis]]></category>
		<category><![CDATA[EN]]></category>
		<category><![CDATA[R-language]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[mining]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.investuotojas.eu/?p=931</guid>
		<description><![CDATA[I have Twitter account and it is relatively easy to see new followers or subscribers. However, I was looking for ways to know who are the unfollowers. I have noticed, that some (un)subscriptions happen in bulks, which made me thinking that either I tweeted some bullshit and upset bunch of people or spam bots work [...]]]></description>
			<content:encoded><![CDATA[<p>I have Twitter <a href="https://twitter.com/dzidorius">account</a> and it is relatively easy to see new followers or subscribers. However, I was looking for ways to know who are the unfollowers. I have noticed, that some (un)subscriptions happen in bulks, which made me thinking that either I tweeted some bullshit and upset bunch of people or spam bots work in sync. With that in mind I have created a simple R script, which produces Markdown and html reports about unfollowers. You can find an example of such report below.</p>
<p>The advantage of this script is that it does not require you to sign or share your data. You just have to install twitter package and you ready to go. Nevertheless, if you want to create Markdown report, you need to install markdown and knitr packages.</p>
<p>The source code and build scripts can be found <a href="https://github.com/kafka399/twittrack">here</a>.</p>

<div class="wp_codebox_msgheader wp_codebox_hide"><span class="right"><sup><a href="http://www.ericbess.com/ericblog/2008/03/03/wp-codebox/#examples" target="_blank" title="WP-CodeBox HowTo?"><span style="color: #99cc00">?</span></a></sup></span><span class="left"><a href="javascript:;" onclick="javascript:showCodeTxt('p931code2'); return false;">View Code</a> SPLUS</span><div class="codebox_clear"></div></div><div class="wp_codebox"><table><tr id="p9312"><td class="code" id="p931code2"><pre class="splus" style="font-family:monospace;">```{r echo=FALSE,message=FALSE}
require(twitteR)
setwd('~/git/twitTracker/')
usr=getUser(&quot;dzidorius&quot;)
tmp=sapply(usr$getFollowers(),function(x)x$screenName)
&nbsp;
if(!file.exists('users.csv'))
{
  ## when file doesn't exist - take users list and add some artificial user
  write.table(c(as.character(tmp),'dzidorius'),'users.csv')
&nbsp;
}
&nbsp;
old_list=as.character(read.table('users.csv')$x)
users=lookupUsers(old_list[which(!(old_list %in% as.character(tmp)))])
if(length(users)==0)
{
  ## stop() doesn't work under knitr
  cat('no one left you')  
&nbsp;
}
```
&nbsp;
&nbsp;
```{r comment=NA,echo=FALSE,message=FALSE,results='asis'}
for(i in 1:length(users))
{
cat(paste(&quot;**&quot;,users[[i]]$name,&quot; @&quot;,users[[i]]$screenName,&quot;**&quot;, &quot;\n===\n&quot;,sep=&quot;&quot;))
cat(paste(&quot;![](https://api.twitter.com/1/users/profile_image?screen_name=&quot;,users[[i]]$screenName,
          &quot;&amp;size=bigger)&quot;,sep=''))
cat(paste(&quot;  \n**Created:** &quot;,users[[i]]$created,
          &quot;  \n**Spam rate:** &quot;,round(users[[i]]$followersCount/users[[i]]$friendsCount,digits=2),
          &quot;  \n**Activity:** &quot; , users[[i]]$statusesCount,
          &quot;  \n**Location:** &quot;, users[[i]]$location,&quot;  \n&quot;,users[[i]]$description,&quot;  \n&quot;,
          &quot;**Last status:** &quot;,(users[[i]]$lastStatus$text),&quot;\n\n&quot;,sep=&quot;&quot;))
}
```
&nbsp;
```{r comment=NA,echo=FALSE,message=FALSE,results='asis'}
write.table(as.character(tmp),'users.csv')
```</pre></td></tr></table></div>

<h1 id="dzidas-dzidorius"><strong>Dzidas @dzidorius</strong></h1>
<p><img src="https://api.twitter.com/1/users/profile_image?screen_name=dzidorius&amp;size=bigger" alt="" /><br />
<strong>Created:</strong> 2010-02-18 13:09:29<br />
<strong>Spam rate:</strong> 0.98<br />
<strong>Activity:</strong> 315<br />
<strong>Location:</strong> Luxembourg<br />
Java, C++ and R developer &amp; data junkie<br />
<strong>Last status:</strong> It is rainy summer in #Luxembourg but there is a party! http://t.co/FZ7eZq6u</p>
]]></content:encoded>
			<wfw:commentRss>http://www.investuotojas.eu/2012/07/18/how-to-track-twitter-unfollowers-in-r/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		<feedburner:origLink>http://www.investuotojas.eu/2012/07/18/how-to-track-twitter-unfollowers-in-r/</feedburner:origLink></item>
		<item>
		<title>Data mining for network security and intrusion detection</title>
		<link>http://feedproxy.google.com/~r/investuotojas/~3/rFROQhRzqko/</link>
		<comments>http://www.investuotojas.eu/2012/07/16/data-mining-for-network-security-and-intrusion-detection/#comments</comments>
		<pubDate>Mon, 16 Jul 2012 21:56:00 +0000</pubDate>
		<dc:creator>Dzidorius Martinaitis</dc:creator>
				<category><![CDATA[books]]></category>
		<category><![CDATA[data analysis]]></category>
		<category><![CDATA[EN]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[R-language]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[IDS]]></category>
		<category><![CDATA[mining]]></category>
		<category><![CDATA[ML]]></category>
		<category><![CDATA[quantitative]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[security]]></category>

		<guid isPermaLink="false">http://www.investuotojas.eu/?p=909</guid>
		<description><![CDATA[In preparation for &#8220;Haxogreen&#8221; hackers summer camp which takes place in Luxembourg, I was exploring network security world. My motivation was to find out how data mining is applicable to network security and intrusion detection. Flame virus, Stuxnet, Duqu proved that static, signature based security systems are not able to detect very advanced, government sponsored [...]]]></description>
			<content:encoded><![CDATA[<p>In preparation for <a href="http://www.haxogreen.lu">&#8220;Haxogreen&#8221; hackers summer camp</a> which takes place in Luxembourg, I was exploring network security world. My motivation was to find out how data mining is applicable to network security and intrusion detection.</p>
<p><a href="http://en.wikipedia.org/wiki/Flame_(malware)">Flame virus</a>, <a href="http://en.wikipedia.org/wiki/Stuxnet">Stuxnet</a>, <a href="http://en.wikipedia.org/wiki/Duqu">Duqu</a> proved that static, signature based security systems are not able to detect very advanced, government sponsored threats. Nevertheless, signature based defense systems are mainstream today &#8211; think of antivirus, intrusion detection systems. What do you do when unknown is unknown? Data mining comes to mind as the answer.</p>
<p>There are following areas where data mining is or can be employed: misuse/signature detection, anomaly detection, scan detection, etc.</p>
<p>Misuse/signature detection systems are based on supervised learning. During learning phase, labeled examples of network packets or systems calls are provided, from which algorithm can learn about the threats. This is very efficient and fast way to find know threats. Nevertheless there are some important drawbacks, namely false positives, novel attacks and complication of obtaining initial data for training of the system.<br />
The false positives happens, when normal network flow or system calls are marked as a threat. For example, an user can fail to provide the correct password for three times in a row or start using the service which is deviation from the standard profile. Novel attack can be define as an attack not seen by the system, meaning that signature or the pattern of such attack is not learned and the system will be penetrated without the knowledge of the administrator. The latter obstacle (training dataset) can be overcome by collecting the data over time or relaying on public data, such as <a href="http://www.ll.mit.edu/mission/communications/ist/corpora/ideval/data/index.html">DARPA Intrusion Detection Data Set</a>.<br />
Although misuse detection can be built on your own data mining techniques, I would suggest well known product like <a href="http://www.snort.org/">Snort</a> which relays on crowd-sourcing.</p>
<p>Anomaly/outlier detection systems looks for deviation from normal or established patterns within given data. In case of network security any threat will be marked as an anomaly. Below you can find two features graph, where number of logins are plotted on x axis and number of queries are plotter on y axis. The color indicates the group to which points are assigned &#8211; blue ones are normal, red ones &#8211; anomalies.</p>
<p><a href="http://s176.photobucket.com/albums/w180/investuotojas/?action=view&amp;current=start_dec_anomalies.png"><img src="http://i176.photobucket.com/albums/w180/investuotojas/start_dec_anomalies.png" alt="alt text" /></a></p>
<p>Anomaly detection systems constantly evolves &#8211; what was a norm year ago can be an anomaly today. The algorithm compares network flow with historical flow over given period and looks for outliers with are far away. Such dynamic approach allows to detect novel attacks, nevertheless it generates false positive alerts (marks normal flow as suspicious). Moreover, hackers can mimic normal profile, if they know that such system is deployed.</p>
<p>The first task when implementing anomaly detection (AD) is collection of the data. If AD is going to be network based, there are two possibilities to collect aggregated data from network. Some Cisco products provide aggregated data as <a href="http://www.cisco.com/en/US/products/ps6601/products_ios_protocol_group_home.html">Netflow</a> protocol. However, you can use <a href="http://www.wireshark.org">Wireshark or tshark</a> to collect network flow data from the computer. For example:</p>
<blockquote><p>tshark -T fields -E separator , -E quote d -e ip.src -e ip.dst -e tcp.srcport -e tcp.dstport -e udp.srcport -e upd.dstport -e tcp.len -e ip.len -e eth.type -e frame.time_epoch -e frame.len</p></blockquote>
<p>Once you have enough data for mining process, you need to preprocess acquired data. In the context of intrusion, anomalous actions happen in bursts rather than single event. <a title="Data Mining for Cyber Security" href="http://minds.cs.umn.edu/publications/chapter.pdf">Varun Chandola et al.</a> proposed to derive following features:</p>
<ul class="incremental">
<li>
<dl class="incremental">
<dt>Time window based:</dt>
<dd>Number of flows to unique destination IP addresses inside the network in the last T seconds from the same source</dd>
<dd>Number of flows from unique source IP addresses inside the network in the last T seconds to the same destination</dd>
<dd>Number of flows from the source IP to the same destination port in the last T seconds host based &#8211; system calls network based &#8211; packet information</dd>
<dd>Number of flows to the destination IP address using same source port in the last T seconds</dd>
</dl>
</li>
<li>
<dl class="incremental">
<dt>Connection based:</dt>
<dd>Number of flows to unique destination IP addresses inside the network in the last N flows from the same source</dd>
<dd>Number of flows from unique source IP addresses inside the network in the last N flows to the same destination</dd>
<dd>Number of flows from the source IP to the same destination port in the last N flows</dd>
<dd>Number of flows to the destination IP address using same source port in the last N flows</dd>
</dl>
</li>
</ul>
<p>Below you can find an example of feature creation in R. The dataset was created by calling tshark script, which is specified above.</p>
<blockquote><p>#load data<br />
tmp=read.csv(‘stats2.csv’,colClasses=c(rep(‘character’,11)),header=F)<br />
#get rid of everything below min. in timestamp<br />
tmp[,10]=as.integer(as.POSIXct(format(as.POSIXct(as.integer(tmp[,10]),origin=‘1970-01-01’),‘%Y-%m-%d %H:%M:00’)))<br />
#fix some rows<br />
tmp=tmp[-which(sapply(tmp[,1],function(x) nchar(x)&gt;15)),] tmp=tmp[which(!is.na(tmp[,4])),]</p>
<p>#aggregate date by 5 mins. it assumes, that flow is continuous<br />
factor=as.factor(tmp[1:5000,10])</p>
<p>feature=do.call(rbind, sapply(seq(from=1,to=length(factor),by=4),function(x){ return(list(ddply( subset(tmp,factor==levels(factor)[x:(x+4)]),.(V1,V4),summarize,times=length(V11),.parallel=FALSE ))) }))</p></blockquote>
<p>After preprocessing the data we can apply local outlier detection, KNN, random forest and others algorithms. I will provide R code and practical implementation of some algorithms in the following post.</p>
<p>While preparing this post, I was looking for the books, I found only few books covering data mining and network security. To my surprise <a href="http://www.amazon.com/gp/product/1439839425/ref=as_li_qf_sp_asin_tl?ie=UTF8&amp;tag=quantitativ0e-20&amp;linkCode=as2&amp;camp=1789&amp;creative=9325&amp;creativeASIN=1439839425">Data Mining and Machine Learning in Cybersecurity</a> book includes both topics and well written. However, if you are security specialist looking for data mining books, you can read my <a href="http://www.investuotojas.eu/2012/07/02/my-first-competition-at-kaggle/">summary</a> of <a href="http://www.amazon.com/gp/product/0123748569/ref=as_li_tf_tl?ie=UTF8&amp;tag=quantitativ0e-20&amp;linkCode=as2&amp;camp=1789&amp;creative=9325&amp;creativeASIN=0123748569%20&quot;Data%20Mining:%20Practical%20Machine%20Learning%20Tools%20and%20Techniques&quot;.">&#8220;Data Mining: Practical Machine Learning Tools and Techniques&#8221;</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.investuotojas.eu/2012/07/16/data-mining-for-network-security-and-intrusion-detection/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		<feedburner:origLink>http://www.investuotojas.eu/2012/07/16/data-mining-for-network-security-and-intrusion-detection/</feedburner:origLink></item>
		<item>
		<title>My first competition at Kaggle</title>
		<link>http://feedproxy.google.com/~r/investuotojas/~3/f1kH318aOzQ/</link>
		<comments>http://www.investuotojas.eu/2012/07/02/my-first-competition-at-kaggle/#comments</comments>
		<pubDate>Mon, 02 Jul 2012 15:41:08 +0000</pubDate>
		<dc:creator>Dzidorius Martinaitis</dc:creator>
				<category><![CDATA[books]]></category>
		<category><![CDATA[data analysis]]></category>
		<category><![CDATA[EN]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[quant]]></category>
		<category><![CDATA[R-language]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[kaggle]]></category>
		<category><![CDATA[mining]]></category>
		<category><![CDATA[ML]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.investuotojas.eu/?p=887</guid>
		<description><![CDATA[For me Kaggle becomes a social network for data scientist, as stackoverflow.com or github.com for programmers. If you are data scientist, machine learner or statistician you better off to have a profile there, otherwise you do not exist. Nevertheless, I won&#8217;t bet on rosy future for data scientist as journalists suggest (sexy job for next [...]]]></description>
			<content:encoded><![CDATA[<p>For me <a href="http://www.kaggle.com/" target="_blank">Kaggle</a> becomes a social network for data scientist, as <a href="http://stackoverflow.com/questions/tagged/r" target="_blank">stackoverflow.com</a> or <a href="https://github.com" target="_blank">github.com</a> for programmers. If you are data scientist, machine learner or statistician you better off to have a profile there, otherwise you do not exist.</p>
<p>Nevertheless, I won&#8217;t bet on rosy future for data scientist as journalists suggest (<a href="http://www.guardian.co.uk/news/datablog/2012/mar/02/data-scientist" target="_blank">sexy job for next decade</a>). For sure, the demand for such specialists is on rise. However, I see one big threat for data scientist &#8211; Kaggle and similar service providers. You see, such services allows to tap high end data scientists (think of PhD in hard science) at minuscule fraction of real price. Think of Hollywood business model &#8211; top players get majority of the pool and the rest is starving. If you try the same service model on IT projects you will most likely get burned. My reasoning can be wrong, but I suspect, that project timespan is the issue &#8211; IT projects can take for while to finish (1-10 years), but main stream ML project won&#8217;t take that long.</p>
<p>Notwithstanding these obstacles, machine learning, information retrieval, data mining and etc. is a must with ability to write code for production, deal with streaming big data and cope with performance of intelligent system. Then, in programmers parlance, you will became &#8220;data scientist ninja&#8221; and every company will die for you. There is a good post on the subject on <a href="http://blog.mikiobraun.de/2012/06/is-machine-learning-losing-impact.html" target="_blank">mikiobraun</a> blog, but I mind you, that it is a bit controversial.</p>
<p>Although for last 4 years I often has been working on financial models and time-series, this competition added a new experience to me and hunger for the knowledge. During competition I found this book very practical and plentiful of ideas what to do next: <a href="http://www.amazon.com/gp/product/0123748569/ref=as_li_tf_tl?ie=UTF8&amp;tag=quantitativ0e-20&amp;linkCode=as2&amp;camp=1789&amp;creative=9325&amp;creativeASIN=0123748569">Data Mining: Practical Machine Learning Tools and Techniques</a>. As complimentary book I used <a href="http://www.amazon.com/gp/product/0123814790/ref=as_li_tf_tl?ie=UTF8&amp;tag=quantitativ0e-20&amp;linkCode=as2&amp;camp=1789&amp;creative=9325&amp;creativeASIN=0123814790">Data Mining: Concepts and Techniques</a><img style="border: none !important; margin: 0px !important;" src="http://www.assoc-amazon.com/e/ir?t=quantitativ0e-20&amp;l=as2&amp;o=1&amp;a=0123814790" alt="" width="1" height="1" border="0" />, though most of information can be found in one of them. I will try to summarize some chapters in my own story.</p>
<p><strong>Understanding the data</strong>. &#8220;<a href="http://www.kaggle.com/c/online-sales" target="_blank">Online Product Sales</a>&#8221; competition metadata (data about data) is miserly &#8211; there are three types of the data &#8211; the date fields, categorical fields, quantitative fields and response data for next 12 months. However metadata is most important element in all ML projects, which can save you a lot of time once you understand it better and it leads to much better forecast if you have &#8220;domain knowledge&#8221;.</p>
<p><strong>Cleaning data.</strong> There is famous phrase: &#8220;garbage in garbage out&#8221;, meaning, that before any further action you have to detect and fix incorrect, incomplete or missing data. You have many possibilities to deal with missing data &#8211; remove all rows, where the data is missing; replace it with mean or regressed value or nearest value and etc.  If your data is plentiful and missing values are random (meaning, that NA values do not bear any information) &#8211; just get rid of them. Otherwise you need impute new values based on mean or other technique. Mean based replacement worked best for me in this competition.<br />
Outliers are another type of the troubles. Suppose, that variable is normally distributed, but few variables are far away from the center. The easiest solution would be to remove such values &#8211; as many do in finance by removing &#8220;crisis period&#8221;.  When next crisis hits, the journalists are rushing to learn a new buzzword- <a href="http://en.wikipedia.org/wiki/Black_swan_theory" target="_blank">black swan</a>. Turns out, that outliers can&#8217;t be ignored, because the impact of them is huge.  So be precautious while dealing with outliers.</p>
<p><strong>Feature selection</strong>. It was surprising to me that too many features or variables can pollute forecast, therefore you need to do feature selection. Such task can be done manually be checking correlation matrix, co-variance and etc. However, random forest or generalized boosted methods can lead to better selection. In R you just call randomForest() or gbm() and job is done.</p>
<p><strong>Variable transformation</strong> - a way to get superior performance. &#8220;Online Product Sales&#8221; competition has two date fields, however these fields encoded as integers. By transforming these variables into date and retrieving year and month led to better performance of the model. In most of cases taking logarithm for numeric fields gives performance boost. Scaling (from 0 to 1 or from -1 to 1) and centering (normal distribution) might be considered when linear models are in use.  It is worth to transform categorical variables as well, where 1 would mean, that a feature belongs to the group and 0 otherwise. Check model.matrix function in R for latter transformation and preProcess function in caret package for numerical variables.</p>
<p><strong>Validation stage </strong>- helps you to measure performance of the model. If you have huge database to build a model you can divide you set into two/three parts &#8211; for training, testing and cross validation and you are ready to start. However, if you are not so lucky, then other methods come to play. Most popular method is division of the set into two groups, namely &#8220;training&#8221; and &#8220;test&#8221; and rotating it for 10 times. For example, you have 100 rows, so you take first 75 for training and 25 for test and you check the performance ratio. In the next step you take the rows from 25 to 100 for training and you use first 25 for test. Once you repeat such procedure 10 times, you have 10 performance ratios and you take average of it.<br />
<a href="http://en.wikipedia.org/wiki/Stratified_sampling" target="_blank">Stratified sampling</a> is a buzzword, which you should know when you do a sampling.<br />
Keeping all this information in mind I wasn&#8217;t able to to implement accurate cross validation and my results differ within 0.05 range.</p>
<p><strong>Model selection and ensemble. </strong>Intuitively you want to choose the best performing algorithm, however the mix of them can lead to superior performance. For regression problem I trained four models (two random forest versions, gbm, svm), made the predictions, averaged the results and that led to better prediction.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.investuotojas.eu/2012/07/02/my-first-competition-at-kaggle/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		<feedburner:origLink>http://www.investuotojas.eu/2012/07/02/my-first-competition-at-kaggle/</feedburner:origLink></item>
		<item>
		<title>GitHub data analysis</title>
		<link>http://feedproxy.google.com/~r/investuotojas/~3/uKTQpSKJeGM/</link>
		<comments>http://www.investuotojas.eu/2012/05/15/github-data-analysis/#comments</comments>
		<pubDate>Tue, 15 May 2012 10:48:49 +0000</pubDate>
		<dc:creator>Dzidorius Martinaitis</dc:creator>
				<category><![CDATA[data analysis]]></category>
		<category><![CDATA[EN]]></category>
		<category><![CDATA[R-language]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[bash]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[c++]]></category>
		<category><![CDATA[coding]]></category>
		<category><![CDATA[quantitative]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.investuotojas.eu/?p=846</guid>
		<description><![CDATA[Few weeks ago GitHub announced, that its timeline data is available on bigquery for analysis. Moreover, it offers prizes for the best visualization of the data. Despite my art skills and minimal chances to win beauty contest, I decided to crunch GitHub data and run data analysis. After initial trial of bigquery service, I found hard [...]]]></description>
			<content:encoded><![CDATA[<p>Few weeks ago GitHub <a href="https://github.com/blog/1112-data-at-github" target="_blank">announced</a>, that its timeline data is available on <a href="https://bigquery.cloud.google.com" target="_blank">bigquery</a> for analysis. Moreover, it <a href="https://github.com/blog/1118-the-github-data-challenge" target="_blank">offers prizes</a> for the best visualization of the data. Despite my art skills and minimal chances to win beauty contest, I decided to crunch GitHub data and run data analysis.</p>
<p>After initial trial of bigquery service, I found hard to know, what price, if any, I&#8217;m going to pay for the service. Hence, I pulled the data (6.5 GB) from bigquery on my machine and further I used my machine for analysis. Bash scripts have been used to clean up and extract necessary data, R for data analysis and visualization and C++ for text extraction.</p>
<p>GitHub dataset is one table, where each row consist of information about repository (i.e. path, date of creation, name, description, programming language, number of forks/watchers and etc.) and action, which was done by user (i.e. username, location, timestamp and etc.).</p>
<p>As a result, we can check how GitHub users actions are spread over time during the day. The X axis on the graph below is labeled with the hours of the day (GMT) and the Y axis represent median values of the actions for each hour. From it, we can make a deduction, that highest load for GitHub can be expected between 15:00 and 17:00 GMT and lowest to be expected between 05:00 and 07:00 GMT. The color of the line indicates how busy was the day based on quantiles: green are calm days (20% of days), blue &#8211; normal days (50% quantile) and red are busy days (80% quantile). I should to mention, that auto-correlation or serial correlation is high (70% for following hour), which means, that busy hours tend to be followed by busy hours and calm hours tend to be followed by calm hours. Moreover, busy days tend happen after busy days.</p>
<p><a href="http://s176.photobucket.com/albums/w180/investuotojas/?action=view&amp;current=actions.png" target="_blank"><img src="http://i176.photobucket.com/albums/w180/investuotojas/actions.png" alt="Photobucket" border="0" /></a></p>
<p>Second graph below shows median of actions divided by weekdays. There is not big surprise &#8211; weekends are more slow than weekdays, nevertheless the programmers are slightly less productive on Mondays and Fridays.</p>
<p><a href="http://s176.photobucket.com/albums/w180/investuotojas/?action=view&amp;current=actions_weekdays.png" target="_blank"><img src="http://i176.photobucket.com/albums/w180/investuotojas/actions_weekdays.png" alt="Photobucket" border="0" /></a></p>
<p>The analysis of creation of new repository shows, that the pattern of busy or calm hours remains over the years. This can be attributed to the fact, that majority of the users comes from North America and Europe.<br />
Another hypothesis can be drawn from this information, that number of creation of the new repositories grow exponentially. However, I mind you, that the graph below is biased &#8211; most likely, GitHub users update recent projects, consequently more recent projects appeared on timeline. Even though, 2009-2011 years show exponential grow.<br />
The X axis of the graph below is labeled with the hour of the day, the Y axis &#8211; log of median values of new repositories.</p>
<p><a href="http://s176.photobucket.com/albums/w180/investuotojas/?action=view&amp;current=new_repos.png" target="_blank"><img src="http://i176.photobucket.com/albums/w180/investuotojas/new_repos.png" alt="Photobucket" border="0" /></a></p>
<p>Following graph shows the number of forks per project (the X axis, log scale) versus number of watchers (the Y axis, log scale). As expected, there is linear correlation between forks and watchers. Even so there is something interesting about outliers, which are below bottom line &#8211; the projects, where number of watchers is low, but number of forks is high. These are anomalies and worth to check.</p>
<p><a href="http://s176.photobucket.com/albums/w180/investuotojas/?action=view&amp;current=fork_watch.png" target="_blank"><img src="http://i176.photobucket.com/albums/w180/investuotojas/fork_watch.png" alt="Photobucket" border="0" /></a></p>
<p>The next thing to do is to look at the repository description. Let&#8217;s group the repositories by programming language and count most dominant words in the description. The graph below has C++ word cloud on the left and Java &#8211; right . C++ projects are about library, game, simple(?), engine, Arduino. Java is dominated by android, plugin, server, minecraft, spring, maven.</p>
<p><a href="http://s176.photobucket.com/albums/w180/investuotojas/?action=view&amp;current=cpp_java.png" target="_blank"><img src="http://i176.photobucket.com/albums/w180/investuotojas/cpp_java.png" alt="Photobucket" border="0" /></a><br />
Ruby (left) vs Python(right ):<br />
<a href="http://s176.photobucket.com/albums/w180/investuotojas/?action=view&amp;current=ruby_python.png" target="_blank"><img src="http://i176.photobucket.com/albums/w180/investuotojas/ruby_python.png" alt="Photobucket" border="0" /></a></p>
<p>&#8220;Surprise&#8221;, &#8220;surprise&#8221; &#8211; R projects (left) are largely about data analysis, however &#8220;machine&#8221; word, which corresponds to Machine learning is very tiny. Shell (right) is dominated by configuration, managing, git(?).</p>
<p><a href="http://s176.photobucket.com/albums/w180/investuotojas/?action=view&amp;current=r_bash.png" target="_blank"><img src="http://i176.photobucket.com/albums/w180/investuotojas/r_bash.png" alt="Photobucket" border="0" /></a></p>
<p>GitHub dataset includes location field. Unfortunately, the users can enter whatever they want &#8211; country, city or leave it empty. Nevertheless, I was able to extract good chunk of actions, where location field has meaningful value.  The video below shows country based users activity, where dark red corresponds to high activity and light red &#8211; minor. Only 30 most active countries are included, the rest are grey.<br />
The same pattern persist over the days &#8211; activity in Asia increases around midnight, Europe wakes up around 8:00 or 9:00, where America starts around 15:00. Who said, that hackers and programmers work at night?<br />
<iframe src="http://player.vimeo.com/video/42186230" frameborder="0" width="500" height="369"></iframe></p>
<p>&nbsp;</p>
<p>What else can be done with GitHub dataset? Most repositories have description field, which can be used to find similar projects by implementing <a href="http://en.wikipedia.org/wiki/Tf*idf">tf-idf</a> method. I tried that method and the results are satisfying.</p>
<p>Most of the graphs shown above are reproducible (except word clouds) and the code can be found on <a href="https://github.com/kafka399/githubdata" target="_blank">GitHub</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.investuotojas.eu/2012/05/15/github-data-analysis/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.investuotojas.eu/2012/05/15/github-data-analysis/</feedburner:origLink></item>
		<item>
		<title>Machine learning for identification of cars</title>
		<link>http://feedproxy.google.com/~r/investuotojas/~3/szutkqOnfkI/</link>
		<comments>http://www.investuotojas.eu/2012/04/22/machine-learning-for-identification-of-cars/#comments</comments>
		<pubDate>Sun, 22 Apr 2012 14:53:33 +0000</pubDate>
		<dc:creator>Dzidorius Martinaitis</dc:creator>
				<category><![CDATA[data analysis]]></category>
		<category><![CDATA[EN]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[R-language]]></category>
		<category><![CDATA[bash]]></category>
		<category><![CDATA[coding]]></category>
		<category><![CDATA[ML]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.investuotojas.eu/?p=814</guid>
		<description><![CDATA[There are plenty of data on internet, however it is raw data. Think for a second about public surveillance cameras - useful to check the traffic on the route or busy place, but anything else? What if you want to know how many cars are on the route? How many car were yesterday at the same time? [...]]]></description>
			<content:encoded><![CDATA[<p>There are plenty of data on internet, however it is raw data. Think for a second about public surveillance cameras - useful to check the traffic on the route or busy place, but anything else? What if you want to know how many cars are on the route? How many car were yesterday at the same time? Given so many cars on the route, how much polluted air in the area?<br />
While working on the road map for data dive event, I started to wonder, how feasible is to use data of public surveillance cameras. So I quickly built a pilot project and now I would like to share my experience.</p>
<p>First step &#8211; <strong>data acquisition</strong>. At beginning I was thinking to plug my smartphone somewhere and collect data of the busy route.  Nevertheless, I quickly found surveillance cameras in Vilnius and started to collect images. Run a search and I&#8217;m sure, that you will find them in your city:</p>
<p><a href="http://s176.photobucket.com/albums/w180/investuotojas/?action=view&amp;current=example.png" target="_blank"><img src="http://i176.photobucket.com/albums/w180/investuotojas/example.png" alt="Photobucket" border="0" /></a></p>
<p>Here is bash script, which I use to collect images:</p>

<div class="wp_codebox_msgheader wp_codebox_hide"><span class="right"><sup><a href="http://www.ericbess.com/ericblog/2008/03/03/wp-codebox/#examples" target="_blank" title="WP-CodeBox HowTo?"><span style="color: #99cc00">?</span></a></sup></span><span class="left"><a href="javascript:;" onclick="javascript:showCodeTxt('p814code6'); return false;">View Code</a> BASH</span><div class="codebox_clear"></div></div><div class="wp_codebox"><table><tr id="p8146"><td class="code" id="p814code6"><pre class="bash" style="font-family:monospace;"><span style="color: #666666; font-style: italic;">#you need full path for crontab</span>
<span style="color: #7a0874; font-weight: bold;">cd</span> <span style="color: #000000; font-weight: bold;">/</span>home<span style="color: #000000; font-weight: bold;">/</span>git<span style="color: #000000; font-weight: bold;">/</span>carCount<span style="color: #000000; font-weight: bold;">/</span>img
<span style="color: #007800;">a</span>=<span style="color: #000000; font-weight: bold;">`</span><span style="color: #c20cb9; font-weight: bold;">date</span> +<span style="color: #000000; font-weight: bold;">%</span>s<span style="color: #000000; font-weight: bold;">`</span>
<span style="color: #007800;">b</span>=<span style="color: #800000;">${a}</span>_4.jpg
<span style="color: #c20cb9; font-weight: bold;">wget</span> <span style="color: #660033;">-O</span> <span style="color: #007800;">$b</span> <span style="color: #660033;">-q</span> <span style="color: #ff0000;">&quot;http://www.sviesoforai.lt/map/camera.aspx?size=full&amp;amp;image=K7742-1.jpg&amp;amp;rnd=0.15417794161476195&quot;</span></pre></td></tr></table></div>

<p><strong>Data preparation</strong>. After while you will have enough data to train your machine (for beginning more than 30 images should be O.K.).<br />
How do we train the algorithm? The goal is to identify the cars in a given image. That means, that we have to provide the examples of positive images (clear image of the cars) and negative images (no car, parts of the car and etc.). Important note &#8211; we don&#8217;t feed whole image, but we cut a chosen image with sliding window (100&#215;100 in my case). 4 examples of positive images:</p>
<p><a href="http://s176.photobucket.com/albums/w180/investuotojas/?action=view&amp;current=4.png" target="_blank"><img src="http://i176.photobucket.com/albums/w180/investuotojas/4.png" alt="Photobucket" border="0" /></a></p>
<p>Meanwhile, it is worth converting each image to <a href="en.wikipedia.org/wiki/Netpbm_format" target="_blank">portable grey format PGM</a>. For this specific task, we can sacrifice information about the color of the car &#8211; it won&#8217;t improve prediction. Besides, PGM images can be loaded into R and easily transformed into matrix. Here is bash script, which converts jpg to pgm and slices each image:</p>

<div class="wp_codebox_msgheader wp_codebox_hide"><span class="right"><sup><a href="http://www.ericbess.com/ericblog/2008/03/03/wp-codebox/#examples" target="_blank" title="WP-CodeBox HowTo?"><span style="color: #99cc00">?</span></a></sup></span><span class="left"><a href="javascript:;" onclick="javascript:showCodeTxt('p814code7'); return false;">View Code</a> BASH</span><div class="codebox_clear"></div></div><div class="wp_codebox"><table><tr id="p8147"><td class="code" id="p814code7"><pre class="bash" style="font-family:monospace;"><span style="color: #666666; font-style: italic;">#remove image duplicates</span>
<span style="color: #c20cb9; font-weight: bold;">find</span> . <span style="color: #660033;">-maxdepth</span> <span style="color: #000000;">1</span> <span style="color: #660033;">-type</span> f <span style="color: #660033;">-exec</span> md5sum <span style="color: #7a0874; font-weight: bold;">&#123;</span><span style="color: #7a0874; font-weight: bold;">&#125;</span> \;  <span style="color: #000000; font-weight: bold;">&amp;</span>gt;test.txt
<span style="color: #c20cb9; font-weight: bold;">awk</span> <span style="color: #ff0000;">'a[$1]++ {gsub(/^\*/,&quot;&quot;,$2); print &quot;rm &quot;, $2}'</span> test.txt <span style="color: #000000; font-weight: bold;">|</span><span style="color: #c20cb9; font-weight: bold;">sh</span>
<span style="color: #c20cb9; font-weight: bold;">rm</span> test.txt
&nbsp;
<span style="color: #666666; font-style: italic;">#convert jpg</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">if</span> <span style="color: #7a0874; font-weight: bold;">&#91;</span> <span style="color: #660033;">-d</span> <span style="color: #ff0000;">&quot;out&quot;</span> <span style="color: #7a0874; font-weight: bold;">&#93;</span>; <span style="color: #000000; font-weight: bold;">then</span>
        <span style="color: #c20cb9; font-weight: bold;">rm</span> <span style="color: #660033;">-r</span> out
<span style="color: #000000; font-weight: bold;">fi</span>
<span style="color: #c20cb9; font-weight: bold;">mkdir</span> out
<span style="color: #000000; font-weight: bold;">for</span> k <span style="color: #000000; font-weight: bold;">in</span> $<span style="color: #7a0874; font-weight: bold;">&#40;</span><span style="color: #c20cb9; font-weight: bold;">ls</span> <span style="color: #000000; font-weight: bold;">*</span>.jpg<span style="color: #7a0874; font-weight: bold;">&#41;</span>; <span style="color: #000000; font-weight: bold;">do</span> convert <span style="color: #007800;">$k</span> out<span style="color: #000000; font-weight: bold;">/</span><span style="color: #007800;">$k</span>.pgm; <span style="color: #000000; font-weight: bold;">done</span>
&nbsp;
<span style="color: #7a0874; font-weight: bold;">cd</span> out
<span style="color: #c20cb9; font-weight: bold;">mkdir</span> slide
<span style="color: #000000; font-weight: bold;">for</span> filename <span style="color: #000000; font-weight: bold;">in</span> $<span style="color: #7a0874; font-weight: bold;">&#40;</span><span style="color: #c20cb9; font-weight: bold;">ls</span> <span style="color: #000000; font-weight: bold;">*</span>.pgm<span style="color: #7a0874; font-weight: bold;">&#41;</span>;
 <span style="color: #000000; font-weight: bold;">do</span> 
&nbsp;
<span style="color: #007800;">w</span>=<span style="color: #000000; font-weight: bold;">`</span>convert <span style="color: #007800;">$filename</span> <span style="color: #660033;">-print</span> <span style="color: #ff0000;">&quot;%w&quot;</span> <span style="color: #000000; font-weight: bold;">/</span>dev<span style="color: #000000; font-weight: bold;">/</span>null<span style="color: #000000; font-weight: bold;">`</span>
<span style="color: #007800;">h</span>=<span style="color: #000000; font-weight: bold;">`</span>convert <span style="color: #007800;">$filename</span> <span style="color: #660033;">-print</span> <span style="color: #ff0000;">&quot;%h&quot;</span> <span style="color: #000000; font-weight: bold;">/</span>dev<span style="color: #000000; font-weight: bold;">/</span>null<span style="color: #000000; font-weight: bold;">`</span>
<span style="color: #7a0874; font-weight: bold;">let</span> <span style="color: #ff0000;">&quot;ww= <span style="color: #007800;">$w</span>/100&quot;</span>
<span style="color: #7a0874; font-weight: bold;">let</span> <span style="color: #ff0000;">&quot;hh= <span style="color: #007800;">$h</span>/100&quot;</span>
<span style="color: #000000; font-weight: bold;">for</span><span style="color: #7a0874; font-weight: bold;">&#40;</span><span style="color: #7a0874; font-weight: bold;">&#40;</span><span style="color: #007800;">y</span>=<span style="color: #000000;">150</span>;y<span style="color: #000000; font-weight: bold;">&lt;</span>=<span style="color: #000000;">250</span>;y+=<span style="color: #000000;">50</span><span style="color: #7a0874; font-weight: bold;">&#41;</span><span style="color: #7a0874; font-weight: bold;">&#41;</span>
<span style="color: #000000; font-weight: bold;">do</span>
<span style="color: #000000; font-weight: bold;">for</span><span style="color: #7a0874; font-weight: bold;">&#40;</span><span style="color: #7a0874; font-weight: bold;">&#40;</span><span style="color: #007800;">i</span>=<span style="color: #000000;">100</span>;i<span style="color: #000000; font-weight: bold;">&lt;</span>=<span style="color: #000000;">400</span>;i+=<span style="color: #000000;">50</span><span style="color: #7a0874; font-weight: bold;">&#41;</span><span style="color: #7a0874; font-weight: bold;">&#41;</span>
<span style="color: #000000; font-weight: bold;">do</span>
<span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #ff0000;">&quot;slide/<span style="color: #007800;">$i</span>.<span style="color: #007800;">$filename</span>&quot;</span>
<span style="color: #7a0874; font-weight: bold;">let</span> <span style="color: #ff0000;">&quot;h_slide=<span style="color: #007800;">$i</span>&quot;</span>
convert <span style="color: #007800;">$filename</span> <span style="color: #660033;">-crop</span> 100x100+<span style="color: #007800;">$i</span>+<span style="color: #007800;">$y</span> slide<span style="color: #000000; font-weight: bold;">/</span><span style="color: #007800;">$y</span>.<span style="color: #007800;">$i</span>.<span style="color: #007800;">$filename</span>
<span style="color: #000000; font-weight: bold;">done</span>
<span style="color: #000000; font-weight: bold;">done</span>
<span style="color: #000000; font-weight: bold;">done</span></pre></td></tr></table></div>

<p><strong>Training, predicting, cross validation</strong>. Now is time to open R, load 100&#215;100 images from &#8220;train/out/slide&#8221; directory and train the algorithm. Important note &#8211; each image is a matrix, however you have to feed a matrix of all images to learning algorithm (support vector machine in my case). What you have to do is to &#8220;unroll&#8221; each image matrix into a vector, get 1X10000 vector and build a new matrix, where each row is an image.<br />
Once training is done, load unseen data from &#8220;crossval/out/slide&#8221; directory and check &#8220;result/&#8221; directory, where you will find  images of the cars. R script, which does all above:</p>

<div class="wp_codebox_msgheader wp_codebox_hide"><span class="right"><sup><a href="http://www.ericbess.com/ericblog/2008/03/03/wp-codebox/#examples" target="_blank" title="WP-CodeBox HowTo?"><span style="color: #99cc00">?</span></a></sup></span><span class="left"><a href="javascript:;" onclick="javascript:showCodeTxt('p814code8'); return false;">View Code</a> RSPLUS</span><div class="codebox_clear"></div></div><div class="wp_codebox"><table><tr id="p8148"><td class="code" id="p814code8"><pre class="rsplus" style="font-family:monospace;"><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/setwd.html"><span style="color: #0000FF; font-weight: bold;">setwd</span></a><span style="color: #080;">&#40;</span><span style="color: #ff0000;">'/home/git/carCount/'</span><span style="color: #080;">&#41;</span>
&nbsp;
<span style="color: #228B22;">######read positives############</span>
files<span style="color: #080;">=</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/list.files.html"><span style="color: #0000FF; font-weight: bold;">list.<span style="">files</span></span></a><span style="color: #080;">&#40;</span><span style="color: #ff0000;">'test/pos/'</span><span style="color: #080;">&#41;</span>
pos<span style="color: #080;">=</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/matrix.html"><span style="color: #0000FF; font-weight: bold;">matrix</span></a><span style="color: #080;">&#40;</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/nrow.html"><span style="color: #0000FF; font-weight: bold;">nrow</span></a><span style="color: #080;">=</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/NROW.html"><span style="color: #0000FF; font-weight: bold;">NROW</span></a><span style="color: #080;">&#40;</span>files<span style="color: #080;">&#41;</span>,<a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/ncol.html"><span style="color: #0000FF; font-weight: bold;">ncol</span></a><span style="color: #080;">=</span><span style="color: #ff0000;">100</span><span style="color: #080;">*</span><span style="color: #ff0000;">100</span><span style="color: #080;">&#41;</span>
&nbsp;
<a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/for.html"><span style="color: #0000FF; font-weight: bold;">for</span></a><span style="color: #080;">&#40;</span>i <span style="color: #0000FF; font-weight: bold;">in</span> <span style="color: #ff0000;">1</span><span style="color: #080;">:</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/NROW.html"><span style="color: #0000FF; font-weight: bold;">NROW</span></a><span style="color: #080;">&#40;</span>files<span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span>
<span style="color: #080;">&#123;</span>
  gray_file<span style="color: #080;">=</span>read.<span style="">pnm</span><span style="color: #080;">&#40;</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/paste.html"><span style="color: #0000FF; font-weight: bold;">paste</span></a><span style="color: #080;">&#40;</span><span style="color: #ff0000;">'test/pos/'</span>,files<span style="color: #080;">&#91;</span>i<span style="color: #080;">&#93;</span>,sep<span style="color: #080;">=</span><span style="color: #ff0000;">''</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span>
  pos<span style="color: #080;">&#91;</span>i,<span style="color: #080;">&#93;</span><span style="color: #080;">=</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/c.html"><span style="color: #0000FF; font-weight: bold;">c</span></a><span style="color: #080;">&#40;</span>gray_file@<a href="http://astrostatistics.psu.edu/su07/R/html/stats/html/summary.lm.html"><span style="color: #0000FF; font-weight: bold;">grey</span></a><span style="color: #080;">&#41;</span>
<span style="color: #080;">&#125;</span>
outcome<span style="color: #080;">=</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/vector.html"><span style="color: #0000FF; font-weight: bold;">vector</span></a><span style="color: #080;">&#40;</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/length.html"><span style="color: #0000FF; font-weight: bold;">length</span></a><span style="color: #080;">=</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/NROW.html"><span style="color: #0000FF; font-weight: bold;">NROW</span></a><span style="color: #080;">&#40;</span>files<span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span>
outcome<span style="color: #080;">&#91;</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/which.html"><span style="color: #0000FF; font-weight: bold;">which</span></a><span style="color: #080;">&#40;</span>outcome<span style="color: #080;">!=</span><span style="color: #ff0000;">1</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#93;</span><span style="color: #080;">=</span><span style="color: #ff0000;">1</span>
&nbsp;
<span style="color: #228B22;">########read negatives#############</span>
files<span style="color: #080;">=</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/list.files.html"><span style="color: #0000FF; font-weight: bold;">list.<span style="">files</span></span></a><span style="color: #080;">&#40;</span><span style="color: #ff0000;">'test/neg/'</span><span style="color: #080;">&#41;</span>
neg<span style="color: #080;">=</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/matrix.html"><span style="color: #0000FF; font-weight: bold;">matrix</span></a><span style="color: #080;">&#40;</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/nrow.html"><span style="color: #0000FF; font-weight: bold;">nrow</span></a><span style="color: #080;">=</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/NROW.html"><span style="color: #0000FF; font-weight: bold;">NROW</span></a><span style="color: #080;">&#40;</span>files<span style="color: #080;">&#41;</span>,<a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/ncol.html"><span style="color: #0000FF; font-weight: bold;">ncol</span></a><span style="color: #080;">=</span><span style="color: #ff0000;">100</span><span style="color: #080;">*</span><span style="color: #ff0000;">100</span><span style="color: #080;">&#41;</span>
&nbsp;
<a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/for.html"><span style="color: #0000FF; font-weight: bold;">for</span></a><span style="color: #080;">&#40;</span>i <span style="color: #0000FF; font-weight: bold;">in</span> <span style="color: #ff0000;">1</span><span style="color: #080;">:</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/NROW.html"><span style="color: #0000FF; font-weight: bold;">NROW</span></a><span style="color: #080;">&#40;</span>files<span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span>
<span style="color: #080;">&#123;</span>
  gray_file<span style="color: #080;">=</span>read.<span style="">pnm</span><span style="color: #080;">&#40;</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/paste.html"><span style="color: #0000FF; font-weight: bold;">paste</span></a><span style="color: #080;">&#40;</span><span style="color: #ff0000;">'test/neg/'</span>,files<span style="color: #080;">&#91;</span>i<span style="color: #080;">&#93;</span>,sep<span style="color: #080;">=</span><span style="color: #ff0000;">''</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span>
  neg<span style="color: #080;">&#91;</span>i,<span style="color: #080;">&#93;</span><span style="color: #080;">=</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/c.html"><span style="color: #0000FF; font-weight: bold;">c</span></a><span style="color: #080;">&#40;</span>gray_file@<a href="http://astrostatistics.psu.edu/su07/R/html/stats/html/summary.lm.html"><span style="color: #0000FF; font-weight: bold;">grey</span></a><span style="color: #080;">&#41;</span>
<span style="color: #080;">&#125;</span>
tmp<span style="color: #080;">=</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/vector.html"><span style="color: #0000FF; font-weight: bold;">vector</span></a><span style="color: #080;">&#40;</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/length.html"><span style="color: #0000FF; font-weight: bold;">length</span></a><span style="color: #080;">=</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/NROW.html"><span style="color: #0000FF; font-weight: bold;">NROW</span></a><span style="color: #080;">&#40;</span>files<span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span>
tmp<span style="color: #080;">&#91;</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/which.html"><span style="color: #0000FF; font-weight: bold;">which</span></a><span style="color: #080;">&#40;</span>tmp<span style="color: #080;">!=</span><span style="color: #ff0000;">0</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#93;</span><span style="color: #080;">=</span><span style="color: #ff0000;">0</span>
outcome<span style="color: #080;">=</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/c.html"><span style="color: #0000FF; font-weight: bold;">c</span></a><span style="color: #080;">&#40;</span>outcome,tmp<span style="color: #080;">&#41;</span>
forecast<span style="color: #080;">=</span>svm<span style="color: #080;">&#40;</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/rbind.html"><span style="color: #0000FF; font-weight: bold;">rbind</span></a><span style="color: #080;">&#40;</span>pos,neg<span style="color: #080;">&#41;</span>,outcome<span style="color: #080;">&#41;</span>
cross_val<span style="color: #080;">=</span>pos<span style="color: #080;">&#91;</span><span style="color: #ff0000;">84</span><span style="color: #080;">:</span><span style="color: #ff0000;">90</span>,<span style="color: #080;">&#93;</span>
pred<span style="color: #080;">=</span><span style="color: #0000FF; font-weight: bold;">predict</span><span style="color: #080;">&#40;</span>forecast,cross_val,decision.<span style="">values</span><span style="color: #080;">=</span>TRUE<span style="color: #080;">&#41;</span>
&nbsp;
<span style="color: #228B22;">##########################unseen data######################</span>
files<span style="color: #080;">=</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/list.files.html"><span style="color: #0000FF; font-weight: bold;">list.<span style="">files</span></span></a><span style="color: #080;">&#40;</span><span style="color: #ff0000;">'crossval/out/slide/'</span><span style="color: #080;">&#41;</span>
cross<span style="color: #080;">=</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/matrix.html"><span style="color: #0000FF; font-weight: bold;">matrix</span></a><span style="color: #080;">&#40;</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/nrow.html"><span style="color: #0000FF; font-weight: bold;">nrow</span></a><span style="color: #080;">=</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/NROW.html"><span style="color: #0000FF; font-weight: bold;">NROW</span></a><span style="color: #080;">&#40;</span>files<span style="color: #080;">&#41;</span>,<a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/ncol.html"><span style="color: #0000FF; font-weight: bold;">ncol</span></a><span style="color: #080;">=</span><span style="color: #ff0000;">100</span><span style="color: #080;">*</span><span style="color: #ff0000;">100</span><span style="color: #080;">&#41;</span>
&nbsp;
<a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/for.html"><span style="color: #0000FF; font-weight: bold;">for</span></a><span style="color: #080;">&#40;</span>i <span style="color: #0000FF; font-weight: bold;">in</span> <span style="color: #ff0000;">1</span><span style="color: #080;">:</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/NROW.html"><span style="color: #0000FF; font-weight: bold;">NROW</span></a><span style="color: #080;">&#40;</span>files<span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span>
<span style="color: #080;">&#123;</span>
  gray_file<span style="color: #080;">=</span>read.<span style="">pnm</span><span style="color: #080;">&#40;</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/paste.html"><span style="color: #0000FF; font-weight: bold;">paste</span></a><span style="color: #080;">&#40;</span><span style="color: #ff0000;">'crossval/out/slide/'</span>,files<span style="color: #080;">&#91;</span>i<span style="color: #080;">&#93;</span>,sep<span style="color: #080;">=</span><span style="color: #ff0000;">''</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span>
  cross<span style="color: #080;">&#91;</span>i,<span style="color: #080;">&#93;</span><span style="color: #080;">=</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/c.html"><span style="color: #0000FF; font-weight: bold;">c</span></a><span style="color: #080;">&#40;</span>gray_file@<a href="http://astrostatistics.psu.edu/su07/R/html/stats/html/summary.lm.html"><span style="color: #0000FF; font-weight: bold;">grey</span></a><span style="color: #080;">&#41;</span>
<span style="color: #080;">&#125;</span>
pred<span style="color: #080;">=</span><span style="color: #0000FF; font-weight: bold;">predict</span><span style="color: #080;">&#40;</span>forest,cross,decision.<span style="">values</span><span style="color: #080;">=</span>TRUE<span style="color: #080;">&#41;</span>
&nbsp;
<span style="color: #228B22;">###############copy positives into result directory###############</span>
<a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/dir.create.html"><span style="color: #0000FF; font-weight: bold;">dir.<span style="">create</span></span></a><span style="color: #080;">&#40;</span><span style="color: #ff0000;">'result'</span><span style="color: #080;">&#41;</span>
<a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/file.copy.html"><span style="color: #0000FF; font-weight: bold;">file.<span style="">copy</span></span></a><span style="color: #080;">&#40;</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/paste.html"><span style="color: #0000FF; font-weight: bold;">paste</span></a><span style="color: #080;">&#40;</span><span style="color: #ff0000;">'crossval/out/slide/'</span>,files<span style="color: #080;">&#91;</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/which.html"><span style="color: #0000FF; font-weight: bold;">which</span></a><span style="color: #080;">&#40;</span><a href="http://astrostatistics.psu.edu/su07/R/html/graphics/html/as.double.html"><span style="color: #0000FF; font-weight: bold;">as.<span style="">double</span></span></a><span style="color: #080;">&#40;</span>pred<span style="color: #080;">&#41;</span><span style="color: #080;">&amp;</span>gt<span style="color: #080;">;</span><span style="color: #ff0000;">0.6</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#93;</span>,sep<span style="color: #080;">=</span><span style="color: #ff0000;">''</span><span style="color: #080;">&#41;</span>,<span style="color: #ff0000;">'result/'</span><span style="color: #080;">&#41;</span></pre></td></tr></table></div>

<p>&nbsp;</p>
<p>Classified as positive by algorithm:<br />
<a href="http://s176.photobucket.com/albums/w180/investuotojas/?action=view&amp;current=pos.png" target="_blank"><img src="http://i176.photobucket.com/albums/w180/investuotojas/pos.png" alt="Photobucket" border="0" /></a></p>
<p>Classified as negative by algorithm:<br />
<a href="http://s176.photobucket.com/albums/w180/investuotojas/?action=view&amp;current=neg.png" target="_blank"><img src="http://i176.photobucket.com/albums/w180/investuotojas/neg.png" alt="Photobucket" border="0" /></a></p>
<p>&nbsp;</p>
<p><strong>Conclusion</strong>. It is truly amazing how well algorithm is able to separate wheat from the chaff without additional tuning. Mind you, my impression is biased after so many fails with financial data, which is noisy and good predictions are scarce.<br />
Nevertheless, this project is far away for ideal &#8211; it doesn&#8217;t take into account weather condition, traffic jams, perspective view, movements of the camera and etc. But I leave this fun for data-dive event.</p>
<p><strong>Fork the code</strong>: <a href="https://github.com/kafka399/carCount/" target="_blank">https://github.com/kafka399/carCount/</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.investuotojas.eu/2012/04/22/machine-learning-for-identification-of-cars/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		<feedburner:origLink>http://www.investuotojas.eu/2012/04/22/machine-learning-for-identification-of-cars/</feedburner:origLink></item>
		<item>
		<title>How to organize R user group</title>
		<link>http://feedproxy.google.com/~r/investuotojas/~3/N_XJHhbQIkc/</link>
		<comments>http://www.investuotojas.eu/2012/04/18/how-to-organize-r-user-group/#comments</comments>
		<pubDate>Wed, 18 Apr 2012 11:12:32 +0000</pubDate>
		<dc:creator>Dzidorius Martinaitis</dc:creator>
				<category><![CDATA[EN]]></category>
		<category><![CDATA[R-language]]></category>
		<category><![CDATA[Lithuania]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.investuotojas.eu/?p=795</guid>
		<description><![CDATA[The first thing, what you have to do is to estimate how many users will be interested in local R group. I would say, that out of one million inhabitants you can expect 10-20 users. Based on this raw number, you can know, what challenges are waiting for you. If you expect 100 or more users, you have [...]]]></description>
			<content:encoded><![CDATA[<p>The first thing, what you have to do is to estimate how many users will be interested in local R group. I would say, that out of one million inhabitants you can expect 10-20 users. Based on this raw number, you can know, what challenges are waiting for you. If you expect 100 or more users, you have to think about the appropriate place to hold first meetup, how to manage so many people, what topics to present first time.<br />
However, if you expect small community (as I did), your challenge is spreading the news about local R group. Get to know a few local users and ask them, what do they think about a meetup. The right place to find such users would be local university. Most likely the local university will be happy to provide a place for the first official meeting.</p>
<p>Fortunately for me, I met a <a href="http://vzemlys.wordpress.com/" target="_blank">powerful</a> R user &#8211; <a href="https://twitter.com/#!/mpiktas" target="_blank">@mpiktas</a>, which is a lecturer at Vilnius University. With his help I was able to identify more R-infected users and get premises at university for presentation.</p>
<p>Next step, when you are sure, that you are not alone, is to choose a name for the group, build simple web site from scratch or by using a service like <a href="http://www.meetup.com">meetup</a>. This won&#8217;t cost you a fortune, however I need to say, that you can <a href="http://www.revolutionanalytics.com/news-events/r-user-group/" target="_blank">apply for sponsorship</a>. Revolution Analytics not only provides sponsorship, but as well manages <a href="http://blog.revolutionanalytics.com/local-r-groups.html" target="_blank">a dictionary of R user group</a>.</p>
<p>Once you have created virtual community you have to thinking about a meetup. As a founder, prepare an introduction about local R group, its future plans, your own usage of R. Here mine, which I used for <a href="http://www.VilniusR.org" target="_blank">VilniusR</a> introduction:</p>
<div id="__ss_12506088" style="width: 425px;">
<p><strong style="display: block; margin: 12px 0 4px;"><a title="R language presentation" href="http://www.slideshare.net/kafka399/vilniusr-group" target="_blank">R language presentation</a></strong> <iframe src="http://www.slideshare.net/slideshow/embed_code/12506088" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" width="425" height="355"></iframe></p>
</div>
<p>&nbsp;</p>
<p>What&#8217;s next? During first meetup you can outline future meetup, however I found fascinating, that many participants expressed an interest in datadive event. So, now we are in the process of organizing such event!</p>
<p>p.s. if you happen to be from Luxembourg and you are interested in local user group &#8211; <a href="http://www.investuotojas.eu/contact-me/" target="_blank">let me know</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.investuotojas.eu/2012/04/18/how-to-organize-r-user-group/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		<feedburner:origLink>http://www.investuotojas.eu/2012/04/18/how-to-organize-r-user-group/</feedburner:origLink></item>
	</channel>
</rss>
