<?xml version='1.0' encoding='UTF-8'?><rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/" xmlns:blogger="http://schemas.google.com/blogger/2008" xmlns:georss="http://www.georss.org/georss" xmlns:gd="http://schemas.google.com/g/2005" xmlns:thr="http://purl.org/syndication/thread/1.0" version="2.0"><channel><atom:id>tag:blogger.com,1999:blog-7662744293957770471</atom:id><lastBuildDate>Thu, 29 Aug 2024 08:58:09 +0000</lastBuildDate><category>R</category><title>Pass the ROC</title><description>Adventures in Data Science</description><link>http://pleasepasstheroc.blogspot.com/</link><managingEditor>noreply@blogger.com (Pass The ROC)</managingEditor><generator>Blogger</generator><openSearch:totalResults>2</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><item><guid isPermaLink="false">tag:blogger.com,1999:blog-7662744293957770471.post-5331086836620668290</guid><pubDate>Fri, 22 Apr 2011 07:55:00 +0000</pubDate><atom:updated>2011-04-26T06:46:15.054-07:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">R</category><title>How to Build a Dataset in R using an RSS feed or Web page</title><description>I recently wanted to build a dataset from content in an RSS feed - the feed of crimes in Newark provided by &lt;a href=&quot;http://spotcrime.com/&quot;&gt;SpotCrime&lt;/a&gt;.&amp;nbsp; (They have feeds for lots of US cities, but I just wanted Newark.&amp;nbsp; Please read their Terms of Service before using this code on their feed.)&amp;nbsp; After some tinkering, I got it to work using the &lt;a href=&quot;http://cran.r-project.org/web/packages/XML/index.html&quot;&gt;XML package&lt;/a&gt; in R.&amp;nbsp; &lt;br /&gt;
The first step is to read in the RSS feed XML file:&lt;br /&gt;
&lt;br /&gt;
&lt;code&gt;&lt;br /&gt;
#install.packages(&quot;XML&quot;)  &lt;br /&gt;
library(XML)&lt;br /&gt;
doc&amp;lt;-xmlTreeParse(&quot;http://s3.spotcrime.com/cache/rss/newark.xml&quot;) &lt;/code&gt;&lt;br /&gt;
&lt;br /&gt;
The xmlTreeParse command &quot;parses an XML or HTML file or string containing XML/HTML content, and generates an R structure representing the XML/HTML tree.&quot;&amp;nbsp; There are tons of optional arguments, but as you can see, I didn&#39;t use any of them, and frankly, I don&#39;t understand many of them.&amp;nbsp; But the function did what I wanted.&lt;br /&gt;
Next, I used the command xmlRoot to isolate the &quot;top level XMLNode object resulting from parsing an XML document.&quot;&amp;nbsp; Now is a good time to look at what we have:&lt;br /&gt;
&lt;code&gt;&lt;br /&gt;
&amp;gt; xmlRoot(doc)&lt;br /&gt;
&amp;lt;rss version=&quot;2.0&quot; xmlns:media=&quot;http://search.yahoo.com/mrss/&quot; xmlns:atom=&quot;http://www.w3.org/2005/Atom&quot; xmlns:geo=&quot;http://www.w3.org/2003/01/geo/wgs84_pos#&quot; xmlns:georss=&quot;http://www.georss.org/georss&quot;&amp;gt;&lt;br /&gt;
&amp;nbsp;&amp;lt;channel&amp;gt;&lt;br /&gt;
&amp;nbsp; &amp;lt;atom:link href=&quot;http://spotcrime.com&quot; rel=&quot;self&quot; type=&quot;application/rss+xml&quot;/&amp;gt;&lt;br /&gt;
&amp;nbsp; &amp;lt;title&amp;gt;Spotcrime.com Crime Listing - Newark, NJ&amp;lt;/title&amp;gt;&lt;br /&gt;
&amp;nbsp; &amp;lt;description&amp;gt;Crime feed - RSS - 5 incidents. To see more visit http://spotcrime.com&amp;lt;/description&amp;gt;&lt;br /&gt;
&amp;nbsp; &amp;lt;language&amp;gt;en-us&amp;lt;/language&amp;gt;&lt;br /&gt;
&amp;nbsp; &amp;lt;link&amp;gt;http://spotcrime.com&amp;lt;/link&amp;gt;&lt;br /&gt;
&amp;nbsp; &amp;lt;ttl&amp;gt;180&amp;lt;/ttl&amp;gt;&lt;br /&gt;
&amp;nbsp; &amp;lt;copyright&amp;gt;ReportSee, Inc.&amp;lt;/copyright&amp;gt;&lt;br /&gt;
&amp;nbsp; &amp;lt;item&amp;gt;&lt;br /&gt;
&amp;nbsp;&amp;nbsp; &amp;lt;guid isPermaLink=&quot;true&quot;&amp;gt;http://spotcrime.com/crime-report/18002873/robbery+on+easton+avenue%2C+franklin%2C+nj&amp;lt;/guid&amp;gt;&lt;br /&gt;
&amp;nbsp;&amp;nbsp; &amp;lt;link&amp;gt;http://spotcrime.com/crime-report/18002873/robbery+on+easton+avenue%2C+franklin%2C+nj&amp;lt;/link&amp;gt;&lt;br /&gt;
&amp;nbsp;&amp;nbsp; &amp;lt;pubDate&amp;gt;Mon, 18 Apr 2011 00:00:00 -0400&amp;lt;/pubDate&amp;gt;&lt;br /&gt;
&amp;nbsp;&amp;nbsp; &amp;lt;title&amp;gt;Robbery on EASTON AVENUE, Franklin, NJ (via spotcrime.com)&amp;lt;/title&amp;gt;&lt;br /&gt;
&amp;nbsp;&amp;nbsp; &amp;lt;description&amp;gt;Police are seeking a man who robbed the Financial Resources Federal Credit Union&amp;lt;/description&amp;gt;&lt;br /&gt;
&amp;nbsp;&amp;nbsp; &amp;lt;georss:point&amp;gt;40.5242061 -74.495662&amp;lt;/georss:point&amp;gt;&lt;br /&gt;
&amp;nbsp;&amp;nbsp; &amp;lt;geo:Point&amp;gt;&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;lt;geo:lat&amp;gt;40.5242061&amp;lt;/geo:lat&amp;gt;&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;lt;geo:long&amp;gt;-74.495662&amp;lt;/geo:long&amp;gt;&lt;br /&gt;
&amp;nbsp;&amp;nbsp; &amp;lt;/geo:Point&amp;gt;&lt;br /&gt;
&amp;nbsp; &amp;lt;/item&amp;gt;&lt;br /&gt;
&lt;/code&gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This is only a portion of the full output - there are more &amp;lt;item&amp;gt; nodes, one for each crime.&lt;br /&gt;
So the feed starts with a header full of stuff we don&#39;t need, followed by the content in the &amp;lt;item&amp;gt; node, which is the good stuff: a link to the crime on SpotCrime, the publication date (more on this later), the crime &quot;title,&quot; a description, and the Lat/Lon, in two different formats.&amp;nbsp; How do we get at that meaty stuff, and put it into a friendly R dataframe?&amp;nbsp; We&#39;ll use the xpathApply command:&lt;br /&gt;
&lt;br /&gt;
&lt;code&gt;&lt;br /&gt;
src&amp;lt;-xpathApply(xmlRoot(doc), &quot;//item&quot;) &lt;/code&gt;&lt;br /&gt;
&lt;br /&gt;
xpathApply is a &quot;way to find XML nodes that match a particular criterion&quot; using XPath syntax.&amp;nbsp; XPath is a way to navigate XML trees.&amp;nbsp; My approach for a project like this is to aim, first and foremost, for code that works, and worry about advanced techniques later.&amp;nbsp; So I did a simple search for nodes identified as &quot;item,&quot; ignoring all the other possible arguments to xpathApply.&amp;nbsp; src is now a list with 5 elements, one for each &quot;item&quot; node in the feed (recall that above, I only showed the first item node - four more followed).&amp;nbsp; We can now iterate through the 5 elements of src and convert the data into a dataframe:&lt;br /&gt;
&lt;br /&gt;
&lt;code&gt;&lt;br /&gt;
&lt;br /&gt;
for (i in 1:length(src)) {&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; if (i==1) {&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; foo&amp;lt;-xmlSApply(src[[i]], xmlValue)&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; DATA&amp;lt;-data.frame(t(foo), stringsAsFactors=FALSE)&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; else {&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; foo&amp;lt;-xmlSApply(src[[i]], xmlValue)&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; tmp&amp;lt;-data.frame(t(foo), stringsAsFactors=FALSE)&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; DATA&amp;lt;-rbind(DATA, tmp)&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;br /&gt;
&lt;/code&gt;xmlSApply applies a function to the subnodes of an XML node.&amp;nbsp; In this case, the function is xmlValue, which returns the raw contents of a node.&amp;nbsp; So foo becomes a character vector containing all of those nice data bits for crime &lt;i&gt;i&lt;/i&gt;. We then transpose foo into a matrix and convert it to a (1 row) data.frame.&amp;nbsp; The stringsasFactors=FALSE prevents R from treating the strings as factors, which makes sense in this case - it might not in yours.&lt;br /&gt;
The first time through the loop, we want to create the data.frame; subsequent iterations, we just want to rbind a row on the bottom.&amp;nbsp; When we&#39;re done, we have what we want: the data from the RSS feed nicely formatted in a data.frame named (descriptively) DATA.&amp;nbsp; &lt;br /&gt;
&lt;br /&gt;
Now, returning to the date and time.&amp;nbsp; SpotCrime reports the &lt;i&gt;publication&lt;/i&gt; date and time, not the date and time &lt;i&gt;that the crime actually occurred&lt;/i&gt;.&amp;nbsp; What can we do?&amp;nbsp; It looks like SpotCrime reports the date and time we want on the webpage for the crime, the link to which was helpfully provided in the RSS feed.&amp;nbsp; Take a look:&lt;br /&gt;
&lt;br /&gt;
&lt;div class=&quot;separator&quot; style=&quot;clear: both; text-align: center;&quot;&gt;&lt;a href=&quot;http://1.bp.blogspot.com/-iT5f6kDxmCY/TbEyx5IrJaI/AAAAAAAAAr4/1itD1Q69dno/s1600/Screen+shot+2011-04-22+at+5.47.27+PM.png&quot; imageanchor=&quot;1&quot; style=&quot;margin-left: 1em; margin-right: 1em;&quot;&gt;&lt;img border=&quot;0&quot; src=&quot;http://1.bp.blogspot.com/-iT5f6kDxmCY/TbEyx5IrJaI/AAAAAAAAAr4/1itD1Q69dno/s1600/Screen+shot+2011-04-22+at+5.47.27+PM.png&quot; /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;
&lt;br /&gt;
So, let&#39;s read in the html for that page, and grab the correct date and time!&lt;br /&gt;
&lt;br /&gt;
&lt;code&gt;&lt;br /&gt;
# Looping through the crimes, going to web page and grabbing actual date and time&lt;br /&gt;
date_time&amp;lt;-vector()&lt;br /&gt;
for (i in 1:length(src)) {&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; res&amp;lt;-htmlTreeParse(DATA$link[i], useInternalNodes=TRUE)&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; title&amp;lt;-xpathApply(xmlRoot(res), &quot;//title&quot;)&lt;br /&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp; date_time[i]&amp;lt;-xmlSApply(title[[1]], xmlValue)&lt;br /&gt;
}&lt;br /&gt;
DATA&amp;lt;-cbind(DATA,date_time)&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;/code&gt;Here, we used many of the same commands we used for the RSS feed.  The real date and time were stored in a node called &quot;title,&quot; so we just grabbed that node for each crime, stuck it into the appropriate slot in a vector, and slapped that vector onto our DATA data.frame.  &lt;br /&gt;
With a little string processing to extract and convert lat/lon and date/time to appropriate data types, the data collection code is finished!</description><link>http://pleasepasstheroc.blogspot.com/2011/04/how-to-build-dataset-in-r-using-rss.html</link><author>noreply@blogger.com (Pass The ROC)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://1.bp.blogspot.com/-iT5f6kDxmCY/TbEyx5IrJaI/AAAAAAAAAr4/1itD1Q69dno/s72-c/Screen+shot+2011-04-22+at+5.47.27+PM.png" height="72" width="72"/><thr:total>5</thr:total></item><item><guid isPermaLink="false">tag:blogger.com,1999:blog-7662744293957770471.post-8264676432157175022</guid><pubDate>Thu, 21 Apr 2011 00:25:00 +0000</pubDate><atom:updated>2011-04-26T06:46:15.054-07:00</atom:updated><category domain="http://www.blogger.com/atom/ns#">R</category><title>How to Source an R script automatically on a Mac using Automator and iCal</title><description>I wrote an R script that pulled data from an RSS feed.&amp;nbsp; The RSS feed updated frequently, so I wanted to be able to schedule the script to run automatically.&amp;nbsp; After some tinkering, I got it to work by implementing the steps below.&amp;nbsp; Note that these steps assume you do not want to save your workspace - that you are saving the objects you need explicitly within the script.&amp;nbsp; &lt;br /&gt;
&lt;br /&gt;
&lt;table cellpadding=&quot;0&quot; cellspacing=&quot;0&quot; class=&quot;tr-caption-container&quot; style=&quot;float: right; margin-left: 1em; text-align: right;&quot;&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style=&quot;text-align: center;&quot;&gt;&lt;a href=&quot;http://2.bp.blogspot.com/--weWAQ8gAKQ/Ta9vGUElCwI/AAAAAAAAArs/RykEnfl49zk/s1600/Screen+shot+2011-04-21+at+9.36.40+AM.png&quot; imageanchor=&quot;1&quot; style=&quot;clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;&quot;&gt;&lt;img border=&quot;0&quot; height=&quot;320&quot; src=&quot;http://2.bp.blogspot.com/--weWAQ8gAKQ/Ta9vGUElCwI/AAAAAAAAArs/RykEnfl49zk/s320/Screen+shot+2011-04-21+at+9.36.40+AM.png&quot; width=&quot;279&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;tr-caption&quot; style=&quot;text-align: center;&quot;&gt;Step 3: The Run Shell Script Action&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;&lt;ol&gt;&lt;li&gt;Test your R script in regular ole&#39; R to make sure it runs without error.&lt;/li&gt;
&lt;li&gt;Add a &lt;span style=&quot;font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;&quot;&gt;quit(save=&quot;no&quot;)&lt;/span&gt;&lt;span style=&quot;font-family: inherit;&quot;&gt; command at the end of your script.&amp;nbsp;&lt;/span&gt;&lt;/li&gt;
&lt;li style=&quot;font-family: inherit;&quot;&gt;&lt;span style=&quot;font-size: small;&quot;&gt;Open Automator&amp;nbsp; (Applications -&amp;gt; Automator)&lt;/span&gt;&lt;/li&gt;
&lt;li style=&quot;font-family: inherit;&quot;&gt;&lt;span style=&quot;font-size: small;&quot;&gt;&lt;span style=&quot;font-family: inherit;&quot;&gt;Select Application from the Tem&lt;/span&gt;plate Selection menu&lt;/span&gt;&lt;/li&gt;
&lt;li style=&quot;font-family: inherit;&quot;&gt;&lt;span style=&quot;font-size: small;&quot;&gt;Select the Run Shell Script Action, double click or drag it over &lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Type:&amp;nbsp; &lt;span style=&quot;font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;&quot;&gt;Rscript --no-save --no-restore /Users..../YourScript.R&lt;/span&gt;&lt;/li&gt;
&lt;ul&gt;&lt;li style=&quot;font-family: inherit;&quot;&gt;The last argument should be the path of the script file you want to run&lt;/li&gt;
&lt;li style=&quot;font-family: inherit;&quot;&gt;The 2nd and 3rd arguments tell R not to save the work&lt;span style=&quot;font-family: inherit;&quot;&gt;space when it&#39;s done and not to restore previously saved objects on startup, respectively&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: inherit;&quot;&gt;To see a full list of arguments you can pass to Rscript, just open Terminal and type&lt;/span&gt; &lt;span style=&quot;font-family: &amp;quot;Courier New&amp;quot;,Courier,monospace;&quot;&gt;Rscript&lt;/span&gt;&amp;nbsp;&lt;/li&gt;
&lt;/ul&gt;&lt;li style=&quot;font-family: inherit;&quot;&gt;&lt;span style=&quot;font-family: inherit;&quot;&gt;You can test your &quot;Application&quot; now by hitting the Run button in the upper right corner.&amp;nbsp; Ignore the little robot&#39;s warning.&amp;nbsp; Hopefully, you&#39;ll see some green checkmarks and a &quot;Workflow completed&quot; message.&lt;/span&gt; &lt;/li&gt;
&lt;li style=&quot;font-family: inherit;&quot;&gt;Save your &quot;Application&quot; somewhere clever.&lt;/li&gt;
&lt;li style=&quot;font-family: inherit;&quot;&gt;Now, open iCal and create a new event.&amp;nbsp; Set the date, time, and repeat values as you wish.&amp;nbsp; Select alarm-&amp;gt;Open File.&lt;/li&gt;
&lt;li style=&quot;font-family: inherit;&quot;&gt;It will default to iCal; click iCal, select Other, and navigate to your &quot;Application,&quot; which you saved somewhere clever.&lt;/li&gt;
&lt;/ol&gt;That&#39;s it, you&#39;re done.&amp;nbsp; When the scheduled time comes, the script will run in the background without even opening R.&amp;nbsp; You may see a little gear up top on your menu bar, that&#39;s it. &amp;nbsp;</description><link>http://pleasepasstheroc.blogspot.com/2011/04/how-to-source-r-script-automatically-on.html</link><author>noreply@blogger.com (Pass The ROC)</author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="http://2.bp.blogspot.com/--weWAQ8gAKQ/Ta9vGUElCwI/AAAAAAAAArs/RykEnfl49zk/s72-c/Screen+shot+2011-04-21+at+9.36.40+AM.png" height="72" width="72"/><thr:total>3</thr:total></item></channel></rss>