<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atomfull.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><feed xmlns="http://purl.org/atom/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" version="0.3" xml:lang="en">
	<title>Data Wrangling</title>
	<link rel="alternate" type="text/html" href="http://www.datawrangling.com" />
	<tagline>Machine Learning, Data Mining, and More</tagline>
	<modified>2009-10-06T02:01:05Z</modified>
	<copyright>Copyright 2009</copyright>
	<generator url="http://wordpress.org/" version="2.0.3">WordPress</generator>
			<link rel="start" href="http://feeds.feedburner.com/DataWrangling" type="application/atom+xml" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com" /><entry>
	  	<author>
			<name>Peter Skomoroch</name>
		</author>
		<title type="text/html" mode="escaped"><![CDATA[Slides &#038; Thoughts from Hadoop World NYC]]></title>
		<link rel="alternate" type="text/html" href="http://www.datawrangling.com/slides-thoughts-from-hadoop-world-nyc" />
		<id>http://www.datawrangling.com/slides-thoughts-from-hadoop-world-nyc</id>
		<modified>2009-10-06T01:53:44Z</modified>
		<issued>2009-10-06T01:53:44Z</issued>
		
	<dc:subject>Python</dc:subject>
	<dc:subject>Ruby</dc:subject>
	<dc:subject>Web Mashups</dc:subject>
	<dc:subject>Amazon EC2</dc:subject>
	<dc:subject>mapreduce</dc:subject>
	<dc:subject>hadoop</dc:subject> 
		<summary type="text/plain" mode="escaped"><![CDATA[

Big data hackers, Apache Hadoop developers, and early adopters from several industries descended on the Roosevelt Hotel this weekend for Hadoop World NYC.  I gave a talk on rapid prototyping of data intensive web applications using Hadoop, Hive, Python, and Ruby on Rails.  The talk also had a few bits about using R [...]]]></summary>
		<content type="text/html" mode="escaped" xml:base="http://www.datawrangling.com/slides-thoughts-from-hadoop-world-nyc">&lt;p&gt;&lt;img src="http://datawrangling.s3.amazonaws.com/Hangover.png" alt="High level languages for MapReduce" /&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="http://dataspora.com/blog/sexy-data-geeks/"&gt;Big data hackers&lt;/a&gt;, &lt;a href="http://hadoop.apache.org/"&gt;Apache Hadoop&lt;/a&gt; developers, and early adopters from several industries descended on the Roosevelt Hotel this weekend for &lt;a href="http://www.cloudera.com/hadoop-world-nyc"&gt;Hadoop World NYC&lt;/a&gt;.  I gave a talk on rapid prototyping of data intensive web applications using Hadoop, Hive, Python, and Ruby on Rails.  The talk also had a few bits about using &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt; with Hadoop for statistical computing at scale.  The sessions were taped, so I&amp;#8217;ll update this post with a link to the video when it becomes available.&lt;/p&gt;

&lt;p&gt;&lt;center&gt;&lt;/p&gt;

&lt;div style='width:425px;text-align:left'&gt;&lt;object style='margin:0px' width='425' height='355'&gt;&lt;param name='movie' value='http://static.slideshare.net/swf/ssplayer2.swf?doc=trendingtopicstalk-091003125043-phpapp01&amp;#038;stripped_title=prototyping-data-intensive-apps-trendingtopicsorg' /&gt;&lt;param name='allowFullScreen' value='true'/&gt;&lt;param name='allowScriptAccess' value='always'/&gt;&lt;center&gt;&lt;embed src='http://static.slideshare.net/swf/ssplayer2.swf?doc=trendingtopicstalk-091003125043-phpapp01&amp;#038;stripped_title=prototyping-data-intensive-apps-trendingtopicsorg' type='application/x-shockwave-flash' allowscriptaccess='always' allowfullscreen='true' width='425' height='355'&gt;&lt;/embed&gt;&lt;/center&gt;&lt;/object&gt;&lt;/div&gt;

&lt;p&gt;&lt;/center&gt;&lt;/p&gt;

&lt;p&gt;The &lt;a href="http://datawrangling.s3.amazonaws.com/trendingtopics_talk.pdf"&gt;slides&lt;/a&gt; give a high level overview of how I built the open source trend tracking site &lt;a href="http://www.trendingtopics.org"&gt;trendingtopics.org&lt;/a&gt; over a few weeks last June using Amazon EC2 and Cloudera tools.  The code for the site is on Github and the raw data it is powered by is available on &lt;a href="http://aws.amazon.com/publicdatasets/"&gt;Amazon Public Data Sets&lt;/a&gt;.  I&amp;#8217;ve also posted a series of tutorials related to trendingtopics on the Cloudera blog over the past few months:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://www.cloudera.com/blog/2009/07/31/tracking-trends-with-hadoop-and-hive-on-ec2/"&gt;Tracking Trends with Hadoop and Hive on EC2&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.cloudera.com/hadoop-data-intensive-application-tutorial"&gt;Building a Data Intensive Web Application with Cloudera, Hadoop, Hive, Pig, and EC2&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.cloudera.com/blog/2009/09/28/grouping-related-trends-with-hadoop-and-hive/"&gt;Grouping Related Trends with Hadoop and Hive&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here are a few resources mentioned in the talk:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://github.com/datawrangling/trendingtopics"&gt;Trendingtopics code on github&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.datawrangling.com/wikipedia-page-traffic-statistics-dataset"&gt;Wikipedia Page Traffic Statistics Dataset&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://developer.amazonwebservices.com/connect/thread.jspa?threadID=32112&amp;amp;tstart=0"&gt;EMR Forum discussion about using R with Hadoop&lt;/a&gt; (scroll down for R code that runs on Twitter data)&lt;/li&gt;
&lt;li&gt;David Rosenberg&amp;#8217;s &lt;a href="http://cran.r-project.org/web/packages/HadoopStreaming/index.html"&gt;R Streaming package&lt;/a&gt; on CRAN&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.datawrangling.com/how-flightcaster-squeezes-predictions-from-flight-data"&gt;How FlightCaster Squeezes Predictions from Flight Data&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Conference Highlights&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It felt like around half of the attendees of Hadoop World were developers or data hackers I know of &lt;a href="http://twitter.com/peteskomoroch/following"&gt;via Twitter&lt;/a&gt; or the Hadoop mailing lists.  This resulted in some decent Twitter coverage via the &lt;a href="http://search.twitter.com/search?q=%23hadoopworld"&gt;hadoopworld hash tag&lt;/a&gt;.  The other half of attendees represented enterprise IT, media companies, government, and financial firms who are either early adopters or interested in using Hadoop.&lt;/p&gt;

&lt;p&gt;Some interesting announcements were made in the morning.  Amazon added &lt;a href="http://aws.typepad.com/aws/2009/10/new-elastic-mapreduce-goodies-apache-hive-hadoop-studio-clouderas-hadoop-distribution.html"&gt;new features for the Elastic MapReduce service&lt;/a&gt;, including support for Hive, Cloudera&amp;#8217;s Hadoop Distribution, and integration with &lt;a href="http://www.hadoopstudio.org/"&gt;Karmasphere Studio&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Cloudera&amp;#8217;s big news was the launch of &lt;a href="http://cloudera.com/desktop"&gt;Cloudera Desktop&lt;/a&gt;, a new web-based unified user interface for users and operators of Hadoop clusters.  Note that you can also run the &lt;a href="http://archive.cloudera.com/desktop/desktop_on_ec2.html"&gt;Cloudera Desktop on Amazon EC2&lt;/a&gt;.  Cloudera announced support for their distribution on Softlayer and Rackspace.  They also outlined new features in the &lt;a href="http://www.cloudera.com/blog/2009/09/10/cdh2-clouderas-distribution-for-hadoop-2/"&gt;latest Hadoop distribution (CDH2)&lt;/a&gt;, which includes support for HBase and Hadoop 0.20.1.&lt;/p&gt;

&lt;p&gt;Vertica &lt;a href="http://www.vertica.com/company/news/Vertica-announces-partnership-with-Cloudera-at-Hadoop-World"&gt;announced a partnership with Cloudera&lt;/a&gt;, which is an interesting development considering the &lt;a href="http://databasecolumn.vertica.com/2008/01/mapreduce-a-major-step-back.html"&gt;RDBMS vs. MapReduce&lt;/a&gt; debates that took place last year.&lt;/p&gt;

&lt;p&gt;I think I actually spent more time talking data with fellow hackers like &lt;a href="http://twitter.com/i2pi"&gt;Joshua Reich&lt;/a&gt; and &lt;a href="http://www.hilarymason.com/"&gt;Hillary Mason&lt;/a&gt; than I did in the talks, but still managed to catch some good ones by the EHarmony team, &lt;a href="http://twitter.com/stuartsierra"&gt;Stuart Sierra&lt;/a&gt;, &lt;a href="http://mndoci.com/"&gt;Deepak Singh&lt;/a&gt;, and several Yahoo people.  As a big Python user, it was exciting to hear that &lt;a href="http://www.jakehofman.com/"&gt;Jake Hofman&lt;/a&gt; from Yahoo! Research, NY plans on an open source release of a Python based Social Network Library for Hadoop, which he used to generate the &lt;a href="http://bit.ly/hadoopworldjmh"&gt;Twitter analysis in his talk&lt;/a&gt;.  A big theme in my talk and others I attended was the use of high level languages on top of Hadoop to accelerate development.  Most of the teams I talked to actively use multiple abstractions on top of Hadoop, including Pig, Hive, Clojure, or other languages like Python through Hadoop Streaming.&lt;/p&gt;

&lt;p&gt;For further details check out these notes from other attendees:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://twitter.com/atveit"&gt;Amund Tveit&lt;/a&gt; has comprehensive notes of the &lt;a href="http://atbrox.com/2009/10/02/hadoop-world-2009-some-notes-from-morning-session/"&gt;morning&lt;/a&gt; and &lt;a href="http://atbrox.com/2009/10/03/hadoop-world-2009-notes-from-application-session/"&gt;afternoon Hadoop World sessions&lt;/a&gt; &lt;/li&gt;
&lt;li&gt;The HubSpot team has two posts: &lt;a href="http://dev.hubspot.com/bid/27047/Hadoop-World-NYC-2009"&gt;Hadoop World 2009&lt;/a&gt; and &lt;a href="http://dev.hubspot.com/bid/27054/Hadoop-World-impressions"&gt;Hadoop World Impressions&lt;/a&gt;  &lt;/li&gt;
&lt;li&gt;Hillary Mason wrote up &lt;a href="http://www.hilarymason.com/blog/hadoop-world-nyc/"&gt;some observations on her blog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Deepak Singh, who presented on Hadoop in Bioinformatics, gives &lt;a href="http://mndoci.com/2009/10/03/post-hadoop-world-thoughts/"&gt;his perspective on the conference&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;img src="http://feeds.feedburner.com/~r/DataWrangling/~4/CbpBAbbNglI" height="1" width="1"/&gt;</content>
	</entry>
		<entry>
	  	<author>
			<name>Peter Skomoroch</name>
		</author>
		<title type="text/html" mode="escaped"><![CDATA[How FlightCaster Squeezes Predictions from Flight Data]]></title>
		<link rel="alternate" type="text/html" href="http://www.datawrangling.com/how-flightcaster-squeezes-predictions-from-flight-data" />
		<id>http://www.datawrangling.com/how-flightcaster-squeezes-predictions-from-flight-data</id>
		<modified>2009-08-24T13:36:40Z</modified>
		<issued>2009-08-24T13:36:40Z</issued>
		
	<dc:subject>Machine learning</dc:subject>
	<dc:subject>Data mining</dc:subject>
	<dc:subject>Amazon EC2</dc:subject>
	<dc:subject>mapreduce</dc:subject>
	<dc:subject>hadoop</dc:subject>
	<dc:subject>Clojure</dc:subject> 
		<summary type="text/plain" mode="escaped"><![CDATA[During the last several years, an increasing number of systems within government and industry have been collecting massive amounts of raw data which often sits untapped in large data warehouses.  FlightCaster strikes me as a great example of the next generation of web applications that will leverage that data: bootstrapped startups that apply machine [...]]]></summary>
		<content type="text/html" mode="escaped" xml:base="http://www.datawrangling.com/how-flightcaster-squeezes-predictions-from-flight-data">&lt;p&gt;&lt;a href="http://www.flightcaster.com/"&gt;&lt;img hspace=5 vspace=5 align=right src="http://datawrangling.s3.amazonaws.com/bg-logo.gif" alt="FlightCaster" /&gt;&lt;/a&gt;During the last several years, an increasing number of systems within government and industry have been collecting &lt;a href="http://delicious.com/pskomoroch/dataset"&gt;massive amounts of raw data&lt;/a&gt; which often sits untapped in large data warehouses.  &lt;a href="http://www.flightcaster.com/"&gt;FlightCaster&lt;/a&gt; strikes me as a great example of &lt;a href="http://mndoci.com/2009/08/03/making-sense-of-all-that-data/"&gt;the next generation of web applications&lt;/a&gt; that will leverage that data: bootstrapped startups that apply machine learning and data processing at scale to solve a focused problem people actually care about.&lt;/p&gt;

&lt;p&gt;From &lt;a href="http://www.flightcaster.com/about"&gt;the site&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;&amp;#8220;FlightCaster predicts flight delays. We use an advanced algorithm that scours data on every domestic flight for the past 10-years and matches it to real-time conditions.  We help you evaluate alternative options and help connect you to the right person to make the change.&amp;#8221;&lt;/blockquote&gt;

&lt;p&gt;FlightCaster uses data from:&lt;/p&gt;

&lt;ul&gt;
    &lt;li&gt;Bureau of Transportation Statistics&lt;/li&gt;
    &lt;li&gt;FAA Air Traffic Control System Command Center&lt;/li&gt;
    &lt;li&gt;FlightStats&lt;/li&gt;
    &lt;li&gt;National Weather Service&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I&amp;#8217;ve been following the data crunching exploits of &lt;a href="http://twitter.com/bradfordcross"&gt;Bradford Cross&lt;/a&gt; on Twitter, and the launch of FlightCaster seemed like a great opportunity for an &amp;#8220;in the trenches&amp;#8221; interview on building a machine learning application with &lt;a href="http://rubyonrails.org/"&gt;Rails&lt;/a&gt; &amp;amp; &lt;a href="http://en.wikipedia.org/wiki/Hadoop"&gt;Hadoop&lt;/a&gt;.  During the interview on FlightCaster, Brad describes some of the challenges of working with flight data, statistical approaches for flight prediction, false negatives in FlightCaster, &lt;a href="http://clojure.org/"&gt;Clojure&lt;/a&gt;, Hadoop &amp;amp; &lt;a href="http://aws.amazon.com/ec2/"&gt;Amazon EC2&lt;/a&gt;, &lt;a href="http://ycombinator.com/"&gt;YCombinator&lt;/a&gt;, and lots more.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Of the 9? people at Flightcaster, how did the roles break down?  How did you get started on the problem?&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;People at YC make fun of us a lot on account of our &lt;a href="http://www.flightcaster.com/team"&gt;monolithic team&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We have a core team of 5 founders which breaks down as follows; a CEO, an air travel domain expert, two engineers working on web service, apps, and production issues, and me for research.&lt;/p&gt;

&lt;p&gt;I have a secret agent collaborator working with me who has been very helpful with research and scalable compute infrastructure.&lt;/p&gt;

&lt;p&gt;We also have a few other engineers doing a mix of stuff.  This is a speed thing.  We wanted to launch an iPhone app, blackberry app, and website all at the same time.  We built it all very fast.&lt;/p&gt;

&lt;p&gt;We got started on the problem thanks to Evan Konwiser and his peculiar and endearing obsession with the commercial air travel industry.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Did you have someone on the team with domain expertise who was familiar with the flight or weather data sources?&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Absolutely, that is Evan Konwiser.  He has been instrumental, and not just for familiarity with the data.&lt;/p&gt;

&lt;p&gt;The airline industry is a very idiosyncratic industry.  It would be hard to learn a lot of the subtle domain logic via induction alone.  Our machine learning approach uses a mix of analytical and inductive learning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;
&lt;em&gt;What was the biggest challenge in working with the public flight and weather data?  Is there a dataset that isn&amp;#8217;t out there right now that might make predicting flight delays easier for you?&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The public data set that we use is the &amp;#8220;on-time database&amp;#8221; published by the FAA.  The data set is tricky to get all in one place since the FAA does not provide any decent API to it.  The biggest issue is that we make real time predictions, so we needed a historical set of captured real time data, which we had to create ourselves.&lt;/p&gt;

&lt;p&gt;Having a more amalgamated real time dataset going back historically for a decade would be a big help.   Having more modernized ways of accessing the data would be helpful.&lt;/p&gt;

&lt;p&gt;Until then, if anyone wants to buy it, we will sell it to them for a very high price &amp;#8230; and to sweeten, they must throw in a few obscure, expensive machine learning books that I have on my Amazon wish list.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;I&amp;#8217;ve been a big fan of using the combination of &lt;a href="http://www.cloudera.com/blog/2009/07/31/tracking-trends-with-hadoop-and-hive-on-ec2/"&gt;Rails, Hadoop, and Amazon EC2&lt;/a&gt; along with a high level language (in your case Clojure).  Any tips for people out there thinking of using a similar technology stack?  How cost effective is running Hadoop on EC2 for you?&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Building layer upon layer of abstraction is a big key.  On the jvm, you have to do this, it is the path around the verbosity of Java and the vast abyss of poorly done APIs.  You just keep searching until you finally find the folks who have built a sane, high level API on top of the thing you want to use - then you wrap it in a high level language like &lt;a href="http://en.wikipedia.org/wiki/Clojure"&gt;Clojure&lt;/a&gt;.   The technical term for this is &amp;#8220;wrap the crap.&amp;#8221;&lt;/p&gt;

&lt;p&gt;In our case, we use &lt;a href="http://www.cascading.org/"&gt;Cascading&lt;/a&gt; as our step up in abstraction on top of Hadoop.&lt;/p&gt;

&lt;p&gt;S3 -&gt; EC2 -&gt; &lt;a href="http://www.cloudera.com/"&gt;Cloudera&lt;/a&gt; -&gt; HDFS -&gt; Hadoop -&gt; Cascading -&gt; Clojure.  I&amp;#8217;m not sure if those layers are exactly the right order, but you get the point.  The key is go keep layering until you encapsulate the plumbing and get to the level of abstraction that lets you focus on solving your problem.&lt;/p&gt;

&lt;p&gt;Running Hadoop on EC2 has been very cost effective.  The biggest issues have come into play with the disconnect between Hadoop and &lt;a href="http://aws.amazon.com/s3/"&gt;S3&lt;/a&gt;.  S3 expects open connections to keep reading, and if they don&amp;#8217;t, S3 terminates them.  S3 is very much the Arnold of the distributed file system world.  So if your Hadoop jobs are compute intensive, and they are buffering in data in a lazy loading fashion, they tend to lose the connection to S3 during long processing phases.  We&amp;#8217;ve worked around this with some hackery, and we are working with &lt;a href="http://chris.wensel.net/"&gt;Chris Wensel&lt;/a&gt; (of Cascading fame) on a more industrial strength solution to the problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;There are many existing machine learning and statistical computing packages for R, Python, Java, C++, why did you choose to go with Clojure?  What were the pros/cons of that approach?  Do you use any of those other tools for prototyping or visualization?&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Over the years, I have found that, in practice, the statistical and machine learning code is not the big thing to worry about.  That code is fun to write, and often you want to tweak and extend the libraries in ways that they have not been designed for.  So we use libraries and frameworks for the basics where we can, but we are OK to implement statistical and machine learning algorithms ourselves.  I&amp;#8217;m quite experienced with this anyway; all the way down to efficient custom data structures built on arrays.&lt;/p&gt;

&lt;p&gt;The bigger problem to worry about is in the title of your site; the data wrangling.  Especially pre-processing (filtering, transforming, etc) and the general fault-tolerant distributed compute infrastructure.  I worked on this sort of thing during my time at Google, and it is far more complex than it seems.  It is easy to grasp the concepts, and get an initial implementation, but the edge cases and last mile issues with respect to the fault tolerance are where you take the hit.&lt;/p&gt;

&lt;p&gt;Hadoop is a wonderful solution for distributed computation, and since our code is all purely functional, it is very natural for us.&lt;/p&gt;

&lt;p&gt;Clojure is ideal for both the data wrangling, as well as the statistical and machine learning code.  Also, Clojure plays nice with everything on the jvm so we wrap and use lots of libs from the java world.  Put this together with the distributed compute infrastructure that you get with Hadoop, and it starts to make a lot more sense to build these systems in Clojure on the jvm and use Hadoop than it does to use R, python, etc.&lt;/p&gt;

&lt;p&gt;That said, if you just need to do quick and dirty prototypes, or don&amp;#8217;t have the need or the option to invest in infrastructure for distributed computing, R and SciPy are probably still the place to turn.&lt;/p&gt;

&lt;p&gt;At Google, the research scientists prototype in python and R, and then port to C++ for the real scalable map reduce runs.  We prototype and run in production on the same language and platform, and although it is not as fast as Google&amp;#8217;s C++ infrastructure, we do have the benefit that Clojure is very high level.  For large runs, we parallelize with Hadoop, and we just run smaller tasks locally as if our Hadoop infrastructure were not even there.  We are not coupled to it, and yet we don&amp;#8217;t need to port code or do anything special to run it in distributed mode.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Can you talk more about how you are using Hadoop for Flightcaster?  Are most of the jobs I/O intensive, preprocessing a large volume of historical input data, or are they more often cross-validation or simulation runs?  Do you tweak your live models then re-train against all the historical data with Hadoop?&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Our jobs are CPU intensive - we do a lot of computation per unit of data, even in our data transformation jobs.&lt;/p&gt;

&lt;p&gt;We train and test all our models offline, on captured real time data.&lt;/p&gt;

&lt;p&gt;Eventually we would like to move to a more incremental online learning approach, but for now we re-train and re-test against newly captured data and re-deploy new the newly trained model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Any tips for handling bad data/records in large MapReduce jobs?&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We&amp;#8217;ve been talking to Chris Wensel and cascading has some really cool stuff for this in the form of &amp;#8220;filters&amp;#8221; that we are not leveraging yet.  We do a lot of filtering and scrubbing of our own during the preprocessing phase, but we will be looking at what cascading can do for us here very soon.&lt;/p&gt;

&lt;p&gt;The biggest lesson we have learned about bad data is that we should spend more time up front with visualizing the data and making assertions (or filtering via predicates) to verify that our assumptions about the data hold.&lt;/p&gt;

&lt;p&gt;I have had these issues in the past with financial data, but not as bad.  Can a flight land before it takes off?  Are there 68 hours in a day?  These are examples of data that is not malformed or missing, but that violates fundamental properties that you expect to hold true, so they can be trickier to catch.&lt;/p&gt;

&lt;p&gt;The big problem with not spending more time on this up front is that these issues are expensive to catch downstream because you will be looking through non-intuitive results of complex analysis jobs and it takes a long time to track down the root of such issues.  It is similar to the argument for having good unit tests; it is cheaper to find issues at that level than at the system test level.  It is also more sanity-preserving.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;You probably can&amp;#8217;t say much about your secret sauce or algorithms, but can you discuss some general issues you encounter handling conflicting information sources or incomplete data?   Do ensemble approaches or online learning play a role?&lt;br /&gt;
&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As I mentioned in a previous response, we want to get into more of an incremental online learning approach, and I think we will be able to soon.  Of course this means certain approaches are in and others are out, but that may be OK in our case.&lt;/p&gt;

&lt;p&gt;As of right now, we are using a combination of analytical and inductive learning.  We compose individual classifiers that are themselves composed of rich domain logic.  This is not an ensemble approach, though we may head in that direction very soon.&lt;/p&gt;

&lt;p&gt;What we are doing now is more of a network-of-classifiers approach.  It is a bit of a strange beast right now, so we would be remiss to call it a Bayesian network.  Maybe we could call it a Cardono network in honor of Jerolomo Cardano.  It is a bit of an eccentric network that is ahead of its time but waiting for some formalism to come along and straighten it out.&lt;/p&gt;

&lt;p&gt;To a large extent, our current approach is an artifact of how quickly we have built our initial model.  We have these rich individual domain specific classifiers that are targeted at different features from different data sources, and the approach we are using to learn the composition of these individual classifiers is an area of active research.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;What is your general style for attacking prediction problems?  Do you start with a single data source and get the entire pipeline running with a simple model, or you dive in building a more complex model with multiple data sources?  How did Flightcaster compare to financial prediction?&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I like to let simpler cases drive out a lot of the demand for infrastructure early on.  As you suggest, I tend to start by getting something working end-to-end with a simple model and a single data source.  Then you can start to evolve faster,  join in other data sources, and try more deeply theoretical ideas; all on top of a quality infrastructure.&lt;/p&gt;

&lt;p&gt;That said, some infrastructure is not required until you get to more complex models, so it is always an evolutionary process where model drives infrastructure.&lt;/p&gt;

&lt;p&gt;I made a lot of mistakes early in my career in building trading models where I let me theories get too far ahead of what I could really test in practice.  That is not a good place to be.  Unfortunately, this is an easy mistake to make.&lt;/p&gt;

&lt;p&gt;Flightcaster has been very different from my work in investing in a number of ways.  FlightCaster makes predictions by turning the probability distribution estimation problem into a k-way classification problem.&lt;/p&gt;

&lt;p&gt;This is a simpler problem than designing holistic investment strategies.&lt;/p&gt;

&lt;p&gt;In investing, you have to answer many questions.  Selecting the point at which buying and selling occurs is akin to predictive problems in other domains.  However, this only answers the question of when to buy or sell.&lt;/p&gt;

&lt;p&gt;You also have to decide what to buy or sell - which is called portfolio selection.  You also have to decide how much to buy or sell - which is called portfolio allocation.  There are also the topics that I like to call risk and exposure accommodation.&lt;/p&gt;

&lt;p&gt;The latter subjects of portfolio selection, portfolio allocation, and risk and exposure accommodation are arguably more important than the former subject of timing or predicting entry and exit points.  Parameter sensitivity analysis tends to show that return and risk-return metrics are less sensitive to variability in the entry-exit approach as compared with portfolio selection, portfolio allocation, and risk accommodation approaches.&lt;/p&gt;

&lt;p&gt;These aspects of financial modeling make it significantly more difficult, and they also seem to largely account for the repeated demise of many so-called &amp;#8220;quantitative trading&amp;#8221; operations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;
&lt;em&gt;Can you say anything about decision points and false alarm rates in travel delay prediction? It seems like there is a big penalty (risk) if somebody misses a flight based on a bad prediction, but a minor annoyance if the flight is delayed and you don&amp;#8217;t predict it correctly.   Will you guys publish some ROC curves at some point?  &lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There are always two key metrics in learning - in IR they are called precision and recall.  We seem to be at somewhere around 85% precision and 60% recall.  So discovering more delays is where we need to focus, and our false positive delay predictions are not the big issue for us right now.&lt;/p&gt;

&lt;p&gt;We compute confusion matrices, and use them to derive precision, recall, and our false positive and false negative rates.&lt;/p&gt;

&lt;p&gt;When you turn numerical probability distribution estimation into k-way classification, one side effect is that not all false positives are equal.  If we way you will be delayed by over an hour, but you are delayed by 45 minutes, that not bad at all.  But if we say you will be delayed by over an hour and you are on time, that is much worse.  So there is a distance aspect to false positives.&lt;/p&gt;

&lt;p&gt;We don&amp;#8217;t see a lot of bad false positives now, that is, we don&amp;#8217;t appear to tell people they will experience a long delay when in fact they will be on time.  The bigger issue seems to be our recall - there are a lot of delays that we are just not able to detect yet.  Alternatively stated, we have a problem with false negatives in the on time class.&lt;/p&gt;

&lt;p&gt;We also have a strong temporal element to all this.  How do these numbers change as we get closer or further away from your departure time?  We&amp;#8217;re working on this analysis right now, and we do hope to publish a lot of useful metrics soon.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;How was it working with YCombinator? &lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="http://ycombinator.com/"&gt;YCombinator&lt;/a&gt; is amazing.  The people are all great; both YC founders and the people they invest in.  The exposure we have gotten from being part of the program is itself worth the equity stake they take in the company.  YC has put together a really fascinating business model.  I would like to be an investor in YC.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://tlb.org/"&gt;Trevor Blackwell&lt;/a&gt; is a YC founder, friend, and maker of fine quality luxury robots.  Actually, his big new thing is telepresence robots. Check out the bots at anybots.com.&lt;/p&gt;

&lt;p&gt;Trevor is the one who lured me into working with the Flightcaster team.  I hadn&amp;#8217;t met Paul Graham and his Wife Jessica before YC - they totally rock!&lt;/p&gt;

&lt;p&gt;It has been an amazing experience and I would encourage anyone to apply; especially if you want to build something that solves a real problem.&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/DataWrangling/~4/MSUigYaJPe0" height="1" width="1"/&gt;</content>
	</entry>
		<entry>
	  	<author>
			<name>Peter Skomoroch</name>
		</author>
		<title type="text/html" mode="escaped"><![CDATA[Wikipedia Page Traffic Statistics Dataset]]></title>
		<link rel="alternate" type="text/html" href="http://www.datawrangling.com/wikipedia-page-traffic-statistics-dataset" />
		<id>http://www.datawrangling.com/wikipedia-page-traffic-statistics-dataset</id>
		<modified>2009-06-11T23:22:33Z</modified>
		<issued>2009-06-11T23:22:33Z</issued>
		
	<dc:subject>Data mining</dc:subject>
	<dc:subject>Amazon EC2</dc:subject>
	<dc:subject>dataset</dc:subject> 
		<summary type="text/plain" mode="escaped"><![CDATA[

I&#8217;ve published a Wikipedia Page Traffic Data Set containing a 320 GB sample of the data used to power trendingtopics.org (I&#8217;ll talk about Trending Topics more in a upcoming post). The EBS snapshot includes 7 months of hourly page traffic statistics for over 8 Million Wikipedia articles (~ 1 TB uncompressed) along with the associated [...]]]></summary>
		<content type="text/html" mode="escaped" xml:base="http://www.datawrangling.com/wikipedia-page-traffic-statistics-dataset">&lt;p&gt;&lt;a href="http://www.trendingtopics.org/"&gt;&lt;img src="http://datawrangling.s3.amazonaws.com/wikipedia_pageviews_lost_episodes.png"&gt;&lt;/img&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I&amp;#8217;ve published a &lt;a href="http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2596"&gt;Wikipedia Page Traffic Data Set&lt;/a&gt; containing a 320 GB sample of the data used to power &lt;a href="http://www.trendingtopics.org"&gt;trendingtopics.org&lt;/a&gt; (I&amp;#8217;ll talk about Trending Topics more in a upcoming post). The EBS snapshot includes 7 months of hourly page traffic statistics for over 8 Million Wikipedia articles (~ 1 TB uncompressed) along with the associated Wikipedia content, linkgraph, &amp;amp; metadata.  The english Wikipedia subset contains ~2.5 Million articles.&lt;/p&gt;

&lt;p&gt;It only takes a couple of minutes to sign up for an &lt;a href="http://aws.amazon.com/ec2/"&gt;Amazon EC2 account&lt;/a&gt; and set up access to the data as an &lt;a href="http://aws.amazon.com/ebs/"&gt;EBS volume&lt;/a&gt; from the &lt;a href="https://console.aws.amazon.com/"&gt;Amazon Management Console&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you want to work entirely from the command line, you will need to complete the steps in the &lt;a href="http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/"&gt;Getting Started Guide&lt;/a&gt;.  When you are set up to use EC2, launch a small EC2 Ubuntu instance from your local machine:&lt;/p&gt;

&lt;pre&gt;
    $ ec2-run-instances ami-5394733a -k gsg-keypair -z us-east-1a
&lt;/pre&gt;

&lt;p&gt;Once it is running and you have the instance id, create and attach an EBS Volume using the public snapshot snap-753dfc1c (make sure the volume is created in the same availability zone as the ec2 instance)&lt;/p&gt;

&lt;pre&gt;
    $ ec2-create-volume --snapshot snap-753dfc1c -z us-east-1a
    $ ec2-attach-volume vol-ec06ea85 -i i-df396cb6 -d /dev/sdf
&lt;/pre&gt;

&lt;p&gt;Next, ssh into the instance and mount the volume&lt;/p&gt;

&lt;pre&gt;
    $ ssh root@ec2-12-xx-xx-xx.z-1.compute-1.amazonaws.com
    root@domU-12-xx-xx-xx-75-81:/mnt# mkdir /mnt/wikidata
    root@domU-12-xx-xx-xx-75-81:/mnt# mount /dev/sdf /mnt/wikidata
&lt;/pre&gt;

&lt;p&gt;See the README files in each subdirectory for more details on these datasets&amp;#8230;&lt;/p&gt;

&lt;h3&gt;Wikistats&lt;/h3&gt;

&lt;p&gt;The good stuff is sitting in 5000 files in /mnt/wikidata/wikistats/pagecounts/&lt;/p&gt;

&lt;pre&gt;
    /mnt/wikidata/wikistats/pagecounts# ls -l | wc -l
    5068
    /mnt/wikidata/wikistats/pagecounts# ls -lh |head
    total 260G
    -rw-r--r-- 1 root root  49M 2009-02-26 13:34 pagecounts-20081001-000000.gz
    -rw-r--r-- 1 root root  46M 2009-02-26 13:34 pagecounts-20081001-010000.gz
    -rw-r--r-- 1 root root  47M 2009-02-26 13:34 pagecounts-20081001-020000.gz
    -rw-r--r-- 1 root root  44M 2009-02-26 13:34 pagecounts-20081001-030000.gz
    -rw-r--r-- 1 root root  45M 2009-02-26 13:34 pagecounts-20081001-040000.gz
    -rw-r--r-- 1 root root  47M 2009-02-26 13:35 pagecounts-20081001-050001.gz
    -rw-r--r-- 1 root root  45M 2009-02-26 13:35 pagecounts-20081001-060000.gz
    -rw-r--r-- 1 root root  50M 2009-02-26 13:35 pagecounts-20081001-070000.gz
    -rw-r--r-- 1 root root  51M 2009-02-26 13:35 pagecounts-20081001-080000.gz
&lt;/pre&gt;

&lt;p&gt;This directory contains hourly Wikipedia article traffic logs covering the 7 month period from October 01 2008 to April 30 2009, this data is regularly &lt;a href="http://dammit.lt/2007/12/10/wikipedia-page-counters/"&gt;logged from the wikipedia squid proxy&lt;/a&gt; by &lt;a href="http://dammit.lt/"&gt;Domas Mituzas&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Each log file is named with the date and time of collection: pagecounts-20090430-230000.gz&lt;/p&gt;

&lt;p&gt;Each line has 4 fields:&lt;/p&gt;

&lt;pre&gt;projectcode, pagename, pageviews, bytes&lt;/pre&gt;

&lt;pre&gt;
    en Barack_Obama 997 123091092
    en Barack_Obama%27s_first_100_days 8 850127
    en Barack_Obama,_Jr 1 144103
    en Barack_Obama,_Sr. 37 938821
    en Barack_Obama_%22HOPE%22_poster 4 81005
    en Barack_Obama_%22Hope%22_poster 5 102081
&lt;/pre&gt;

&lt;h3&gt;Wikilinks (1.1G)&lt;/h3&gt;

&lt;p&gt;Contains a &lt;a href="http://users.on.net/~henry/home/wikipedia.htm"&gt;wikipedia linkgraph dataset&lt;/a&gt; provided by &lt;a href="http://haselgrove.id.au/"&gt;Henry Haselgrove&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;These files contain all links between proper english language Wikipedia pages, that is pages in &amp;#8220;namespace 0&amp;#8243;. This includes disambiguation pages and redirect pages.&lt;/p&gt;

&lt;p&gt;In links-simple-sorted.txt, there is one line for each page that has links from it. The format of the lines is ready for processing by &lt;a href="http://hadoop.apache.org/core/"&gt;Hadoop&lt;/a&gt;:&lt;/p&gt;

&lt;pre&gt;
    from1: to11 to12 to13 ...
    from2: to21 to22 to23 ...
    ...
&lt;/pre&gt;

&lt;p&gt;where from1 is an integer labelling a page that has links from it, and to11 to12 to13 &amp;#8230; are integers labelling all the pages that the page links to. To find the page title that corresponds to integer n, just look up the n-th line in the file titles-sorted.txt.&lt;/p&gt;

&lt;h3&gt;Wikidump (29G)&lt;/h3&gt;

&lt;p&gt;Contains the &lt;a href="http://en.wikipedia.org/wiki/Wikipedia_database"&gt;raw Wikipedia dumps&lt;/a&gt; from March along with some processed versions of the data.  One of the useful files I created provides a direct lookup table for wikipedia article redirects in page_lookup_redirects.txt, which can be useful for name standardization and search:&lt;/p&gt;

&lt;p&gt;Here is a sample query run when the file is loaded into MySQL:&lt;/p&gt;

&lt;pre&gt;
   mysql&gt; select redirect_title, true_title from page_lookups
               where page_id = 534366;
   +------------------------------------------------+--------------+
   | redirect_title                                 | true_title   |
   +------------------------------------------------+--------------+
   | Barack_Obama                                   | Barack Obama |
   | Barak_Obama                                    | Barack Obama |
   | 44th_President_of_the_United_States            | Barack Obama |
   | Barach_Obama                                   | Barack Obama |
   | Senator_Barack_Obama                           | Barack Obama | 
                          .....                           .....         

   | Rocco_Bama                                     | Barack Obama |
   | Barack_Obama's                                 | Barack Obama | 
   | B._Obama                                       | Barack Obama |
   +------------------------------------------------+--------------+
   110 rows in set (11.15 sec)    
&lt;/pre&gt;

&lt;p&gt;The raw wikipedia dump file latest-pages-articles.xml was also post-processed using &lt;a href="http://meta.wikimedia.org/wiki/Xml2sql"&gt;xml2sql&lt;/a&gt; to produce a set of tab delimited text files for use with Hadoop and other tools :&lt;/p&gt;

&lt;pre&gt;
692M page.txt
115M redirect.txt
987M revision.txt
17G text.txt
&lt;/pre&gt;

&lt;p&gt;the corresponding namespace0 files were created by limiting page.txt and redirect.txt as follows:&lt;/p&gt;

&lt;pre&gt;
# grep '^[0-9]*       0       ' page.txt &gt; page_namespace0.txt
# grep '^[0-9]*        0       ' redirect.txt &gt; redirect_namespace0.txt
&lt;/pre&gt;
&lt;img src="http://feeds.feedburner.com/~r/DataWrangling/~4/kD7rcuOJP30" height="1" width="1"/&gt;</content>
	</entry>
		<entry>
	  	<author>
			<name>Peter Skomoroch</name>
		</author>
		<title type="text/html" mode="escaped"><![CDATA[Quick Visualization of irs.gov Search Queries]]></title>
		<link rel="alternate" type="text/html" href="http://www.datawrangling.com/quick-visualization-of-irs-search-queries" />
		<id>http://www.datawrangling.com/quick-visualization-of-irs-search-queries</id>
		<modified>2009-04-15T22:00:22Z</modified>
		<issued>2009-04-15T22:00:22Z</issued>
		
	<dc:subject>Information Retrieval</dc:subject>
	<dc:subject>Data mining</dc:subject>
	<dc:subject>visualization</dc:subject>
	<dc:subject>dataset</dc:subject> 
		<summary type="text/plain" mode="escaped"><![CDATA[Here is a quick visualization I did in honor of April 15th to investigate what people looking for on tax day&#8230;



This &#8220;query tree&#8221; shows the most frequent searches starting with the term &#8220;irs&#8221;. Each branch in the tree represents a query where the words are sized according to frequency of occurrence.   I like [...]]]></summary>
		<content type="text/html" mode="escaped" xml:base="http://www.datawrangling.com/quick-visualization-of-irs-search-queries">&lt;p&gt;Here is a quick visualization I did in honor of April 15th to investigate what people looking for on tax day&amp;#8230;&lt;/p&gt;

&lt;p&gt;&lt;img src="http://datawrangling.s3.amazonaws.com/dataviz/irs-queries.png"&gt;&lt;/img&gt;&lt;/p&gt;

&lt;p&gt;This &amp;#8220;query tree&amp;#8221; shows the most frequent searches starting with the term &amp;#8220;irs&amp;#8221;. Each branch in the tree represents a query where the words are sized according to frequency of occurrence.   I like how you can see at a glance what the most popular tax forms are by following the &amp;#8220;irs tax form &amp;#8230;&amp;#8221; branch.  Apparently form 8868, &lt;em&gt;Application for Extension of Time To File&lt;/em&gt;, is in high demand.&lt;/p&gt;

&lt;p&gt;
It was created by uploading search queries from AOL users leading to clicks on &lt;a href="http://www.irs.gov/"&gt;irs.gov&lt;/a&gt; during Spring 2006 to &lt;a href="http://www.concentrateme.com"&gt;Concentrate&lt;/a&gt;, which generated the query tree.  This image is a snapshot of an interactive flash visualization in Concentrate, where the focus term was &amp;#8220;irs&amp;#8221;.  Looking at query patterns like this can help you get an idea of what people are looking for and how to better organize your site so they can find it quickly.  
&lt;/p&gt;

&lt;p&gt;The interactive flash visualization was developed by &lt;a href="http://twitter.com/chrisgemignani"&gt;Chris Gemignani&lt;/a&gt; using &lt;a href="http://flare.prefuse.org/"&gt;Flare&lt;/a&gt; with some input from Zach Gemignani and myself and inspiration from the &lt;a href="http://manyeyes.alphaworks.ibm.com/manyeyes/page/Word_Tree.html"&gt;Many Eyes WordTree&lt;/a&gt;.
&lt;/p&gt;

&lt;p&gt;
The raw data is from the released &lt;a href="http://www.gregsadetsky.com/aol-data/"&gt;AOL Search data sample&lt;/a&gt;, and consists of the subset of unique queries leading to clicks on irs.gov from March to May 2006. The IRS queries used to make the visualization can be downloaded here: &lt;a href="http://datawrangling.s3.amazonaws.com/irs.gov.queries.csv"&gt;irs.gov.queries.csv&lt;/a&gt; (191K)
&lt;/p&gt;

&lt;p&gt;Here are the top 10 queries in the file:&lt;/p&gt;

&lt;p&gt;
&lt;table border=0 cellpadding=0 cellspacing=0 width=197 style='border-collapse:
 collapse;table-layout:fixed'&gt;
 &lt;col width=140&gt;
 &lt;col width=57&gt;

 &lt;tr height=13&gt;
  &lt;td height=13 width=140&gt;&lt;strong&gt;Query&lt;/strong&gt;&lt;/td&gt;
  &lt;td width=57&gt;&lt;strong&gt;Searches&lt;/strong&gt;&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr height=13&gt;
  &lt;td height=13&gt;irs&lt;/td&gt;
  &lt;td align=right&gt;4787&lt;/td&gt;

 &lt;/tr&gt;
 &lt;tr height=13&gt;
  &lt;td height=13&gt;irs.gov&lt;/td&gt;
  &lt;td align=right&gt;2282&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr height=13&gt;
  &lt;td height=13&gt;www.irs.gov&lt;/td&gt;

  &lt;td align=right&gt;1975&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr height=13&gt;
  &lt;td height=13&gt;internal revenue service&lt;/td&gt;
  &lt;td align=right&gt;1154&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr height=13&gt;

  &lt;td height=13&gt;irs forms&lt;/td&gt;
  &lt;td align=right&gt;608&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr height=13&gt;
  &lt;td height=13&gt;tax forms&lt;/td&gt;
  &lt;td align=right&gt;361&lt;/td&gt;
 &lt;/tr&gt;

 &lt;tr height=13&gt;
  &lt;td height=13&gt;irs tax forms&lt;/td&gt;
  &lt;td align=right&gt;196&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr height=13&gt;
  &lt;td height=13&gt;internal revenue&lt;/td&gt;
  &lt;td align=right&gt;158&lt;/td&gt;

 &lt;/tr&gt;
 &lt;tr height=13&gt;
  &lt;td height=13&gt;taxes&lt;/td&gt;
  &lt;td align=right&gt;142&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr height=13&gt;
  &lt;td height=13&gt;wheres my refund&lt;/td&gt;

  &lt;td align=right&gt;139&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr height=13&gt;
  &lt;td height=13&gt;federal tax forms&lt;/td&gt;
  &lt;td align=right&gt;125&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr height=13&gt;

  &lt;td height=13&gt;irs refunds&lt;/td&gt;
  &lt;td align=right&gt;106&lt;/td&gt;
 &lt;/tr&gt;
&lt;/table&gt;

&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/DataWrangling/~4/AbhhYbjACyA" height="1" width="1"/&gt;</content>
	</entry>
		<entry>
	  	<author>
			<name>Peter Skomoroch</name>
		</author>
		<title type="text/html" mode="escaped"><![CDATA[Amazon Elastic MapReduce: A Web Service API for Hadoop]]></title>
		<link rel="alternate" type="text/html" href="http://www.datawrangling.com/amazon-elastic-mapreduce-a-web-service-api-for-hadoop" />
		<id>http://www.datawrangling.com/amazon-elastic-mapreduce-a-web-service-api-for-hadoop</id>
		<modified>2009-04-02T07:42:21Z</modified>
		<issued>2009-04-02T07:42:21Z</issued>
		
	<dc:subject>Data mining</dc:subject>
	<dc:subject>Python</dc:subject>
	<dc:subject>Amazon EC2</dc:subject>
	<dc:subject>mapreduce</dc:subject>
	<dc:subject>collaborative filtering</dc:subject>
	<dc:subject>scipy</dc:subject>
	<dc:subject>dataset</dc:subject>
	<dc:subject>hadoop</dc:subject> 
		<summary type="text/plain" mode="escaped"><![CDATA[AWS just launched a new service called Amazon Elastic MapReduce that provides the same kind of developer friendly API used for Amazon EC2 or S3 for running Hadoop jobs in the Cloud.  You submit a job request and number of instances to the API (pointing to input data and code on S3), and AWS [...]]]></summary>
		<content type="text/html" mode="escaped" xml:base="http://www.datawrangling.com/amazon-elastic-mapreduce-a-web-service-api-for-hadoop">&lt;p&gt;AWS just launched a new service called &lt;a href="http://aws.amazon.com/elasticmapreduce"&gt;Amazon Elastic MapReduce&lt;/a&gt; that provides the same kind of developer friendly API used for Amazon EC2 or S3 for running &lt;a href="hadoop.apache.org/core/"&gt;Hadoop&lt;/a&gt; jobs in the Cloud.  You submit a job request and number of instances to the API (pointing to input data and code on S3), and AWS spins up a private Hadoop cluster on EC2, submits your job, and reports back on status through the API.  You can cancel or modify jobs using the API, and can even add additional steps to a running job.&lt;/p&gt;

&lt;p&gt;I was part of the private beta and wrote a short code sample that shows how to run Python streaming jobs using the service: &lt;a href="http://developer.amazonwebservices.com/connect/entry!default.jspa?categoryID=265&amp;amp;externalID=2294"&gt;Finding Similar Items with Amazon Elastic MapReduce, Python, and Hadoop Streaming&lt;/a&gt;.  As part of the code example, I also pulled together a cleaned up version of the &lt;a href="http://www-etud.iro.umontreal.ca/~bergstrj/audioscrobbler_data.html"&gt;AudioScrobbler dataset&lt;/a&gt; for use in music recommendations (it is about 1/4 the size of the Netflix Prize data).  The code sample basically implements a Python streaming version of the Pairwise Similarity algorithm found in this &lt;a href="http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf"&gt;paper&lt;/a&gt; by Tamer Elsayed, Jimmy Lin, and Douglas Oard and applies it to Netflix Prize ratings and Audioscrobbler playlist data.&lt;/p&gt;

&lt;p&gt;The base EC2 images underlying the service are running &lt;a href="http://hadoop.apache.org/core/docs/r0.18.3/"&gt;Hadoop 18.3&lt;/a&gt; on Debian and include &lt;a href="http://numpy.scipy.org"&gt;NumPy&lt;/a&gt;, &lt;a href="http://www.scipy.org/"&gt;SciPy&lt;/a&gt;, &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;, &lt;a href="http://www.crummy.com/software/BeautifulSoup/"&gt;BeautifulSoup&lt;/a&gt;, and other preinstalled packages useful for Streaming Hadoop jobs.  You can use the &lt;a href="http://www.cloudera.com/blog/2008/11/14/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/"&gt;distributed cache&lt;/a&gt; to install other packages like &lt;a href="http://code.google.com/p/nltk/source/browse/trunk/nltk#nltk/nltk_contrib/hadoop"&gt;nltk&lt;/a&gt; at runtime.&lt;/p&gt;

&lt;p&gt;My initial impression is that this will evolve into a powerful tool for people who want to run adhoc mapreduce jobs, prototype MapReduce code on EC2, or interface with on-demand clusters from within their apps.  Hopefully we&amp;#8217;ll see a MapReduce code/task sharing facility at some point similar to the EC2 pubic AMI system.&lt;/p&gt;

&lt;p&gt;Note that in the current release of Elastic MapReduce, input data is copied down from S3 at the start of the job and your cluster shuts itself down upon completion by default (you can override this with the API).   Mounting data directly from &lt;a href="http://aws.amazon.com/ebs/"&gt;EBS volumes&lt;/a&gt; isn&amp;#8217;t supported yet, but I wouldn&amp;#8217;t be surprised to see that soon given the potential for integrating with &lt;a href="http://www.datawrangling.com/amazon-web-services-public-datasets"&gt;Amazon Public Datasets&lt;/a&gt;.  Running &lt;a href="http://wiki.github.com/klbostee/dumbo"&gt;Dumbo&lt;/a&gt; jobs isn&amp;#8217;t supported yet since it requires a Hadoop patch for 18.3, but it should be possible when AWS moves to Hadoop 0.21 (which will also bring in a number of other important Hadoop features that are missing in 18.3).&lt;/p&gt;

&lt;p&gt;For maintaining a permanent cluster in-house or even a semi-permanent cluster on EC2 with a large amount of data, I would recommend using the &lt;a href="http://www.cloudera.com/hadoop-ec2"&gt;Cloudera distribution for Hadoop&lt;/a&gt; (it is a one-liner to start an EC2 Hadoop cluster from the command line).  I would often bounce between running jobs on my Cloudera EC2 cluster and Elastic MapReduce during development of the code example.  If you are getting started with Hadoop, the &lt;a href="http://www.cloudera.com/hadoop-training-basic"&gt;Cloudera training videos&lt;/a&gt; are a great place to get up to speed.&lt;/p&gt;

&lt;p&gt;So what can you do with Elastic MapReduce?  Here are a few initial ideas:&lt;/p&gt;

&lt;ul&gt;
    &lt;li&gt;Offload background processing from your Rails or Django app to Hadoop by sending the ElasticMapReduce API job requests pointing to data stored on S3: convert PDFs, classify spam, deduplicate records, batch geocoding, etc.&lt;/li&gt;
    &lt;li&gt;Process large amounts of retail sales and inventory transaction data for sales forecasting and optimization&lt;/li&gt;
    &lt;li&gt;Use the AddJobFlowSteps method in the API to run iterative machine learning algorithms using MapReduce on a remote Hadoop cluster and shut it down when your results converge to an answer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I&amp;#8217;ll post more on this later today - including a detailed explanation of using Netflix Prize data in the code example and some next steps for using Elastic MapReduce.&lt;/p&gt;
&lt;img src="http://feeds.feedburner.com/~r/DataWrangling/~4/yIueWIfjc5A" height="1" width="1"/&gt;</content>
	</entry>
		<entry>
	  	<author>
			<name>Peter Skomoroch</name>
		</author>
		<title type="text/html" mode="escaped"><![CDATA[Updated List of Datasets &#038; Video Lectures]]></title>
		<link rel="alternate" type="text/html" href="http://www.datawrangling.com/updated-list-of-datasets-video-lectures" />
		<id>http://www.datawrangling.com/updated-list-of-datasets-video-lectures</id>
		<modified>2009-02-12T23:37:49Z</modified>
		<issued>2009-02-12T23:37:49Z</issued>
		
	<dc:subject>Machine learning</dc:subject>
	<dc:subject>Computational Finance</dc:subject>
	<dc:subject>Python</dc:subject>
	<dc:subject>Web Frameworks</dc:subject>
	<dc:subject>Software engineering</dc:subject>
	<dc:subject>visualization</dc:subject>
	<dc:subject>Computer Science</dc:subject>
	<dc:subject>Education</dc:subject>
	<dc:subject>dataset</dc:subject> 
		<summary type="text/plain" mode="escaped"><![CDATA[New Datasets

It&#8217;s spring cleaning time at Data Wrangling. I&#8217;ve bookmarked 230 new datasets since publishing my first dataset linkdump in January 2008, so at the request of @mrflip, I&#8217;ve appended them to the original post along with a json dump of the tagged links.  Flip and the other Infochimps will be pulling anything they [...]]]></summary>
		<content type="text/html" mode="escaped" xml:base="http://www.datawrangling.com/updated-list-of-datasets-video-lectures">&lt;h3&gt;New Datasets&lt;/h3&gt;

&lt;p&gt;It&amp;#8217;s spring cleaning time at Data Wrangling. I&amp;#8217;ve bookmarked 230 new datasets since publishing my first dataset linkdump in January 2008, so at the request of &lt;a href="http://twitter.com/mrflip"&gt;@mrflip&lt;/a&gt;, I&amp;#8217;ve appended them to the original post along with a json dump of the tagged links.  Flip and the other &lt;a href="http://blog.infochimps.org/"&gt;Infochimps&lt;/a&gt; will be pulling anything they might have missed into the &lt;a href="http://infochimps.org"&gt;infochimps.org dataset repository&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;You can check out the new list of datasets at the same url:&lt;BR&gt; &lt;a href="http://www.datawrangling.com/some-datasets-available-on-the-web"&gt;&amp;#8220;Some Datasets Available on the Web&amp;#8221;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Around 85 of these datasets can be redistributed publicly: &lt;a href="http://delicious.com/pskomoroch/redistributable+dataset"&gt;http://delicious.com/pskomoroch/redistributable+dataset&lt;/a&gt;.  The rest are mostly free for academic use, but the license conditions vary.  Some appear to adhere to the terms on &lt;a href="http://opendefinition.org/"&gt;http://opendefinition.org/&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;New Video Courses&lt;/h3&gt;

&lt;p&gt;In addition to the datasets, my bookmarks included 20 new video courses since the original video lecture post was published in April, 2008.  These are mostly graduate and advanced undergraduate courses in Physics, Mathematics, and Computer Science.  Among these are full video courses in Parallel programming, Loop Quantum Gravity, Machine Learning, Financial Markets, and other fun subjects.&lt;/p&gt;

&lt;p&gt;The new videos have been added to the post:&lt;BR&gt;&lt;a href="http://www.datawrangling.com/hidden-video-courses-in-math-science-and-engineering"&gt;&amp;#8220;Hidden Video Courses in Math, Science, and Engineering&amp;#8221;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;Videos of Talks &amp;#038; Seminars&lt;/h3&gt;

&lt;p&gt;As an added bonus, here is a completely unorganized list of interesting programming, machine learning, and visualization talks which caught my eye in 2008:&lt;/p&gt;

&lt;p&gt;&lt;a id="more-36"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=http://www.youtube.com/watch?v=yOOJzQRJfIw rel="nofollow" tags=education,video,talk,computerscience,towatch&gt;YouTube - Google University Inaugural Lecture: Expanding the Frontiers of Computer S&amp;#8230;&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.researchchannel.org/itunesu/ rel="nofollow" tags=video,research,via:jolby,talks,itunes&gt;ResearchChannel @ iTunes U&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.facebook.com/video/video.php?v=631826881803 rel="nofollow" tags=facebook,video,talk,scalability,memcache,towatch&gt;Videos Posted by Facebook Engineering Tech Talks: Memcached Tech Talk with Mark Zuckerberg | Facebook&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.johnmwillis.com/ccatl/cloud-camp-atlanta-recap/ rel="nofollow" tags=sysadmin,ec2,video,ruby,talk,deployment,puppet&gt;Cloud Camp Atlanta Recap | IT Management and Cloud Blog&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://loops05.aei.mpg.de/index_files/Programme.html rel="nofollow" tags=video,talk,physics,conference,gravity,quantum,loop&gt;Loops&amp;#039;05&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://videocast.nih.gov/pastevents.asp?c=36 rel="nofollow" tags=biology,bioinformatics,video,nih,talks,proteomics&gt;NIH VideoCasting Past Events&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://videolectures.net/fws06_yates_aqm/ rel="nofollow" tags=web,search,queryminer,towatch,video,talk,yahoo&gt;Applications of Query Mining&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://videolectures.net/mloss08_whistler/ rel="nofollow" tags=nips,machinelearning,video,talk,towatch,software,python,mlpy,mdp&gt;NIPS ´08 Workshop: Machine Learning Open Source Software&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.softdevtube.com/?p=153 rel="nofollow" tags=google,screencast,talk,selenium,amazon,ec2,testing,video&gt;Extending Selenium | Software Development Videos&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://michaelnielsen.org/blog/?page_id=503 rel="nofollow" tags=mapreduce,google,pagrank,talk,video,search,python&gt;Michael Nielsen » Lectures on the Google Technology Stack: Syllabus and Background Reading&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://videolectures.net/google_datar_gnp/ rel="nofollow" tags=google,collaborative,filtering,news,personalization,mapreduce,em,video,talk,towatch,plsi&gt;Google News Personalization: Scalable Online Collaborative Filtering&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://videolectures.net/kdd07_agarwal_pdlfm/ rel="nofollow" tags=talk,video,machinelearning,recommender,collaborative,filtering,mapreduce,hadoop,yahoo,towatch,MMDS,deepak_agarwal&gt;Predictive Discrete Latent Factor Models for Large Scale Dyadic Data&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://videolectures.net/icml07_domingos_psr/ rel="nofollow" tags=statistics,markov,graphicalmodel,video,talk,towatch,refresher,beginner&gt;Practical Statistical Relational Learning&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://video.google.com/videosearch?q=doug+cutting&amp;#038;emb=0&amp;#038;aq=f rel="nofollow" tags=nutch,hadoop,video,talk,doug_cutting,people,towatch&gt;doug cutting - Google Video&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://videolectures.net/mlas06_mitchell_sla/ rel="nofollow" tags=video,talk,machinelearning,semisupervised,mitchell&gt;Semisupervised Learning Approaches&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.ted.com/index.php/talks/murray_gell_mann_on_beauty_and_truth_in_physics.html rel="nofollow" tags=video,talk,physics,gell-mann,towatch&gt;Murray Gell-Mann on beauty and truth in physics | Video on TED.com&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.redditall.com/2008/11/startup-insights-from-dharmesh-shah-in.html rel="nofollow" tags=video,talk,startup,towatch&gt;reddit all: alien artist blog: startup insights from dharmesh shah - in video!&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.youtube.com/watch?v=MsRTrO_p6yE rel="nofollow" tags=youtube,similarity,search,recommendation,personalization,video,talk,towatch&gt;YouTube - Similarity Search: A Web Perspective&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://research.google.com/roundtable/MR.html rel="nofollow" tags=video,talk,mapreduce,google&gt;Google Technology RoundTable: Map Reduce&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://video.google.com/videoplay?docid=7803400155407155553 rel="nofollow" tags=hadoop,video,talk,panel&gt;Hadoop Panel Discussion&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://videolectures.net/wsdm08_mei_esl/ rel="nofollow" tags=search,log,query,talk,video,machinelearning,microsoft,entropy,queryminer&gt;Entropy of Search Logs: How Hard is Search? With Personalization? With Backoff?&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.youtube.com/watch?v=23s9Wc3aWGY rel="nofollow" tags=python,video,towatch,google,talk&gt;YouTube - Slightly Advanced Python: Some Python Internals&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.youtube.com/watch?v=fzQ00f1oETs rel="nofollow" tags=python,video,talk,towatch&gt;YouTube - BayPIGgies Meeting - SF Bay Area Python Interest Group - Python Callbacks&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.youtube.com/watch?v=NFCZuzA4cFc&amp;#038;feature=channel rel="nofollow" tags=named_entity,machinelearning,wikipedia,video,talk,google,towatch&gt;YouTube - Knowledge-based Information Retrieval with Wikipedia.&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.balsamiq.com/blog/?p=375 rel="nofollow" tags=video,talk,startup,development,towatch,37signals&gt;Some Excellent ISV Advice from Jason Fried | The Balsamiq Blog&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.youtube.com/watch?v=UXy1bRSOJP4 rel="nofollow" tags=video,talk,django,python,towatch,teams&gt;YouTube - DjangoCon 2008 Panel: Django Success Stories&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.youtube.com/watch?v=i6Fr65PFqfk rel="nofollow" tags=video,talks,python,scalability,djangocon,django,towatch&gt;YouTube - DjangoCon 2008 Keynote: Cal Henderson&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.infoq.com/presentations/fernandez-restful-rails-apps rel="nofollow" tags=video,talk,rubyonrails,rails,rest&gt;InfoQ: Designing RESTful Rails Applications&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://videolectures.net/icml08_collobert_lsl/ rel="nofollow" tags=video,talk,nlp,via:chl,machinelearning&gt;Large Scale Learning Which Is Actually Useful&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.youtube.com/watch?v=dddFfRaBPqg rel="nofollow" tags=kahneman,psychology,finance,economics,strategy,video,talk,to:watch,behavior,statistics,via:chl,bias&gt;YouTube - Explorations of the Mind: Intuition&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://gigaom.com/2008/07/11/early-youtube-engineer-tells-all/ rel="nofollow" tags=talk,video,towatch,youtube,scalability,architecture,performance,python,mysql,protocol_buffers,bigtable&gt;Early YouTube Engineer Tells All - GigaOM&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://radar.oreilly.com/archives/2008/06/rich-wolski-eucalyptus-open-source-ec2.html rel="nofollow" tags=velocity,conference,video,talk,ec2,eucalyptus,automation,scalability,sysadmin&gt;Video of Rich Wolski&amp;#039;s EUCALYPTUS talk at Velocity - O&amp;#039;Reilly Radar&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://videolectures.net/cmulls08_singh_rlm/ rel="nofollow" tags=machinelearning,matrix,nmf,pca,svd,lsi,ica,factorization,collaborative,filtering,video,talk,cmu,textmining,via:chl,towatch&gt;Relational Learning as Collective Matrix Factorization&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://sites.google.com/site/io/ rel="nofollow" tags=video,talk,google,conference&gt;Google I/O Sessions (Google I/O Session Videos and Slides)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://blip.tv/file/948814 rel="nofollow" tags=manyeyes,visualization,finance,video,talk,treemap,wordtree,text,analytics,queryminer&gt;Martin Wattenberg, &amp;quot;Money is Beautiful: Looking at Markets in New Ways&amp;quot;&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://en.oreilly.com/money2008/public/content/home rel="nofollow" tags=finance,web2.0,conference,oreilly,video,talk,towatch&gt;O&amp;#039;Reilly Money:Tech Conference 2008 — O&amp;#039;Reilly Conferences, February 06 - 07, 2008, New York, NY&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://deus-ex-machine.blogspot.com/2008/05/we-use-all-100-percent-of-our-brain.html rel="nofollow" tags=video,talk,neuroscience,google,layman&gt;deus ex machine: We Use All 100 Percent of Our Brain&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://mtnwestrubyconf2008.confreaks.com/05younger.html rel="nofollow" tags=video,talk,kato,ruby,ec2,sqs,s3&gt;Confreaks: MountainWest Ruby Conference 2008&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://video.google.com/videoplay?docid=-6459171443654125383&amp;#038;hl=en rel="nofollow" tags=usability,video,talk,google,design,analytics,testing&gt;The Science and Art of User Experience at Google&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://video.google.com/videoplay?docid=4042661206682688471 rel="nofollow" tags=video,talk,erik_hatcher,search,lucene&gt;Code4Lib 2007: Erik Hatcher Keynote&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://aws.typepad.com/aws/2008/05/use-amazon-sqs.html rel="nofollow" tags=amazon,ec2,sqs,aws,howto,video,screencast,talk&gt;Amazon Web Services Blog: Use Amazon SQS to Build Self-Healing Applications&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.bestechvideos.com/2007/03/09/linuxconfau-puppet-a-system-administration-abstraction-and-automation-framework rel="nofollow" tags=puppet,admin,sysadmin,automation,video,talk,computerscience&gt;LinuxConf.Au: Puppet: A System Administration Abstraction and Automation Framework :: Tech Videos, Screencasts, Tutorials, Webinars, Techtalks, Tutorials&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://youtube.com/watch?v=KYUay3dCWBc rel="nofollow" tags=database,scalability,analytics,statistics,video,talk,performance&gt;YouTube - Supporting Scalable Online Statistical Processing&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://next.yahoo.net/archives/79/big-data-viewpoints-from-the-facebook-data-team rel="nofollow" tags=facebook,video,talk,analytics,hadoop&gt;next.yahoo » Blog Archive » Big Data: Viewpoints from the Facebook Data Team&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://highscalability.com/scaling-mania-mysql-conference-2008 rel="nofollow" tags=database,mysql,scalability,performance,presentations,links,talk,video&gt;Scaling Mania at MySQL Conference 2008 | High Scalability&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://research.yahoo.com/node/2104 rel="nofollow" tags=hadoop,video,lectures,talk,yahoo,summit,slides,parallel,mapreduce,towatch,nlp,hbase,amazon,ec2,distributed,search,microsoft,facebook,powerset,berkeley,google,intel,cmu,rapleaf&gt;Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides | Yahoo! Research&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://bio-alive.com/seminars/biodefense.htm rel="nofollow" tags=biodefense,video,talks&gt;Biodefense Video Lectures and Seminars&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://ejohn.org/blog/javascript-talk-at-northeastern/ rel="nofollow" tags=video,talk,javascript,jquery&gt;John Resig - JavaScript Talk at Northeastern&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://video.google.com/videoplay?docid=-7111461506729989490&amp;#038;q=norvig+language&amp;#038;total=3&amp;#038;start=0&amp;#038;num=10&amp;#038;so=0&amp;#038;type=search&amp;#038;plindex=0 rel="nofollow" tags=data,analysis,norvig,google,video,talk&gt;Google Developers Day US - Theorizing from Data&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://youtube.com/watch?v=3boKlkPBckA rel="nofollow" tags=video,talk,yann_lecun,neuralnetwork,vision,machinelearning,via:chl&gt;YouTube - Visual Perception with Deep Learning&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://videolectures.net/mmdss07_bottou_lume/ rel="nofollow" tags=video,talk,machinelearning,statisticallearning,via:chl&gt;Learning with Large Datasets [Video]&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;img src="http://feeds.feedburner.com/~r/DataWrangling/~4/BHoXke_rDao" height="1" width="1"/&gt;</content>
	</entry>
		<entry>
	  	<author>
			<name>Peter Skomoroch</name>
		</author>
		<title type="text/html" mode="escaped"><![CDATA[Search map: interactive visualization of search query clusters]]></title>
		<link rel="alternate" type="text/html" href="http://www.datawrangling.com/search-map-interactive-visualization-of-query-clusters" />
		<id>http://www.datawrangling.com/search-map-interactive-visualization-of-query-clusters</id>
		<modified>2009-02-10T01:12:16Z</modified>
		<issued>2009-02-10T01:12:16Z</issued>
		
	<dc:subject>Machine learning</dc:subject>
	<dc:subject>Information Retrieval</dc:subject>
	<dc:subject>Data mining</dc:subject>
	<dc:subject>Python</dc:subject>
	<dc:subject>Web Mashups</dc:subject>
	<dc:subject>Juice</dc:subject>
	<dc:subject>numpy</dc:subject>
	<dc:subject>scipy</dc:subject>
	<dc:subject>visualization</dc:subject> 
		<summary type="text/plain" mode="escaped"><![CDATA[Last month, our team at Juice launched a Django web analytics app called Concentrate that ingests search queries from sources like Google Analytics or Hitwise, then enhances this raw data by discovering common query patterns, generating segmented reports, and offering visual interfaces for data exploration.  Jeff Barr wrote about the technology stack we used [...]]]></summary>
		<content type="text/html" mode="escaped" xml:base="http://www.datawrangling.com/search-map-interactive-visualization-of-query-clusters">&lt;p&gt;Last month, our team at &lt;a href="http://www.juiceanalytics.com/"&gt;Juice&lt;/a&gt; launched a &lt;a href="http://www.djangoproject.com/"&gt;Django&lt;/a&gt; web analytics app called &lt;b&gt;&lt;a href="https://www.concentrateme.com/"&gt;Concentrate&lt;/a&gt;&lt;/b&gt; that ingests search queries from sources like Google Analytics or Hitwise, then enhances this raw data by discovering common query patterns, generating segmented reports, and offering visual interfaces for data exploration.  Jeff Barr &lt;a href="http://aws.typepad.com/aws/2009/01/aws-links-tuesday-january-27-2009.html"&gt;wrote about the technology stack&lt;/a&gt; we used to build the app itself a couple of weeks ago at the AWS blog.  I&amp;#8217;ll provide some more detail on that topic later this week.  This post will give a basic description of Concentrate&amp;#8217;s pattern discovery algorithm and show it in action.&lt;/p&gt;

&lt;p&gt;The following mashup provides a visual interface for exploring search patterns used by readers of the Data Wrangling blog by combining output from &lt;a href="https://www.concentrateme.com/"&gt;concentrateme.com&lt;/a&gt; with the &lt;a href="http://code.google.com/apis/ajaxsearch/"&gt;Google AJAX search API&lt;/a&gt;.  Each bubble in the visualization below represents a search query typed into Google during the last 2 months that led to clicks on on this site (~2000 unique queries, ~3400 searches).  The size of each bubble represents the number of visitors referred by that particular query, and the bubbles are colorized by the query cluster based on phrase pattern structure (&amp;#8217;python [x]&amp;#8217;, [x] video&amp;#8217;, etc).  The search results below the chart are highlighted in yellow if they lead to datawrangling.com pages, which allows you to see at a glance where the site ranks for each query.&lt;/p&gt;

&lt;p&gt;&lt;H3&gt;Search map of queries leading to clicks on datawrangling.com&lt;/H3&gt;
Click to open the query browser in a new window, then mouse over a query bubble and click to update the search results.&lt;/p&gt;

&lt;p&gt;&lt;a title="Data Wrangling Search Traffic: Dec-Jan" rel="nofollow" href="http://datawrangling.s3.amazonaws.com/visits-datawrangling.html" target="_blank"&gt;&lt;img hspace=5 vspace=5 HEIGHT=380 WIDTH=650 style="border:1px solid black" src="http://datawrangling.s3.amazonaws.com/scatter_visits.png" ALT="Interactive Search Query Map"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a id="more-35"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Search query referrals from Google depend on a number of factors; including what content you have, how well it ranks in Google, and how often people actually search for it.  I&amp;#8217;m not the most prolific blogger, so the topical coverage seen in the chart is somewhat sparse.  Some alternate views which can offer additional insight include &lt;a href="http://datawrangling.s3.amazonaws.com/average_time_on_site-datawrangling.html"&gt;sizing the bubbles by engagement metrics like average time on site&lt;/a&gt; instead of visit count.  You can download the raw data here if you want to experiment further: &lt;a href="http://datawrangling.s3.amazonaws.com/datawrangling_dec_jan.csv"&gt;datawrangling_dec_jan.csv&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Pattern discovery in Concentrate grew out of discussions I had with clients about current pain points in web analytics.  A common theme seemed to be that they had large amounts of internal and competitive search referral data, but found managing and deriving insights from the data to be difficult using existing tools.  My first instinct was to give them a summary view and trend reports by clustering queries based on topic using methods like &lt;a href="http://en.wikipedia.org/wiki/Non-negative_matrix_factorization"&gt;NMF&lt;/a&gt; (I&amp;#8217;ll be doing a few posts on topic classification later this month), but it turned out that the use cases they described were better served by another approach which automatically discovered common text patterns in the data and segmented queries based on phrase structure.&lt;/p&gt;

&lt;p&gt;What I&amp;#8217;m calling &amp;#8220;patterns&amp;#8221; are really regular expression templates for searches that share a similar structure. For instance, the pattern “jobs in [x]” represents searches for jobs in some location.   The “[x]” is a wildcard that can stand for one or more words. These wildcards are often variants of a similar concepts like locations, brands, or celebrity names.  As it turned out, a number of web analytics teams I talked to were spending a lot of time mining this kind of information manually from their search data in an effort to get a picture of their search traffic.&lt;/p&gt;

&lt;p&gt;Clustering searches based on phrase similarity is a problem that people have looked at before in fields like &lt;a href="http://en.wikipedia.org/wiki/Question_answering"&gt;question answering&lt;/a&gt;.  Several interesting papers are referenced in my delicious &lt;a href="http://delicious.com/pskomoroch/clustering?setcount=100"&gt;clustering&lt;/a&gt; tag if you want to dig deeper.  I think the novel part in Concentrate is that it combines a custom distance metric with an iterative algorithm that alternates between clustering queries and extracting common patterns within those clusters in a scalable fashion.  Since the algorithm is part of a commercial service I can&amp;#8217;t go into much more detail about how this all works, but hopefully you get the general idea of how this kind of pattern discovery can be useful.&lt;/p&gt;

&lt;p&gt;The original idea for the interactive query scatter plot was inspired by a &lt;a href="http://abeautifulwww.com/2007/04/03/an-interactive-visualization-of-the-netflix-prize-dataset/"&gt;Netflix Prize data visualization&lt;/a&gt; at &lt;a href="http://abeautifulwww.com/"&gt;A Beautiful WWW&lt;/a&gt;.  A similar analysis was run by &lt;a href="http://twitter.com/mdreid"&gt;Mark Reid&lt;/a&gt; using book borrowing data at &lt;a href="http://mark.reid.name/iem/visualising-19th-century-reading.html"&gt;Inductio ex Machina&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To generate the scatter plot, I tied my hands a bit and only allowed myself to use the CSV files downloaded from Concentrate as an input to the mashup (an API may be on the horizon).  I made similar plots to aid in development and debugging of the clustering &amp;amp; pattern extraction algorithms, but this version is nice because you don&amp;#8217;t need any extra information to generate it yourself.  I computed inter-query distance for all pairs of query strings in the file using a string edit distance metric based on a combination of the queries and the phrase patterns labels discovered by Concentrate (also in the CSV).&lt;/p&gt;

&lt;p&gt;The scatter plot layout was generated by running &lt;a href="http://delicious.com/pskomoroch/multidimensional_scaling"&gt;multidimensional scaling (MDS)&lt;/a&gt; on the resulting full distance matrix.  The MDS approach consisted of applying the &lt;a href="http://www.scipy.org/doc/numpy_api_docs/numpy.linalg.linalg.html#svd"&gt;SVD function&lt;/a&gt; built into &lt;a href="http://www.scipy.org/"&gt;NumPy&lt;/a&gt; to the distance matrix to find the first two basis vectors, which were then used as X-Y coordinates.  To produce the actual scatterplot html imagemap I used &lt;a href="http://matplotlib.sourceforge.net/search.html?q=scatter"&gt;Matplotlib&lt;/a&gt; and smattering of Python code from the following links:&lt;/p&gt;

&lt;ul&gt;
    &lt;li&gt;
&lt;a href="http://www.depthfirstsearch.net/blog/2008/04/18/multidimensional-scaling/"&gt;Depth First Search: Multidimensional Scaling&lt;/a&gt;&lt;/li&gt;
    &lt;li&gt;
&lt;a href="http://www.dalkescientific.com/writings/diary/archive/2005/04/24/interactive_html.html"&gt;Dalke Scientific: Interactive HTML&lt;a&gt;&lt;/li&gt;
    &lt;li&gt;
&lt;a href="http://hackmap.blogspot.com/2008/06/pylab-matplotlib-imagemap.html"&gt;Bio and Geo Informatics: pylab matplotlib imagemap&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This was just intended as a quick one-off example, but if anyone out there wants to generate the same visualization using another site&amp;#8217;s search traffic, feel free to contact me.  If you have Google Analytics on your site, you can sign up for a &lt;a href="https://www.concentrateme.com/pricing/"&gt;free Concentrate account&lt;/a&gt; and email me a CSV report containing your pattern clusters.  If you want to find out more about how you can apply search patterns for web analytics applications, check out these posts from the Juice blog:&lt;/p&gt;

&lt;ul&gt;
    &lt;li&gt;
&lt;a href="http://www.juiceanalytics.com/writing/search-competition-travel-sites/"&gt;Search Competition Among Travel Sites&lt;/a&gt;&lt;/li&gt;
    &lt;li&gt;
&lt;a href="http://www.juiceanalytics.com/writing/target-long-tail-searches-keyword-patterns/"&gt;Target Long Tail Searches with Keyword Patterns&lt;/a&gt;&lt;/li&gt;
    &lt;li&gt;
&lt;a href="http://www.juiceanalytics.com/writing/introducing-concentrate-long-tail-search-analytics/"&gt;Introducing Concentrate for Long Tail Search Analytics&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;img src="http://feeds.feedburner.com/~r/DataWrangling/~4/aFB40hd93AQ" height="1" width="1"/&gt;</content>
	</entry>
		<entry>
	  	<author>
			<name>Peter Skomoroch</name>
		</author>
		<title type="text/html" mode="escaped"><![CDATA[Conversation with Eric Siegel on Predictive Analytics World]]></title>
		<link rel="alternate" type="text/html" href="http://www.datawrangling.com/conversation-with-eric-siegel-on-predictive-analytics-world" />
		<id>http://www.datawrangling.com/conversation-with-eric-siegel-on-predictive-analytics-world</id>
		<modified>2009-01-29T16:54:18Z</modified>
		<issued>2009-01-29T16:54:18Z</issued>
		
	<dc:subject>Netflixprize</dc:subject>
	<dc:subject>Machine learning</dc:subject>
	<dc:subject>Data mining</dc:subject>
	<dc:subject>mapreduce</dc:subject>
	<dc:subject>collaborative filtering</dc:subject>
	<dc:subject>Computer Science</dc:subject> 
		<summary type="text/plain" mode="escaped"><![CDATA[

The Predictive Analytics World Conference is taking place Feb 18-19, 2009 in San Francisco, CA and seems to have an interesting lineup of speakers (including one of the winners of this years Netflix Progress Prize).  I&#8217;m going to be in the bay area during the week of Feb 15th, so I&#8217;m planning on checking [...]]]></summary>
		<content type="text/html" mode="escaped" xml:base="http://www.datawrangling.com/conversation-with-eric-siegel-on-predictive-analytics-world">&lt;p&gt;&lt;a href="http://www.predictiveanalyticsworld.com" &gt;&lt;img hspace=5 vspace=5 align=right HEIGHT=92 WIDTH=375 src="http://datawrangling.s3.amazonaws.com/predictive_analytics.png" ALT="predicitive analytics world conference"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;a href="http://www.predictiveanalyticsworld.com"&gt;Predictive Analytics World Conference&lt;/a&gt; is taking place Feb 18-19, 2009 in San Francisco, CA and seems to have an interesting lineup of speakers (including one of the winners of this years &lt;a href="http://www.netflixprize.com/community/viewtopic.php?id=1193"&gt;Netflix Progress Prize&lt;/a&gt;).  I&amp;#8217;m going to be in the bay area during the week of Feb 15th, so I&amp;#8217;m planning on checking out some of the talks.  Data Wrangling readers can &lt;a href="http://www.predictiveanalyticsworld.com/register.php"&gt;register&lt;/a&gt; using this code: &lt;strong&gt;datawranglingpaw09&lt;/strong&gt;  and get 15% off the conference registration fee.  Drop me a line if you are attending and want to meet up.&lt;/p&gt;

&lt;p&gt;It also might be worth stopping by if you are an R user, as &lt;a href="http://twitter.com/dataspora"&gt;Mike E. Driscoll&lt;/a&gt; at &lt;a href="http://dataspora.com/blog/"&gt;Data Evolution&lt;/a&gt; mentioned:&lt;/p&gt;

&lt;blockquote&gt;The Bay Area R UseRs group is doing a free, co-located event on Wed evening of the conference — so if you’re interested in mingling with some PAW folks as well as some R users — you can sign up at: &lt;a href="http://ia.meetup.com/67/calendar/9573566/"&gt;http://ia.meetup.com/67/calendar/9573566/&lt;/a&gt;&lt;/blockquote&gt;

&lt;p&gt;The organizers of the conference are coordinating a nice media blitz across several machine learning blogs; check out the &lt;a href="http://anyall.org/blog/2009/01/sf-conference-for-data-mining-mercenaries/"&gt;post by Brendan O’Connor&lt;/a&gt; and John Langford&amp;#8217;s &lt;a href="http://hunch.net/?p=516"&gt;interview at Machine Learning (Theory)&lt;/a&gt;.  I thought I would join in the fun by interviewing Eric about a few topics related to the conference, mostly focusing on customer modeling and machine learning in the business world.&lt;/p&gt;

&lt;p&gt;Read on for the transcript of our email interview:
&lt;a id="more-32"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="http://www.cs.columbia.edu/~evs/"&gt;Eric Siegel, Ph.D.&lt;/a&gt;, is the conference chair of &lt;a href="www.predictiveanalyticsworld.com"&gt;Predictive Analytics World&lt;/a&gt;, coming to San Francicso Feb 18-19 - the event for predictive analytics professionals, managers and commercial practitioners. This conference delivers case studies, expertise and resources in order to strengthen the business impact delivered by predictive analytics.  &lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pete: Can you give readers an overview of your background?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Eric: I&amp;#8217;ve been in data mining for 16 years and commercially applying predictive analytics with Prediction Impact since 2003.  As a professor at Columbia University, I taught graduate courses in predictive modeling (referred to as &amp;#8220;machine learning&amp;#8221; at universities), and have continued to lead training seminars in predictive analytics as part of my consulting career.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;There appear to be a few talks on the schedule related to the Netflix Prize.  During the first year of the competition I found it surprising how effective ensemble approaches turned out to be, with multiple teams often pooling a number of algorithms together to improve accuracy.  Are you seeing similar ensemble methods used in practice more often now?
&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Netflix is addressed during at least three session of PAW, including a presentation by the current Netflix leader (and winner of the 2008 Progress Prize).&lt;/p&gt;

&lt;p&gt;I do see ensemble models as key to many successful deployments of predictive analytics, both internally (my firm, Prediction Impact) and otherwise.  In fact, the Netflix leader uses a &amp;#8220;mini-&amp;#8221; ensemble, combining two competing methods with a meta-model, which turned out to be key to their success. The only reason team &amp;#8220;BellKor in BigChaos&amp;#8221; did well enough to qualify for the 2008 Progress Prize was by combining the two approaches developed by the separate teams &amp;#8220;BigChaos&amp;#8221; and &amp;#8220;BellKor&amp;#8221;, who had operated completely separately and developed different methods (in Austria and the U.S., respectively). The method to combine models is called &amp;#8220;meta-learning&amp;#8221; or &amp;#8220;ensemble models&amp;#8221;. The &amp;#8220;meta-model&amp;#8221; learned which of the two methods were better at which kinds of cases, and weights their outputs accordingly.&lt;/p&gt;

&lt;p&gt;Ensembles are also covered at PAW-09 outside of the Netflix leader&amp;#8217;s session. John Elder, who published a very interesting article (now a book chapter), &lt;a href="http://www.datamininglab.com/pubs/Paradox_JCGS.pdf"&gt;&amp;#8220;The Generalization Paradox of Ensembles&amp;#8221;&lt;/a&gt;, will speak on the topic during his workshop, &lt;a href="http://www.predictiveanalyticsworld.com/predictive_modeling_methods.php"&gt;&amp;#8220;The Best and the Worst of Predictive Analytics: Predictive Modeling Methods and Common Data Mining Mistakes&amp;#8221;&lt;/a&gt;.  Dr. Elder is also speaking on case studies during &lt;a href="http://www.predictiveanalyticsworld.com/agenda.php#multiplecase"&gt;a regular conference session&lt;/a&gt;.  And Dean Abbott&amp;#8217;s session covers a case study of his work for the NRA employing ensemble models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Peter Norvig gave a talk a couple of years ago, &lt;a href="http://www.youtube.com/watch?v=nU8DcBF-qo4"&gt;&amp;#8220;Theorizing from Data: Avoiding the Capital Mistake&amp;#8221;&lt;/a&gt;, which discussed how relatively simple statistical approaches could outperform more complex algorithms given large enough datasets.  Use of the Mapreduce approach is often pointed to as a key factor enabling companies like Google and Yahoo to tackle problems like this.  Lately, I&amp;#8217;m hearing more interest from enterprise clients about using Hadoop and cloud computing to analyze large internal datasets in a similar manner.  Do you think this kind of large scale parallel data mining will see increased adoption in traditional enterprises?  Or will this approach be limited to a handful of large organizations?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Well, even with simple models, there can be a benefit to parallelizing the learning process.  On the other hand, even with a million training cases, one reasonably modern desktop can be enough.  Even if most organizations don&amp;#8217;t need to go parallel, this is one of those &amp;#8220;raise the floor by raising the ceiling&amp;#8221; kind of situations - it&amp;#8217;s important to keep this moving and push to new heights.&lt;/p&gt;

&lt;p&gt;By the way, the advantage of simpler approaches relates to the article by Dr. Elder I mentioned above.  An ensemble is by definition a more complex model, yet it is &lt;em&gt;less&lt;/em&gt; inclined to succumb to the central risk of higher complexity: overadapting/overfitting to the data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In the past, I worked with offline retailers on predictive modeling of price elasticity and promotional effectiveness.  Back then, we were often limited to analysis of historical price and sales POS data, without the capability to conduct live A/B testing in physical store locations.  Over the last few years, sites like Amazon and Google have &lt;a href="http://glinden.blogspot.com/2006/04/early-amazon-shopping-cart.html"&gt;taken advantage of live A/B testing&lt;/a&gt; for price and advertising optimization on the web.  In your experience, is this experimental approach to decision making becoming more common in traditional enterprises or is it still limited to the web?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You know, I know of few to no public case studies of offline (non-web) businesses employing online/realtime learning. Generally, it isn&amp;#8217;t worth the integration requirements and increased analytical challenges - companies usually just refresh the predictive model periodically with the usual offline process of learning/modeling over updated data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Earlier this month, the &lt;a href="http://en.oreilly.com/money2009/"&gt;O&amp;#8217;Reilly Money:Tech Conference&lt;/a&gt; was postponed in light of the financial crisis. Have you seen much fallout in the predictive analytics community from the economic downturn?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It is hard to generalize. People are more risk adverse in employing a technology to which they may be new, but, at the same time, there&amp;#8217;s more attention on improving efficiency with better decisions or more precisely targeted marketing. My keynote at PAW-09 is entitled, &lt;a href="www.predictiveanalyticsworld.com/agenda.php#fiveways"&gt;&amp;#8220;Five Ways to Lower Costs with Predictive Analytics&amp;#8221;&lt;/a&gt;. And I would say my consulting colleagues and I are as busy as usual.&lt;/p&gt;

&lt;p&gt;You may think conferences are another thing, since attendance/travel are often the first costs cut. Well, I&amp;#8217;ve got a bias, but my opinion that predictive analytics is too important to circumvent is founded: registration for PAW-09 has surpassed the goals we set last spring. I credit in large part predictive analytics&amp;#8217; increase in buzz and attention; it&amp;#8217;s starting to get its due.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What problem areas appear to be most in demand by enterprise customers right now?  Credit scoring, recommendation systems, predictive text mining?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Credit scoring is established - an application of predictive analytics that has crossed the chasm and achieved wide adoption.  It is fairly specialized, almost a field unto itself.  Product recommendations is certainly on it&amp;#8217;s way up, as vendors rendor it more accessible to companies that aren&amp;#8217;t an Amazon or Netflix. Augmenting predictive analytics with text mining has great potential, but I haven&amp;#8217;t seen enough case studies to get a feel of how pervasive it&amp;#8217;s become - at least not yet.&lt;/p&gt;

&lt;p&gt;Human resource applications, including human capital retention, are an up-and-coming contrast to marketing applications &amp;#8212; predict which employee will quit rather than the more standard prediction of which customer will defect.&lt;/p&gt;

&lt;p&gt;Beyond those, I consider the following the remaining hot areas (all represented by named case studies at PAW-09, by the way):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Marketing and CRM (both offline and online)
&lt;ul&gt;
&lt;li&gt;Response modeling&lt;/li&gt;
&lt;li&gt;Customer retention with churn modeling&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;Online marketing optimization
&lt;ul&gt;
&lt;li&gt;Behavior-based advertising&lt;/li&gt;
&lt;li&gt;Email targeting&lt;/li&gt;
&lt;li&gt;Website content optimization&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;Insurance pricing&lt;/li&gt;
&lt;/ul&gt;
&lt;img src="http://feeds.feedburner.com/~r/DataWrangling/~4/OUpkZFoNUrQ" height="1" width="1"/&gt;</content>
	</entry>
		<entry>
	  	<author>
			<name>Peter Skomoroch</name>
		</author>
		<title type="text/html" mode="escaped"><![CDATA[Amazon Web Services Public Datasets]]></title>
		<link rel="alternate" type="text/html" href="http://www.datawrangling.com/amazon-web-services-public-datasets" />
		<id>http://www.datawrangling.com/amazon-web-services-public-datasets</id>
		<modified>2008-11-22T02:00:02Z</modified>
		<issued>2008-11-22T02:00:02Z</issued>
		
	<dc:subject>Data mining</dc:subject>
	<dc:subject>Amazon EC2</dc:subject>
	<dc:subject>mapreduce</dc:subject>
	<dc:subject>Amazon</dc:subject> 
		<summary type="text/plain" mode="escaped"><![CDATA[Amazon announced their Hosted Public Data Sets service today, and I expect it to be a game changer.  Finding and using datasets on the web just got a lot easier.  Similar to how developers can share Amazon Machine Images on EC2, you can now freely share large datasets in the cloud using Amazon [...]]]></summary>
		<content type="text/html" mode="escaped" xml:base="http://www.datawrangling.com/amazon-web-services-public-datasets">&lt;p&gt;Amazon announced their &lt;a href="http://aws.amazon.com/publicdatasets/"&gt;Hosted Public Data Sets service&lt;/a&gt; today, and I expect it to be a game changer.  Finding and using &lt;a href="http://www.datawrangling.com/some-datasets-available-on-the-web"&gt;datasets on the web&lt;/a&gt; just got a lot easier.  Similar to how developers can share &lt;a href="http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=101"&gt;Amazon Machine Images on EC2&lt;/a&gt;, you can now freely share large datasets in the cloud using Amazon EBS snapshots.&lt;/p&gt;

&lt;p&gt;A few months ago, &lt;a href="http://www.jeff-barr.com/"&gt;Jeff Bar&lt;/a&gt; stopped by &lt;a href="http://www.juiceanalytics.com/"&gt;Juice&lt;/a&gt; to talk with our team about how we are using Amazon EC2 and SQS to scale our data mining efforts.  One of the issues I brought up was the potential cost and hassle of shuffling large datasets on and off AWS.  Jeff discussed his concept of using Amazon as a kind of data &amp;amp; application ecosystem, where various companies, researchers, and data providers interact on AWS and take advantage of the transfer efficiencies of staying within the Amazon infrastructure and using data and APIs locally.&lt;/p&gt;

&lt;p&gt;This seems to be a part of that vision, and I&amp;#8217;m looking forward to &lt;a href="http://www.cloudera.com/hadoophack/hacks"&gt;unleashing Hadoop&lt;/a&gt; on whatever data flows into the system.&lt;/p&gt;

&lt;p&gt;From the &lt;a href="http://aws.amazon.com/publicdatasets/"&gt;AWS Public Data site&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;Select public data sets are hosted on Amazon EC2 for free as &lt;a href="http://aws.amazon.com/ebs/"&gt;Amazon Elastic Block Store (Amazon EBS) snapshots&lt;/a&gt;. Amazon EC2 customers can access this data by creating their own personal Amazon EBS volumes, using the public data set snapshots as a starting point. They can then access, modify and perform computation on these volumes directly using their Amazon EC2 instances and just pay for the compute and storage resources that they use. If available, researchers can also use pre-configured Amazon Machine Images (AMIs) with tools like &lt;a href="http://www.bioteam.net/inquiry/index.html"&gt;Inquiry by BioTeam&lt;/a&gt; to perform their analysis.

To get started using the Public Data Sets on AWS, simply perform these three easy steps:

&lt;ul&gt;
   &lt;li&gt;1. Sign up for an Amazon EC2 account.&lt;/li&gt;
   &lt;li&gt;2. Launch an Amazon EC2 instance.&lt;/li&gt;
   &lt;li&gt;3. Create an Amazon EBS volume using the Snapshot ID listed in the catalog above for your chosen snapshot.&lt;/li&gt;
&lt;/ul&gt;

&amp;#8230;If you have a public domain or non-proprietary data set that you think is useful and interesting to the AWS community, please submit a request below and the AWS team will review your submission and get back to you. Typically the data sets in the repository are between 1 GB to 1 TB in size (based on the Amazon EBS volume limit), but we can work with you to host larger data sets as well. You must have the right to make the data freely available.&lt;/blockquote&gt;
&lt;img src="http://feeds.feedburner.com/~r/DataWrangling/~4/ETEWw3jJFEM" height="1" width="1"/&gt;</content>
	</entry>
		<entry>
	  	<author>
			<name>Peter Skomoroch</name>
		</author>
		<title type="text/html" mode="escaped"><![CDATA[Hidden Video Courses in Math, Science, and Engineering]]></title>
		<link rel="alternate" type="text/html" href="http://www.datawrangling.com/hidden-video-courses-in-math-science-and-engineering" />
		<id>http://www.datawrangling.com/hidden-video-courses-in-math-science-and-engineering.html</id>
		<modified>2008-04-09T21:04:57Z</modified>
		<issued>2008-04-09T21:04:57Z</issued>
		
	<dc:subject>Machine learning</dc:subject>
	<dc:subject>Neuroscience</dc:subject>
	<dc:subject>physics</dc:subject>
	<dc:subject>Mathematics</dc:subject>
	<dc:subject>Computer Science</dc:subject>
	<dc:subject>Lifehacks</dc:subject>
	<dc:subject>Education</dc:subject> 
		<summary type="text/plain" mode="escaped"><![CDATA[Over the last few years, a large number of open courseware directories and video lecture aggregators have popped up on the web.  These sites often include introductory courses and research seminars, but it can be difficult to find full courses covering advanced topics.  For budgetary and copyright reasons, most upper level and smaller [...]]]></summary>
		<content type="text/html" mode="escaped" xml:base="http://www.datawrangling.com/hidden-video-courses-in-math-science-and-engineering">&lt;p&gt;Over the last few years, a large number of &lt;a href=http://ocw.mit.edu/OcwWeb/web/courses/av/ tags=lectures,video,MIT,download&gt;open&lt;/a&gt; &lt;a href=http://webcast.berkeley.edu/courses.php tags=lectures,video&gt;courseware&lt;a&gt; &lt;a href=http://stanfordocw.org/&gt;directories&lt;/a&gt; and &lt;a href=http://videolectures.net/ tags=machinelearning,lectures,video,course,tutorial,learning&gt;video lecture&lt;/a&gt; &lt;a href=http://freescienceonline.blogspot.com/ tags=programming,video,lectures,screencast,links,science,talks&gt;aggregators&lt;/a&gt; have popped up on the web.  These sites often include introductory courses and research seminars, but it can be difficult to find full courses covering advanced topics.  For &lt;a href=http://mybiasedcoin.blogspot.com/2007/08/advertising-anyone-can-take-my.html&gt;budgetary and copyright reasons&lt;/a&gt;, most upper level and smaller attendance courses are not recorded, or are only offered online for a fee.   Many schools provide access-restricted videos of advanced courses to current students, but do not make them available to the wider community.  To help remedy this, I have pulled together a big list of advanced courses with publicly available video lectures in math, physics, finance, and computer science that seem to have slipped through the cracks and included them in this post (scroll down to skip to the links).&lt;/p&gt;

&lt;p&gt;&lt;a href=http://www.kennedy-center.org/calendar/index.cfm?fuseaction=showEvent&amp;amp;event=XICRD&gt;&lt;img hspace=5 vspace=5 align=left HEIGHT=134 WIDTH=196 src="http://datawrangling.s3.amazonaws.com/burnout.jpg" ALT="Book Burnout at MIT"&gt;&lt;/a&gt;
What motivated me to pull this together? Like many people who are working full time while taking grad courses, blogging, or burning the midnight oil on a startup, I looked up after a couple of years to find I had gained a bunch of weight and was no longer in the best shape of my life.  I had too much to do, and couldn&amp;#8217;t tear myself away from coding every day for a couple of hours at the gym.   In addition to my gym problem, I had just moved to DC and missed the huge number of courses available in the Boston area. 
&lt;a href="http://www.amazon.com/gp/product/B000S5UY2G?ie=UTF8&amp;amp;tag=datawr-20&amp;amp;linkCode=as2&amp;amp;camp=1789&amp;amp;creative=9325&amp;amp;creativeASIN=B000S5UY2G"&gt;&lt;img  hspace=5 vspace=5 align=right  border="0" src="http://datawrangling.s3.amazonaws.com/51yQ4SG63CL._AA280_.jpg"&gt;&lt;/a&gt;&lt;img src="http://www.assoc-amazon.com/e/ir?t=datawr-20&amp;amp;l=as2&amp;amp;o=1&amp;amp;a=B000S5UY2G"  width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" /&gt;  It is difficult to find advanced math and physics courses that fit into a full time work schedule.  Being a geek, my first instinct was to look for a technical solution to non-technical problems.&lt;/p&gt;

&lt;p&gt;The approach I came up with was to load an Archos video player with video lectures from the web (an iphone would probably work just as well).  After 3 months of watching machine learning lectures while on the elliptical machine, I had lost 30 lbs and learned a few things at the same time.  The motivation problems for self-study using open courseware videos are a lot like those with working out: you really intend to do something to improve yourself, but you never seem to find the time.  Somehow putting the two together and forcing myself to get things done appealed to the part of my brain which seeks extreme efficiency.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://en.wikipedia.org/wiki/Ludovico_technique" &gt;&lt;img hspace=5 vspace=5 align=left HEIGHT=268 WIDTH=392 src="http://datawrangling.s3.amazonaws.com/tabula_rasa.jpg" ALT="forcing yourself to learn something"&gt;&lt;/a&gt;&lt;br /&gt;
Most video players now come with wifi built in, so if you have wireless access at your gym you should be ready to go.  If you need to download the videos, then depending on the copyright of the author you can use mplayer or other linux utilities to rip the stream and encode it appropriately.  Check out my del.icio.us &lt;a href=http://del.icio.us/pskomoroch/stream&gt;video streaming links&lt;/a&gt; for details.&lt;br /&gt;
There was a lot of buzz last week about the pace of technology causing &lt;a href=http://gigaom.com/2008/04/06/relax-chill-and-maybe-blog/&gt;bloggers to sacrifice health for work&lt;/a&gt;, but this might be a way for technology to actually help improve the situation.  You can force yourself to watch some video lectures and get back in shape at the same time&amp;#8230;&lt;/p&gt;

&lt;p&gt;Enough motivation, on with the links:
&lt;br /&gt;&lt;/p&gt;

&lt;h3&gt;Links to Advanced Courses with Complete Video Lectures:&lt;/h3&gt;

&lt;hr /&gt;

&lt;p&gt;See &lt;a href=http://del.icio.us/pskomoroch/video%2Blectures&gt;http://del.icio.us/pskomoroch/video+lectures&lt;/a&gt; to find updated links for complete courses&amp;#8230;this list is mostly composed of courses I hadn&amp;#8217;t seen in other directories, but includes links to some of the better Berkeley, Stanford, and MIT videos as well.&lt;/p&gt;

&lt;p&gt;&lt;img src="http://datawrangling.s3.amazonaws.com/new.gif"/&gt;&lt;b&gt;Update (02/10/09)&lt;/b&gt;: I&amp;#8217;ve bookmarked 20 new video courses since the original post was published on April 09, 2008.  The new video links have been added to the sections below and are in &lt;b&gt;bold type&lt;/b&gt;.&lt;/p&gt;

&lt;h4&gt;Physics&lt;/h4&gt;

&lt;ul&gt;
&lt;b&gt;
&lt;li&gt;&lt;a href=http://cosmicposts.wordpress.com/loop-quantum-gravity/ rel="nofollow" tags=links,resources,lectures,video,physics,gravity,quantum,loop&gt;Loop Quantum Gravity &amp;laquo;  Cosmic Posts&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=http://www.math.hc.keio.ac.jp/coe/videos/spivak2004/ rel="nofollow" tags=video,lectures,mathematics,physics,differential,geometry,spivak,via:chl&gt;Prof. Michael D. Spivak Pathway Lectures&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=http://pirsa.org/C06001 rel="nofollow" tags=lectures,video,physics,gravity,quantum,loop&gt;PIRSA - Perimeter Institute Recorded Seminar Archive&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=http://ocw.mit.edu/OcwWeb/Electrical-Engineering-and-Computer-Science/6-002Spring-2007/VideoLectures/ rel="nofollow" tags=video,lectures,mit,electronics,physics,circuits,opamps&gt;MIT OpenCourseWare - 6.002 Circuits and Electronics, Spring 2007 &lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=http://bethe.cornell.edu/index.html rel="nofollow" tags=video,lectures,quantum,physics,bethe&gt;Personal and Historical Perspectives of Hans Bethe&lt;/a&gt;&lt;/li&gt;
&lt;/b&gt;
  &lt;li&gt;&lt;a href=http://www.physics.harvard.edu/about/Phys253.html rel="nofollow" tags=video,lectures,harvard,qft,quantum,field,theory,physics&gt;Harvard Physics: Quantum Field Theory by Sidney Coleman&lt;/a&gt; - 50 videos&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://bio.phys.unm.edu/524/ rel="nofollow" tags=quantumfieldtheory,physics,course,video,lectures&gt;University of New Mexico: Physics 524 Quantum Field Theory II&lt;/a&gt; -27 videos&lt;/li&gt;   
  &lt;li&gt;&lt;a href=http://bio.phys.unm.edu/521/ rel="nofollow" tags=video,physics,quantum,course,lectures&gt;University of New Mexico: Physics 521 Quantum Mechanics&lt;/a&gt; - 32 videos&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://physicsstream.ucsd.edu/courses/spring2003/physics130a/ rel="nofollow" tags=quantum,physics,video,lectures,course&gt;UCSD Quantum Physics 130A&lt;/a&gt;, &lt;a href=http://physicsstream.ucsd.edu/courses/fall2003/physics130b/ rel="nofollow" tags=quantum,physics,video,lectures,ucsd&gt;130B&lt;/a&gt;, &lt;a href=http://physicsstream.ucsd.edu/courses/winter2004/physics130c/ rel="nofollow" tags=quantum,physics,video,lectures,ucsd&gt;130C&lt;/a&gt; ~ 25 videos each&lt;/li&gt; 
  &lt;li&gt;&lt;a href=http://shepherd.physics.sc.edu/~jjohnson/lectures.html rel="nofollow" tags=physics,video,lectures,group,theory,liegroup,mathematics&gt;University of South Carolina PHYS 729 - Applied Group Theory&lt;/a&gt; - 22 Videos, The Foundations of Theoretical Physics Using Lie Groups &amp;#038; Algebras&lt;/li&gt; 
  &lt;li&gt;&lt;a href=http://www.physics.fau.edu/~cbeetle/PHY6938.07F/index.html rel="nofollow" tags=physics,general_relativity,video,lectures,course&gt;Florida Atlantic University: PHY 6938 General Relativity &amp;#8212; Fall 2007&lt;/a&gt; - 28 videos&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.bnl.gov/video/lectures.asp rel="nofollow" tags=physics,video,lectures,cosmology&gt;Brookhaven National Laboratory Streaming Video: Cosmology for Beginners&lt;/a&gt; -5 videos&lt;/li&gt; 
  &lt;li&gt;&lt;a href=http://ocw.mit.edu/OcwWeb/web/courses/av/#Physics rel="nofollow" tags=physics,gravity,lectures,course,video,relativity&gt;MIT OpenCourseWare | Physics | Video Lectures&lt;/a&gt; - Physics I: Classical Mechanics, 8.02 E &amp;#038; M, 8.03 Vibrations and Waves, 8.224 GR &amp;#038; Astrophysics&lt;/li&gt;    
  &lt;li&gt;&lt;a href=http://www.physics.orst.edu/~rubin/COURSES/ph464/index.html rel="nofollow" tags=video,lectures,course,computing,scientific,physics,computerscience&gt;Oregon State University - Physics 464/564, Computational Physics&lt;/a&gt; - 23 videos, based on &amp;#8220;A Survey of Computational Physics&amp;#8221;, Landau, Paez, Bordeianu&lt;/li&gt;   
  &lt;li&gt;&lt;a href=http://mediaplayer.group.cam.ac.uk/component/option,com_mediadb/task,view/idstr,CU-MSM_HB-TD-Thermodynamics/Itemid,69 rel="nofollow" tags=physics,video,lectures,thermodynamics&gt;Cambridge University Video - Thermodynamics and Phase Diagrams with Harry Bhadeshia&lt;/a&gt; - 7 videos&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://info.phys.unm.edu/talks/index.phtml rel="nofollow" tags=quantum_computing,physics,course,lectures,video&gt;University of New Mexico: Prof. Ivan H. Deutsch, Short Course in Quantum Information&lt;/a&gt; 8 videos&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.vega.org.uk/video/subseries/16 rel="nofollow" tags=astronomy,chemistry,physics,video,lectures&gt;The Vega Science Trust - Astrophysical Chemistry by Harry Kroto&lt;/a&gt; - 8 videos&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://indico.cern.ch/conferenceDisplay.py?confId=a032483 rel="nofollow" tags=video,lectures,stringtheory,introduction,physics&gt;CERN: Introduction to String Theory - W. Lerche &lt;/a&gt; - 4 videos&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://indico.cern.ch/conferenceDisplay.py?confId=a044402 rel="nofollow" tags=video,lectures,introduction,stringtheory,physics&gt;CERN: String Theory - Johnson, C. (University of Southern California) &lt;/a&gt; - 5 videos &lt;/li&gt;
  &lt;li&gt;&lt;a href=http://indico.cern.ch/conferenceDisplay.py?confId=a063319 rel="nofollow" tags=video,lectures,physics,stringtheory&gt;CERN: String Theory for Pedestrians - Zwiebach, B. (MIT)&lt;/a&gt; 3 videos, author of &amp;#8220;A First Course in String Theory&amp;#8221;&lt;/li&gt;
 &lt;li&gt;&lt;a href=http://teachers.web.cern.ch/teachers/archiv/HST2004/web/TeachersGuide/aboutPP/index.html rel="nofollow" tags=video,lectures,particle,physics,cern&gt;CERN Short Courses in Particle Physics&lt;/a&gt; - Accelerators, Detectors, Bubble Chambers, Feynman Diagrams, etc.&lt;/li&gt; 
&lt;/ul&gt;

&lt;p&gt;&lt;a id="more-29"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;Mathematics&lt;/h4&gt;

&lt;ul&gt;
&lt;b&gt;
&lt;li&gt;&lt;a href=http://ocw.mit.edu/OcwWeb/Mathematics/18-02Fall-2007/VideoLectures/ rel="nofollow" tags=video,lectures,mit,multivariable,calculus,vector,physics&gt;MIT OpenCourseWare - 18.02 Multivariable Calculus, Fall 2007 &lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=http://nptel.iitm.ac.in/ rel="nofollow" tags=video,lectures,technical,engineering,list,links,india&gt;National Programme on Technology Enhanced Learning(NPTel)&lt;/a&gt;- Chaos, Fractals &amp;#038; Dynamic Systems&lt;/li&gt;
&lt;/b&gt;
  &lt;li&gt;&lt;a href=http://www.stanford.edu/class/ee364/videos.html rel="nofollow" tags=video,lectures,optimization,stanford&gt;Stanford EE364a: Optimization Lecture Videos&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://tableau.stanford.edu/~lall/ee263-aut0708/videos.html rel="nofollow" tags=video,lectures,linear,dynamical,systems,mathematics,engineering,stanford,kalman,course&gt;Stanford EE263: Linear Dynamical Systems Lecture Videos&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://ocw.mit.edu/OcwWeb/hs/geb/VideoLectures/index.htm rel="nofollow" tags=education,course,video,lectures,mit,godel,escher,bach,computation,philosophy&gt;MIT Courseware: Godel, Escher, Bach: A Mental Space Odyssey&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.iiia.csic.es/summerschools/sscp2007/ rel="nofollow" tags=video,lectures,optimization,constraint,programming&gt;Constraint Programming Summer School 2007&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.uccs.edu/~math/vidarchive.html rel="nofollow" tags=mathematics,video,lectures&gt;University of Colorado at Colorado Springs UCCS - Mathematics Video Courses&lt;/a&gt; - Requires free registration.. lots of courses&lt;/li&gt; 
&lt;li&gt;&lt;a href=http://cmes.uccs.edu/Spring2008/Math432/archive.php?type=valid rel="nofollow" tags=analysis,mathemetics,video,lectures&gt;UCCS Math 432 Modern Analysis II | Spring 2008&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=http://cmes.uccs.edu/Spring2008/Math311/archive.php?type=valid rel="nofollow" tags=numbertheory,mathematics,video,lectures&gt;UCCS Math 311 Number Theory | Spring 2008&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://cmes.uccs.edu/Spring2006/Math535/archive.php?type=valid rel="nofollow" tags=mathematics,functional_analysis,video,lectures&gt;UCCS Math 535 Applied Functional Analysis | Spring 2006&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.math.tamu.edu/~mpilant/math614/videos.html rel="nofollow" tags=video,lectures,chaos,mathematics,nonlinear&gt;Texas A&amp;#038;M University - Math 614
Dynamical Systems
and Chaos&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href=http://ocw.mit.edu/OcwWeb/web/courses/av/#Mathematics rel="nofollow" tags=mit,video,lectures,applied_math,mathematics&gt;MIT OpenCourseWare | Mathematics | Video Lectures&lt;/a&gt;- 18.03 Differential Equations, 18.06   Linear Algebra, 18.085  Computational Science and Engineering I, 18.086 Mathematical Methods for Engineers II&lt;/li&gt; 
 &lt;li&gt;&lt;a href=http://www.youtube.com/watch?v=EX_is9LzFSYrel="nofollow" tags=mathematics,video,lectures,calculus,funny&gt;Calculus in 20 Minutes Part 1&lt;/a&gt; &lt;a href=http://www.youtube.com/watch?v=Q9OkFTDG4fY&amp;#038;feature=related&gt;Part 2&lt;/a&gt; - funny and educational.
&lt;/ul&gt;

&lt;h4&gt;Computer Science &amp;#038; Engineering &lt;/h4&gt;

&lt;ul&gt;
&lt;b&gt;
&lt;li&gt;&lt;a href=http://www.hlrs.de/organization/par/par_prog_ws/ rel="nofollow" tags=video,lectures,mpi,computerscience,parallel,course,openmp,PETSc,fortran,c,towatch&gt;HLRS - Organization - Parallel Computing - Parallel Programming Workshop ONLINE&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=http://www.clustermonkey.net//content/view/228/2/ rel="nofollow" tags=video,lectures,mpi,parallel,openmpi,cluster&gt;ClusterMonkey - MPI-Tube: Learn MPI the Internet Way&lt;/a&gt;&lt;/li&gt;
&lt;/b&gt;
  &lt;li&gt;&lt;a href=http://electures.informatik.uni-freiburg.de/catalog/chapter.do?courseId=algoIR2006&amp;#038;chapter=15# rel="nofollow" tags=video,lectures,informationretrieval,crawling,course,search&gt;Information Retrieval / Web Crawling Course - University of Freiburg&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://electures.informatik.uni-freiburg.de/catalog/chapter.do?courseId=advancedAD2006&amp;#038;chapter=3# rel="nofollow" tags=video,lectures,algorithms,computerscience,rtree,parallel&gt;Advanced Topics in Algorithms and Datastructures 2006 - University of Freiburg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=http://electures.informatik.uni-freiburg.de/catalog/chapter.do?courseId=advancedAD2005&amp;#038;chapter=1# rel="nofollow" tags=video,lectures,parallel,programming,computerscience,algorithms&gt;University of Freiburg - Advanced Topics in Algorithms and Datastructures 2005: Parallel Algorithms&lt;/a&gt;&lt;/li&gt;   
  &lt;li&gt;&lt;a href=http://swiss.csail.mit.edu/classes/6.001/abelson-sussman-lectures/ rel="nofollow" tags=video,lectures,lisp,mit,scheme,sicp&gt;MIT Structure and Interpretation of Computer Programs, Video Lectures&lt;/a&gt;&lt;/li&gt; 
  &lt;li&gt;&lt;a href=http://vanets.vuse.vanderbilt.edu/cs251.html rel="nofollow" tags=c++,video,lectures&gt;CS 251: Intermediate Software Design with C++ - Vanderbilt University&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://ocw.mit.edu/OcwWeb/Electrical-Engineering-and-Computer-Science/6-046JFall-2005/LectureNotes/ rel="nofollow" tags=algorithm,video,lectures,mit&gt;MIT OpenCourseWare | Electrical Engineering and Computer Science | 6.046J Introduction to Algorithms (SMA 5503), Fall 2005 | Lecture Notes&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://savenseek.com/page/Algorithms_Video_Lectures__brainDead rel="nofollow" tags=lectures,video,course&gt;Algorithms Video Lectures from ArsDigita University&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.aduni.org/courses/theory/index.php?view=cw rel="nofollow" tags=video,lectures,compiler,computation&gt;Theory of Computation Video Lectures from ArsDigita University&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.cs.washington.edu/education/courses/582/02au/lectures/index.html rel="nofollow" tags=compiler,course,video,lectures,computerscience&gt;University of Washington CSE 582: Compilers&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.cs.washington.edu/education/courses/csep505/06sp/lectures/ rel="nofollow" tags=programming,languages,course,lectures,video,ocaml&gt;University of Washington CSE P505: Programming Languages&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://wally.cs.iupui.edu/csci240/ rel="nofollow" tags=c++,video,lectures,course,computerscience&gt;Indiana University CSCI 240 Object Oriented Programming with C++&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.nanohub.org/resources/99/ rel="nofollow" tags=python,video,lectures,scipy,numpy,mpi,swig,weave,fortran&gt;nanoHUB - Scientific Computing with Python&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href="http://www.archive.org/search.php?query=Fernando+Perez+scientific+python" rel="nofollow" tags=python,video,lectures,scipy,numpy,mpi,swig,weave,fortran,py4science&gt;Py4Science - Python for scientific computing&lt;/a&gt;- 10 Videos, by Fernando Perez et. al &lt;a href="https://cirl.berkeley.edu/fperez/py4science/workshop_berkeley_2008.html"&gt;materials&lt;/a&gt;&lt;/li&gt; 
  &lt;li&gt;&lt;a href=http://www.cs.wustl.edu/~jain/cse567-06/index.html rel="nofollow" tags=video,lectures,computing,simulation,queue&gt;CSE567M: Computer Systems Analysis (2006) - Washington University in St Louis &lt;/a&gt; Comparing systems using measurement, simulation, and queueing models &lt;/li&gt;
  &lt;li&gt;&lt;a href=http://cpe.njit.edu/dlvideos/CIS/CS631/ rel="nofollow" tags=database,lectures,video&gt;NJIT Distance Learning Class Videos  for CS 631 Data Management System Design&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://cpe.njit.edu/dlvideos/CIS/CIS375-602/ rel="nofollow" tags=java,lectures,video&gt;NJIT Distance Learning Class Videos  for CIS 375_602 Applications Development and Java&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://cpe.njit.edu/dlvideos/CS630/ rel="nofollow" tags=video,lectures,operatingsystems&gt;NJIT Distance Learning Class Videos  for CS 630 Operating Systems&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://cone.informatik.uni-freiburg.de/teaching/lecture/wsn-w06/movies.html rel="nofollow" tags=video,course,lectures,wireless,sensor,networks&gt;Wireless Sensor Networks - University of Freiburg - 2006&lt;/a&gt;&lt;/li&gt;  
  &lt;li&gt;&lt;a href=http://www.soe.ucsc.edu/classes/cmpe118/Spring05/#handouts rel="nofollow" tags=video,lectures,electronics,embedded,robotics,sensors&gt;UC Santa Cruz CMPE 118 - Introduction to Mechatronics&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.ecse.rpi.edu/Homepages/shivkuma/teaching/sp2007/wbn2007/index.html rel="nofollow" tags=video,lectures,course,wireless,electricalengineering,communications,networks&gt;RPI - ECSE-6961: Fundamentals of Wireless Broadband Networks. Spring 2007.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;Machine Learning&lt;/h4&gt;

&lt;ul&gt;
&lt;b&gt;
&lt;li&gt;&lt;a href=http://see.stanford.edu/SEE/lecturelist.aspx?coll=63480b48-8819-4efd-8412-263f1a472f5a rel="nofollow" tags=video,lectures,stanford,machinelearning,nlp,textmining&gt;Video Lectures: Stanford Natural Language Processing&lt;/a&gt; - Christopher Manning&lt;/li&gt;
&lt;li&gt;&lt;a href=http://www.youtube.com/results?search_query=%22Machine+Learning+%28Stanford%29%22&amp;#038;search_type=&amp;#038;aq=f rel="nofollow" tags=stanford,video,lectures,machinelearning&gt;Machine Learning (Stanford CS 229) Course Videos&lt;/a&gt;- Andrew Ng (also on iTunesU)&lt;/li&gt;
&lt;li&gt;&lt;a href=http://www.google.com/translate?u=http%3A%2F%2Favva.livejournal.com%2F1912151.html&amp;#038;langpair=ru%7Cen&amp;#038;hl=en&amp;#038;ie=UTF8 rel="nofollow" tags=video,lectures,search,google,algorithms,russian,course&gt;Yuri Lifshits - course &amp;quot;Algorithms for the Internet&amp;quot; (3 short courses in Russian)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=http://www.cedar.buffalo.edu/~srihari/CSE574/ rel="nofollow" tags=machinelearning,course,video,lectures,suny_buffalo&gt;SUNY Buffalo - Machine Learning (CSE574)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=http://www.math.uaa.alaska.edu/~afkjm/cs405/calendar.html rel="nofollow" tags=video,lectures,ai,computerscience,machinelearning&gt;U Alaska: CS A405 - Artificial Intelligence&lt;/a&gt;&lt;/li&gt;
&lt;/b&gt;
  &lt;li&gt;&lt;a href=http://www.youtube.com/results?search_query=%22Machine+Learning+%28Stanford%29%22&amp;#038;search_type=&amp;#038;aq=f rel="nofollow" tags=machinelearning,video,lectures,stanford&gt;Stanford Machine Learning CS229 (Fall 07)&lt;/a&gt; 20 lectures - thanks to Max Khesin + on iTunes, &lt;a href=http://www.stanford.edu/class/cs229/&gt;course page&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://webcast.berkeley.edu/event_details.php?webcastid=21091&amp;#038;p=1&amp;#038;ipp=15&amp;#038;category= rel="nofollow" tags=machinelearning,video,lectures,berkeley&gt;UC Berkeley Machine Learning Workshop&lt;/a&gt; 11 lectures&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://inst.eecs.berkeley.edu/~cs281a/fa05/lectures/lectures.html rel="nofollow" tags=machinelearning,lectures,video,course,berkeley&gt;CS 281A / Stat 241A: Statistical Learning Theory&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.cs.washington.edu/education/courses/577/04sp/contents.html#BP rel="nofollow" tags=video,lectures,machinelearning,vision,uw&gt;U Washington Machine Learning Videos&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=http://electures.informatik.uni-freiburg.de/catalog/chapter.do?courseId=advancedAI2005&amp;#038;chapter=3# rel="nofollow" tags=machinelearning,video,lectures,reinforcement,learning&gt;University of Freiburg - Advanced AI Techniques - Reinforcement Learning, NLP, Bayesian Networks&lt;/a&gt;&lt;/li&gt;   
&lt;/ul&gt;

&lt;h4&gt;Neuroscience &amp;#038; Biology&lt;/h4&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=https://www.ipam.ucla.edu/schedule.aspx?pc=gss2007 rel="nofollow" tags=video,lectures,machinelearning,neuroscience,probabilistic,mathematics,talk,ucla&gt;Graduate Summer School: Probabilistic Models of Cognition: The Mathematics of Mind&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://ctbp.ucsd.edu/curriculum/physics272/ rel="nofollow" tags=video,course,lectures,biology,molecularbiology,physics,ucsd&gt;UCSD: Quantitative Molecular Biology - Physics 172/272&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.ks.uiuc.edu/Training/SumSchool/lectures.html rel="nofollow" tags=video,lectures,biophysics,biology,physics,computational,nsf&gt;University of Illinois at Urbana-Champaign - NSF Biophysics Summer School Lectures&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.nanohub.org/resources/courses/ rel="nofollow" tags=nanotechnology,video,lectures&gt;nanoHUB - Resources &amp;#x3E; Courses&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://online.itp.ucsb.edu/online/neuro01/ rel="nofollow" tags=lectures,video,neuroscience,seung,audio,neuralnetworks&gt;ITP Program on Dynamics of Neural Networks&lt;/a&gt;- Dynamics of Neural Networks: From Biophysics to Behavior&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.hsph.harvard.edu/bioinfocore/training.html rel="nofollow" tags=machinelearning,bioinformatics,snp,genomics,video,lectures&gt;Harvard School of Public Health: Bioinformatics Core&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://webcast.berkeley.edu/course_details.php?seriesid=1906978433 rel="nofollow" tags=biology,video,lectures,cell&gt;UC Berkeley Webcasts | Video and Podcasts: MCB 130 Cell Biology&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://webcast.berkeley.edu/course_details.php?seriesid=1906978486 rel="nofollow" tags=biology,video,biochemistry,lectures&gt;UC Berkeley Webcasts | Video and Podcasts: MCB 110: General Biochemistry and Molecular Biology &lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://pathmicro.med.sc.edu/book/video.htm rel="nofollow" tags=microbiology,immunology,video,lectures&gt;University of South Carolina - Microbiology and Immunology - Streaming Video&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://media.med.sc.edu/video_index/index.php?course=microbiology2006&amp;#038;category=immunology rel="nofollow" tags=video,lectures,immunology,course&gt;University of South Carolina - Microbology Video Index&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;Finance and Econometrics&lt;/h4&gt;

&lt;ul&gt;    
&lt;b&gt;
&lt;li&gt;&lt;a href=http://oyc.yale.edu/economics/game-theory/ rel="nofollow" tags=video,gametheory,course,lectures,yale,economics,finance&gt;Game Theory — Open Yale Courses&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=http://oyc.yale.edu/economics/financial-markets/ rel="nofollow" tags=video,finance,economics,yale,course,lectures&gt;Financial Markets — Open Yale Courses&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=http://economistsview.typepad.com/videolectures/ rel="nofollow" tags=econometrics,economics,video,lectures,course,links,university_of_oregon&gt;U Oregon Economics Video Lectures&lt;/a&gt; - E421 Econometrics, E470 Monetary Theory &amp;#038; Policy, and more&lt;/li&gt;
&lt;/b&gt;
  &lt;li&gt;&lt;a href=http://www.utstat.toronto.edu/sjaimung/courses/2007-2008/sta2502/main.htm rel="nofollow" tags=stochastic,calculus,course,video,lectures,finance,toronto&gt;University of Toronto ACT 460 / STA2502 - Stochastic Methods for Actuarial Science&lt;/a&gt; - S. Jaimungal, Department of Statistics and Mathematical Finance Program &lt;/li&gt;
  &lt;li&gt;&lt;a href=http://economistsview.typepad.com/economics421/ rel="nofollow" tags=econometrics,video,lectures&gt;Economics 421 - Econometrics&lt;/a&gt;- Mark Thoma: Department of Economics, University of Oregon&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.ats.ucla.edu/STAT/seminars/ed231e/default.htm rel="nofollow" tags=statistics,video,lectures,survival,latent,factor,mixture,model,course,via:ulmerham&gt;Course Video Lectures: Latent Variable Analysis&lt;/a&gt; Professor Bengt Muth&amp;eacute;n of the UCLA Graduate School of Education &amp;#038; Information Studies&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://instruct1.cit.cornell.edu/courses/info747/ rel="nofollow" tags=datacleaning,record_linkage,video,lectures,course,cornell,economics,finance,dataset,publicdata,for:chrisgemignani,for:zgemignani&gt;INFO 747 - Social and Economic Data - Cornell Record Linkage Course Lecture Videos&lt;/a&gt; Prof. John M. Abowd&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://webcast.berkeley.edu/course_details.php?seriesid=1906978197 rel="nofollow" tags=econometrics,lectures,video&gt;UC Berkeley Webcasts: Econometrics 244&lt;/a&gt; - Discrete Choice Methods with Simulation&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;Seminars, Talks, and Conference Videos:&lt;/h3&gt;

&lt;hr /&gt;

&lt;p&gt;See &lt;a href=http://del.icio.us/pskomoroch/talk%2Bvideo&gt;http://del.icio.us/pskomoroch/talk+video&lt;/a&gt; for more links&amp;#8230;&lt;/p&gt;

&lt;h4&gt;Physics&lt;/h4&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=http://www.perimeterinstitute.ca/Outreach/Public_Lectures/View_Past_Public_Lectures/ rel="nofollow" tags=physics,video,lectures,science&gt;View Past Public Lectures - Perimeter Institute for Theoretical Physics&lt;/a&gt;&lt;/li&gt;  
  &lt;li&gt;&lt;a href=http://www.asti.ac.za/lectures.php rel="nofollow" tags=physics,video,qft,lectures&gt;African Summer Theory Institute (ASTI): Online Lectures&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.physics.rutgers.edu/het/video/het-video.html rel="nofollow" tags=physics,video,lectures&gt;Rutgers Physics: NHETC video seminars&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.math.washington.edu/Seminars/Milliman/milliman-archives.php rel="nofollow" tags=mathematics,physics,video,lectures&gt;UW Math: Milliman Lectures Archive&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.vega.org.uk/video/subseries/8 rel="nofollow" tags=physics,feynman,video,lectures,electrodynamics&gt;The Vega Science Trust - Richard Feynman Videos&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://online.itp.ucsb.edu/online/ rel="nofollow" tags=physics,mathematics,neuroscience,biology,video,talk,lectures,links&gt;Kavli Institute for Theoretical Physics (KITP) Online Conferences, Lectures and Seminars&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;Mathematics&lt;/h4&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=http://www.msri.org/communications/vmath/index_html rel="nofollow" tags=video,lectures,mathematics&gt;MSRI Video Archive&lt;/a&gt;&lt;/li&gt;    
  &lt;li&gt;&lt;a href=http://www.math.duke.edu/computing/Broadcasts/General.html rel="nofollow" tags=video,lectures,mathematics,duke&gt;Duke University Mathematics Department Video Archive&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.mth.msu.edu/Seminar/Stream/ rel="nofollow" tags=mathematics,video,lectures&gt;Michigan State University Math Department - Video Lectures&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;Computer Science &amp;#038; Engineering&lt;/h4&gt;

&lt;ul&gt;
 &lt;li&gt;&lt;a href=http://research.yahoo.com/node/2104 rel="nofollow" tags=hadoop,video,lectures,talks,yahoo,summit,mapreduce,hbase,amazon,ec2,distributed&gt;Hadoop Summit and Data-Intensive Computing Symposium Videos and Slides&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href=http://scpd.stanford.edu/knuth/ rel="nofollow" tags=knuth,video,lectures&gt;SCPD - Donald E. Knuth&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href=http://www.quiprocone.org/Protected/DD_lectures.htm rel="nofollow" tags=video,quantum,lectures,quantumcomputing,computation&gt;David Deutsch Video Lectures on Quantum Computation&lt;/a&gt;&lt;/li&gt;   
&lt;/ul&gt;

&lt;h4&gt;Machine Learning&lt;/h4&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/DeepLearningWorkshopNIPS2007 rel="nofollow" tags=machinelearning,video,lectures,NIPS&gt;DeepLearningWorkshopNIPS2007 &amp;#x3C; Public &amp;#x3C; TWiki&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://nips.cc/Conferences/2006/Program/schedule.php?Session=Tutorials rel="nofollow" tags=machinelearning,video,tutorials,lectures,NIPS&gt;NIPS : Conferences : 2006 : Program : NIPS 2006 Schedule&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://nips.cc/Conferences/2006/Media/ rel="nofollow" tags=machinelearning,nips,video,lectures,tutorials&gt;NIPS : Conferences : 2006 : Media : NIPS 2006 Media&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://nips.cc/Conferences/2005/Tutorials/ rel="nofollow" tags=nips,machinelearning,video,lectures&gt;NIPS : Conferences : 2005 : Tutorial Videos&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://videolectures.net/mmdss07_gazzada/ rel="nofollow" tags=machinelearning,video,talk,bigdata&gt;NATO Advanced Study Institute on Mining Massive Data Sets for Security&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;Neuroscience &amp;#038; Biology&lt;/h4&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=http://www.imaginggenetics.uci.edu/archives_2006.htm rel="nofollow" tags=neuroscience,genetics,video,talk,lectures&gt;UC Irvine International Imaging Genetics Conference&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://icnc.huji.ac.il/activities/heller/ rel="nofollow" tags=video,lectures,neuroscience,neuralnetwork&gt;Hebrew University of Jerusalem: Heller Lecture Series in Computational Neuroscience&lt;/a&gt;&lt;/li&gt;  
  &lt;li&gt;&lt;a href=http://videocast.nih.gov/PastEvents.asp?c=16 rel="nofollow" tags=neuroscience,video,lectures,talk,nih&gt;NIH VideoCasting: Past Events&lt;/a&gt;&lt;/li&gt;  
  &lt;li&gt;&lt;a href=http://www.utdallas.edu/~kilgard/lectures.htm rel="nofollow" tags=neuroscience,lectures,video&gt;U Texas. Colection of Online Neuroscience Lectures&lt;/a&gt;&lt;/li&gt; 
  &lt;li&gt;&lt;a href=http://www.archive.org/search.php?query=2007%2Bbrain%2Bnetwork%2Bdynamics rel="nofollow" tags=neuroscience,talk,video,lectures&gt;Internet Archive Search: 2007+brain+network+dynamics&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://sulcus.berkeley.edu/wjf/2007BrainNetworkDynamicsUCBConference/2007BrainNetworkDynamicsUCBConferenceVids.html rel="nofollow" tags=neuroscience,conference,talk,video,lectures&gt;Conference on Brain Network Dynamics 2007 - University of California Berkeley&lt;/a&gt;&lt;/li&gt;    
  &lt;li&gt;&lt;a href=https://www.nanohub.org/resources/seminars/ rel="nofollow" tags=video,lectures,nanotechnology,physics,engineering&gt;nanoHUB - Resources &amp;#x3E; Online Presentations&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://mbi.osu.edu/2003/ws4abstracts.html rel="nofollow" tags=video,lectures,talk,cardiac,calcium,biology,computational&gt;Mathematical Biosciences Institute: Workshop on Biophysics and Mathematical Models of Calcium Channels &lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;Finance and Economics&lt;/h4&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=http://www.law.uconn.edu/news/events/itax/arbitrage.html rel="nofollow" tags=finance,tax,arbitrage,video,lectures&gt;International Tax Lecture Series - University of Connecticut School of Law&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://nobelprize.org/nobel_prizes/economics/laureates/2002/kahneman-lecture.html rel="nofollow" tags=finance,video,lectures&gt;Daniel Kahneman - Nobel Prize Lecture: Maps of Bounded Rationality&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;Open Courseware Directories and Other Video Lecture Roundup Posts&lt;/h3&gt;

&lt;hr /&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=http://webcast.berkeley.edu/courses.php tags=lectures,video&gt;Berkeley Course Webcasts&lt;/a&gt;&lt;/li&gt; 
  &lt;li&gt;&lt;a href=http://ocw.mit.edu/OcwWeb/web/courses/av/ tags=lectures,video,MIT,download&gt;MIT OpenCourseWare Videos &lt;/a&gt;&lt;/li&gt;    
  &lt;li&gt;&lt;a href=http://open.yale.edu/courses/index.html tags=education,yale,opencourseware,video,lectures,for:kfcooke&gt;Open Yale Courses&lt;/a&gt;&lt;/li&gt;  
  &lt;li&gt;&lt;a href=http://videolectures.net/ tags=machinelearning,lectures,video,course,tutorial,learning&gt;VideoLectures - exchange ideas &amp;#038; share knowledge&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://freescienceonline.blogspot.com/ tags=programming,video,lectures,screencast,links,science,talks&gt;Free Science and Video Lectures Online!&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://www.lecturefox.com/ tags=video,lectures,podcasts&gt;Lecturefox: free university lectures - computer science, mathematics, physics&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://machine-learning.blogspot.com/2007/03/machine-learning-online-lectures.html tags=machinelearning,lectures,video&gt;Business Intelligence, Data Mining &amp;#038; Machine Learning: Machine Learning OnLine Lectures&lt;/a&gt; -  Machine Learning OnLine Lectures &lt;/li&gt;
  &lt;li&gt;&lt;a href=http://emotion.inrialpes.fr/~dangauthier/blog/2007/03/16/machine-learning-videos/ tags=video,machinelearning,lectures&gt;Yet Another Machine Learning Blog » Machine learning videos [Pierre Dangauthier]&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=http://obousquet.googlepages.com/mlvideos tags=machinelearning,lectures,video&gt;obousquet - ML Videos&lt;/a&gt; - Online videos of talks or lectures about Machine Learning related topics&lt;/li&gt;
&lt;/ul&gt;
&lt;img src="http://feeds.feedburner.com/~r/DataWrangling/~4/TjA8OYGhUTE" height="1" width="1"/&gt;</content>
	</entry>
	</feed>
