<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:blogger='http://schemas.google.com/blogger/2008' xmlns:georss='http://www.georss.org/georss' xmlns:gd="http://schemas.google.com/g/2005" xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-2523819181563716059</id><updated>2026-04-27T04:13:22.861-07:00</updated><category term="data mining"/><category term="rapidminer"/><category term="extract transform load"/><category term="google docs spreadsheets"/><category term="text analysis"/><category term="xpath"/><category term="ajax web scraping scraper"/><category term="business intelligence"/><category term="concept mining"/><category term="crawling rules"/><category term="etl"/><category term="extjs ext js tutorial learn help"/><category term="google spreadsheets"/><category term="how to scrape ajax web pages"/><category term="importxml"/><category term="k-nn"/><category term="knn"/><category term="naive bayes"/><category term="r"/><category term="rapid miner"/><category term="rapidminer data mining etl"/><category term="robots.txt"/><category term="text mining"/><category term="tutorial"/><category term="web crawl"/><category term="web crawling"/><category term="web scraping"/><category term="web scraping rapidminer xpath web scrape rapid miner x-path"/><category term="x-path"/><title type='text'>Vancouver Data Blog by Neil McGuigan</title><subtitle type='html'>Some RapidMiner, some JMP, some Google Docs</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://vancouverdata.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default?redirect=false'/><link rel='alternate' type='text/html' href='http://vancouverdata.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><link rel='next' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default?start-index=26&amp;max-results=25&amp;redirect=false'/><author><name>Neil McGuigan</name><uri>http://www.blogger.com/profile/14122981831780837323</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>64</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-2523819181563716059.post-3745787646496993296</id><published>2016-08-05T15:35:00.001-07:00</published><updated>2024-04-04T10:30:20.842-07:00</updated><title type='text'>Most of my blogging is on database-patterns.blogspot.com now</title><summary type="text">Go to https://database-patterns.blogspot.com/</summary><link rel='replies' type='application/atom+xml' href='http://vancouverdata.blogspot.com/feeds/3745787646496993296/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://vancouverdata.blogspot.com/2016/08/most-of-my-blogging-is-on.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/3745787646496993296'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/3745787646496993296'/><link rel='alternate' type='text/html' href='http://vancouverdata.blogspot.com/2016/08/most-of-my-blogging-is-on.html' title='Most of my blogging is on database-patterns.blogspot.com now'/><author><name>Neil McGuigan</name><uri>http://www.blogger.com/profile/14122981831780837323</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2523819181563716059.post-1513370372912776998</id><published>2013-07-30T14:25:00.002-07:00</published><updated>2013-07-30T14:25:53.291-07:00</updated><title type='text'>JMP 11 statistics sneak peek</title><summary type="text">JMP 11 Sneak Peak&amp;nbsp;just came out today.</summary><link rel='replies' type='application/atom+xml' href='http://vancouverdata.blogspot.com/feeds/1513370372912776998/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://vancouverdata.blogspot.com/2013/07/jmp-11-statistics-sneak-peek.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/1513370372912776998'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/1513370372912776998'/><link rel='alternate' type='text/html' href='http://vancouverdata.blogspot.com/2013/07/jmp-11-statistics-sneak-peek.html' title='JMP 11 statistics sneak peek'/><author><name>Neil McGuigan</name><uri>http://www.blogger.com/profile/14122981831780837323</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2523819181563716059.post-3013544504700835089</id><published>2013-05-26T23:02:00.001-07:00</published><updated>2013-05-26T23:05:15.808-07:00</updated><title type='text'>Text Mining Performance in RapidMiner</title><summary type="text">Did load testing with RapidMiner 5.3 on my laptop (Core i3, 8GB RAM, non-SSD hard drive). Here are the results.

I set up Java to use 6500 MB of memory (max). 

I used the Read Database operator to get the documents. They were random Latin words, of 20 to 500 words in length. 

The text processing was purposefully simple: tokenize the document and get the binary word vector.

I then stored the </summary><link rel='replies' type='application/atom+xml' href='http://vancouverdata.blogspot.com/feeds/3013544504700835089/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://vancouverdata.blogspot.com/2013/05/text-mining-performance-in-rapidminer.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/3013544504700835089'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/3013544504700835089'/><link rel='alternate' type='text/html' href='http://vancouverdata.blogspot.com/2013/05/text-mining-performance-in-rapidminer.html' title='Text Mining Performance in RapidMiner'/><author><name>Neil McGuigan</name><uri>http://www.blogger.com/profile/14122981831780837323</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2523819181563716059.post-2130792509713501426</id><published>2013-05-16T12:41:00.001-07:00</published><updated>2013-05-16T12:41:19.697-07:00</updated><title type='text'>AWS Redshift: How Amazon Changed The Game</title><summary type="text">A good blog post on Amazon RedShift - their Postgres-based massive data warehouse. Some good analysis on performance and costs:&amp;nbsp; 

http://blog.aggregateknowledge.com/2013/05/16/aws-redshift-how-amazon-changed-the-game/</summary><link rel='replies' type='application/atom+xml' href='http://vancouverdata.blogspot.com/feeds/2130792509713501426/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://vancouverdata.blogspot.com/2013/05/aws-redshift-how-amazon-changed-game.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/2130792509713501426'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/2130792509713501426'/><link rel='alternate' type='text/html' href='http://vancouverdata.blogspot.com/2013/05/aws-redshift-how-amazon-changed-game.html' title='AWS Redshift: How Amazon Changed The Game'/><author><name>Neil McGuigan</name><uri>http://www.blogger.com/profile/14122981831780837323</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2523819181563716059.post-7440313607929380140</id><published>2013-04-18T22:53:00.001-07:00</published><updated>2013-04-18T22:53:48.886-07:00</updated><title type='text'>Vancouver Training: Introduction to Data Mining and Predictive Analytics with RapidMiner - Save $500</title><summary type="text">I&#39;ll be teaching a RapidMiner course here in Vancouver next week:Tuesday, April 23, 2013 at 8:30 AM - Wednesday, April 24, 2013 at 5:00 PM (PDT)Details here:http://rapid-i_us_20130423-eorg.eventbrite.com/Save $500 with the coupon VAN_BLOG !</summary><link rel='replies' type='application/atom+xml' href='http://vancouverdata.blogspot.com/feeds/7440313607929380140/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://vancouverdata.blogspot.com/2013/04/vancouver-training-introduction-to-data.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/7440313607929380140'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/7440313607929380140'/><link rel='alternate' type='text/html' href='http://vancouverdata.blogspot.com/2013/04/vancouver-training-introduction-to-data.html' title='Vancouver Training: Introduction to Data Mining and Predictive Analytics with RapidMiner - Save $500'/><author><name>Neil McGuigan</name><uri>http://www.blogger.com/profile/14122981831780837323</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2523819181563716059.post-8783676131394604283</id><published>2013-02-12T17:56:00.000-08:00</published><updated>2013-02-12T17:56:00.998-08:00</updated><title type='text'>Google&#39;s Data Mining Research Papers</title><summary type="text">In case you missed it, here are Google&#39;s 104 data mining research papers:

http://research.google.com/pubs/DataMining.html

</summary><link rel='replies' type='application/atom+xml' href='http://vancouverdata.blogspot.com/feeds/8783676131394604283/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://vancouverdata.blogspot.com/2013/02/googles-data-mining-research-papers.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/8783676131394604283'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/8783676131394604283'/><link rel='alternate' type='text/html' href='http://vancouverdata.blogspot.com/2013/02/googles-data-mining-research-papers.html' title='Google&#39;s Data Mining Research Papers'/><author><name>Neil McGuigan</name><uri>http://www.blogger.com/profile/14122981831780837323</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2523819181563716059.post-5468419230103865951</id><published>2012-12-20T20:15:00.000-08:00</published><updated>2012-12-20T20:15:51.932-08:00</updated><title type='text'>The Google F1 slides</title><summary type="text">Google F1 is a relational database query engine that works on top of Google Spanner, which is a distributed storage system that sits on top of Google File System. Got it? :)

Basically, it&#39;s a really big, distributed relational database, and Google is using F1 to replace MySQL for Adwords.

http://www.stanford.edu/class/cs347/slides/f1.pdf

</summary><link rel='replies' type='application/atom+xml' href='http://vancouverdata.blogspot.com/feeds/5468419230103865951/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://vancouverdata.blogspot.com/2012/12/the-google-f1-slides.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/5468419230103865951'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/5468419230103865951'/><link rel='alternate' type='text/html' href='http://vancouverdata.blogspot.com/2012/12/the-google-f1-slides.html' title='The Google F1 slides'/><author><name>Neil McGuigan</name><uri>http://www.blogger.com/profile/14122981831780837323</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2523819181563716059.post-4801634930985173130</id><published>2012-11-01T22:56:00.001-07:00</published><updated>2012-11-01T22:56:46.276-07:00</updated><title type='text'>Chomsky on Where AI Went Wrong</title><summary type="text">If one were to rank a list of civilization&#39;s greatest and most elusive intellectual challenges, the problem of &quot;decoding&quot; ourselves -- understanding the inner workings of our minds and our brains, and how the architecture of these elements is encoded in our genome -- would surely be at the top. Yet the diverse fields that took on this challenge, from philosophy and psychology to computer science </summary><link rel='replies' type='application/atom+xml' href='http://vancouverdata.blogspot.com/feeds/4801634930985173130/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://vancouverdata.blogspot.com/2012/11/chomsky-on-where-ai-went-wrong.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/4801634930985173130'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/4801634930985173130'/><link rel='alternate' type='text/html' href='http://vancouverdata.blogspot.com/2012/11/chomsky-on-where-ai-went-wrong.html' title='Chomsky on Where AI Went Wrong'/><author><name>Neil McGuigan</name><uri>http://www.blogger.com/profile/14122981831780837323</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2523819181563716059.post-6417931691516940891</id><published>2012-11-01T22:54:00.002-07:00</published><updated>2012-11-01T22:54:49.872-07:00</updated><title type='text'>The father of fractals</title><summary type="text">A nice little piece on Mandlebrot in the Economist:

http://www.economist.com/node/2246127</summary><link rel='replies' type='application/atom+xml' href='http://vancouverdata.blogspot.com/feeds/6417931691516940891/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://vancouverdata.blogspot.com/2012/11/the-father-of-fractals.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/6417931691516940891'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/6417931691516940891'/><link rel='alternate' type='text/html' href='http://vancouverdata.blogspot.com/2012/11/the-father-of-fractals.html' title='The father of fractals'/><author><name>Neil McGuigan</name><uri>http://www.blogger.com/profile/14122981831780837323</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2523819181563716059.post-7588855649941313006</id><published>2012-09-26T13:53:00.002-07:00</published><updated>2012-09-26T13:53:32.489-07:00</updated><title type='text'>As I predicted, Self-driving cars a reality for &#39;ordinary people&#39; within 5 years, says Google&#39;s Sergey Brin</title><summary type="text">Link here:

http://www.computerworld.com/s/article/9231707/Self_driving_cars_a_reality_for_39_ordinary_people_39_within_5_years_says_Google_39_s_Sergey_Brin</summary><link rel='replies' type='application/atom+xml' href='http://vancouverdata.blogspot.com/feeds/7588855649941313006/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://vancouverdata.blogspot.com/2012/09/self-driving-cars-reality-for-ordinary.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/7588855649941313006'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/7588855649941313006'/><link rel='alternate' type='text/html' href='http://vancouverdata.blogspot.com/2012/09/self-driving-cars-reality-for-ordinary.html' title='As I predicted, Self-driving cars a reality for &#39;ordinary people&#39; within 5 years, says Google&#39;s Sergey Brin'/><author><name>Neil McGuigan</name><uri>http://www.blogger.com/profile/14122981831780837323</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2523819181563716059.post-4755800189553045656</id><published>2012-09-26T13:13:00.001-07:00</published><updated>2012-09-26T13:15:16.321-07:00</updated><title type='text'>The Google Spanner Paper</title><summary type="text">Google spanner is a massively distributed database. It needs atomic clocks on each machine to work though...

http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/spanner-osdi2012.pdf</summary><link rel='replies' type='application/atom+xml' href='http://vancouverdata.blogspot.com/feeds/4755800189553045656/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://vancouverdata.blogspot.com/2012/09/the-google-spanner-paper.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/4755800189553045656'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/4755800189553045656'/><link rel='alternate' type='text/html' href='http://vancouverdata.blogspot.com/2012/09/the-google-spanner-paper.html' title='The Google Spanner Paper'/><author><name>Neil McGuigan</name><uri>http://www.blogger.com/profile/14122981831780837323</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2523819181563716059.post-4425466551555268857</id><published>2012-09-07T20:30:00.002-07:00</published><updated>2012-10-27T12:47:51.752-07:00</updated><title type='text'>The Google Dremel Paper</title><summary type="text">Here is the paper describing Google Dremel, which may replace Hive one day. There does not seem to be anyone working on an open-source version though

Link (PDF)

Update: Apache Drill is the open source version of Dremel (hat tip to Zoltan).

Also, Cloudera&#39;s Impala looks simlar.</summary><link rel='replies' type='application/atom+xml' href='http://vancouverdata.blogspot.com/feeds/4425466551555268857/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://vancouverdata.blogspot.com/2012/09/the-google-dremel-paper.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/4425466551555268857'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/4425466551555268857'/><link rel='alternate' type='text/html' href='http://vancouverdata.blogspot.com/2012/09/the-google-dremel-paper.html' title='The Google Dremel Paper'/><author><name>Neil McGuigan</name><uri>http://www.blogger.com/profile/14122981831780837323</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2523819181563716059.post-5744926602260808931</id><published>2012-09-07T20:22:00.001-07:00</published><updated>2012-09-07T20:23:07.274-07:00</updated><title type='text'>Self-driving cars:  The next revolution</title><summary type="text">Here is a recent report from KPMG about self-driving cars:

http://www.kpmg.com/US/en/IssuesAndInsights/ArticlesPublications/Documents/self-driving-cars-next-revolution.pdf</summary><link rel='replies' type='application/atom+xml' href='http://vancouverdata.blogspot.com/feeds/5744926602260808931/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://vancouverdata.blogspot.com/2012/09/self-driving-cars-next-revolution.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/5744926602260808931'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/5744926602260808931'/><link rel='alternate' type='text/html' href='http://vancouverdata.blogspot.com/2012/09/self-driving-cars-next-revolution.html' title='Self-driving cars:  The next revolution'/><author><name>Neil McGuigan</name><uri>http://www.blogger.com/profile/14122981831780837323</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2523819181563716059.post-994590053096386889</id><published>2012-08-07T17:36:00.000-07:00</published><updated>2012-10-24T23:09:54.200-07:00</updated><title type='text'>Google’s Self-Driving Cars Are Going to Change Everything</title><summary type="text">Recent News:

Google’s Self-Driving Cars Complete 300K Miles Without Accident, Deemed Ready For Commuting
http://techcrunch.com/2012/08/07/google-cars-300000-miles-without-accident/



Here&#39;s what is going to happen in the next 5-10 years. It won&#39;t all happen right away.



The car insurance industry will cease to exist. These cars aren&#39;t going to crash. Even if there are hold-outs that drive </summary><link rel='replies' type='application/atom+xml' href='http://vancouverdata.blogspot.com/feeds/994590053096386889/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://vancouverdata.blogspot.com/2012/08/googles-self-driving-cars-are-going-to.html#comment-form' title='56 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/994590053096386889'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/994590053096386889'/><link rel='alternate' type='text/html' href='http://vancouverdata.blogspot.com/2012/08/googles-self-driving-cars-are-going-to.html' title='Google’s Self-Driving Cars Are Going to Change Everything'/><author><name>Neil McGuigan</name><uri>http://www.blogger.com/profile/14122981831780837323</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>56</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2523819181563716059.post-6460311686727702164</id><published>2012-02-11T20:36:00.000-08:00</published><updated>2012-02-12T20:34:44.357-08:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="ajax web scraping scraper"/><title type='text'>Less Painful AJAX / Javascript Web Scraping</title><summary type="text">If you read my previous post, you&#39;ll see that scraping ajax pages can be a pain. So I wrote a little Java program to make it easier. It takes a list of URLs to scrape, and will render them in a browser, and save the (normal and ajax) rendered HTML and screenshots to a folder. 

Here&#39;s the how-to video:



You need Firefox 3+ installed, as well as Java 1.6. This is a beta project, and no warranty </summary><link rel='replies' type='application/atom+xml' href='http://vancouverdata.blogspot.com/feeds/6460311686727702164/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://vancouverdata.blogspot.com/2012/02/less-painful-ajax-javascript-web.html#comment-form' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/6460311686727702164'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/6460311686727702164'/><link rel='alternate' type='text/html' href='http://vancouverdata.blogspot.com/2012/02/less-painful-ajax-javascript-web.html' title='Less Painful AJAX / Javascript Web Scraping'/><author><name>Neil McGuigan</name><uri>http://www.blogger.com/profile/14122981831780837323</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://img.youtube.com/vi/wB9-rRmjT2E/default.jpg" height="72" width="72"/><thr:total>9</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2523819181563716059.post-3679647273351338746</id><published>2012-02-09T16:01:00.000-08:00</published><updated>2012-06-11T21:12:23.865-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="how to scrape ajax web pages"/><title type='text'>Web Scraping AJAX Pages</title><summary type="text">This is part four of a series of video tutorials on web scraping and web crawling.

You can probably skip this one, and go to the easy version.


Part 1:&amp;nbsp;Web scraping with Google Spreadsheets and XPath

Part 2:&amp;nbsp;Web Crawling with RapidMiner

Part 3:&amp;nbsp;Web Scraping with RapidMiner and Xpath

This post explains how to capture HTML from Ajax / Javascript generated pages.

Here is the </summary><link rel='replies' type='application/atom+xml' href='http://vancouverdata.blogspot.com/feeds/3679647273351338746/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://vancouverdata.blogspot.com/2012/02/web-scraping-ajax-pages.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/3679647273351338746'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/3679647273351338746'/><link rel='alternate' type='text/html' href='http://vancouverdata.blogspot.com/2012/02/web-scraping-ajax-pages.html' title='Web Scraping AJAX Pages'/><author><name>Neil McGuigan</name><uri>http://www.blogger.com/profile/14122981831780837323</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2523819181563716059.post-8179698928628077273</id><published>2012-01-29T17:59:00.000-08:00</published><updated>2012-01-29T17:59:03.714-08:00</updated><title type='text'>On Making Videos</title><summary type="text">Here is what i use to make my videos:


1.&amp;nbsp;CamStudio. This is a nice free and open-source desktop video capture program. Make sure to use their Lossless Codec, and go with these settings:

Set Keyframes Every 30 frames
Capture Frames Every = 50 milliseconds
Playback Rate = 20 frames per second
Video codec: CamStudio Lossless Codec 
Quality: 70%


2. Handbrake Video Transcoder. This will help</summary><link rel='replies' type='application/atom+xml' href='http://vancouverdata.blogspot.com/feeds/8179698928628077273/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://vancouverdata.blogspot.com/2012/01/on-making-videos.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/8179698928628077273'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/8179698928628077273'/><link rel='alternate' type='text/html' href='http://vancouverdata.blogspot.com/2012/01/on-making-videos.html' title='On Making Videos'/><author><name>Neil McGuigan</name><uri>http://www.blogger.com/profile/14122981831780837323</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2523819181563716059.post-3878839214082603433</id><published>2011-12-31T19:38:00.003-08:00</published><updated>2011-12-31T19:38:49.203-08:00</updated><title type='text'>Happy New Year</title><summary type="text">75,000 pageviews this year! Thanks to everyone for visiting. I will post some new material in the new year.

Have a safe and fun 2012

Neil</summary><link rel='replies' type='application/atom+xml' href='http://vancouverdata.blogspot.com/feeds/3878839214082603433/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://vancouverdata.blogspot.com/2011/12/happy-new-year.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/3878839214082603433'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/3878839214082603433'/><link rel='alternate' type='text/html' href='http://vancouverdata.blogspot.com/2011/12/happy-new-year.html' title='Happy New Year'/><author><name>Neil McGuigan</name><uri>http://www.blogger.com/profile/14122981831780837323</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2523819181563716059.post-5679397571750587883</id><published>2011-11-04T14:39:00.001-07:00</published><updated>2012-09-18T23:27:10.651-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="extjs ext js tutorial learn help"/><title type='text'>My new blog about learning ExtJS</title><summary type="text">I have a new blog. It&#39;s about learning to use ExtJS, a great rich internet application library in javascript. Here it is:

http://extjs-tutorials.blogspot.com/

Check it out. Thanks
Don&#39;t worry, I&#39;ll keep posting here too</summary><link rel='replies' type='application/atom+xml' href='http://vancouverdata.blogspot.com/feeds/5679397571750587883/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://vancouverdata.blogspot.com/2011/11/extjs-ext-js-learn-tutorial-help.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/5679397571750587883'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/5679397571750587883'/><link rel='alternate' type='text/html' href='http://vancouverdata.blogspot.com/2011/11/extjs-ext-js-learn-tutorial-help.html' title='My new blog about learning ExtJS'/><author><name>Neil McGuigan</name><uri>http://www.blogger.com/profile/14122981831780837323</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2523819181563716059.post-6633676995435803511</id><published>2011-10-09T22:26:00.000-07:00</published><updated>2011-10-09T22:26:31.748-07:00</updated><title type='text'>How Obama&#39;s data-crunching prowess may get him re-elected</title><summary type="text">An article on CNN about how the Obama 2012 campaign has hired many data miners and statisticians to help boost fundraising and support.

http://www.cnn.com/2011/10/09/tech/innovation/obama-data-crunching-election/index.html?hpt=hp_c1</summary><link rel='replies' type='application/atom+xml' href='http://vancouverdata.blogspot.com/feeds/6633676995435803511/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://vancouverdata.blogspot.com/2011/10/how-obamas-data-crunching-prowess-may.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/6633676995435803511'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/6633676995435803511'/><link rel='alternate' type='text/html' href='http://vancouverdata.blogspot.com/2011/10/how-obamas-data-crunching-prowess-may.html' title='How Obama&#39;s data-crunching prowess may get him re-elected'/><author><name>Neil McGuigan</name><uri>http://www.blogger.com/profile/14122981831780837323</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2523819181563716059.post-1939800431561533890</id><published>2011-10-08T15:31:00.001-07:00</published><updated>2012-11-06T12:15:18.679-08:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="data mining"/><category scheme="http://www.blogger.com/atom/ns#" term="r"/><category scheme="http://www.blogger.com/atom/ns#" term="rapidminer"/><category scheme="http://www.blogger.com/atom/ns#" term="text analysis"/><category scheme="http://www.blogger.com/atom/ns#" term="text mining"/><title type='text'>Text Analytics with RapidMiner Part 6 of 6 - Applying the Model to New Documents</title><summary type="text">After my last series, I got a lot of questions about how to apply a model to new data, so here is the real final installment in the series.

I show how to save a wordlist and model to the repository. I use them later to read the wordlist and model and apply them to new documents that RapidMiner hasn&#39;t seen before. It correctly labels 11 of the 12 documents.



Files from the video.</summary><link rel='replies' type='application/atom+xml' href='http://vancouverdata.blogspot.com/feeds/1939800431561533890/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://vancouverdata.blogspot.com/2011/10/rapidminer-text-mining-r-analytics.html#comment-form' title='19 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/1939800431561533890'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/1939800431561533890'/><link rel='alternate' type='text/html' href='http://vancouverdata.blogspot.com/2011/10/rapidminer-text-mining-r-analytics.html' title='Text Analytics with RapidMiner Part 6 of 6 - Applying the Model to New Documents'/><author><name>Neil McGuigan</name><uri>http://www.blogger.com/profile/14122981831780837323</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://img.youtube.com/vi/9I0BcMuhPe8/default.jpg" height="72" width="72"/><thr:total>19</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2523819181563716059.post-6512368803019856563</id><published>2011-09-02T21:06:00.000-07:00</published><updated>2011-09-02T21:05:54.648-07:00</updated><title type='text'>September sunset</title><summary type="text"></summary><link rel='replies' type='application/atom+xml' href='http://vancouverdata.blogspot.com/feeds/6512368803019856563/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://vancouverdata.blogspot.com/2011/09/september-sunset.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/6512368803019856563'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/6512368803019856563'/><link rel='alternate' type='text/html' href='http://vancouverdata.blogspot.com/2011/09/september-sunset.html' title='September sunset'/><author><name>Neil McGuigan</name><uri>http://www.blogger.com/profile/14122981831780837323</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhqm1KKbbxQFxzWiqjWUccC4pfcabaLmRgBF6StJWuVmyrjBzsoslaUIdcC6ig0euALqdTP-CGEeVhPjaWCRrQUQOAdktF4f_EZ85cGgMebfwqW3Ky5S9qRQtPmnxho5SELdNZEJT_kTjM/s72-c/%253D%253Futf-8%253FB%253FVmFuY291dmVyLTIwMTEwOTAyLTAwMDY3LmpwZw%253D%253D%253F%253D-754650" height="72" width="72"/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2523819181563716059.post-1128744730928246656</id><published>2011-08-27T20:01:00.002-07:00</published><updated>2011-08-28T11:10:13.003-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="data mining"/><category scheme="http://www.blogger.com/atom/ns#" term="extract transform load"/><category scheme="http://www.blogger.com/atom/ns#" term="rapidminer"/><title type='text'>RapidMiner ETL - Transforming Attributes with Functions</title><summary type="text">In this video I show how to transform features in RapidMiner using operators such as log, sqrt, absolute value, and multiplying columns.

</summary><link rel='replies' type='application/atom+xml' href='http://vancouverdata.blogspot.com/feeds/1128744730928246656/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://vancouverdata.blogspot.com/2011/08/rapidminer-etl-transforming-attributes.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/1128744730928246656'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/1128744730928246656'/><link rel='alternate' type='text/html' href='http://vancouverdata.blogspot.com/2011/08/rapidminer-etl-transforming-attributes.html' title='RapidMiner ETL - Transforming Attributes with Functions'/><author><name>Neil McGuigan</name><uri>http://www.blogger.com/profile/14122981831780837323</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://img.youtube.com/vi/6uBKg9-EMRk/default.jpg" height="72" width="72"/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2523819181563716059.post-6208340498780104596</id><published>2011-08-27T20:01:00.000-07:00</published><updated>2011-08-28T11:06:25.178-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="data mining"/><category scheme="http://www.blogger.com/atom/ns#" term="extract transform load"/><title type='text'>RapidMiner ETL - Normalizing, Discretizing, Recoding</title><summary type="text">In this video I show how to normalize an attribute, including z-normalization, how to discretize a column, and how to recode values


</summary><link rel='replies' type='application/atom+xml' href='http://vancouverdata.blogspot.com/feeds/6208340498780104596/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://vancouverdata.blogspot.com/2011/08/rapidminer-etl-normalizing-discretizing.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/6208340498780104596'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/6208340498780104596'/><link rel='alternate' type='text/html' href='http://vancouverdata.blogspot.com/2011/08/rapidminer-etl-normalizing-discretizing.html' title='RapidMiner ETL - Normalizing, Discretizing, Recoding'/><author><name>Neil McGuigan</name><uri>http://www.blogger.com/profile/14122981831780837323</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://img.youtube.com/vi/XfvSIgcTDZs/default.jpg" height="72" width="72"/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2523819181563716059.post-4366878170088799710</id><published>2011-08-25T18:18:00.000-07:00</published><updated>2011-08-26T10:43:23.852-07:00</updated><category scheme="http://www.blogger.com/atom/ns#" term="rapidminer data mining etl"/><title type='text'>RapidMiner ETL - Sampling, Selecting Rows, Attributes</title><summary type="text">In this video I show how to sample rows, including balancing class labels, bootstrap sampling. I also show how to filter rows by value, and select a subset of attributes.



You can get the dataset here</summary><link rel='replies' type='application/atom+xml' href='http://vancouverdata.blogspot.com/feeds/4366878170088799710/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://vancouverdata.blogspot.com/2011/08/rapidminer-etl-sampling-selecting-rows.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/4366878170088799710'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2523819181563716059/posts/default/4366878170088799710'/><link rel='alternate' type='text/html' href='http://vancouverdata.blogspot.com/2011/08/rapidminer-etl-sampling-selecting-rows.html' title='RapidMiner ETL - Sampling, Selecting Rows, Attributes'/><author><name>Neil McGuigan</name><uri>http://www.blogger.com/profile/14122981831780837323</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='https://img1.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://img.youtube.com/vi/DtKE2aaRhAU/default.jpg" height="72" width="72"/><thr:total>2</thr:total></entry></feed>