<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>Entrepreneurial Geekiness</title>
	
	<link>http://ianozsvald.com</link>
	<description>My thoughts on screencasting, the A.I. Cookbook and high-tech entrepreneurship</description>
	<lastBuildDate>Tue, 18 Jun 2013 11:01:24 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=</generator>
		<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/rss+xml" href="http://feeds.feedburner.com/EntrepreneurialGeekiness" /><feedburner:info uri="entrepreneurialgeekiness" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><item>
		<title>Demonstrating the first Brand Disambiguator (a hacky, crappy classifier that does something useful)</title>
		<link>http://feedproxy.google.com/~r/EntrepreneurialGeekiness/~3/WZ77TbxGtyE/</link>
		<comments>http://ianozsvald.com/2013/06/17/demonstrating-the-first-brand-disambiguator-a-hacky-crappy-classifier-that-does-something-useful/#comments</comments>
		<pubDate>Mon, 17 Jun 2013 19:13:44 +0000</pubDate>
		<dc:creator>Ian</dc:creator>
				<category><![CDATA[ArtificialIntelligence]]></category>
		<category><![CDATA[Data science]]></category>
		<category><![CDATA[Life]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[SocialMediaBrandDisambiguator]]></category>
		<category><![CDATA[Api]]></category>
		<category><![CDATA[Apple Juice]]></category>
		<category><![CDATA[Apple Sauce]]></category>
		<category><![CDATA[Baseline]]></category>
		<category><![CDATA[Cl]]></category>
		<category><![CDATA[Classifier]]></category>
		<category><![CDATA[Crappy]]></category>
		<category><![CDATA[Dense Mat]]></category>
		<category><![CDATA[Disambiguate]]></category>
		<category><![CDATA[Disambiguation]]></category>
		<category><![CDATA[Entity Recognition]]></category>
		<category><![CDATA[Pleasure]]></category>
		<category><![CDATA[Py]]></category>
		<category><![CDATA[Real Frequency]]></category>
		<category><![CDATA[Scientists]]></category>
		<category><![CDATA[Shabby Apple]]></category>
		<category><![CDATA[Slides]]></category>
		<category><![CDATA[Test Train]]></category>
		<category><![CDATA[Threshold]]></category>
		<category><![CDATA[Tokens]]></category>
		<category><![CDATA[Tweet]]></category>
		<category><![CDATA[Tweets]]></category>
		<category><![CDATA[Validation]]></category>
		<category><![CDATA[Week 1]]></category>
		<category><![CDATA[Word Sense]]></category>

		<guid isPermaLink="false">http://ianozsvald.com/?p=1865</guid>
		<description><![CDATA[Last week I had the pleasure of talking at both BrightonPython and DataScienceLondon to about 150 people in total (Robin East wrote-up the DataScience night). The updated code is in github. The goal is to disambiguate the word-sense of a token (e.g. &#8220;Apple&#8221;) in a tweet as being either the-brand-I-care-about (in this case &#8211; Apple [...]]]></description>
				<content:encoded><![CDATA[<p>Last week I had the pleasure of talking at both <a href="http://brightonpy.org/meetings/2013-06-11/">BrightonPython</a> and <a href="http://www.meetup.com/Data-Science-London/events/123032212/">DataScienceLondon</a> to about 150 people in total (Robin East <a href="https://robineast.wordpress.com/2013/06/14/data-science-london-meetup-june-2013/">wrote-up</a> the DataScience night). The <a href="https://github.com/ianozsvald/social_media_brand_disambiguator">updated code</a> is in github.</p>
<p>The goal is to disambiguate the <a href="https://en.wikipedia.org/wiki/Word_sense">word-sense</a> of a token (e.g. &#8220;Apple&#8221;) in a tweet as being either the-brand-I-care-about (in this case &#8211; Apple Inc.) or anything-else (e.g. apple sauce, Shabby Apple clothing, apple juice etc). This is related to named entity recognition, I&#8217;m exploring simple techniques for disambiguation. In both talks people asked if this could classify an arbitrary tweet as being &#8220;about Apple Inc or not&#8221; and whilst this is possible, for this project I&#8217;m restricting myself to the (achievable, I think) goal of robust disambiguation within the 1 month timeline I&#8217;ve set myself.</p>
<p>Below are the <a href="https://speakerdeck.com/ianozsvald/detecting-the-right-apples-and-oranges-1-hour-talk-on-python-for-brand-disambiguation-using-scikit-learn-at-brightonpython-june-2013">slides</a> from the longer of the two talks at BrightonPython:<br />
<script class="speakerdeck-embed" type="text/javascript" src="//speakerdeck.com/assets/embed.js" async="" data-id="08288690b59f0130552832ce4b0305c5" data-ratio="1.33333333333333"></script></p>
<p>As noted in the slides for week 1 of the project I built a trivial <a href="http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression">LogisticRegression</a> classifier using the default <a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer">CountVectorizer</a>, applied a threshold and tested the resulting model on a held-out validation set. Now I have a few more weeks to build on the project before returning to <a href="http://morconsulting.com/">consulting work</a>.</p>
<p>Currently I use a JSON file of tweets filtered on the term &#8216;apple&#8217;, obtained using the free streaming API from Twitter using cURL. I then annotate the tweets as being in-class (apple-the-brand) or out-of-class (any other use of the term &#8220;apple&#8221;). I used the <a href="https://pypi.python.org/pypi/chromium_compact_language_detector">Chromium Language Detector</a> to filter non-English tweets and also discard English tweets that I can&#8217;t disambiguate for this data set. In total I annotated 2014 tweets. This set contains many duplicates (e.g. retweets) which I&#8217;ll probably thin out later, possibly they over-represent the real frequency of important tokens.</p>
<p>Next I built a validation set using 100 in- and 100 out-of-class tweets at random and created a separate test/train set with 584 tweets of each class (a balanced set from the two classes but ignoring the issue of duplicates due to retweets inside each class).</p>
<p>To convert the tweets into a dense matrix for learning I used the CountVectorizer with all the defaults (simple tokenizer [which is not great for tweets], minimum document frequency=1, unigrams only).</p>
<p>Using the simplest possible approach that could work &#8211; I trained a LogisticRegression classifier with all its defaults on the dense matrix of 1168 inputs. I then apply this classifier to the held-out validation set using a confidence threshold (&gt;92% for in-class, anything less is assumed to be out-of-class). It classifies 51 of the 100 in-class examples as in-class and makes no errors (100% precision, 51% recall). This threshold was chosen arbitrarily on the validation set rather than deriving it from the test/train set (poor hackery on my part), but it satisfied me that this basic approach was learning something useful from this first data set.</p>
<p>The strong (but not generalised at all!) result for the very basic LogisticRegression classifier will be due to token artefacts in the time period I chose (March 13th 2013 around 7pm for the 2014 tweets). Extracting the top features from LogisticRegression shows that it is identifying terms like &#8220;Tim&#8221;, &#8220;Cook&#8221;, &#8220;CEO&#8221; as significant features (along with other features that you&#8217;d expect to see like &#8220;iphone&#8221; and &#8220;sauce&#8221; and &#8220;juice&#8221;) &#8211; this is due to their prevalence in this small dataset (in this set examples like <a href="https://twitter.com/trendblognet/statuses/311959699010502656">this</a> are very frequent). Once a larger dataset is used this advantage will disappear.</p>
<p>I&#8217;ve added some TODO items to the <a href="https://github.com/ianozsvald/social_media_brand_disambiguator/blob/master/README.md">README</a>, maybe someone wants to tinker with the code? Building an interface to the open source <a href="http://dbpedia-spotlight.github.io/demo/">DBPediaSpotlight</a> (based on WikiPedia data using e.g. this <a href="https://github.com/newsgrape/pyspotlight">python wrapper</a>) would be a great start for validating progress, along with building some naive classifiers (a capital-letter-detecting one and a more complex heuristic-based one, to use as controls against the machine learning approach).</p>
<p>Looking at the data 6% of the out-of-class examples are retweets and 20% of the in-class examples are retweets. I suspect that the repeated strings are distorting each class so I think they need to be thinned out so we just have one unique example of each tweet.</p>
<p>Counting the number of capital letters in-class and out-of-class might be useful, in this set a count of &lt;5 capital letters per tweet suggests an out-of-class example:</p>
<p><a href="http://ianozsvald.com/wp-content/uploads/2013/06/nbr_capitals_scikit_testtrain_apple.png"><img class="aligncenter size-medium wp-image-1869" alt="nbr_capitals_scikit_testtrain_apple" src="http://ianozsvald.com/wp-content/uploads/2013/06/nbr_capitals_scikit_testtrain_apple-300x226.png" width="300" height="226" /></a><br />
This histogram of tweet lengths for in-class and out-of-class tweets might also suggest that shorter tweets are more likely to be out-of-class (though the evidence is much weaker):</p>
<p><a href="http://ianozsvald.com/wp-content/uploads/2013/06/histogram_tweet_lengths_scikit_testtrain_apple.png"><img class="aligncenter size-medium wp-image-1870" alt="histogram_tweet_lengths_scikit_testtrain_apple" src="http://ianozsvald.com/wp-content/uploads/2013/06/histogram_tweet_lengths_scikit_testtrain_apple-300x226.png" width="300" height="226" /></a></p>
<p>Next I need to:</p>
<ul>
<li>Update the docs so that a contributor can play with the code, this includes exporting a list of tweet-ids and class annotations so the data can be archived and recreated</li>
<li>Spend some time looking at the most-important features (I want to properly understand the numbers so I know what is happening), I&#8217;ll probably also use a Decision Tree (and maybe RandomForests) to see what they identify (since they&#8217;re much easier to debug)</li>
<li>Improve the tokenizer so that it respects some of the structure of tweets (preserving #hashtags and @users would be a start, along with URLs)</li>
<li>Build a bigger data set that doesn&#8217;t exhibit the easily-fitted unigrams that appear in the current set</li>
</ul>
<p>Longer term I&#8217;ve got a set of Homeland tweets (to disambiguate the TV show vs references to the US Department and various sayings related to the term) which I&#8217;d like to play with &#8211; I figure making some progress here opens the door to analysing media commentary in tweets.</p>
<hr>
Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.
<img src="http://feeds.feedburner.com/~r/EntrepreneurialGeekiness/~4/WZ77TbxGtyE" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://ianozsvald.com/2013/06/17/demonstrating-the-first-brand-disambiguator-a-hacky-crappy-classifier-that-does-something-useful/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://ianozsvald.com/2013/06/17/demonstrating-the-first-brand-disambiguator-a-hacky-crappy-classifier-that-does-something-useful/</feedburner:origLink></item>
		<item>
		<title>Active Countermeasures for Privacy in a Social Networking age?</title>
		<link>http://feedproxy.google.com/~r/EntrepreneurialGeekiness/~3/SsPLIvoZCOQ/</link>
		<comments>http://ianozsvald.com/2013/06/17/active-countermeasures-for-privacy-in-a-social-networking-age/#comments</comments>
		<pubDate>Mon, 17 Jun 2013 13:39:47 +0000</pubDate>
		<dc:creator>Ian</dc:creator>
				<category><![CDATA[Life]]></category>
		<category><![CDATA[Android]]></category>
		<category><![CDATA[Backdoors]]></category>
		<category><![CDATA[Camouflage]]></category>
		<category><![CDATA[Countermeasures]]></category>
		<category><![CDATA[Data Stream]]></category>
		<category><![CDATA[Disclosures]]></category>
		<category><![CDATA[Distro]]></category>
		<category><![CDATA[Doctorow]]></category>
		<category><![CDATA[Ec2]]></category>
		<category><![CDATA[Edge Nodes]]></category>
		<category><![CDATA[Encrypted Traffic]]></category>
		<category><![CDATA[Gmail]]></category>
		<category><![CDATA[Good Reason]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Government Agency]]></category>
		<category><![CDATA[Hat Tip]]></category>
		<category><![CDATA[Hoops]]></category>
		<category><![CDATA[Inferences]]></category>
		<category><![CDATA[Metadata]]></category>
		<category><![CDATA[Mobile Os]]></category>
		<category><![CDATA[Mobile Phones]]></category>
		<category><![CDATA[Monopolies]]></category>
		<category><![CDATA[Nsa]]></category>
		<category><![CDATA[Optio]]></category>
		<category><![CDATA[Paul Revere]]></category>
		<category><![CDATA[Personal Privacy]]></category>
		<category><![CDATA[Pgp]]></category>
		<category><![CDATA[Private Web]]></category>
		<category><![CDATA[Relays]]></category>
		<category><![CDATA[Six Ways]]></category>
		<category><![CDATA[Social Networking]]></category>
		<category><![CDATA[Sysadmin]]></category>
		<category><![CDATA[Web Usage]]></category>
		<category><![CDATA[Whatsmyip]]></category>

		<guid isPermaLink="false">http://ianozsvald.com/?p=1852</guid>
		<description><![CDATA[This is a bit of a rambling post covering some thoughts on data privacy, mobile phones and social networking. A general and continued decrease in personal privacy  seems inevitable in our age of data (NSA Files at The Guardian). We generate a lot of data, we rarely know how or where it is stored and [...]]]></description>
				<content:encoded><![CDATA[<p>This is a bit of a rambling post covering some thoughts on data privacy, mobile phones and social networking.</p>
<p>A general and continued decrease in personal privacy  seems inevitable in our age of data (<a href="http://www.guardian.co.uk/world/the-nsa-files">NSA Files</a> at The Guardian). We generate a lot of data, we rarely know <a href="http://www.zdnet.com/six-ways-to-protect-yourself-from-the-nsa-and-other-eavesdroppers-7000016860/">how or where</a> it is stored and we don&#8217;t understand how easy it is to make certain <a href="http://kieranhealy.org/blog/archives/2013/06/09/using-metadata-to-find-paul-revere/">inferences based on aggregated forms</a> of our data. Cory Doctorow has <a href="http://www.guardian.co.uk/technology/blog/2013/jun/14/nsa-prism">some points</a> on why we should care about this topic.</p>
<p>Will we now see the introduction of active countermeasures in a data stream by way of protest or camouflage by regular folk?</p>
<p><strong>Update</strong> &#8211; hat tip to Kyran for <a href="http://prism-break.org/">prism-break.org</a>, listing open-source alternatives to Operating Systems and communication clients/systems. I had a play earlier today with the Tor-powered <a href="https://play.google.com/store/apps/details?id=info.guardianproject.browser&amp;hl=en">Orweb</a> on Android &#8211; it Just Worked and whatsmyip.org didn&#8217;t know where my device was coming from (running traceroute went from whatsmyip to the Tor entry node and [of course] no further). It seems that installing <a href="http://www.instructables.com/id/Raspberry-Pi-Tor-relay/">Tor on a raspberrypi</a> or <a href="https://cloud.torproject.org/">Tor on EC2</a> is pretty easy too (Tor runs faster when more people start Tor relays [which carry the internal encrypted traffic, so there's none of the fear of running an edge nodes that sends the traffic onto the unencrypted Internet]). Here are some <a href="http://torstatus.blutmagie.de/network_detail.php">Tor network statistic graphs</a>.</p>
<p>I&#8217;ve long been unhappy with the fact that my email is known to be transmitted and stored in the clear (accepting that I turn on <a href="http://gmailblog.blogspot.co.uk/2010/01/default-https-access-for-gmail.html">HTTPS-only in Gmail</a>). I&#8217;d really like for it to be readable only for the recipient, not for anyone (sysadmin or Government agency) along the chain. Maybe someone can tell me if adding PGP into Gmail via the browser and Android phone is an easy thing to do?</p>
<p>I&#8217;m curious to see how long it&#8217;ll be before we have a cypherpunk mobile OS, preconfigured with sensible defaults. CyanogenMod is an open build of Android (so you could double-check for Government backdoors [if you took the time]), there&#8217;s no good reason why a distro couldn&#8217;t be setup that uses <a href="https://www.torproject.org/docs/android.html.en">Tor</a>, <a href="https://www.eff.org/https-everywhere">HTTPSEverywhere</a> (eff.org post <a href="https://www.eff.org/pages/tor-and-https">on this combo</a>, this Tor blog post comments on <a href="https://blog.torproject.org/blog/prism-vs-tor">Tor vs PRISM</a>) and <a href="https://plus.google.com/app/basic/stream/z12afdsybredwhed2220xb2reqilvvbw504">Incognito Mode</a> by default as a start for private web usage. Add on a secure and open source VoIP client (<a href="https://en.wikipedia.org/wiki/Skype_security#Eavesdropping_by_design">not Skype</a>) and an IM tool and you&#8217;re most of the way there for better-than-normal-folk privacy.</p>
<p>Compared to an iOS device it&#8217;ll be a bit clunky (so maybe my mum won&#8217;t use it) but I&#8217;d like the option, even if I have to jump through a few hoops. You might also choose not to trust your handset provider, we&#8217;re just starting to see designs for <a href="http://abcnews.go.com/Technology/build-cell-phone/story?id=18952735#.Ub8DrRVlGqR">build-it-yourself cellphones</a> (albeit very basic non-data phones at present).</p>
<p>Maybe we&#8217;ll start to consider the dangers of entrusting our data to near-monopolies in the hope that they do no evil (and aren&#8217;t subject to US Government secret &amp; uninvestigable<br />
disclosures to people who we personally may or may not trust, and may or may not be decent, upright, solid, incorruptible citizens). Perhaps far-sighted governments in other countries will start to educate their citizens about the dangers of trusting US Data BigCorps (&#8220;<a href="http://www.noholtzbarred.com/loose-lips-sink-ships/">Loose Lips Sink Ships</a>&#8220;)?</p>
<p>So what about active countermeasures? For the social networking example above we&#8217;d look at communications traffic (&#8216;friends&#8217; are cheap to acquire but communication takes effort). What if we started to lie about who we talk to? What if my email client builds a commonly-communicated-with list and picks someone from outside of that list, then starts to send them reasonably sensible-looking emails automatically? Perhaps it contains a pre-agreed codeword, then their client responds at a sensible frequency with more made-up but intelligible text. Suddenly they appear to be someone I closely communicate with, but that&#8217;s a lie.</p>
<p>My email client knows this so I&#8217;m not bothered by it but an eavesdropper has to process this text. It might not pass human inspection but it ought to tie up more resources, forcing more humans to get involved, driving up the cost and slowing down response times. Maybe our email clients then seed these emails with provocative keywords in innocuous phrases (&#8220;I&#8217;m going to get the bomb now! The bomb is of course the name for my football&#8221;) which tie up simple keyword scanners.</p>
<p>The above will be a little like the war on fake website signups for spam being defeated by <a href="https://en.wikipedia.org/wiki/Captcha">CAPTCHAs</a> (and in turn <a href="http://www.theregister.co.uk/2011/11/02/popular_captchas_easily_defeated/">defeating</a> <a href="http://hackaday.com/2013/01/16/script-defeats-minteye-captcha/">the</a> CAPTCHAs), driving perhaps improvements in <a href="https://en.wikipedia.org/wiki/Natural_language_processing">NLP</a> technologies. I seem to recall that <a href="https://en.wikipedia.org/wiki/Hari_Seldon">Hari Seldon</a> in Asimov&#8217;s Foundation novels used auto-generated plausible speech generators to mask private in-person communications from external eavesdropping (I can&#8217;t find a reference &#8211; am I making this up?), this stuff doesn&#8217;t feel like science fiction any more.</p>
<p>Maybe with FourSquare people will practice fake check-ins. Maybe during a protest you comfortably sit at home and take part in remote virtual check-ins to spots that&#8217;ll upset the police (&#8220;quick! join the mass check-in in the underground coffee shop! the police will have to spend resources visiting it to see if we&#8217;re actually there!&#8221;). Maybe you&#8217;ll physically be in the protest but will send spoofed GPS co-ords with your check-ins <a href="http://latimesblogs.latimes.com/technology/2010/02/confessions-of-a-foursquare-cheater.html">pretending to be elsewhere</a>.</p>
<p>Maybe people start to record and replay another person&#8217;s check-ins, a form of &#8216;identify theft&#8217; where they copy the behaviour of another to mask their own movements?</p>
<p>Maybe we can extend this idea to photo sharing. Some level of <a href="https://en.wikipedia.org/wiki/Face.com">face detection and recognition</a> already exists and it is pretty good, especially if you bound the face recognition problem to a known social group. What if we use a graphical smart-paste to blend a person-of-interest&#8217;s face into some of our group photos? Maybe <a href="https://en.wikipedia.org/wiki/Julian_Assange">Julian Assange</a> appears in background shots around London or a member of Barack Obama&#8217;s Government in photos from Iranian photobloggers?</p>
<p>The photos could be small and perhaps reasonably well disguised so they&#8217;re not obvious to humans, but obvious enough to good face detection &amp; recognition algorithms. Again this ties up resources (and computer vision algorithms are terribly CPU-expensive). It would no doubt upset the intelligence services if it impacted their automated analysis, maybe this becomes a form of citizen protest?</p>
<p><a href="https://en.wikipedia.org/wiki/Hidden_Mickeys">Hidden Mickeys</a> appear in lots of places (did you spot the one in Tron?), yet we don&#8217;t notice them. I&#8217;m pretty sure a smart paste could hide a small or distorted or rotated or blended image of a face in some photos, without too much degradation.</p>
<p>Figuring out who is doing what given the absence of information is another interesting area. With <a href="http://socialtiesapp.com/">SocialTies</a> (built by <a href="http://www.emilytoop.com/">Emily</a> and I) I could track who was at a conference via their <a href="http://lanyrd.com/">Lanyrd</a> sign-up, and also track people via nearby FourSquare check-ins and geo-tagged tweets (there are plenty of <a href="http://ianozsvald.com/2013/04/17/visualising-london-brighton-and-the-uk-using-geo-tweets/">geo-tagged tweets in London</a>&#8230;). Inferring where you were was quite possible, even if you only tweeted (and had geo-locations enabled). Double checking your social group and seeing that friends are marked as attending the event that you are near only strengthens the assertion that you&#8217;re also present.</p>
<p>Facebook typically knows the address book of your friends, so even if you haven&#8217;t joined the service it&#8217;ll <a href="http://www.bbc.co.uk/blogs/thereporters/rorycellanjones/2010/10/not_on_facebook_facebook_still.html">still have your email</a>. If 5 members of Facebook have your email address then that&#8217;s 5 directed edges in a social network graph pointing at a not-yet-active profile with your name on it. You might never join Facebook but they still have your email, name and some of your social connections. You can&#8217;t make those edges disappear. You just leaked your social connectivity without ever going near the service.</p>
<p>Anyhow, enough with the prognostications. Privacy is dead. C&#8217;est la vie. As long as we <a href="http://yro.slashdot.org/story/13/06/16/1526223/snowden-nsa-claims-partially-confirmed-says-rep-jerrold-nadler">trust the good guys</a> to only be good, <a href="https://en.wikipedia.org/wiki/Nineteen_Eighty-Four">nothing bad</a> can happen.</p>
<hr>
Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.
<img src="http://feeds.feedburner.com/~r/EntrepreneurialGeekiness/~4/SsPLIvoZCOQ" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://ianozsvald.com/2013/06/17/active-countermeasures-for-privacy-in-a-social-networking-age/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		<feedburner:origLink>http://ianozsvald.com/2013/06/17/active-countermeasures-for-privacy-in-a-social-networking-age/</feedburner:origLink></item>
		<item>
		<title>Open Sourcing “The Screencasting Handbook”</title>
		<link>http://feedproxy.google.com/~r/EntrepreneurialGeekiness/~3/jtprQccUW4I/</link>
		<comments>http://ianozsvald.com/2013/06/17/open-sourcing-the-screencasting-handbook/#comments</comments>
		<pubDate>Mon, 17 Jun 2013 10:02:14 +0000</pubDate>
		<dc:creator>Ian</dc:creator>
				<category><![CDATA[The Screencasting Handbook]]></category>
		<category><![CDATA[3 Years]]></category>
		<category><![CDATA[Book Download]]></category>
		<category><![CDATA[Collaborative Fashion]]></category>
		<category><![CDATA[Creative Commons License]]></category>
		<category><![CDATA[Ebook]]></category>
		<category><![CDATA[Finished Version]]></category>
		<category><![CDATA[Knowledge]]></category>
		<category><![CDATA[Little Bit]]></category>
		<category><![CDATA[Open Source Version]]></category>
		<category><![CDATA[Preview Audience]]></category>
		<category><![CDATA[Price Tag]]></category>
		<category><![CDATA[Sourced]]></category>

		<guid isPermaLink="false">http://ianozsvald.com/?p=1846</guid>
		<description><![CDATA[Back in 2010 I released the finished version of my first commercial eBook The Screencasting Handbook. It was 129 pages of distilled knowledge for the budding screencaster, written in part to introduce my (then) screencasting company ProCasts to the world (which I sold years back) and based on experience teaching through ShowMeDo. Today I release [...]]]></description>
				<content:encoded><![CDATA[<p>Back in 2010 I released the finished version of my first commercial eBook <a href="http://thescreencastinghandbook.com/">The Screencasting Handbook</a>. It was 129 pages of distilled knowledge for the budding screencaster, written in part to introduce my (then) screencasting company ProCasts to the world (which I sold years back) and based on experience teaching through <a href="http://showmedo.com/">ShowMeDo</a>. Today I release the Handbook under a Creative Commons License. After 3 years the content is showing its age (the procedures are good, the software-specific information is well out of date), I moved out of screencasting a while back and have no plans to update this book.</p>
<p>The download link for the open sourced version is at <a href="http://thescreencastinghandbook.com/">thescreencastinghandbook.com</a>.</p>
<p>I&#8217;m using the Creative Commons Unported license &#8211; it allows anyone to derive a new version and/or make commercial usage without requiring any additional permissions from me, it does require attribution. This is the most open license I can give that still gives me a little bit of value (by way of attribution). The license must not be modified.</p>
<p>If <em>someone would like to derive an updated version</em> (with or without a price tag) <em>you are very welcome to</em> &#8211; just remember to attribute back to the original site and to this site with my name please (as noted at the download point). You can <em>not</em> change the license (but if you wanted to make a derived and non-open-source version of the book for commercial use, I&#8217;m sure we can come to an arrangement).</p>
<p>Previously <a href="http://ianozsvald.com/2009/08/30/3000-words-written-for-the-screencasting-handbook/">I&#8217;ve</a> <a href="http://ianozsvald.com/2009/11/22/how-im-writing-the-screencasting-handbook/">discussed</a> how I wrote the Handbook in an open, collaborative fashion (with monthly chapter releases to the preview audience), this was a good procedure that I&#8217;d use again. Other posts discussing the Handbook are under the &#8220;<a href="http://ianozsvald.com/category/the-screencasting-handbook/">screencasting-handbook</a>&#8221; tag.</p>
<hr>
Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.
<img src="http://feeds.feedburner.com/~r/EntrepreneurialGeekiness/~4/jtprQccUW4I" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://ianozsvald.com/2013/06/17/open-sourcing-the-screencasting-handbook/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		<feedburner:origLink>http://ianozsvald.com/2013/06/17/open-sourcing-the-screencasting-handbook/</feedburner:origLink></item>
		<item>
		<title>Social Media Brand Disambiguator first steps</title>
		<link>http://feedproxy.google.com/~r/EntrepreneurialGeekiness/~3/r2I3NayeQGk/</link>
		<comments>http://ianozsvald.com/2013/06/03/social-media-brand-disambiguator-first-steps/#comments</comments>
		<pubDate>Mon, 03 Jun 2013 19:24:14 +0000</pubDate>
		<dc:creator>Ian</dc:creator>
				<category><![CDATA[ArtificialIntelligence]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[SocialMediaBrandDisambiguator]]></category>
		<category><![CDATA[Annotate]]></category>
		<category><![CDATA[Apis]]></category>
		<category><![CDATA[Apple Orange]]></category>
		<category><![CDATA[Benchmark]]></category>
		<category><![CDATA[Distinction]]></category>
		<category><![CDATA[Entity Recognition]]></category>
		<category><![CDATA[Few Days]]></category>
		<category><![CDATA[First Steps]]></category>
		<category><![CDATA[Frustrations]]></category>
		<category><![CDATA[Gold Standard]]></category>
		<category><![CDATA[Honeymoon]]></category>
		<category><![CDATA[Iphon]]></category>
		<category><![CDATA[Json]]></category>
		<category><![CDATA[Media Messages]]></category>
		<category><![CDATA[Media Tools]]></category>
		<category><![CDATA[Nltk]]></category>
		<category><![CDATA[Python Module]]></category>
		<category><![CDATA[Recognition Tools]]></category>
		<category><![CDATA[Software Names]]></category>
		<category><![CDATA[Spelling Errors]]></category>
		<category><![CDATA[Sqlite]]></category>
		<category><![CDATA[Tweet]]></category>
		<category><![CDATA[Valve Seat]]></category>
		<category><![CDATA[Vine]]></category>
		<category><![CDATA[Word Apple]]></category>

		<guid isPermaLink="false">http://ianozsvald.com/?p=1838</guid>
		<description><![CDATA[As noted a few days back I&#8217;m spending June working on a social-media focused brand disambiguator using Python, NLTK and scikit-learn. This project has grown out of frustrations using existing Named Entity Recognition tools (like OpenCalais and DBPediaSpotlight) to recognise brands in social media messages. These tools are generally trained to work on long-form clean [...]]]></description>
				<content:encoded><![CDATA[<p>As noted a few days back I&#8217;m spending June working on a <a href="http://ianozsvald.com/2013/05/05/june-project-disambiguating-brands-in-social-media/">social-media focused brand disambiguator</a> using Python, NLTK and scikit-learn. This project has grown out of frustrations using existing Named Entity Recognition tools (like OpenCalais and DBPediaSpotlight) to recognise brands in social media messages. These tools are generally trained to work on long-form clean text and tweets are anything but long or cleanly written!</p>
<p>The problem is this: in a short tweet (e.g. &#8220;Loving my apple, like how it werks with the iphon&#8221;) we have little context to differentiate the sense of the word &#8220;apple&#8221;. As a human we see the typos and deliberate spelling errors and know that this use of &#8220;apple&#8221; is for the brand, not for the fruit. Existing APIs don&#8217;t make this distinction, typically they want a lot more text with fewer text errors. I&#8217;m hypothesising that with a supervised learning system (using scikit-learn and NLTK) and hand tagged data I can outperform the existing APIs.</p>
<p>I started on Saturday (freshly back from honeymoon), a very small <a href="https://github.com/ianozsvald/social_media_brand_disambiguator">github repo</a> is online. Currently I can ingest tweets from a JSON file (captured <a href="http://mike.teczno.com/notes/streaming-data-from-twitter.html">using curl</a>), marking the ones with a brand and those with the same word but not-a-brand (in-class and out-of-class) in a SQLite db. I&#8217;ll benchmark my results against my hand-tagged Gold Standard to see how I do.</p>
<p>Currently I&#8217;m using my <a href="https://github.com/ianozsvald/python_template_with_config">Python template</a> to allow environment-variable controlled configurations, simple logging, argparse and unittests. I&#8217;ll also be using the <a href="https://pypi.python.org/pypi/twitter-text-python/">twitter text python</a> module that I&#8217;m now supporting to parse some structure out of the tweets.</p>
<p>I&#8217;ll be presenting my progress next week at <a href="http://brightonpy.org/meetings/2013-06-11/">Brighton Python</a>, my goal is to have a useful MIT-licensed tool that is pre-trained with some obvious brands (e.g. Apple, Orange, Valve, Seat) and software names (e.g. Python, vine, Elite) by the end of this month, with instructions so anyone can train their own models. Assuming all goes well I can then plumb it into my planned <a href="http://annotate.io/">annotate.io</a> online service later.</p>
<hr>
Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.
<img src="http://feeds.feedburner.com/~r/EntrepreneurialGeekiness/~4/r2I3NayeQGk" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://ianozsvald.com/2013/06/03/social-media-brand-disambiguator-first-steps/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		<feedburner:origLink>http://ianozsvald.com/2013/06/03/social-media-brand-disambiguator-first-steps/</feedburner:origLink></item>
		<item>
		<title>Thoughts from a month’s backpacking honeymoon</title>
		<link>http://feedproxy.google.com/~r/EntrepreneurialGeekiness/~3/jAAEkiJ5skM/</link>
		<comments>http://ianozsvald.com/2013/05/28/thoughts-from-a-months-backpacking-honeymoon/#comments</comments>
		<pubDate>Tue, 28 May 2013 10:41:40 +0000</pubDate>
		<dc:creator>Ian</dc:creator>
				<category><![CDATA[Life]]></category>
		<category><![CDATA[Travel]]></category>
		<category><![CDATA[Androids]]></category>
		<category><![CDATA[Background Info]]></category>
		<category><![CDATA[Backpacking]]></category>
		<category><![CDATA[Cache Images]]></category>
		<category><![CDATA[Caches]]></category>
		<category><![CDATA[Dearth]]></category>
		<category><![CDATA[European Languages]]></category>
		<category><![CDATA[Gap]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Google Maps]]></category>
		<category><![CDATA[Honeymoon]]></category>
		<category><![CDATA[Hungary]]></category>
		<category><![CDATA[Independent Reading]]></category>
		<category><![CDATA[Iphone]]></category>
		<category><![CDATA[Language Dictionary]]></category>
		<category><![CDATA[Languages]]></category>
		<category><![CDATA[Lastminute]]></category>
		<category><![CDATA[Offline Mode]]></category>
		<category><![CDATA[Phrase Book]]></category>
		<category><![CDATA[Pictogram]]></category>
		<category><![CDATA[Poor Man]]></category>
		<category><![CDATA[Pronunciation]]></category>
		<category><![CDATA[Pronunciation Guides]]></category>
		<category><![CDATA[Reading System]]></category>
		<category><![CDATA[Roughguide]]></category>
		<category><![CDATA[Train Stations]]></category>
		<category><![CDATA[Travel Data]]></category>
		<category><![CDATA[Tripadvisor]]></category>
		<category><![CDATA[Wifi]]></category>
		<category><![CDATA[Wikipedia]]></category>

		<guid isPermaLink="false">http://ianozsvald.com/?p=1824</guid>
		<description><![CDATA[I&#8217;m publishing this on the hoof, right now we&#8217;re in Istanbul near the end of our honeymoon back home. Here are some app-travelling notes (for our Nexus 4 Androids). Google Translate offers Offline dictionaries for all the European languages, each is 150mb. We downloaded new ones before each country hop. Generally they were very useful, [...]]]></description>
				<content:encoded><![CDATA[<p dir="ltr">I&#8217;m publishing this on the hoof, right now we&#8217;re <del>in Istanbul near the end of our honeymoon</del> back home. Here are some app-travelling notes (for our Nexus 4 Androids).</p>
<p dir="ltr">Google Translate offers Offline dictionaries for all the European languages, each is 150mb. We downloaded new ones before each country hop. Generally they were very useful, some phrases were wrong or not colloquial (often for things like &#8220;the bill please&#8221;). Some languages had pronunciation guides, they were ok but a phrase book would be better. It worked well as a glorified language dictionary.</p>
<p dir="ltr">Google Maps Offline were great except Hungary where offline wasn&#8217;t allowed (it didn&#8217;t explain why).</p>
<p dir="ltr">The lack of phrase or dictionary apps was a pain, there&#8217;s a real dearth on Android. Someone should fill this gap!</p>
<p dir="ltr">WiFi was fairly common throughout our travels so we rarely used our paper Guides. WiFi was free in all hotels, sometimes in train stations, often in cafes and bars even in Romania.</p>
<p dir="ltr"><a href="http://www.wikisherpa.com/">WikiSherpa</a> caches recent search results which are pulled out of Wikipedia and Wikivoyage, this works like a poor man&#8217;s RoughGuide. It doesn&#8217;t link to any maps or cache images but if you search on a city, you can read up on it (e.g. landmarks, how to get a taxi etc) whilst you travel.</p>
<p dir="ltr">The official WikiPedia app has page saving, this is useful for background info on a city when reading offline.</p>
<p dir="ltr"><a href="http://anymemo.org/">AnyMemo</a> is useful for learning phrases in new languages. It is chaotic as the learning files aren&#8217;t curated. You can edit the files to remove the phrases you don&#8217;t need and to add useful new ones in.</p>
<p dir="ltr">Emily notes that TripAdvisor on Android doesn&#8217;t work well (the iPhone version was better but still not great). Emily also notes that hotels.com, lastminute and booking.com were all useful for booking most of our travels and hotels.</p>
<p dir="ltr">We used foursquare when we had WiFi, sadly there is no offline mode so I just starred locations using Google Maps. Foursquare needs a language independent reading system, trying to figure out if a series of Turkish reviews were positive or not based on the prevalence of smileys wasn&#8217;t easy (Google Translate integration would have helped). An offline FourSquare would have been useful (e.g. for cafes near to our spot).</p>
<p dir="ltr">We really should have bought a WiFi 3G dongle. The lack of data was a pain. We used Emily&#8217;s £5 travel data day plans on occasion (via Three). It works for most of Europe but not Switzerland or Turkey.</p>
<p dir="ltr">Given that we have WikiPedia and Wiktionary, how come we don&#8217;t have a &#8220;WikiPhrases&#8221; (&#8220;wikilingo&#8221;?) with multi-language forms of common phrases? Just like the phrase books for travel that we can buy but with good local phrases and idioms across any language that gets written up. This feels like it&#8217;d have a lot of value.</p>
<hr>
Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.
<img src="http://feeds.feedburner.com/~r/EntrepreneurialGeekiness/~4/jAAEkiJ5skM" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://ianozsvald.com/2013/05/28/thoughts-from-a-months-backpacking-honeymoon/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		<feedburner:origLink>http://ianozsvald.com/2013/05/28/thoughts-from-a-months-backpacking-honeymoon/</feedburner:origLink></item>
		<item>
		<title>June project: Disambiguating “brands” in Social Media</title>
		<link>http://feedproxy.google.com/~r/EntrepreneurialGeekiness/~3/WdkTBRiOXaY/</link>
		<comments>http://ianozsvald.com/2013/05/05/june-project-disambiguating-brands-in-social-media/#comments</comments>
		<pubDate>Sun, 05 May 2013 13:32:00 +0000</pubDate>
		<dc:creator>Ian</dc:creator>
				<category><![CDATA[ArtificialIntelligence]]></category>
		<category><![CDATA[Life]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[SocialMediaBrandDisambiguator]]></category>
		<category><![CDATA[Abbreviation]]></category>
		<category><![CDATA[Aim]]></category>
		<category><![CDATA[Apis]]></category>
		<category><![CDATA[Apple Brand]]></category>
		<category><![CDATA[Ba]]></category>
		<category><![CDATA[Brands Products]]></category>
		<category><![CDATA[British Airways]]></category>
		<category><![CDATA[Classifier]]></category>
		<category><![CDATA[Client Projects]]></category>
		<category><![CDATA[Contractions]]></category>
		<category><![CDATA[Entity Recognition]]></category>
		<category><![CDATA[Fruit Drink]]></category>
		<category><![CDATA[Honeymoon]]></category>
		<category><![CDATA[Love Apple]]></category>
		<category><![CDATA[Nltk]]></category>
		<category><![CDATA[Reuters Articles]]></category>
		<category><![CDATA[Social Group]]></category>
		<category><![CDATA[Sourced]]></category>
		<category><![CDATA[Tweet]]></category>
		<category><![CDATA[Tweets]]></category>

		<guid isPermaLink="false">http://ianozsvald.com/?p=1822</guid>
		<description><![CDATA[Having returned from Chile last year, settled in to consulting in London, got married and now on honeymoon I&#8217;m planning on a change for June. I&#8217;m taking the month off from clients to work on my own project, an open sourced brand disambiguator for social media. As an example this will detect that the following [...]]]></description>
				<content:encoded><![CDATA[<p>Having returned from Chile last year, settled in to consulting in London, got married and now on honeymoon I&#8217;m planning on a change for June.</p>
<p>I&#8217;m taking the month off from clients to work on my own project, an open sourced brand disambiguator for social media. As an example this will detect that the following tweet mentions Apple-the-brand:<br />
&#8220;I love my apple, though leopard can be a pain&#8221;<br />
and that this tweet does not:<br />
&#8220;Really enjoying this apple, very tasty&#8221;</p>
<p>I&#8217;ve used AlchemyAPI, OpenCalais, DBPedia Spotlight and others for client projects and it turns out that these APIs expect long-form text (e.g. Reuters articles) written with good English.</p>
<p>Tweets are short-form, messy, use colloquialisms, can be compressed (e.g. using contractions) and rely on local context (both local in time and social group). Linguistically a lot is expressed in 140 characters and it doesn&#8217;t look like&#8221;good English&#8221;.</p>
<p>A second problem with existing APIs is that they cannot be trained and often don&#8217;t know about European brands, products, people and places. I plan to build a classifier that learns whatever you need to classify.</p>
<p>Examples for disambiguation will include <em>Apple</em> vs apple (brand vs e.g. fruit/drink/pie), <em>Seat</em> vs seat (brand vs furniture), cold vs cold (illness vs temperature), ba (when used as an abbreviation for British Airways).</p>
<p>The goal of the June project will be to out-perform existing Named Entity Recognition APIs for well-specified brands on Tweets, developed openly with a liberal licence. The aim will be to solve new client problems that can&#8217;t be solved with existing APIs.</p>
<p>I&#8217;ll be using Python, NLTK, scikit-learn and Tweet data. I&#8217;m speaking on progress at BrightonPy and DataScienceLondon in June.</p>
<p>Probably for now I should focus on having no computer on my honeymoon&#8230;</p>
<hr>
Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.
<img src="http://feeds.feedburner.com/~r/EntrepreneurialGeekiness/~4/WdkTBRiOXaY" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://ianozsvald.com/2013/05/05/june-project-disambiguating-brands-in-social-media/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		<feedburner:origLink>http://ianozsvald.com/2013/05/05/june-project-disambiguating-brands-in-social-media/</feedburner:origLink></item>
		<item>
		<title>Visualising London, Brighton and the UK using Geo-Tweets</title>
		<link>http://feedproxy.google.com/~r/EntrepreneurialGeekiness/~3/SXeeVWZXDnE/</link>
		<comments>http://ianozsvald.com/2013/04/17/visualising-london-brighton-and-the-uk-using-geo-tweets/#comments</comments>
		<pubDate>Wed, 17 Apr 2013 10:38:31 +0000</pubDate>
		<dc:creator>Ian</dc:creator>
				<category><![CDATA[ArtificialIntelligence]]></category>
		<category><![CDATA[Data science]]></category>
		<category><![CDATA[Entrepreneur]]></category>
		<category><![CDATA[Life]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Brighton Marina]]></category>
		<category><![CDATA[Brighton Pier]]></category>
		<category><![CDATA[Brighton University]]></category>
		<category><![CDATA[Canary Wharf]]></category>
		<category><![CDATA[Chitchat]]></category>
		<category><![CDATA[Coastline]]></category>
		<category><![CDATA[Conversation Analysis]]></category>
		<category><![CDATA[Dataset]]></category>
		<category><![CDATA[Firehose]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Heading North]]></category>
		<category><![CDATA[Heatmap]]></category>
		<category><![CDATA[Hyde Park]]></category>
		<category><![CDATA[Ins And Outs]]></category>
		<category><![CDATA[Journalism]]></category>
		<category><![CDATA[Languages]]></category>
		<category><![CDATA[London Bridge]]></category>
		<category><![CDATA[London Brighton]]></category>
		<category><![CDATA[London Parks]]></category>
		<category><![CDATA[London River]]></category>
		<category><![CDATA[Ly]]></category>
		<category><![CDATA[M25 Motorway]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Nbsp]]></category>
		<category><![CDATA[Networkx]]></category>
		<category><![CDATA[News Sites]]></category>
		<category><![CDATA[O2]]></category>
		<category><![CDATA[Oxford Street]]></category>
		<category><![CDATA[Oyster Card]]></category>
		<category><![CDATA[Pleasure Trips]]></category>
		<category><![CDATA[Pycon]]></category>
		<category><![CDATA[Railway Stations]]></category>
		<category><![CDATA[Rivington Street]]></category>
		<category><![CDATA[Shopping Centre]]></category>
		<category><![CDATA[Stamen]]></category>
		<category><![CDATA[Stratford]]></category>
		<category><![CDATA[Tottenham Court Road]]></category>
		<category><![CDATA[Train Stations]]></category>
		<category><![CDATA[Tweets]]></category>
		<category><![CDATA[Uk News]]></category>
		<category><![CDATA[Visualising]]></category>
		<category><![CDATA[West Edge]]></category>
		<category><![CDATA[Workspace]]></category>

		<guid isPermaLink="false">http://ianozsvald.com/?p=1776</guid>
		<description><![CDATA[Recently I&#8217;ve been grabbing Tweets some some natural language processing analysis (in Python using NetworkX and NLTK) &#8211; see this PyCon and PyData conversation analysis. Using the London dataset (visualised in the PyData post) I wondered if the geo-tagged tweets would give a good-looking map of London. It turns out that it does: You can [...]]]></description>
				<content:encoded><![CDATA[<p>Recently I&#8217;ve been grabbing Tweets some some natural language processing analysis (in Python using NetworkX and NLTK) &#8211; see this <a href="http://ianozsvald.com/2013/03/18/semantic-map-of-pycon2013-twitter-topics/">PyCon</a> and <a href="http://ianozsvald.com/2013/03/22/analysing-pydata-london-and-brighton-tweets-for-concept-mapping/">PyData</a> conversation analysis. Using the London dataset (visualised in the PyData post) I wondered if the geo-tagged tweets would give a good-looking map of London. It turns out that it does:</p>
<p><a href="http://ianozsvald.com/wp-content/uploads/2013/04/london_all_r1_nomap.png"><img class="aligncenter size-medium wp-image-1791" alt="london_all_r1_nomap" src="http://ianozsvald.com/wp-content/uploads/2013/04/london_all_r1_nomap-300x225.png" width="300" height="225" /></a></p>
<p>You can see the bright centre of London, the Thames is visible wiggling left-to-right through the centre. The black region to the left of the centre is <a href="https://en.wikipedia.org/wiki/Hyde_Park,_London">Hyde Park</a>. If you look around the edges you can even see the M25 motorway circling the city. This is about a week&#8217;s worth of geo-filtered Tweets from the Twitter 10% firehose. It is easier to locate using the following Stamen tiles:</p>
<p><a href="http://ianozsvald.com/wp-content/uploads/2013/04/london_all_r5.png"><img class="aligncenter size-medium wp-image-1792" alt="london_all_r5" src="http://ianozsvald.com/wp-content/uploads/2013/04/london_all_r5-300x265.png" width="300" height="265" /></a></p>
<p>Can you see Canary Wharf and the O2 arena to its east? How about Heathrow to the west edge of the map? And the string of reservoirs heading north north east from Tottenham?</p>
<p>Here&#8217;s a zoom around Victoria and London Bridge, we see a lot of Tweets around the railway stations, Oxford Street and Soho. I&#8217;m curious about all the dots in the Thames &#8211; presumably people Tweeting about their pleasure trips?</p>
<p><a href="http://ianozsvald.com/wp-content/uploads/2013/04/centrallondon_r3_map.png"><img class="aligncenter size-medium wp-image-1793" alt="centrallondon_r3_map" src="http://ianozsvald.com/wp-content/uploads/2013/04/centrallondon_r3_map-300x231.png" width="300" height="231" /></a></p>
<p>Here&#8217;s a zoom around the Shoreditch/Tech City area. I was surprised by the cluster of Tweets in the roundabout (Old Street tube station), there&#8217;s a cluster in Bonhill Street (where <a href="http://www.campuslondon.com/">Google&#8217;s Campus</a> is located &#8211; I work above there in <a href="http://www.centralworking.com/">Central Working</a>). The cluster off of Old Street onto Rivington Street seems to be at the location of the new and fashionable outdoor eatery spot (with <a href="http://www.burgeraddict.org/2013/01/greedy-bear-burger-bear-london.html">Burger Bear</a>). Further to the east is a more pubby/restauranty area.</p>
<p><a href="http://ianozsvald.com/wp-content/uploads/2013/04/london_shoreditch_all.png"><img class="aligncenter size-medium wp-image-1794" alt="london_shoreditch_all" src="http://ianozsvald.com/wp-content/uploads/2013/04/london_shoreditch_all-266x300.png" width="266" height="300" /></a></p>
<p>I&#8217;ve yet to analyse the content of these tweets (doing something like phrase extraction from the PyCon/PyData tweets onto this map would be great). As such I&#8217;m not sure what&#8217;s being discussed, probably a bunch of the banal along with chitchat between people (&#8220;I&#8221;m on my way&#8221;&#8230;). Hopefully some of it discusses the nearby environment.</p>
<p>I&#8217;m using <a href="http://www.sethoscope.net/heatmap/">Seth&#8217;s Python heatmap</a> (inspired by his lovely visuals). In addition I&#8217;m using <a href="http://maps.stamen.com/#terrain/12/37.7706/-122.3782">Stamen</a> map tiles (via OpenStreetMap). I&#8217;m using curl to consume the Twitter firehose via a geo-defined area for London, saving the results to a JSON file which I consume later (shout if you&#8217;d like the code and I&#8217;ll put it in github) &#8211; here&#8217;s a <a href="http://mike.teczno.com/notes/streaming-data-from-twitter.html">tutorial</a>.</p>
<p>During London Fashion Week I grabbed the tagged tweets (for &#8220;#lfw&#8217; and those mentioning &#8220;london fashion week&#8221; in the London area), if you zoom on the <a href="http://www.londonfashionweek.co.uk/Map_EventsList.aspx">official event map</a> you&#8217;ll see that the primary Tweet locations correspond to the official venue sites.</p>
<p><a href="http://ianozsvald.com/wp-content/uploads/2013/04/lfw.png"><img class="aligncenter size-medium wp-image-1795" alt="lfw" src="http://ianozsvald.com/wp-content/uploads/2013/04/lfw-300x275.png" width="300" height="275" /></a></p>
<p>What about <a href="https://en.wikipedia.org/wiki/Brighton">Brighton</a>? Down on the south coast (about 1 hour on the train south of London), it is where I&#8217;ve spent the last 10 years (before my <a href="http://ianozsvald.com/2012/11/25/startupchile-round-2-1-all-finished-thoughts/">recent move</a> to London). You can see the coastline, also Sussex University&#8217;s campus (north east corner). Western Road (the thick line running west a little way back from the sea) is the main shopping street with plenty of bars.</p>
<p><a href="http://ianozsvald.com/wp-content/uploads/2013/04/brighton_gps_to0103_nomap.png"><img class="aligncenter size-medium wp-image-1796" alt="brighton_gps_to0103_nomap" src="http://ianozsvald.com/wp-content/uploads/2013/04/brighton_gps_to0103_nomap-300x225.png" width="300" height="225" /></a></p>
<p>It&#8217;ll make more sense with the Stamen tiles, Brighton Marina (south east corner) is clear along with the small streets in the centre of Brighton:</p>
<p><a href="http://ianozsvald.com/wp-content/uploads/2013/04/brighton_gps_to0403_map.png"><img class="aligncenter size-medium wp-image-1797" alt="brighton_gps_to0403_map" src="http://ianozsvald.com/wp-content/uploads/2013/04/brighton_gps_to0403_map-300x253.png" width="300" height="253" /></a></p>
<p>Zooming to the centre is very nice, the <a href="https://en.wikipedia.org/wiki/North_Laine">North Laines</a> are obvious (to the north) and the pedestriansed area below (the &#8220;south laines&#8221;) is clear too. Further south we see the <a href="https://en.wikipedia.org/wiki/Brighton_Pier">Brighton Pier</a> reaching into the sea. To the north west on the edge of the map is another cluster inside <a href="https://en.wikipedia.org/wiki/Brighton_station">Brighton Station</a>:</p>
<p><a href="http://ianozsvald.com/wp-content/uploads/2013/04/brighton_gps_to0403_map_zoomed.png"><img class="aligncenter size-medium wp-image-1798" alt="brighton_gps_to0403_map_zoomed" src="http://ianozsvald.com/wp-content/uploads/2013/04/brighton_gps_to0403_map_zoomed-300x288.png" width="300" height="288" /></a></p>
<p>Finally &#8211; what about all the geo-tagged Tweets for the UK (annoyingly I didn&#8217;t go far enough west to get Ireland)? I&#8217;m pleased to see that the entirety of the mainland is well defined, I&#8217;m guessing many of the tweets around the coastline are more from pretty visiting points.</p>
<p><a href="http://ianozsvald.com/wp-content/uploads/2013/04/uk_gps_to0404_map_r5_zoomed.png"><img class="aligncenter size-medium wp-image-1800" alt="uk_gps_to0404_map_r5_zoomed" src="http://ianozsvald.com/wp-content/uploads/2013/04/uk_gps_to0404_map_r5_zoomed-184x300.png" width="184" height="300" /></a></p>
<p>How might this compare with a satellite photograph of the UK at night? Population centres are clearly visible but tourist spots are far less visible, the edge of the country is much less defined (via <a href="http://www.dailymail.co.uk/sciencetech/article-2243891/Sleepless-Britain-Nasas-stunning-images-UK-night.html">dailymail</a>):</p>
<p><a href="http://ianozsvald.com/wp-content/uploads/2013/04/uk_nightlights_via_dailymail_article-0-165F7719000005DC-828_964x592.jpg"><img class="aligncenter size-medium wp-image-1801" alt="Europe satellite" src="http://ianozsvald.com/wp-content/uploads/2013/04/uk_nightlights_via_dailymail_article-0-165F7719000005DC-828_964x592-300x184.jpg" width="300" height="184" /></a></p>
<p>I&#8217;m guessing we can use these Tweets for:</p>
<ul>
<li>Understanding what people talk about in certain areas (e.g. Oxford Street at rush-hour?)</li>
<li>Learning why foursquare checkings (below) aren&#8217;t in the same place as tweet locations (can we filter locations away by using foursquare data?)</li>
<li>Seeing how people discuss the weather &#8211; is it correlated with local weather reports?</li>
<li>Learning if people talk about their environment (e.g. too many cars, poor London tube climate control, bad air, too noisy, shops and signs, events)</li>
<li>Seeing how shops, gigs and events are discussed &#8211; could we recommend places and events in real time based on their discussion?</li>
<li>Figuring out how people discuss landmarks and tourist spots &#8211; maybe this helps with recommending good spots to visit?</li>
<li>Looking at the trail people leave as they Tweet over time &#8211; can we figure out their commute and what they talk about before and after? Maybe this is a sort of survey process that happens using public data?</li>
</ul>
<p>Here are some other geo-based visualisations I&#8217;ve recently seen:</p>
<ul>
<li>Nice video of Oyster London Underground checkins from 2012 (<a href="http://oliverobrien.co.uk/2013/03/londons-tidal-oyster-card-flow/">write-up</a>)</li>
<li>FourSquare&#8217;s 500,000 <a href="https://foursquare.com/infographics/500million?">check-in visualisation</a> (<a href="http://blog.foursquare.com/2013/01/17/what-the-last-500000000-check-ins-look-like-and-what-they-show-about-the-future-of-foursquare/">Jan blog post</a>) for the world, zoom on London to see how the map is <em>different</em> to the tweet data I have above</li>
<li>Another FourSquare <a href="http://thebackofyourhand.com/">check-in visualisation</a> just for London filtered by location-type</li>
<li>Language-tagged <a href="http://ny.spatial.ly/">geo-tweets for New York</a></li>
<li>Language-tagged <a href="http://spatialanalysis.co.uk/2012/10/londons-twitter-languages/">geo-tweets for London</a></li>
<li>Language-tagged <a href="https://secure.flickr.com/photos/walkingsf/6276642489/">geo-tweets for Europe</a> (uses the Chromium <a href="https://code.google.com/p/chromium-compact-language-detector/">compact language detector</a>)</li>
</ul>
<p>If you want help with this sort of work then note that I run my own <a href="http://morconsulting.com/">AI consultancy</a>, analysing and visualising social media like Twitter is an active topic for me at present (and will be more so via my planned API at <a href="http://annotate.io/">annotate.io</a>).</p>
<hr>
Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.
<img src="http://feeds.feedburner.com/~r/EntrepreneurialGeekiness/~4/SXeeVWZXDnE" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://ianozsvald.com/2013/04/17/visualising-london-brighton-and-the-uk-using-geo-tweets/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		<feedburner:origLink>http://ianozsvald.com/2013/04/17/visualising-london-brighton-and-the-uk-using-geo-tweets/</feedburner:origLink></item>
		<item>
		<title>More Python 3.3 downloads than Python 2.7 for past 3 months</title>
		<link>http://feedproxy.google.com/~r/EntrepreneurialGeekiness/~3/dwdzdSIPtBQ/</link>
		<comments>http://ianozsvald.com/2013/04/15/more-python-3-3-downloads-than-python-2-7-for-past-3-months/#comments</comments>
		<pubDate>Mon, 15 Apr 2013 13:03:32 +0000</pubDate>
		<dc:creator>Ian</dc:creator>
				<category><![CDATA[Life]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Armin]]></category>
		<category><![CDATA[Blog]]></category>
		<category><![CDATA[Conversations]]></category>
		<category><![CDATA[Cusp]]></category>
		<category><![CDATA[Datatypes]]></category>
		<category><![CDATA[Django]]></category>
		<category><![CDATA[Download Linux]]></category>
		<category><![CDATA[Download Mac]]></category>
		<category><![CDATA[Downloads]]></category>
		<category><![CDATA[Fabric]]></category>
		<category><![CDATA[Flask]]></category>
		<category><![CDATA[Gist]]></category>
		<category><![CDATA[Libraries]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[Linux Mac]]></category>
		<category><![CDATA[Mac Linux]]></category>
		<category><![CDATA[Natural Language]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Nltk]]></category>
		<category><![CDATA[Pil]]></category>
		<category><![CDATA[Popularity]]></category>
		<category><![CDATA[Preinstalled Linux]]></category>
		<category><![CDATA[Pycon]]></category>
		<category><![CDATA[Python Project]]></category>
		<category><![CDATA[Science Work]]></category>
		<category><![CDATA[Scipy]]></category>
		<category><![CDATA[Sprint]]></category>
		<category><![CDATA[Sqlalchemy]]></category>
		<category><![CDATA[Top Dog]]></category>
		<category><![CDATA[Top Mac]]></category>
		<category><![CDATA[Ubuntu]]></category>
		<category><![CDATA[Web Dev]]></category>

		<guid isPermaLink="false">http://ianozsvald.com/?p=1779</guid>
		<description><![CDATA[Since PyCon 2013 I&#8217;ve been in a set of conversations that start with &#8220;should I be using Python 3.3 for science work?&#8221;. Here&#8217;s a recent reddit thread on the subject. Last year I solidly recommended using Python 2.7 for scientific work (as many key libraries weren&#8217;t yet supported). I&#8217;m on the cusp of changing my [...]]]></description>
				<content:encoded><![CDATA[<p>Since PyCon 2013 I&#8217;ve been in a set of conversations that start with &#8220;should I be using Python 3.3 for science work?&#8221;. Here&#8217;s a recent <a href="http://www.reddit.com/r/Python/comments/19eu32/im_a_student_brushing_up_on_my_python_should_i/">reddit thread</a> on the subject. Last year I solidly recommended using Python 2.7 for scientific work (as many key libraries weren&#8217;t yet supported). I&#8217;m on the cusp of changing my recommendation.</p>
<p><strong>Update</strong> there&#8217;s a nice thread on <a href="http://www.reddit.com/r/Python/comments/1cdxi6/more_python_33_downloads_than_python_27_each/">Reddit/r/python</a> discussing what&#8217;s required and where the numbers are coming from.</p>
<p>I last looked at the rate of Python downloads via ShowMeDo <a href="http://blog.showmedo.com/news/growth-in-python-project-popularity/">during 2008</a> when Python 2.5 was the top dog. The Windows 2.5.1 installer was getting 500,000 downloads a month. In the last 3 months I&#8217;m pleasantly surprised to see that Python 3.3 for Windows is downloaded more each month than Python 2.7. We can see:</p>
<ul>
<li><a href="http://www.python.org/webstats/usage_201303.html">March 2013</a> Python 3.3 for Windows has 647k downloads vs Python 2.7 with 630k</li>
<li><a href="http://www.python.org/webstats/usage_201302.html">February 2013</a> Python 3.3 for Windows has 553k downloads vs Python 2.7 with 498k</li>
<li><a href="http://www.python.org/webstats/usage_201301.html">January 2013</a> Python 3.3 for Windows has 533k downloads vs Python 2.7 with 495k (Python 2.7 less popular since January 2013)</li>
<li><a href="http://www.python.org/webstats/usage_201212.html">December 2012</a> Python 3.3 for Windows has 412k downloads vs Python 2.7 with 525k</li>
</ul>
<p>These figures only tell a part of the story of course. For Windows you have to download Python. On Linux and Mac it comes pre-installed (so we can&#8217;t measure those numbers).</p>
<p>Python 2.7 has been the default on Ubuntu for a while, that&#8217;s <a href="https://wiki.ubuntu.com/RaringRingtail/TechnicalOverview#Python_3.3">changing with Ubuntu 13.04</a>. There are <a href="https://py3ksupport.appspot.com/">two</a> <a href="https://python3wos.appspot.com/">lists</a> of Python-3 compatible packages, it seems that Django is on this list and at PyCon there was a <a href="http://www.pyvideo.org/video/1787/porting-django-apps-to-python-3">how-to-port-to-py3 video</a> (not <a href="https://gist.github.com/untitaker/5321447">Flask yet</a> <strong>update</strong> Armin is <a href="https://twitter.com/mitsuhiko/status/323177177367601152">tweeting for sprint help</a> for Py3 support), SQLAlchemy is (but not MySQL-python), Fabric isn&#8217;t ready yet. For web-dev it seems to be a mixed bag but I&#8217;m guessing Python 3 support will be across the board this year.</p>
<p>For scientific use we already have Python-3 compatible numpy, scipy and matplotlib. scikit-learn is &#8216;<a href="https://github.com/scikit-learn/scikit-learn/pull/1361">nearly</a>&#8216; ported, Pillow (the recent fork of PIL) is ready for Python 3. NLTK is also <a href="http://nltk.org/nltk3-alpha/">being ported</a>.</p>
<p>For scientific use around natural language processing the switch to unicode-by-default looks most attractive (the mix of strings and unicode datatypes has burnt hours for me over the years in Python 2.x). Here&#8217;s a PyCon video on the use of <a href="http://www.pyvideo.org/video/1704/why-you-should-use-python-3-for-text-processing">Python 3 for text processing</a> and this reviews <a href="http://www.pyvideo.org/video/1730/python-33-trust-me-its-better-than-27">why Python 3.3 is superior to Python 2.7</a>.</p>
<p>It is slightly too early for me yet to want to switch but I&#8217;m starting to experiment. I&#8217;ve added some __future__ imports to new code so I know I&#8217;m writing Python 2.7 in a 3-like style. I&#8217;m also increasingly using Ned Batchelder&#8217;s <a href="http://nedbatchelder.com/code/coverage/">coverage.py</a> via nosetests to make sure I have good coverage. I currently run 2to3 to check that things convert cleanly to Python 3 but rarely run the result with Python 3 (I haven&#8217;t needed to do this yet). There&#8217;s a set of useful advice on <a href="http://python3porting.com/">python3porting</a> including various __future__ imports (including division, print_function, unicode_literals, absolute_import).</p>
<hr>
Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.
<img src="http://feeds.feedburner.com/~r/EntrepreneurialGeekiness/~4/dwdzdSIPtBQ" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://ianozsvald.com/2013/04/15/more-python-3-3-downloads-than-python-2-7-for-past-3-months/feed/</wfw:commentRss>
		<slash:comments>28</slash:comments>
		<feedburner:origLink>http://ianozsvald.com/2013/04/15/more-python-3-3-downloads-than-python-2-7-for-past-3-months/</feedburner:origLink></item>
		<item>
		<title>Applied Parallel Computing (PyCon 2013 Tutorial) slides and code</title>
		<link>http://feedproxy.google.com/~r/EntrepreneurialGeekiness/~3/XJ8gRFQFsiI/</link>
		<comments>http://ianozsvald.com/2013/04/02/applied-parallel-computing-pycon-2013-tutorial-slides-and-code/#comments</comments>
		<pubDate>Tue, 02 Apr 2013 07:32:33 +0000</pubDate>
		<dc:creator>Ian</dc:creator>
				<category><![CDATA[ArtificialIntelligence]]></category>
		<category><![CDATA[Life]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[5k Run]]></category>
		<category><![CDATA[5km Runs]]></category>
		<category><![CDATA[Amin]]></category>
		<category><![CDATA[Applied Parallel Computing]]></category>
		<category><![CDATA[Birds Of A Feather]]></category>
		<category><![CDATA[Bof]]></category>
		<category><![CDATA[Cancer Research]]></category>
		<category><![CDATA[Concept Map]]></category>
		<category><![CDATA[Consulting Ltd]]></category>
		<category><![CDATA[Decks]]></category>
		<category><![CDATA[Disco]]></category>
		<category><![CDATA[Feather Sessions]]></category>
		<category><![CDATA[John Hunter]]></category>
		<category><![CDATA[Map]]></category>
		<category><![CDATA[Minesh]]></category>
		<category><![CDATA[Mor]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Optimisation]]></category>
		<category><![CDATA[Parallelism]]></category>
		<category><![CDATA[Pycon]]></category>
		<category><![CDATA[Random Search]]></category>
		<category><![CDATA[Repo]]></category>
		<category><![CDATA[Runners]]></category>
		<category><![CDATA[Topics Of Conversation]]></category>
		<category><![CDATA[Tutorial Slides]]></category>
		<category><![CDATA[Tweets]]></category>
		<category><![CDATA[Two Birds]]></category>

		<guid isPermaLink="false">http://ianozsvald.com/?p=1720</guid>
		<description><![CDATA[Minesh B. Amin (MBASciences) and I (Mor Consulting Ltd) taught Applied Parallel Computing over 3 hours at PyCon 2013. PyCon this year was a heck of a lot of fun, I did the fun run (mentioned below), received one of the free 2500 RaspberryPis that were given away, met an awful lot of interesting people [...]]]></description>
				<content:encoded><![CDATA[<p>Minesh B. Amin (<a href="http://www.mbasciences.com/">MBASciences</a>) and I (<a href="http://www.morconsulting.com/">Mor Consulting</a> Ltd) taught <a href="https://us.pycon.org/2013/schedule/presentation/27/">Applied Parallel Computing</a> over 3 hours at PyCon 2013. PyCon this year was a heck of a lot of fun, I did the fun run (mentioned below), received one of the free 2500 RaspberryPis that were given away, met an awful lot of interesting people and ran two birds-of-a-feather sessions (parallel computing for our tutorial, another on natural language processing).</p>
<p>I held posting this entry until the video was ready (it came out yesterday). All the code and slides are in the <a href="https://github.com/ianozsvald/pycon2013_applied_parallel_computing/tree/master/Presentation%20slides">github repo</a>. Currently (but not indefinitely) there&#8217;s a VirtualBox image <a href="http://ianozsvald.com/2013/03/15/use-of-virtualbox-to-prepare-students-pycon-tutorials/">with everything</a> (Redis, Disco etc) pre-installed.</p>
<p>After the conference, partly as a result of the BoF NLP session I created a Twitter graph <a href="http://ianozsvald.com/2013/03/18/semantic-map-of-pycon2013-twitter-topics/">&#8220;Concept Map&#8221; based on #pycon tweets</a>, then <a href="http://ianozsvald.com/2013/03/22/analysing-pydata-london-and-brighton-tweets-for-concept-mapping/">another for #pydata</a>. They neatly summarise many of the topics of conversation.</p>
<p>Here&#8217;s our room of 60+ students, slides and video are below:</p>
<p><a href="http://ianozsvald.com/wp-content/uploads/2013/03/students_left.jpg"><img class="aligncenter size-medium wp-image-1724" alt="Applied Parallel Computing PyCon 2013 (left side of room)" src="http://ianozsvald.com/wp-content/uploads/2013/03/students_left-300x225.jpg" width="300" height="225" /></a></p>
<p><a href="http://ianozsvald.com/wp-content/uploads/2013/03/students_right1.jpg"><img class="aligncenter size-medium wp-image-1723" alt="Applied Parallel Computing PyCon 2013 (left side)" src="http://ianozsvald.com/wp-content/uploads/2013/03/students_right-300x225.jpg" width="300" height="225" /></a></p>
<p>The video runs for 2 hours 40:</p>
<p><iframe width="420" height="315" src="http://www.youtube.com/embed/vL0UtbJOKR0" frameborder="0" allowfullscreen></iframe></p>
<p>Here&#8217;s a list of our slides:</p>
<ol>
<li><a href="https://github.com/ianozsvald/pycon2013_applied_parallel_computing/blob/master/Presentation%20slides/IntroParallelism.pdf">Intro to Parallelism</a> (Minesh)</li>
<li><a href="https://github.com/ianozsvald/pycon2013_applied_parallel_computing/blob/master/Presentation%20slides/LessonsLearned_AppliedParallelComputing_PyCon2013.pdf">Lessons Learned</a> (Ian)</li>
<li><a href="https://github.com/ianozsvald/pycon2013_applied_parallel_computing/blob/master/Presentation%20slides/ListOfTasks_AppliedParallelComputing_PyCon2013.pdf">List of Tasks with Mandelbrot set</a> (Ian)</li>
<li><a href="https://github.com/ianozsvald/pycon2013_applied_parallel_computing/blob/master/Presentation%20slides/MapReduce_AppliedParallelComputing_PyCon2013.pdf">Map/Reduce with Disco</a> (Ian)</li>
<li><a href="https://github.com/ianozsvald/pycon2013_applied_parallel_computing/blob/master/Presentation%20slides/IntroOS.pdf">Hyperparameter optimisation with grid and random search</a> (Minesh)</li>
</ol>
<p>These are each of the slide decks:</p>
<p>&nbsp;</p>
<p><script async class="speakerdeck-embed" data-id="840fbb406fe90130792122000a1d8862" data-ratio="1.77777777777778" src="//speakerdeck.com/assets/embed.js"></script></p>
<p><script async class="speakerdeck-embed" data-id="850c89c06fe90130111922000a918550" data-ratio="1.33333333333333" src="//speakerdeck.com/assets/embed.js"></script></p>
<p><script async class="speakerdeck-embed" data-id="8596d5706fe90130111922000a918550" data-ratio="1.33333333333333" src="//speakerdeck.com/assets/embed.js"></script></p>
<p><script async class="speakerdeck-embed" data-id="865259a06fe90130792122000a1d8862" data-ratio="1.33333333333333" src="//speakerdeck.com/assets/embed.js"></script></p>
<p><script async class="speakerdeck-embed" data-id="78db06d06fe90130792122000a1d8862" data-ratio="1.77777777777778" src="//speakerdeck.com/assets/embed.js"></script></p>
<p>I also had fun in the 5k fun run (coming around 77th of 150 runners), we raised $7k or so for cancer research and the <a href="http://numfocus.org/johnhunter/">John Hunter Memorial Fund</a>. </p>
<hr>
Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.
<img src="http://feeds.feedburner.com/~r/EntrepreneurialGeekiness/~4/XJ8gRFQFsiI" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://ianozsvald.com/2013/04/02/applied-parallel-computing-pycon-2013-tutorial-slides-and-code/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		<feedburner:origLink>http://ianozsvald.com/2013/04/02/applied-parallel-computing-pycon-2013-tutorial-slides-and-code/</feedburner:origLink></item>
		<item>
		<title>Analysing #pydata, London and Brighton tweets for concept mapping</title>
		<link>http://feedproxy.google.com/~r/EntrepreneurialGeekiness/~3/1CoLcH45ag4/</link>
		<comments>http://ianozsvald.com/2013/03/22/analysing-pydata-london-and-brighton-tweets-for-concept-mapping/#comments</comments>
		<pubDate>Fri, 22 Mar 2013 00:16:10 +0000</pubDate>
		<dc:creator>Ian</dc:creator>
				<category><![CDATA[ArtificialIntelligence]]></category>
		<category><![CDATA[Life]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Analysing]]></category>
		<category><![CDATA[Attendance]]></category>
		<category><![CDATA[Bigram]]></category>
		<category><![CDATA[Brighton]]></category>
		<category><![CDATA[Collocation]]></category>
		<category><![CDATA[Concept Mapping]]></category>
		<category><![CDATA[Education Source]]></category>
		<category><![CDATA[Fernando Perez]]></category>
		<category><![CDATA[Few Days]]></category>
		<category><![CDATA[Friendly Software]]></category>
		<category><![CDATA[Innovation]]></category>
		<category><![CDATA[Inspiration]]></category>
		<category><![CDATA[Ipython]]></category>
		<category><![CDATA[London Brighton]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Nbsp]]></category>
		<category><![CDATA[Nltk]]></category>
		<category><![CDATA[Notebook]]></category>
		<category><![CDATA[Noun Phrase]]></category>
		<category><![CDATA[Oth]]></category>
		<category><![CDATA[Peter Norvig]]></category>
		<category><![CDATA[Phrases]]></category>
		<category><![CDATA[Props]]></category>
		<category><![CDATA[Pycon]]></category>
		<category><![CDATA[Santa Clara Ca]]></category>
		<category><![CDATA[Scientist]]></category>
		<category><![CDATA[Social Networks]]></category>
		<category><![CDATA[Software Carpentry]]></category>
		<category><![CDATA[Tweets]]></category>

		<guid isPermaLink="false">http://ianozsvald.com/?p=1748</guid>
		<description><![CDATA[Below I&#8217;ve visualised tweets for #PyData conference and the cities of London and Brighton &#8211; this builds on my &#8216;concept cloud&#8216; from a few days ago at the #PyCon conference. Props to Maksim for his Social Media Analysis tutorial for inspiration. Update &#8211; Maksim&#8217;s Analying Social Networks tutorial video is online. For the earlier #PyCon [...]]]></description>
				<content:encoded><![CDATA[<p>Below I&#8217;ve visualised tweets for #PyData conference and the cities of London and Brighton &#8211; this builds on my &#8216;<a href="http://ianozsvald.com/2013/03/18/semantic-map-of-pycon2013-twitter-topics/">concept cloud</a>&#8216; from a few days ago at the #PyCon conference. Props to Maksim for his <a href="https://us.pycon.org/2013/schedule/presentation/29/">Social Media Analysis</a> tutorial for inspiration.</p>
<p><strong>Update</strong> &#8211; Maksim&#8217;s <a href="http://pyvideo.org/video/1714/analyzing-social-networks-with-python">Analying Social Networks</a> tutorial video is online.</p>
<p>For the earlier <a href="https://us.pycon.org/2013/">#PyCon 2013</a> analysis I visualised #hashtags and @usernames from #pycon tagged tweets during the conference. I&#8217;ve built upon this to add some natural language processing for &#8216;noun phrase extraction&#8217; which I detail below &#8211; this helps me to pull out phrases that are descriptive but haven&#8217;t been tagged. It also helps us to see which people are connected with which subjects. For the PyCon analysis I collected 22k tweets, after removing retweets I was left with 7,853 for analysis.</p>
<h2>#PyData (PyData Santa Clara 2013)</h2>
<p><a href="http://ianozsvald.com/wp-content/uploads/2013/03/pydata_weds_afternoon.png"><img class="aligncenter size-medium wp-image-1749" alt="pydata_weds_afternoon" src="http://ianozsvald.com/wp-content/uploads/2013/03/pydata_weds_afternoon-300x300.png" width="300" height="300" /></a></p>
<p><a href="http://sv2013.pydata.org/">PyData 2013</a> is a much smaller conference than PyCon (PyCon had 2,500 people and 20% female attendance, PyData had around 400 with 10% female attendance). Being smaller it had far fewer tweets &#8211; after removing retweets I had just 225 tweets to analyse. Cripes! This is clearly <em>not big data</em>. The other problem was that people weren&#8217;t using many #hashtags, they were referring to topics using natural language. For example:</p>
<blockquote><p>&#8220;Peter Norvig was giving a talk at PyData in Santa Clara, CA on the topic of innovation in education.&#8221; (<a href="https://twitter.com/jdunck/status/314073172360187906">source</a>)</p></blockquote>
<p>Clearly some natural language processing was required. I took two approaches:</p>
<ul>
<li>Extract capitalised sub-phrases (e.g. &#8220;Peter Norvig&#8221;, &#8220;Santa Clara&#8221;) of one or more words</li>
<li>Use NLTK&#8217;s <a href="https://en.wikipedia.org/wiki/N-gram">bigram</a> <a href="https://nltk.googlecode.com/svn/trunk/doc/howto/collocations.html">collocation</a> analyser (to find lowercased phrases such as &#8220;ipython notebook&#8221;, &#8220;machine learning&#8221;)</li>
</ul>
<p>Starting at the bottom of the plot we see three types of colour:</p>
<ul>
<li>white is for #hashtags</li>
<li>light blue is for @usernames</li>
<li>dark green is for phrases (extracted using natural language processing)</li>
</ul>
<p>We see a cluster of references around <a href="https://twitter.com/fperez_org">@fperez_org</a> (Fernando Perez of IPython), one cluster is around <a href="https://twitter.com/swcarpentry">@swcarpentry</a> (the scientist-friendly software carpentry movement), the other is around IPython and the IPython Notebook (<a href="https://twitter.com/minrk">@minrk</a> of IPython/parallel is linked too). I like the connection to Julia &#8211; Fernando discussed during his keynote that Julia now interoperates with Python.</p>
<p>The day before we had <a href="https://en.wikipedia.org/wiki/Peter_Norvig">Peter Norvig</a> (Director of research at Google) giving a keynote on the use of Python in education at Udacity including a discussion of how machine learning could be used to identify the mistakes that new coders make so we could make friendlier error messages to help students correct their code. See the clustering around this at the top of the graph.</p>
<p>Later the same day Henrik (<a href="https://twitter.com/brinkar">@brinkar</a>) spoke on <a href="http://about.wise.io/">Wise.io</a>&#8216;s Random Forest classifier. Their approach was efficient enough to demo live on a RaspberryPi. The connection from Peter to Henrik goes via #venturebeat who <a href="http://venturebeat.com/2013/03/19/data-science-nerds-bring-machine-learning-to-the-masses-exclusive/">covered</a> wise.io&#8217;s new software release during the conference.</p>
<p>Connecting IPython and Wise.io is <a href="https://twitter.com/ogrisel">@ogrisel</a> (Olivier Grisel) of scikit-learn. He gave an impressive (and given the variability of conference wifi &#8211; slightly ballsy) live demo of scaling a machine learning system via IPython Parallel on EC2.</p>
<p>In the middle we see <a href="https://twitter.com/teoliphant">@teoliphant </a>(Travis Oliphant) joined to Continuum (his company). Off to the right I get to blow my own trumpet &#8211; the phrases &#8220;awesome python&#8221; and &#8220;network analysis&#8221; connect to &#8220;russel brand&#8221; which is how one wag described my lightning talk. I got a chance to demo the earlier version of this at the end of <a href="https://twitter.com/katychuang">@katychuang</a>&#8216;s talk on networkx.</p>
<h2>London (geo-tagged tweets)</h2>
<p><a href="http://ianozsvald.com/wp-content/uploads/2013/03/londonout.png"><img class="aligncenter size-medium wp-image-1750" alt="londonout" src="http://ianozsvald.com/wp-content/uploads/2013/03/londonout-300x300.png" width="300" height="300" /></a></p>
<p>For the last month I&#8217;ve been grabbing tweets in the London geo area for another project. I had to raise my filtering levels to bring the network down to a sane (and easily visualised) number of nodes. After removing ReTweets I have 497,771 tweets from just a subset of my data. Some obvious clusters can be seen:</p>
<ul>
<li>#weather and #rain and (presumably a rather wet) &#8220;St Albans&#8221; (a very British discussion)</li>
<li>The &#8220;O2 Arena&#8221; near the centre with &#8220;Justin Beiber&#8221; and #believetour, linked with #amazing, #excited, #nowplaying</li>
<li><a href="https://twitter.com/onedirection">@onedirection</a> must have been playing (connected with band members <a href="https://twitter.com/louis_tomlinson">@louis_tomlinson</a> and <a href="https://twitter.com/real_liam_payne">@real_liam_payne</a> amongst others)</li>
<li>To the top-right we have a football cluster with &#8220;Manchester United&#8221;, &#8220;Champions League&#8221;, #cpfc, #realmadrid and &#8220;Old Trafford&#8221;</li>
<li>The usual tourist spots like &#8220;Tower Bridge&#8221;, &#8220;Covent Garden&#8221;, &#8220;Hyde Park&#8221;, &#8220;Big Ben&#8221;, &#8220;Trafalgar Square&#8221; are  discussed with #happy #sun #loveit, linked just off of here is &#8220;London Heathrow Airport&#8221; and &#8220;New York&#8221;</li>
</ul>
<h2>Brighton (geo-tagged tweets)</h2>
<p><a href="http://ianozsvald.com/wp-content/uploads/2013/03/brighton.png"><img class="aligncenter size-medium wp-image-1751" alt="brighton" src="http://ianozsvald.com/wp-content/uploads/2013/03/brighton-300x300.png" width="300" height="300" /></a></p>
<p>This is my favourite, analysed using 40,379 tweets after removing ReTweets. The nature of the two cities (Brighton is 50 miles south of London on the coast, it is a university town with a young &amp; party-friendly population) is quite apparent:</p>
<ul>
<li>Top left there is discussion around &#8220;One Direction&#8221;, #justinbeiber and #seo (a particular Brighton tech <em>thing</em>)</li>
<li>Just south of <a href="https://twitter.com/justinbieber">@justinbieber</a> is a single chain of not-safe-for-work ranting (another particular Brighton <em>thing</em>)</li>
<li>If you jump to the bottom right you&#8217;ll see #underwear, #lingerie, #teenagers &#8211; not as dodgy as you might expect, Sweetling were doing a <a href="http://sweetling.co.uk/products">social media</a> <a href="https://twitter.com/dollysweetling/status/308610700694130689/photo/1">bra</a> campaign</li>
<li>#hove is joined with #sunny #morning and nearby places #lewes #shoreham</li>
<li>#brightonbeach and &#8220;Brighton Pier&#8221; connect with #birds (Seagulls &#8211; a bane!) and #sun</li>
<li>#friends, #memories#, #happy, #goodtimes, #marina, #fun, #girls cluster around the centre (Brighton does like a party)</li>
<li>Off down to the bottom left is a some sort of political discussion (what were they doing in Brighton?)</li>
</ul>
<h2>Reproducing this</h2>
<p>All the code is in github at <a href="https://github.com/ianozsvald/twitter_networkx_concept_map">twitter_networkx_concept_map</a> including the one line cURL command to capture the data. An example .gephi file is included for visualisation in <a href="https://gephi.org/">Gephi</a>. The built-in <a href="https://networkx.lanl.gov/">networkx</a> viewer (optionally using <a href="http://www.graphviz.org/">GraphViz</a>) works reasonably well but isn&#8217;t interactive. Maksim&#8217;s tutorial and utils class were jolly useful (utils is in my repo), I&#8217;m also using <a href="https://pypi.python.org/pypi/twitter-text-python/">twitter-text-python</a> for parsing @usernames, #hashtags and URLs from the tweets.</p>
<p>If you want some custom work around this, give me a shout via <a href="http://www.morconsulting.com/">Mor Consulting</a>.</p>
<hr>
Ian applies Data Science as an AI/Data Scientist for companies in Mor Consulting, founded the image and text annotation API Annotate.io, co-authored SocialTies, programs Python, authored The Screencasting Handbook, lives in London and is a consumer of fine coffees.
<img src="http://feeds.feedburner.com/~r/EntrepreneurialGeekiness/~4/1CoLcH45ag4" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://ianozsvald.com/2013/03/22/analysing-pydata-london-and-brighton-tweets-for-concept-mapping/feed/</wfw:commentRss>
		<slash:comments>19</slash:comments>
		<feedburner:origLink>http://ianozsvald.com/2013/03/22/analysing-pydata-london-and-brighton-tweets-for-concept-mapping/</feedburner:origLink></item>
	</channel>
</rss><!-- Dynamic page generated in 1.262 seconds. --><!-- Cached page generated by WP-Super-Cache on 2013-06-20 02:11:54 -->
