<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/atom10full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?><feed xmlns="http://www.w3.org/2005/Atom" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
 
 <title>Thomson Nguyen | Data Scientist, eats bacon</title>
 
 <link href="http://squareheadgroup.com/" />
 <updated>2013-02-04T11:18:56-08:00</updated>
 <id>http://squareheadgroup.com</id>
 <author>
   <name>Thomson Nguyen</name>
 </author>
 
 
 <atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" type="application/atom+xml" href="http://feeds.feedburner.com/ThomsonNguyen" /><feedburner:info uri="thomsonnguyen" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com/" /><feedburner:browserFriendly></feedburner:browserFriendly><entry>
   <title>Getting Things Done</title>
   <link href="http://squareheadgroup.com/getting-things-done" />
   <updated>2012-11-04T00:00:00-07:00</updated>
   <id>http://squareheadgroup.com/getting-things-done</id>
   <content type="html">&lt;p&gt;I&amp;rsquo;m currently in New York with &lt;a href="http://en.wikipedia.org/wiki/Hurricane_Sandy"&gt;no power or
internet&lt;/a&gt;, so I&amp;rsquo;ve decided to knock
off some tasks that&amp;rsquo;s been sitting in to-do list. I&amp;rsquo;m currently using
&lt;a href="http://www.asana.com"&gt;Asana&lt;/a&gt; to
track my progress on things that I want or need to do:&lt;/p&gt;

&lt;p&gt;&lt;a href="/img/asana.png"&gt;&lt;img src="/img/asana.png" alt="Tracking things in Asana" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I&amp;rsquo;ve been on it for the past month or so, and I think it&amp;rsquo;s been pretty useful.
When I started caring about rigorously tracking my to-do list, I came up with
the following list of features that I wanted from my tracking tool:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ability to colloborate with other people (e.g., shared lists)&lt;/li&gt;
&lt;li&gt;Deadline tracking and reminders&lt;/li&gt;
&lt;li&gt;Hooks into Github for personal coding projects&lt;/li&gt;
&lt;li&gt;Moblie alerts&lt;/li&gt;
&lt;li&gt;Historical archiving of finished tasks&lt;/li&gt;
&lt;li&gt;Personal analytics of how fast I&amp;rsquo;m closing tasks&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Six months and five tools later, I&amp;rsquo;ve updated my feature list to reflect what
actually helps me get things done:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sustained adherence to one tool and one methodology, no matter what&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;This point is trite and beaten to death, but it bears repeating for me, so maybe
someday you might find this post helpful: all the time you spend figuring out
which to-do list, text editor, or mail client to use is time taken away from
actually getting things done. This especially applies if you use any tool that
requires extensive &lt;a href="https://github.com/itsthomson/dotfiles"&gt;dotfile
configuration&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I used to justify putting extensive work into my dotfiles and tool selection as
a way of investing in future productivity. What I&amp;rsquo;ve found though is that it hasn&amp;rsquo;t
amortized for me personally, and that I reinvent my toolchain months down the
line anyway.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>On Hiring a Data Scientist</title>
   <link href="http://squareheadgroup.com/on-hiring-a-data-scientist" />
   <updated>2012-09-26T00:00:00-07:00</updated>
   <id>http://squareheadgroup.com/on-hiring-a-data-scientist</id>
   <content type="html">&lt;p&gt;There&amp;rsquo;s this &lt;a href="http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/1"&gt;article that&amp;rsquo;s being passed
around&lt;/a&gt; by &lt;a href="http://www.tomdavenport.com/"&gt;Thomas
Davenport&lt;/a&gt; and &lt;a href="https://twitter.com/dpatil"&gt;DJ
Patil&lt;/a&gt; on the newfound &amp;ldquo;sexiness&amp;rdquo; of what I do for a
living, and why data science is here to stay. I&amp;rsquo;m well behind the curve on this one so
I&amp;rsquo;ll leave it to you to find the many other analyses and responses to this
article, but I wanted to call attention to the one paragraph that really
resonated with me:&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;Data scientists don’t do well on a short leash. They should have the freedom to
experiment and explore possibilities. That said, they need close relationships
with the rest of the business. The most important ties for them to forge are
with executives in charge of products and services rather than with people
overseeing business functions. As the story of Jonathan Goldman illustrates,
their greatest opportunity to add value is not in creating reports or
presentations for senior executives but in innovating with customer-facing
products and processes.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;A very clear and present danger that exists among companies wanting to
capitalize on the recent influx of data talent is that they&amp;rsquo;ll want to put these
men and women right away to work on short-term business objectives that add
immediate value to the company. That&amp;rsquo;s fine and dandy, but what makes this
dangerous is that continued prioritization of these tactical units of work will
only take time away from the long-term strategic research directions that
(competent) data scientists were hired to enact. As a result, shortsighted
companies will fail to realize any long-term value from their data scientists
(now glorified BI analysts), and the data scientists will become increasingly
frustrated with being siloed into a pure analyst role.&lt;/p&gt;

&lt;p&gt;I write this as a couple of my peers are exhibiting symptoms of a &amp;ldquo;failure to thrive&amp;rdquo; in their respective companies, either because their charter of data creativity and research has been perverted into being a full-time data puller, or because the peer is inflexible in doing anything other than pure research. As a cofounder, you can&amp;rsquo;t help the latter case (you hired poorly), but you can help the former by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Setting expecations for data science hires early on by being writing a very clear list of responsibilities and expectations in your job requisition, and&lt;/li&gt;
&lt;li&gt;Once the data scientist has been hired, staying true to these roles and
responsibilities.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;I suspect that first point is something a lot of startups haven&amp;rsquo;t really thought about&amp;mdash;they simply assumed &amp;ldquo;big data&amp;rdquo; was a thing and that hiring data scientists blindly was going to save us all (hint: it won&amp;rsquo;t). Likewise, fresh graduate students might assume being a data scientist for a startup is just like working in an academic research lab (hint: it&amp;rsquo;s not).&lt;/p&gt;

&lt;p&gt;One last note: I&amp;rsquo;m not saying it&amp;rsquo;s entirely the company&amp;rsquo;s fault in the paragraphs above&amp;mdash;sometimes we all have to grit our teeth and do what&amp;rsquo;s necessary to fight fires at any given moment in time. All I&amp;rsquo;m saying is that while the onus falls on data scientists to prove that their research and products add value to a business (Hilary Mason does an &lt;a href="http://www.hilarymason.com/blog/how-do-you-prioritize-research/"&gt;excellent job&lt;/a&gt; explaining how to prioritize research directions), so too should companies ensure that their developing data teams possess a significant, stated commitment to data science and research.&lt;/p&gt;

&lt;p&gt;Otherwise, there are plenty of companies that are doing just fine without data scientists.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>It's the little things</title>
   <link href="http://squareheadgroup.com/its-the-little-things" />
   <updated>2012-07-19T00:00:00-07:00</updated>
   <id>http://squareheadgroup.com/its-the-little-things</id>
   <content type="html">&lt;p&gt;I recently took
a look at our users' first name initials at
&lt;a href="http://www.causes.com"&gt;Causes&lt;/a&gt;, and found the following distribution of
first-name initials in our userbase:&lt;/p&gt;

&lt;p&gt;&lt;a href="/img/names1.png"&gt;&lt;img src="/img/names1.png" alt="Distribution of first name initials at
causes.com" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What&amp;rsquo;s interesting is that we see an unusually high number of users whose name
starts with A. Was it actually that unusual though? How could I find out whether
this distribution of first name initials was anomalous?&lt;/p&gt;

&lt;h2&gt;Prior art&lt;/h2&gt;

&lt;p&gt;I initially started with some very creative ideas to construct a model
distribution to compare against: I tried scraping &lt;a href="http://www.behindthename.com/top/"&gt;top 1000
names&lt;/a&gt;; I had grand ideas to ping 500k
random public FBUIDs, etc. Sometimes though, it&amp;rsquo;s best to find prior art:&lt;/p&gt;

&lt;p&gt;&lt;a href="/img/google-answers.png"&gt;&lt;img src="/img/google-answers.png" alt="Google Answers saves the day" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Someone had already &lt;a href="http://answers.google.com/answers/threadview/id/347668.html"&gt;asked the same
question&lt;/a&gt; on Google Answers. Granted, it&amp;rsquo;s based off of tweleve year old data, and it&amp;rsquo;s only for names in the US, but I figured that was a good rough distribution to compare to our 179 million names in our userbase.&lt;/p&gt;

&lt;h2&gt;The A&amp;rsquo;s have it&lt;/h2&gt;

&lt;p&gt;By taking the total percentages from the page and creating an expected value
based off of our user numbers, I came up with this:&lt;/p&gt;

&lt;p&gt;&lt;a href="/img/names2.png"&gt;&lt;img src="/img/names2.png" alt="Distribution of first name initials compared with its expected
value" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I was very surprised at how well our distribution actually fit with the
expected value calculated from a 2000 US Census. That is, except for the
extrordinary high count of A-names. In fact, if we&amp;rsquo;re to take this census data
on good faith, we&amp;rsquo;re about 95.8% over our expected value, or almost double the number of
A-names than that in the US. That&amp;rsquo;s unusually high!&lt;/p&gt;

&lt;h2&gt;Smoking gun&lt;/h2&gt;

&lt;p&gt;Why do you think we have so many A-names? For those unfamiliar with the product
relaunch, Causes is a platform that allows you to create a pledge, a petition,
or a fundraising campaign for something you believe in. After creating an
action, we encourage you to invite your friends to take action with you. And
just what does that inviter look like?&lt;/p&gt;

&lt;p&gt;&lt;a href="/img/inviter.png"&gt;&lt;img src="/img/inviter.png" alt="Smoking gun" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I&amp;rsquo;ve whited out the surnames, but the point is clear. With such a nontrivial proportion of our userbase coming from viral channels,
sometimes it becomes necessary to revisit the little things that we gloss over
when shipping a minimum viable relaunch.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>I use this</title>
   <link href="http://squareheadgroup.com/usesthis" />
   <updated>2012-02-24T00:00:00-08:00</updated>
   <id>http://squareheadgroup.com/usesthis</id>
   <content type="html">&lt;p&gt;If you&amp;rsquo;ve ever wondered what I use on a daily basis, I was recently interviewed
by &lt;a href="http://thomson.nguyen.usesthis.com"&gt;The
Setup&lt;/a&gt;. The only thing that&amp;rsquo;s changed from when I
did the interview is that I now have a 4S, after losing my Droid Bionic. Oh
well.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>On Data Journalism</title>
   <link href="http://squareheadgroup.com/on-data-journalism" />
   <updated>2011-12-25T00:00:00-08:00</updated>
   <id>http://squareheadgroup.com/on-data-journalism</id>
   <content type="html">&lt;p&gt;Last week I had a really good Skype converation with &lt;a href="http://uxmag.com/contributors/hunter-whitney"&gt;Hunter Whitney&lt;/a&gt;, someone I met at the R
Users Meetup where I &lt;a href="http://www.squareheadgroup.com/bay-area-r-users-talk/"&gt;gave my talk&lt;/a&gt; a couple of weeks ago. He&amp;rsquo;s
currently writing a book on data visualization and we talked about all kinds
of data science-y topics, but there was one thing we touched on that I
feel strongly about: that all data scientists are really also data
journalists.&lt;/p&gt;

&lt;p&gt;By this, I don&amp;rsquo;t mean to say that we write data articles for a non-technical audience (although the
&lt;a href="http://www.guardian.co.uk/news/datablog/2011/dec/09/data-journalism-reading-riots"&gt;Guardian&lt;/a&gt; does this &lt;em&gt;really&lt;/em&gt; well). What
I mean is that as data scientists who acquire, parse, filter, mine,
represent, and refine data (totally stealing Ben Fry&amp;rsquo;s
&lt;a href="http://benfry.com/phd/"&gt;computational information
design&lt;/a&gt;), we have to acknowledge the fact that
at every step in this process, we &lt;em&gt;editorialize&lt;/em&gt; something on some
level. We have to: Our
job is to turn data into some kind of statistically significant narrative for
people who have neither the analytical background or time to validate our research themselves. It&amp;rsquo;s in that respect we&amp;rsquo;re sort of like journalists (without the credentials or beautiful prose, of course).&lt;/p&gt;

&lt;p&gt;Maybe I do actually mean that we write data articles for a non-technical
audience, but our audience varies&amp;mdash;we create dashboards and internal
reports for various teams, we visualize data in handy infographics for
public consumption, we machine learn datasets for product features, and so on. All I know is that somewhere along the way, I became just as concerned with
what I was trying to communicate as well as what I was hacking. I think
everyone who works with data has at some point come to this same
realization.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Data mining in ten minutes</title>
   <link href="http://squareheadgroup.com/bay-area-r-users-talk" />
   <updated>2011-12-14T00:00:00-08:00</updated>
   <id>http://squareheadgroup.com/bay-area-r-users-talk</id>
   <content type="html">&lt;script src="http://speakerdeck.com/embed/4ee8ef2cd69857004901222f.js"&gt;&lt;/script&gt;


&lt;p&gt;I gave a &amp;ldquo;lightning talk&amp;rdquo; at last night&amp;rsquo;s &lt;a href="barug"&gt;Bay Area R Users Group&lt;/a&gt;. This
was a format where we had ten minutes to talk about how various ways in which we
use R. I decided to talk about what little progress I made on the &lt;a href="HHP"&gt;Heritage Health
Prize&lt;/a&gt;. This was a concerted three-month effort at Courant to learn R, Hadoop, and
data mining . PS: The rankings there are inaccurate now&amp;mdash;I haven&amp;rsquo;t touched it
since the first progress prize in August and I think I&amp;rsquo;m about 4321748th place
now.&lt;/p&gt;

&lt;p&gt;I had a great time&amp;mdash;everyone was nice and I had a lot of fruitful discussions
with people after the event.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Deploying Jekyll</title>
   <link href="http://squareheadgroup.com/jekyll-part-two" />
   <updated>2011-11-27T00:00:00-08:00</updated>
   <id>http://squareheadgroup.com/jekyll-part-two</id>
   <content type="html">&lt;p&gt;After spending some time  &lt;a href="http://www.squareheadgroup.com/getting-set-up-with-jekyll/"&gt;setting up Jekyll&lt;/a&gt;, I realized I
was serving my &lt;code&gt;.git&lt;/code&gt; directory as well. Whoops. A quick google search yielded
a pretty cool (but probably simple) trick that allows for handy deployment
without serving embarassing commit comments.&lt;/p&gt;

&lt;p&gt;In your &lt;code&gt;hooks/&lt;/code&gt; directory in your repo, create a file called &lt;code&gt;post-receive&lt;/code&gt;
containing the following:&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;code class="bash"&gt;&lt;span class="c"&gt;#!/bin/sh&lt;/span&gt;
&lt;span class="nv"&gt;GIT_WORK_TREE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/path/to/public_html/ git checkout -f
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;This is a post-receive hook that will push to my &lt;code&gt;public_html&lt;/code&gt; directory upon
receiving a push from my local computer. I feel like I should&amp;rsquo;ve known this ages
ago.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>On blogging</title>
   <link href="http://squareheadgroup.com/on-blogging" />
   <updated>2011-11-14T00:00:00-08:00</updated>
   <id>http://squareheadgroup.com/on-blogging</id>
   <content type="html">&lt;p&gt;Blogging is hard.&lt;/p&gt;

&lt;p&gt;Well, it&amp;rsquo;s hard for me because I always feel like I have nothing
&lt;a href="http://zachholman.com/"&gt;meaningful&lt;/a&gt;, &lt;a href="http://flowingdata.com/"&gt;insightful&lt;/a&gt;, &lt;a href="http://blog.echen.me/"&gt;educational&lt;/a&gt;,
&lt;a href="http://www.drewconway.com/zia/"&gt;intelligent&lt;/a&gt;, &lt;a href="http://www.theatlantic.com/megan-mcardle/"&gt;poignant&lt;/a&gt;, &lt;a href="http://hilarymason.com"&gt;inspiring&lt;/a&gt; or
&lt;a href="http://unethicalblogger.com/"&gt;amusing&lt;/a&gt; to say without coming off as trite, forced, corny, pompous,
hokey, flat, or just plan stupid (I&amp;rsquo;m not going to provide examples of those,
but they&amp;rsquo;re everywhere). All blogs are an exercise in thinly-veiled narcissism,
but the successful ones manage to convince you that they are, in some way,
deserving of their vainity domain and your two minutes of attention.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Getting set up with Jekyll</title>
   <link href="http://squareheadgroup.com/getting-set-up-with-jekyll" />
   <updated>2011-10-30T00:00:00-07:00</updated>
   <id>http://squareheadgroup.com/getting-set-up-with-jekyll</id>
   <content type="html">&lt;p&gt;After several hours of futzing around with &lt;a href="https://github.com/mojombo/jekyll"&gt;Jekyll&lt;/a&gt;, we&amp;rsquo;re now live. The
hardest parts were setting up Disqus, and getting &lt;a href="https://github.com/holman/boastful"&gt;boastful&lt;/a&gt; to work (which is
probably &lt;a href="http://en.wikiquote.org/wiki/Donald_Knuth#Computer_Programming_as_an_Art_.281974.29"&gt;premature optimization&lt;/a&gt;, given the complete lack of traffic to
this site).&lt;/p&gt;

&lt;h2&gt;Jekyll for fun and profit&lt;/h2&gt;

&lt;p&gt;I&amp;rsquo;m lazy, so I pretty much cloned the Jekyll repo:&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;code class="bash"&gt;&amp;gt; git clone https://github.com/mojombo/jekyll.git
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;Created a .css from &lt;a href="http://960.gs/"&gt;960 grid&lt;/a&gt;, and downgraded liquid (as Jekyll &lt;a href="http://stackoverflow.com/questions/7801197/syntax-highlighting-with-pygments-is-failing-via-liquid-templates-string-error"&gt;doesn&amp;rsquo;t
play well with 2.3.0&lt;/a&gt;:&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;code class="bash"&gt;&amp;gt; sudo gem uninstall liquid
&amp;gt; sudo gem install liquid --version &lt;span class="s1"&gt;&amp;#39;2.2.2&amp;#39;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;The rest is pretty straightforward. Set up a YAML config file:&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;code class="bash"&gt;auto: &lt;span class="nb"&gt;true&lt;/span&gt;
markdown: rdiscount
permalink: /blag/:title
rdiscount:
  extensions: &lt;span class="o"&gt;[&lt;/span&gt;smart&lt;span class="o"&gt;]&lt;/span&gt;
pygments: &lt;span class="nb"&gt;true&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;And start writing posts in markdown!&lt;/p&gt;
</content>
 </entry>
 
 
</feed>
