Born to be geek!2023-01-05T18:15:16+01:00http://herraiz.org/blogIsrael Herraizisra@herraiz.orgFirst steps in Deep Learning (workshop @ KSchool)2018-02-09T00:00:00+01:00http://herraiz.org/blog/2018/02/09/deep-learning<p>
Although <a href="http://herraiz.org/blog/2017/08/29/data-science-sw-eng/">I am not the highest fan of notebooks</a> for data science
projects, it is undoubtedly a great tool for purposes like teaching,
or delivering a talk.
</p>
<p>
In fact, yesterday I delivered a small workshop at <a href="http://kschool.com/">KSchool</a> on making
your first baby steps on deep learning, using <a href="https://colab.research.google.com/">Google Colab</a>, the
Python notebooks tool of Google.
</p>
<p>
Google Colab works like Jupyter, except that several people can work
on the same notebook at once, à-la Google Docs. But it comes with even
nicer things: you can use a K80 Tesla GPU with Tensorflow and Keras,
to have lightning-fast training even when dealing with large
datasets. <b><b>For free</b></b>, just with your Google account.
</p>
<p>
In the workshop yesterday, we used a small dataset, the famous MNIST
handwritten digits set. But I have tried working with much larger
images in Colab (loading them from Google Drive!), and it works like a
charm.
</p>
<p>
If you are curious about Deep Learning, or about using a GPU entirely
<b><b>for free</b></b>, have a look at the Keras notebook I prepared for the
workshop yesterday:
</p>
<ul class="org-ul">
<li><a href="https://colab.research.google.com/drive/17V67nca1HVwwOlC_ruC8SOhQRC3olMBB">First steps in Deep Learning with Keras - a Google Colab notebook</a></li>
</ul>
<p>
If you don't have a Google account, <a href="https://drive.google.com/file/d/17V67nca1HVwwOlC_ruC8SOhQRC3olMBB/view?usp=sharing">grab the notebook from this open
link</a>, you can downwload it anonymously and run it locally using
Jupyter (with Python 3 and Keras).
</p>
<p>
You can download it and use it locally with Jupyter, or copy it to
your Google Drive to run it in your environment at Google Colab.
</p>
<p>
The notebook includes detailed explanations of every step we did at
the workshop, and links to external materials and videos, just in case
you want to extend on some of the details.
</p>
Killing machines - why data science needs software engineering2017-08-29T00:00:00+02:00http://herraiz.org/blog/2017/08/29/data-science-sw-eng<p>
<a href="http://jupyter.org/">Jupyter notebooks</a> (and other notebooks-based system) are a very
popular and handy tool in data science. With a notebook, you can
quickly explore different alternatives, getting immediate feedback,
producing plots, combining code and text, achieving a form of <a href="https://en.wikipedia.org/wiki/Literate_programming">literate
programming</a>, and allowing a hassle-free sharing of your results – a
key enabler for <a href="https://en.wikipedia.org/wiki/Reproducibility">reproducibility</a>, essential feature of any empirical
discipline.
</p>
<p>
However notebooks are not enough.
</p>
<p>
Why? Let's get know to an story about artificial intelligence and the
singularity.
</p>
<p>
<a href="https://www.theguardian.com/technology/2017/jul/25/elon-musk-mark-zuckerberg-artificial-intelligence-facebook-tesla">Elon Musk and Mark Zuckerberg have recently engaged in a public debate
about the perils of artificial intelligence</a> (AI). On one side of the
debate, Elon Musk defends that AI may give birth to future killing
machines that will exploit humanity as slaves. The moment when it will
happen is known as <a href="https://en.wikipedia.org/wiki/Technological_singularity">the singularity</a>. From that moment on, machines may
autonomously decide to kill us, or to enslave us, or to do us anything
they please.
</p>
<p>
What Elon Musk fails to notice is that we have alredy created machines
that kill humans autonomously: <a href="https://en.wikipedia.org/wiki/Therac-25">the Therac 25</a>, a radiotherapy machine
that killed 4 people and severely injured 2 more.
</p>
<p>
How did we as humankind design such a horrible machine? What kind of
hyper-sophisticated super intelligence did we create for the Therac
25?
</p>
<p>
It was much easier than that. We just were sloppy. We did not follow
(now common) best practices for software development.
</p>
<p>
Whether we like it or not, we as data scientists write code on a daily
basis. If you don't want the humankind to be enslaved by machines,
don't forget, notebooks are great but not enough. You need software
engineering too.
</p>
<p>
Want to know more? Find the story and links with more details in the
slides I presented at <a href="https://databeers.tumblr.com/post/162044820416/databeers-madrid-xix-2017-06-29">Databeers Madrid on June 29</a>:
</p>
<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vSGShLJ2cL26BJ2Isk-5avc6QhV325jm5VBr_ZT0oTWafXnxM0E2zlvRniD7UGCOLlrrGgdYhrikwzf/embed?start=false&loop=false&delayms=3000" frameborder="0" width="480" height="299" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>
<p>
(<a href="https://docs.google.com/presentation/d/1TMjYVn3jJoAoXekbIrPIHT0ATFWgU0Imxa27VWUB0UY/edit?usp=sharing">grab a copy of the slides</a>)
</p>
Trying to write from iPad2016-07-03T00:00:00+02:00http://herraiz.org/blog/2016/07/03/trying-to-write-from-ipad<p>
My blog works using <a href="https://jekyllrb.com">Jekyll</a> and <a href="http://orgmode.org">Org Mode</a>,
along with a Git repository. Every time I push a new org file to
the repository (a new blog post), it gets translated into HTML
by Emacs, and published in the web.
</p>
<p>
This is very cool, as I can write in the blog while I am offline,
and just push the changes whenever I get online again.
</p>
<p>
But there is a big drawback: I can only post from devices that run Emacs.
</p>
<p>
In the last months I have started to write more and more with
an iPad, and I have always felt that it really sucks that I
cannot post in my blog directly from the iPad.
</p>
<p>
After some investigations, I think I have found the setup that will
allow me to write from the iPad, and maybe, this can also mean that I will revive
my blog.
</p>
<p>
In order to publish a blog post from the iPad, I need two pieces:
</p>
<ul class="org-ul">
<li>A text editor, with nice syntax highlight if possible</li>
<li>A Git client, that allows to push to my personal repo using SSH</li>
</ul>
<p>
I have not been able to find a text editor with Org mode syntax
higlighting. But other than that, <a href="http://www.textasticapp.com">Textastic</a> is just working fine.
</p>
<p>
It integrates very well with <a href="http://workingcopyapp.com">Working Copy</a>, a very nice
Git client for iPad. This app allows to commit in a local clone in the iPad,
and then push to my server.
</p>
<p>
So now, I can write the Org file in the editor, save it in Working Copy,
commit and push it.
The hooks in my cloned repo at the server, where Emacs and Jekyll are installed,
just take care of the rest.
</p>
Work like in the teams of the future2014-08-11T00:00:00+02:00http://herraiz.org/blog/2014/08/11/working-together<p>
I have been always convinced of the criticality of recruiting for a
team to be successful. If you don't hire the best and the brightest,
you are wasting everyone's time. Recently, I stumbled upon a
<a href="http://firstround.com/article/Heres-Why-Youre-Not-Hiring-the-Best-and-the-Brightest">blog post with that very same title</a>.
</p>
<p>
You may think (as I did until not a so-long time ago) that the
solution to this problem is just putting the barrier very
high. However, there is trade-off between getting the people you need,
and putting the barrier too high. You might be hiring the truly best
and brightest, but if they are just a few of people, or you take too
long to hire, then you are not solving the problem.
</p>
<p>
The solution? Open your focus, and hire globally, <i>but</i> you don't have
to relocate people, just let them work from their places. This is of
course like <i>discovering the Mediterranean Sea</i>, many open source
projects are working in globally distributed teams.
</p>
<p>
However, making a remote team work like a team is not
straightforward. You have to setup the proper environment to allow the
team to work, regardless of where is everyone located.
</p>
<p>
The post contains a thorough piece of advice on how to achieve this; I
have particularly liked the following ones:
</p>
<ul class="org-ul">
<li>”Enable people to leave a record of the useful things they’ve
done. <b>Not a 'to do' list, but a 'done' list</b>.”</li>
<li>“ <b>Don't lock important information</b> into water cooler chats and
hallway meetings where nobody except the locals who happened to be
in earshot benefit.” (or in emails to specific people, share
information using mailing lists, or forwarding messages to lists if
they contain relevant information)</li>
<li>”Every time you see something like this arrive in your inbox, you
better believe in your heart of hearts that it contains useful
information. The minute these posts become just another "whenever I
have time to read that stuff" … you’ve let someone cry wolf too much
and ruined it. So tread carefully here.
<b>Everything that gets shared in this public discussion space has to be Need to Know</b>.”</li>
<li>”Never underestimate
<b>the power of actually talking to another human being</b>.”</li>
<li>”Monday Team Status Reports: Every Monday, every team at your company
(even if you just have one) should produce
<b>a brief, summarized rundown of</b>:
<ul class="org-ul">
<li>What we did last week.</li>
<li>What we’re planning to do this week.</li>
<li>Anything that is blocking us or we are concerned about.”</li>
</ul></li>
</ul>
<p>
Many of these suggestions can probably be useful too for teams that
work in the same physical location.
</p>
<p>
Some of this advice can be easily implemented in a team using tools
such as a wiki (e.g. <a href="https://www.mediawiki.org/">Mediawiki</a>) and project management software
(e.g. <a href="http://www.redmine.org/">Redmine</a>). Of course the crucial issue is not having these tools
installed, but fostering that everyone share the vision on how to
organize the team, how to report activities, the importance of sharing
information, etc. In short words, fostering this culture of work.
</p>
Will it work in the MIR?2014-07-10T00:00:00+02:00http://herraiz.org/blog/2014/07/10/does-it-work-in-the-mir<p>
The Big Data hype, the cool visualizations of data, the popularity of
Hadoop for data processing and other sophisticated tools can make us
forget that the one of the most powerful data science tools has been
at our fingertips since decades ago.
</p>
<p>
Many data wrangling tasks can be done directly in the shell using UNIX
commands that are older than most of us. This <a href="http://strataconf.com/stratany2014/public/schedule/detail/36204">session at the Strata
conference</a> has reminded me of this once again.
</p>
<p>
Handling data using shell commands is probably faster than other
options (e.g., a Python script), but it also helps fulfilling a (IMO)
crucial requirement for all the code that we write to do data science:
<b>it should even work in the MIR.</b>
</p>
<p>
Yes, <a href="http://en.wikipedia.org/wiki/Mir">the MIR</a>, the Russian space station.
</p>
<p>
When NASA realized they had to collaborate with the Russians to get
their people and experiments in space, they had to design everything
<i>to work with the MIR</i>. Being so old-school, designing for the MIR
required some extra effort and even sometimes it seemed kind of
outdated. But the benefits clearly outweighed the additional required
effort. Americans experienced an important reduction of the number of
problems they had to face when everything was already in space and
there was no turning back (*).
</p>
<p>
In our case, to make our code to even work in the MIR, we should
always ask ourselves questions such as:
</p>
<ul class="org-ul">
<li>Will this code work unattended in a server even if it fails for some of the cases?</li>
<li>Can I extend it to more cases/files without touching the code?</li>
<li>If I give my code to a third person, instead of hating me, will she/he love me? (it is documented, commented, etc.)</li>
<li>And probably many more questions…</li>
</ul>
<p>
Fulfilling all these requirements is of course not straightforward. I
know sometimes <i>making things work in the MIR</i> can be painful,
discouraging and disheartening. After all, it works in your laptop,
why should you bother making it MIR-compliant?
</p>
<p>
Americans asked themselves the same all the time. The answer was
either you do it his way and get your stuff in space, or you keep your
stuff in Earth and watch the Russians progress in the space race.
Eventually Americans ended up doing their own space station. I suspect
because all the cumbersome to make things work in the MIR.
</p>
<p>
So even if we have <a href="http://en.wikipedia.org/wiki/Hadoop">Hadoop</a>, <a href="http://en.wikipedia.org/wiki/Spark_(cluster_computing_framework)">Spark</a>, <a href="http://en.wikipedia.org/wiki/Spark_(cluster_computing_framework)">Graphlab</a>, <a href="http://en.wikipedia.org/wiki/Apache_Mahout">Mahout</a> or any other modern
and sophicasted tool. Or even if we don't have it but intend to get
there doing research and workin towards getting our own space
station, we should never forget the shell to make your
programs MIR-compliant.
</p>
<p>
The speaker at the conference in StrataConf is also author of <a href="http://datasciencetoolbox.org/">Data
Science Toolbox</a>, a collection of shell tools for data science. But any
GNU/Linux distribution will also give you all the tools you need to
rock the MIR from your command line.
</p>
<p>
(*) Slightly made up story, but the point is not invalidated :)
</p>