Born to be geek!

First steps in Deep Learning (workshop @ KSchool)

2018-02-09T00:00:00+00:00

Although I am not the highest fan of notebooks for data science projects, it is undoubtedly a great tool for purposes like teaching, or delivering a talk.

In fact, yesterday I delivered a small workshop at KSchool on making your first baby steps on deep learning, using Google Colab, the Python notebooks tool of Google.

Google Colab works like Jupyter, except that several people can work on the same notebook at once, à-la Google Docs. But it comes with even nicer things: you can use a K80 Tesla GPU with Tensorflow and Keras, to have lightning-fast training even when dealing with large datasets. For free, just with your Google account.

In the workshop yesterday, we used a small dataset, the famous MNIST handwritten digits set. But I have tried working with much larger images in Colab (loading them from Google Drive!), and it works like a charm.

If you are curious about Deep Learning, or about using a GPU entirely for free, have a look at the Keras notebook I prepared for the workshop yesterday:

First steps in Deep Learning with Keras - a Google Colab notebook

If you don't have a Google account, grab the notebook from this open link, you can downwload it anonymously and run it locally using Jupyter (with Python 3 and Keras).

You can download it and use it locally with Jupyter, or copy it to your Google Drive to run it in your environment at Google Colab.

The notebook includes detailed explanations of every step we did at the workshop, and links to external materials and videos, just in case you want to extend on some of the details.

Killing machines - why data science needs software engineering

2017-08-29T00:00:00+00:00

Jupyter notebooks (and other notebooks-based system) are a very popular and handy tool in data science. With a notebook, you can quickly explore different alternatives, getting immediate feedback, producing plots, combining code and text, achieving a form of literate programming, and allowing a hassle-free sharing of your results – a key enabler for reproducibility, essential feature of any empirical discipline.

However notebooks are not enough.

Why? Let's get know to an story about artificial intelligence and the singularity.

Elon Musk and Mark Zuckerberg have recently engaged in a public debate about the perils of artificial intelligence (AI). On one side of the debate, Elon Musk defends that AI may give birth to future killing machines that will exploit humanity as slaves. The moment when it will happen is known as the singularity. From that moment on, machines may autonomously decide to kill us, or to enslave us, or to do us anything they please.

What Elon Musk fails to notice is that we have alredy created machines that kill humans autonomously: the Therac 25, a radiotherapy machine that killed 4 people and severely injured 2 more.

How did we as humankind design such a horrible machine? What kind of hyper-sophisticated super intelligence did we create for the Therac 25?

It was much easier than that. We just were sloppy. We did not follow (now common) best practices for software development.

Whether we like it or not, we as data scientists write code on a daily basis. If you don't want the humankind to be enslaved by machines, don't forget, notebooks are great but not enough. You need software engineering too.

Want to know more? Find the story and links with more details in the slides I presented at Databeers Madrid on June 29:

(grab a copy of the slides)

Trying to write from iPad

2016-07-03T00:00:00+00:00

My blog works using Jekyll and Org Mode, along with a Git repository. Every time I push a new org file to the repository (a new blog post), it gets translated into HTML by Emacs, and published in the web.

This is very cool, as I can write in the blog while I am offline, and just push the changes whenever I get online again.

But there is a big drawback: I can only post from devices that run Emacs.

In the last months I have started to write more and more with an iPad, and I have always felt that it really sucks that I cannot post in my blog directly from the iPad.

After some investigations, I think I have found the setup that will allow me to write from the iPad, and maybe, this can also mean that I will revive my blog.

In order to publish a blog post from the iPad, I need two pieces:

A text editor, with nice syntax highlight if possible
A Git client, that allows to push to my personal repo using SSH

I have not been able to find a text editor with Org mode syntax higlighting. But other than that, Textastic is just working fine.

It integrates very well with Working Copy, a very nice Git client for iPad. This app allows to commit in a local clone in the iPad, and then push to my server.

So now, I can write the Org file in the editor, save it in Working Copy, commit and push it. The hooks in my cloned repo at the server, where Emacs and Jekyll are installed, just take care of the rest.

Work like in the teams of the future

2014-08-11T00:00:00+00:00

I have been always convinced of the criticality of recruiting for a team to be successful. If you don't hire the best and the brightest, you are wasting everyone's time. Recently, I stumbled upon a blog post with that very same title.

You may think (as I did until not a so-long time ago) that the solution to this problem is just putting the barrier very high. However, there is trade-off between getting the people you need, and putting the barrier too high. You might be hiring the truly best and brightest, but if they are just a few of people, or you take too long to hire, then you are not solving the problem.

The solution? Open your focus, and hire globally, but you don't have to relocate people, just let them work from their places. This is of course like discovering the Mediterranean Sea, many open source projects are working in globally distributed teams.

However, making a remote team work like a team is not straightforward. You have to setup the proper environment to allow the team to work, regardless of where is everyone located.

The post contains a thorough piece of advice on how to achieve this; I have particularly liked the following ones:

”Enable people to leave a record of the useful things they’ve done. Not a 'to do' list, but a 'done' list.”
“ Don't lock important information into water cooler chats and hallway meetings where nobody except the locals who happened to be in earshot benefit.” (or in emails to specific people, share information using mailing lists, or forwarding messages to lists if they contain relevant information)
”Every time you see something like this arrive in your inbox, you better believe in your heart of hearts that it contains useful information. The minute these posts become just another "whenever I have time to read that stuff" … you’ve let someone cry wolf too much and ruined it. So tread carefully here. Everything that gets shared in this public discussion space has to be Need to Know.”
”Never underestimate the power of actually talking to another human being.”
”Monday Team Status Reports: Every Monday, every team at your company (even if you just have one) should produce a brief, summarized rundown of:
- What we did last week.
- What we’re planning to do this week.
- Anything that is blocking us or we are concerned about.”

Many of these suggestions can probably be useful too for teams that work in the same physical location.

Some of this advice can be easily implemented in a team using tools such as a wiki (e.g. Mediawiki) and project management software (e.g. Redmine). Of course the crucial issue is not having these tools installed, but fostering that everyone share the vision on how to organize the team, how to report activities, the importance of sharing information, etc. In short words, fostering this culture of work.

Will it work in the MIR?

2014-07-10T00:00:00+00:00

The Big Data hype, the cool visualizations of data, the popularity of Hadoop for data processing and other sophisticated tools can make us forget that the one of the most powerful data science tools has been at our fingertips since decades ago.

Many data wrangling tasks can be done directly in the shell using UNIX commands that are older than most of us. This session at the Strata conference has reminded me of this once again.

Handling data using shell commands is probably faster than other options (e.g., a Python script), but it also helps fulfilling a (IMO) crucial requirement for all the code that we write to do data science: it should even work in the MIR.

Yes, the MIR, the Russian space station.

When NASA realized they had to collaborate with the Russians to get their people and experiments in space, they had to design everything to work with the MIR. Being so old-school, designing for the MIR required some extra effort and even sometimes it seemed kind of outdated. But the benefits clearly outweighed the additional required effort. Americans experienced an important reduction of the number of problems they had to face when everything was already in space and there was no turning back (*).

In our case, to make our code to even work in the MIR, we should always ask ourselves questions such as:

Will this code work unattended in a server even if it fails for some of the cases?
Can I extend it to more cases/files without touching the code?
If I give my code to a third person, instead of hating me, will she/he love me? (it is documented, commented, etc.)
And probably many more questions…

Fulfilling all these requirements is of course not straightforward. I know sometimes making things work in the MIR can be painful, discouraging and disheartening. After all, it works in your laptop, why should you bother making it MIR-compliant?

Americans asked themselves the same all the time. The answer was either you do it his way and get your stuff in space, or you keep your stuff in Earth and watch the Russians progress in the space race. Eventually Americans ended up doing their own space station. I suspect because all the cumbersome to make things work in the MIR.

So even if we have Hadoop, Spark, Graphlab, Mahout or any other modern and sophicasted tool. Or even if we don't have it but intend to get there doing research and workin towards getting our own space station, we should never forget the shell to make your programs MIR-compliant.

The speaker at the conference in StrataConf is also author of Data Science Toolbox, a collection of shell tools for data science. But any GNU/Linux distribution will also give you all the tools you need to rock the MIR from your command line.

(*) Slightly made up story, but the point is not invalidated :)