Hexacta Engineering - Medium

Testing Pandas

Nico Gallinal — Mon, 07 Jan 2019 21:43:44 GMT

Well, with such a title you may think we started hiring pandas as QA analysts.
I mean who wouldn’t want a little panda sitting next to them? They are so cute! Unfortunately, this is not the case, the legal team advised us against hiring them.

This story is about the “flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more” — pandas github.

I will not start talking about how great of a library it is. It has been out in the wild (pun intended) for quite long and the community has embraced it for many data analysis tasks.

But, what I will do is talk about some ways of testing it. The intrepid reader may think of classical unit testing, that could be one approach until one considers the magnitude of data represented by dataframes.

It is simply impossible to manually create all the many scenarios without letting the time consumed by it tend to infinity.

So, we would like some machinery to create the data for us. And, not only that, we will also like to create that data in a controlled fashion to recreate the different testing scenarios.

We would need to define some invariant of the function being tested (a property of the function that must always hold) and assert if the output is in accordance with our expectations.

Let’s say we have that, and suppose we are working with dataframes of different sizes, maybe we have 90 columns (not a large number at all) and a varying number of rows. It would be desirable that this machinery also provides a way of giving us a simple counterexample (technique known as shrinking) that violates the invariant, and a way of replicating that example.

Furthermore, this machinery could run each test many times as the data is not the same everytime and that may help us find edge cases.

What I have just described above is known as property-based testing.
The first of its kind was QuickCheck for Haskell a long time ago and, since then, many ports have been developed for different languages. I’ve never had the chance of using the original but I have used jsverify for javascript, cluckcheck for schemer and fscheck for c#.

All of them with different flavors and some being better than the others, as it generally occurs.
This held true until I met Hypothesis: “A Python library for creating unit tests which are simpler to write and more powerful when run, finding edge cases in your code you wouldn’t have thought to look for. It is stable, powerful and easy to add to any existing test suite” and determined it is the best one I’ve used. Let me show you with a few examples why I think this and why you should start testing your code with it as well.

Example 1: Level Beginner

Suppose we have the following function defined in the builder.py file.

def fix_new_boxes(raw_prog):
    return (
        raw_prog
        # using -1 as placeholder for mat_code in new boxes
        .assign(mat_code=lambda df:
            df.mat_code.fillna(NO_MAT_CODE).astype("int64")
        )
        .sort_values("prog_start")
    )

Types annotations are not yet ready for pandas, but we can infer it receives a dataframe with at least two columns: mat_code and prog_starts. And what it does is filling the mat_codes that do not have a value with NO_MAT_CODE. Then it sorts the dataframe by prog_start, which is a date by the way.
So let’s write some tests for it.

Example 1 Test 1

import pandas as pd

from hypothesis import given
from hypothesis import strategies
from hypothesis.extra.pandas import column, data_frames

import builder

@given(
    data_frames(
    columns=[
        column(
            name='prog_start',
            elements=strategies.datetimes(
                min_value=pd.Timestamp(2017, 1, 1),
                max_value=pd.Timestamp(2019, 1, 1)
            )
        , unique=True),
        column(
            name='mat_code', 
            elements=strategies.just(float('nan'))
        )
    ])
)
def test_fix_new_boxes_nan_replaced(raw_prog):
    prog = builder.fix_new_boxes(raw_prog)
    assert (prog.mat_code == builder.NO_MAT_CODE).all()
    assert prog.shape == raw_prog.shape

Hey Nico, haven’t you just said this was “beginner level”?
It seems a lot of code! Don’t worry, let’s decompose it little by little.

The “given” annotation accepts strategies and … wait, what is a strategy?
Fair enough, suppose you want to generate datetimes, there are many ways to do that and each of those is called an strategy. Hypothesis provides one strategy for this which is called “datetimes” and it lets you define a min_value and a max_value as you see above.
Let’s look at a few examples from the terminal.

Cool huh? Everytime it is asked for an example it returns a datetime within the given bounds.

In the Test 1 example we can see that these strategies are composable, what allows to create more complex strategies.
Particularly the “data_frames” strategy is composed by the “datetimes” strategy we saw before.
Let’s go to the terminal and see some similar examples using “integers”.

This is getting better, look how we compose strategies and how an empty dataframe is a valid example.

With these examples from the terminal we are able to completely understand what data we are generating for the test except for the “just” strategy which I haven’t mentioned before. It is very simple, it always returns the value passed to it.
There are other strategies provided by the library such as “characters”, “booleans”, “lists”, etc; but we won’t tackle them in this post.

But let’s go back to the “given” annotation. As I was saying, it receives strategies and with them it generates the data that will be used to test the invariant.

There should be one invariant we are testing here, can you think of it? While you think let me show you a pandas image so you get some inspiration.

Awww, it is waving at Hypothesis!!!

The invariant is: “no mat_codes are left as NaNs, they are replace by NO_MAT_CODE”.
I also added an assertion about the shape of both dataframes to make sure they are actually being replaced and not being filtered out.

Example 1 Test 2

import pandas as pd

from hypothesis import given
from hypothesis import strategies
from hypothesis.extra.pandas import column, data_frames

import builder

@given(
    data_frames(columns=[
        column(name='prog_start',
        elements=strategies.datetimes(
            min_value=pd.Timestamp(2017, 1, 1),
            max_value=pd.Timestamp(2019, 1, 1)
        ), unique=True),
        column(name='mat_code',
        elements=strategies.one_of(
            strategies.just(float('nan')),
            strategies.integers(min_value=100))
        )
    ])
)
def test_fix_new_boxes(raw_prog):
    prog = builder.fix_new_boxes(raw_prog)
    assert prog.mat_code.notna().all()
    assert pd.Index(prog.prog_start).is_monotonic_increasing

Can you tell which is the invariant we are testing? Again, let me show you an image for inspiration.

Here one of them is telling the other that the question was trickier.

We are actually testing two invariants:
1) No NaN values should be present.
2) It must be sorted by prog_start in a monotonic increasing manner.
This test should be refactored in two, because we should test one invariant at a time to have exactly one point of failure.

Have you understand how the strategy “one_of” works? If not, the following examples will make it clear.

It receives different strategies and returns the generated value of one picked randomly.

Example 2: Level Intermediate

Suppose we have the following function defined in the builder.py file.

def add_dayofweek_dummies(data):
    return (
        data
        .assign(dayofweek=lambda df:
            pd.Categorical(df.prog_start.dt.weekday_name,
            categories=["Monday", "Tuesday", "Wednesday",
            "Thursday", "Friday", "Saturday", "Sunday"])
        )
        .assign(is_weekend=lambda df: 
            df.dayofweek.isin(["Saturday", "Sunday"])
        )
        .pipe(pd.get_dummies, columns=["dayofweek"])
)

First of all, what does this function do? It takes a dataframe with a “prog_start” column, it adds a new column called “dayofweek” with the corresponding name, it adds another column called “is_weekend” which is True or False depending the day name and finally it performs one hot encoding on the “dayofweek” column.

How do we test this, Nico? Good question! Let’s see…

Example 2 Test 1

import pandas as pd

from hypothesis import given
from hypothesis import strategies
from hypothesis.extra.pandas import series, range_indexes

import builder

def assert_is_day(df, i, day):
    assert df.loc[i, day] == 1
    assert (df.loc[i, ~df.columns.isin(
        ['prog_start', 'is_weekend', day]
    )] == 0).all()

@given(
    series(
        strategies.datetimes(
            min_value=pd.Timestamp(2017, 1, 1),
            max_value=pd.Timestamp(2020, 1, 1)
        ),
        index=range_indexes(min_size=7),
        unique=True
    )
    .map(lambda s: s.to_frame('prog_start'))
)
def test_add_dayofweek_dummies_is_day(data):
    iso_to_day_asserts = {
       1: lambda df, i: assert_is_day(df, i, 'dayofweek_Monday'),
       2: lambda df, i: assert_is_day(df, i, 'dayofweek_Tuesday'),
       3: lambda df, i: assert_is_day(df, i, 'dayofweek_Wednesday'),
       4: lambda df, i: assert_is_day(df, i, 'dayofweek_Thursday'),
       5: lambda df, i: assert_is_day(df, i, 'dayofweek_Friday'),
       6: lambda df, i: assert_is_day(df, i, 'dayofweek_Saturday'),
       7: lambda df, i: assert_is_day(df, i, 'dayofweek_Sunday')
    }
    
    dayofweek_dummies = builder.add_dayofweek_dummies(data)

    for i in dayofweek_dummies.index:
       iso_to_day_asserts
       [dayofweek_dummies.loc[i, 'prog_start'].isoweekday()]
       (dayofweek_dummies, i)

Here we are testing that the labels of the day are correctly set and that the one hot encoding was done right. I won’t go in much detail of how the test works but I will do explain the new stuff.

We have a “series” and a “range_indexes” strategy and a “map” function.
The “series” strategy let us create series of elements of a given strategy, in this case datetimes.
The “range_indexes” strategy let us create Indexes. We used it here because we don’t want series with fewer than seven elements.

Some examples of range_indexes strategy and series strategy.

And we have the “map” function which let us transform what was generated before reaching the test according to:

s.map(f).example() == f(s.example())

So here we let Hypothesis generate a bunch of series and then we are mapping them to dataframes. Isn’t it awesome?

Example 2 Test 2

Here, we should test the invariant that the weekend days are correctly assigned, I’ll leave that as homework for the reader.

THIS LINE INTENTIONALLY LEFT BLANK

Example 3: Level Advanced

Suppose we have the following function defined in the builder.py file.

def add_top_programs_dummies(df):
    top_programs = (
        df[df.selected].prog_name.value_counts()
        [lambda s: s >= 50].index
    )

    return (
        df
        .assign(prog=lambda df: 
            df.prog_name.where(lambda s: s.isin(top_programs), None)
        )
        .pipe(pd.get_dummies, columns=["prog"])
    )

This function receives a dataframe that contains information about different programs in a period of time and when they were selected.
It takes the ones that were selected more than fifty times, defines them as top programs and performs one hot encoding over them. One may think this example is the same as the one before, but it isn’t. The dataframes we will be generating needs to be constructed in a more specific way.

For this, we will need the help of a new and more powerful strategy.
Meet the “composite”.

But first, one last inspiring image!

Pandas composition :)

Example 3 Test

import pandas as pd

from hypothesis import given
from hypothesis import strategies
from hypothesis.extra.pandas import data_frames, range_indexes

@strategies.composite
def prog_generator(draw, top_threshold):
    top1 = draw(data_frames(columns=[
        column(name='prog_name', elements=st.just("TOP1")),
        column(name='selected', elements=st.just(True))
    ], index=range_indexes(min_size=top_threshold)))

    top2 = draw(data_frames(columns=[
        column(name='prog_name', elements=st.just("TOP2")),
        column(name='selected', elements=st.just(True))
    ], index=range_indexes(min_size=top_threshold)))

    notop = draw(data_frames(columns=[
        column(name='prog_name', elements=st.text(
            alphabet=['a', 'b', 'c', 'd'], min_size=2)),
        column(name='selected', elements=st.just(True))
    ], index=range_indexes(max_size=top_threshold - 1)))

    return pd.concat([top1, top2, notop])

@given(prog_generator(top_threshold=50))
def test_get_prog_dummies_top_become_dummies(prog):
    dummies = builder.add_top_programs_dummies(prog)
    
    assert (
        dummies[dummies.prog_name == "TOP1"].prog_TOP1 == 1
    ).all()

    assert (
        dummies[dummies.prog_name == "TOP1"].prog_TOP2 == 0
    ).all()

    assert (
        dummies[dummies.prog_name == "TOP2"].prog_TOP1 == 0
    ).all()

    assert (
        dummies[dummies.prog_name == "TOP2"].prog_TOP2 == 1
    ).all()
    
    # no dummies for no top progs
    assert prog_dummies.shape[1] == 4

According to the documentation: “the composite decorator lets you combine other strategies in more or less arbitrary ways. It’s probably the main thing you’ll want to use for complicated custom strategies.” which is precisely what we want to do!!!

We need to create a single dataframe containing at least a top_threshold amount of TOP1 and TOP2 programs and then many other programs with random names that should appear no more than top_threshold minus one times.

And that is exactly what we are doing thanks to the draw function.
The draw function is always passed as the first argument of the composite and should be thought as a function that returns one example of the strategy it was invoked with.

There is much more of Hypothesis out there but this is all for now.
I hope you have enjoyed reading this article as much as I have enjoyed writing it. Also hopefully the examples were clear and that takes you right away to the Hypothesis site to dive deeper and start using it.

Thanks for reading and stay tuned!

Testing Pandas was originally published in Hexacta Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Chocolatest, a sweet story of yet another testing framework.

Nico Gallinal — Tue, 23 Oct 2018 21:33:20 GMT

a preview of an exercise after checking the solution

Weeks ago, Hexacta asked me to develop an application to challenge computer-engineering students. This application was going to be used at a convention where Hexacta would had a stand. There we’d tell about us and get in touch with young people that want to do cool stuff like us :)

The idea was simple: “let’s have a web application where they can tackle some programming challenges”. The first thing that I thought of was “it would be cool if they are presented with different functions to implement, but it would be cooler if they are given some sample tests to explain how the function should work besides the specification. If we could make it so that they can add their tests and run them, that would be very neat. And what about giving them a score running some hidden tests for each exercise, that would definitely be very nice…”

I was up to the challenge and then asked how much time I had… Only a week!

It was something really cool to build so I said to myself that I would give my best in order to make it happen but there were no guarantees, there was not much time and so I began.

I started, as always, googling around to see if there was something I could make good use of. I just had to run JavaScript tests on the browser but these were not the kind of test where you have functions and assertion. These were the tests where your implementation is a string and your tests are a strings too!!!

It is very likely that you are now asking yourself what the heck is Nico talking about?! Do you remember that I said that the user would have to implement a function? Maybe using some editor as Monaco (the one that powers Visual Studio Code)? Do you also remember that I said that the user could also add more tests to the samples? Strings and strings, now you are following me, I feel better.

I had to implement a testing framework in one week that would evaluate JavaScript as strings and also make the application. Clearly there was no time to waste!!!

I had experience with Node’s vm module and I said “I need something like that”. https://github.com/browserify/vm-browserify was the answer.

Then I said, I would need some assertions but I don’t want to reinvent the wheel. Maybe I can find something like Node’s assert module, right? And again I was lucky to find https://github.com/browserify/commonjs-assert.

I should really thank the Browserify team for the only two dependencies Chocolatest has, without them I would have ran out of time!

Now that I had the building blocks, I needed a way of intercepting the calls to the different assert methods and create a log of what was going on during the execution of the tests.

The word intercepting made me think about Aspect Oriented Programming but we all know we do not have all those fancy things in JavaScript! What we do have is meta-programming thanks to the Proxy object.

Most of the magic happens in the following lines.

import { Assertion } from './types';
import * as assert from 'assert';

type Omit = Pick>;

const applyWrapper = (operator: string, method: (...args: any[]) => any, thisArg: any, args: any[], log: (assertion: Assertion) => void) => {
  try {
    let result = method.apply(thisArg, args);
    log({'ok': true, 'operator': operator, 'args': args, 'type': 'assert'});
    return result;
  } catch (e) {
    log({'ok': false, 'operator': operator, 'args': args, 'type': 'assert'});
    // we don't want to fast fail, we want to run all asserts in test
    // throw e; 
  }
}

const generateProxy = (log: (assertion: Assertion) => void) => {
  return new Proxy(assert,
  {
    apply(target, thisArg, args) {
      return applyWrapper('assert', target, thisArg, args, log);
    },
    get(target, propKey: keyof Omit) {
      const origMethod = target[propKey];
      return function (...args: any[]) {
        return applyWrapper(propKey, origMethod, target, args, log);
      };
    }
  });
}

export {
  generateProxy
}

First of all, you will notice that the code is written in TypeScript. If you haven’t tried it yet I suggest you do it.

Secondly, I’m sure you noticed the creation of a Proxy of the library assert, where it says “new Proxy(assert, …” and again you probably said what the heck Nico?!

Basically, what I’m simply doing is intercepting all the calls to the methods of assert (even assert itself which is a method too) and calling the original method in isolation inside a try catch block. With this interception I’m able to log when the method is about to be called, when it was called and it threw because of an exception or when it was executed successfully.

The method “generateProxy” is called with a log as an argument, this is a collector function where I will be allocating all the events described above and some more.

Now that I explained you the framework, I’ll give you the link in case you wanna play with it https://github.com/nicoabie/chocolatest and tell you about what happened to the application.

The application was developed on Vuejs using the Monaco editor to make a good coding experience and had some exercises ready for the convention time. The convention started at 9 AM and the last commit was about 9:20 AM or so. I will not brag about its success because in the end no student got to use it, hahaha! There were so many people walking around that it wasn’t an easy task to stop on a stand and apparently they were starting the University so they were a little bit shy unfortunately.

Anyways, we are at Hexacta thinking about having it published on our website, so you can be challenged with some exercises and test your level!!!

Thanks for reading and stay tuned!

P.S. I recently found out that around 500 modules are published each day on npm according to http://www.modulecounts.com/. I just want you to know that before I started to code I searched for quite a bit if such a framework existed to avoid polluting our environment further hahaha.

Chocolatest, a sweet story of yet another testing framework. was originally published in Hexacta Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Part 2: Install and Setup a 3-node Hadoop Cluster Cloudera

Mariano N. Lirussi — Fri, 21 Sep 2018 15:17:00 GMT

Let’s resume with our installation of a Hadoop cluster. First, login to the site we configured in the first part of the installation: http://master.hexacta.com:7180

user: admin

pass: admin

It is important to read and accept the Cloud Manager License

Now, we arrive to the moment where we must select the edition of Cloudera that we want. For the purpose of this document, we are going to select the option “Cloudera Express”.

Then we can receive some interesting topics and information about the versions and requirements.

The time has come to for us to specify the hosts that we are going to configure as nodes for our cluster.

This specification has a relation with the configuration of /etc/hosts that we saw in the first part of the installation.

This is why we are only going to put the IPs of the Master and the nodes.

Now we will see if the hosts are ready for installation and running as needed. If all is well, we continue with the configuration of the nodes.

In the following step, we are going to select the repository method and any additional parcels we need to add to those already in the CDH suite. By default, we select the options ‘Use Parcels’.

To continue we have to read and accept the license of Oracle JDK to be able to install and use it.

Next, we have the option to enable ‘single user mode’. In this case, we leave it as it is without enabling.

We have to configure the user and his credentials for the automatic management of the master and the nodes. Let’s configure it with a user other than root. Select ‘another user’ and put the user ‘Hadoop’.

Select ‘all host accept same private key’ and load the id_rsa.pub file of the Hadoop user that we created in the master in the first part of the installation. We load the passphrase if we configured any.

In case you don’t want to use the SSH key file you also have the option ‘All hosts accept same password’ and we complete the password field. Clearly, the Hadoop user must have the same password in all nodes.

The other parameters are left with their default values unless we have to change them.

In the next screen, we see the process of installation and agents.

Once you successfully complete the installation on the cluster nodes, let’s continue with the installation.

Now we have to see the process of installing the parcels in each node.

If everything went ok, we are going very well. Now we are going to inspect the installation in the nodes, this process might take some time. After this, we will have successfully finished the installation of our cluster in all the machines where we plan to run it.

Cluster Setup:

Great, now we are going to start with the Cluster Service Setup. Here we can select the option that best suits our needs. We can select if we need some extra service besides those from Core Hadoop.

Once we know what services we are going to run in our cluster, we can configure the assignment of them within our nodes. There are several possibilities to set these preferences but in this opportunity, we are going to pass them by since they can be configured later from the options from the Cloudera manager.

Awesome!

Now we must configure and test the connection to the database. Some Hadoop Core services need a Posgresql to run their service (like Hive). There are two ways to run the database:

In an external Postgresql: this option is the right one to run a Hadoop cluster installation in production.
The ‘Embedded Database’ option, which is the option we chose in this guide to continue without additional resources.

After selecting the Embedded Database, we test the connection to Posgresql. If successful, we are ready to continue.

We are almost finished but we still have to check and customize any additional parameters.

As like the block size of HDFS or the tolerance of failures in the volumes among others.

In general, most of these options can be configured from the system already installed, so we will continue without making changes.

Very well done!

We arrived at the process of installation and execution of each of the services we decided to install and with the preferences we selected.

If we get to this point, it means that we already have our Hadoop cluster installed, configured and running successfully.

Congratulations! On the dashboard screen, we have our Hadoop cluster with cloud manager installed, working and ready to work. You can get some notifications like the Postgresql database embedded as we selected in our installation, for each of the notifications the system offers a description and documentation to know how to address the problem.

Good luck and enjoy the Hadoop cluster.

See you next time.

Part 2: Install and Setup a 3-node Hadoop Cluster Cloudera was originally published in Hexacta Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Part 1: How To install a 3-Node Hadoop Cluster on Ubuntu 16

Mariano N. Lirussi — Tue, 21 Aug 2018 19:27:49 GMT

This guide is intended to provide a quick and easy way to install Cloudera Hadoop on local servers.

For this reason, we need to prepare the infrastructure environments prior to installation.

In this way, we assume that you have some basic knowledge on networking, linux administration and Apache Hadoop.

1.-Requirements

We recommend complying with the following minimum hardware requirements:

Master: CPU x6 core — 12Gb Mem — 80Gb HD

Node: CPU x4 core — 4Gb Mem — 80Gb HD

In this case, we configured the network at each node as follows:

master.hexacta.com 10.0.5.1

node1.hexacta.com 10.0.5.2

node2.hexacta.com 10.0.5.3

Once the network is ready, we make sure we have the latest updates on each node.

-:# apt-get update && apt-get upgrade -y
-:# apt install ssh rsync

2.- Config Hosts

The configuration of host names and their relationship to the corresponding IP addresses is a very important point to consider. These IP address will be used in the /etc/hosts file on all nodes.

On our case, we have it this way:

10.0.5.1 master.hexacta.com

10.0.5.2 node1.hexacta.com

10.0.5.3 node2.hexacta.com

127.0.0.1 localhost

127.0.1.1 localhost

3.- Create user and ssh-key

Now we must create the hadoop user on all nodes with sudo permissions. Besides, it should not ask for a password using the sudo command. For this, we use the commands to create the user:

-:# adduser hadoop

We added him to the group with sudo permissions.

-:# adduser hadoop sudo

Then in order not to require the passwd we must edit the sudo configuration with the command:

-:# visudo

Within the configuration, we must add the line for the hadoop user under the %sudo section as seen in the following line:

# Allow members of group sudo to execute any command
 %sudo ALL=(ALL:ALL) ALL
 hadoop ALL=(ALL) NOPASSWD: ALL

Now it is necessary to create the ssh key for the hadoop user so that the Master node can manage the Nodes securely remotely.

In the Master with the hadoop user session, we generate the key with the following command:

-:# ssh-keygen -b 4096

Then we copy the public key to the master and the nodes we want to install.

-:# ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@master.hexacta.com

-:# ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@node1.hexacta.com

-:# ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@node2.hexacta.com

4.- Swappiness

To avoid swappiness errors or warnings we will customize a kernel parameter.

We change it in the current execution with the command

-:# sysctl vm.swappiness=10

To persist this change in future restarts we add the configuration line at the end of the file

/etc/sysctl.conf :

-:# echo ‘vm.swappiness = 10’ >> /etc/sysctl.conf

NOTE: Swappiness is a Linux kernel parameter that controls the relative weight given to swapping out of runtime memory, as opposed to dropping pages from the system page cache.

So far, we have correctly configured the infrastructure for the installation of any Hadoop distribution system.

5.- Install Cloudera

Now let’s install Hadoop-Cloudera-Manager. The Cloudera Manager is an administration tool that will help you administrate the services on your Hadoop Cluster. There are a free and an Enterprise version. We used the free version to set up the whole cluster.

First, we need to download the installer of the latest version of Cloudera-manager

-:# wget http://archive.cloudera.com/cm5/installer/5.15.0/cloudera-manager-installer.bin

We have to change the installer permissions to be able to run it.

-:# chmod u+x cloudera-manager-installer.bin

Run the file with sudo to start the installation.

-:# sudo ./cloudera-manager-installer.bin

Cloudera-Manager-README:

This Readme gives useful details for the subsequent installation of Cloudera manager, such as the Linux versions it supports, let’s click on “next”.

This is the Cloudera Standard License, let’s click on “Next” after reading it.

We accept the license’s terms of use.

We click “next” to accept the license of the Oracle Java SE Plataform

Accept Oracle License

We expect the Cloudera Manager Server installation process to be completed

After the installation of Cloudera Manager is finished, we can continue with the second part of Cluster Setup by going to http://master.hexacta.com:7180/ for our example, with the user name: admin and passwd admin.

Installation successfully completed

After the installation is complete, access the site http://master.hexacta.com:7180/. We can continue with the installation and configuration of the Cloudera-Manager cluterization from that point onward.

On Part 2, we will be explaining how to set up the Hadoop Cluster with Cloudera Manager.

Part 1: How To install a 3-Node Hadoop Cluster on Ubuntu 16 was originally published in Hexacta Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

JSX can do that?

Rodrigo Pombo — Sun, 06 May 2018 14:37:10 GMT

First I’m going to explain how JSX works and then use it in “unusual” ways. If you know how JSX works you can skip the first part. If you are here to learn something useful you can skip the second part.

Last week I tweeted this:

This works
— @pomber

People loved it. You can see it in the replies: “Eww”, “what have we done”, “oh no is XML back again”, “Every day we stray further from god’s light”, “murderer”. I write this post to return the love to them (that and also I said to Lorenzo Palmes I’ll write this post if the tweet reached 100 likes, thanks for the retweet Ken Wheeler…).

JSX

If you used React you know JSX, that XML-like syntax for creating React elements:

https://medium.com/media/0187f0f4c1646e17f1ec1dbbd9d5ebbf/href

Because browsers don’t support JSX, your code needs to be changed to normal JavaScript before a browser runs it. This transformation from developer-friendly code to browser-friendly code is done by tools like Babel. Using Babel, the getGreeting function becomes:

https://medium.com/media/2c77b5e23acbcaf5847f9469dd4efe0a/href

All the tag names, attribute names, attribute values, and text content are still there, with a different syntax. But what about React.createElement?

React.createElement is the function that React uses to create elements. Babel inject that function by default because JSX is commonly used together with React, but that doesn't need to always be the case. In fact JSX is decoupled from React. JSX is a specification for defining tree structures with an XML-like syntax in JS. That tree structure could be the elements rendered by a React component or something entirely different.

In order to use JSX for something different than React, we only need to tell Babel to use another function instead of React.createElement. We do that by adding the comment /** @jsx anotherFunction */ somewhere in the file. For example:

https://medium.com/media/e1fc3e6b030d477e38c08f463d7b1ba6/href

One last thing you need to know is that Babel handles JSX element names in different ways depending if they start with lowercase or with uppercase. Lowercase names are passed as string arguments, like in the last snippet. Titlecase names are passed as functions, like in the snippet.

JSX for Math

Disclaimer: beyond this point you probably won’t learn anything. I’m going to use JSX for things it shouldn’t be used.

We can calculate the hypotenuse of a and b with Math.sqrt(a*a + b*b), but that's no fun. Let's write it with JSX:

https://medium.com/media/e29f3facdcfd6477ed4532e61f636019/href

There’s also a version of hypotenuse that can receive more than two arguments:

https://medium.com/media/374700da073b0e8647a35ec17f7dcfe4/href

JSX for Everything

Let’s try something more ambitious, let’s try merge sort.

We’ll use RamdaJS to have some primitives for our components. This is how FizzBuzz looks in ramda:

https://medium.com/media/633a27a903d3876b396358e10729c9f1/href

We can write the same code using JSX, we only need to write the function that calls the ramda function that matches the JSX element name:

https://medium.com/media/393356ba7cb49ea44b51265f2580c127/href

Beautiful, right?… No.

We can make the run function smarter in order to increase the JSX “purity” of the code:

https://medium.com/media/2734dd7e4872c0f528ae1346c238cfb8/href

Now, anything you can do with ramda you can also do it with JSX. And you can do everything with ramda, including merge sort:

codepen

Thanks for reading.

Follow @pomber on twitter for more (awful) stuff like this.

JSX can do that? was originally published in Hexacta Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Reactive Programming basics

Emanuel — Wed, 14 Sep 2016 00:00:00 GMT

Like its name indicates, reactive programming is oriented to reaction, to the data flow and the principle of causality, meaning that, each cause is connected to its effects.

This is a paradigm, meaning that most problems that can be solved with reactive programming can also be solved by other types of programming; object-oriented, procedural, functional, etc. The most important thing is to recognize which approach is the most appropriate, because this decision impacts the elegance and the quality of the resolution obtained.

Like its name indicates, reactive programming is oriented to reaction, to the data flow and the principle of causality, meaning that, each cause is connected to its effects. Perhaps the most known example is the spreadsheets’ one; where the modification of a cell (event) triggers the following modification of all cells that were watching it. And “watching” is one of the keywords because it will be easier to understand the issue if we know the GoF observer’s pattern.

A little theory

The Manifesto of the reactive programming was rewritten at the end of 2014 and, according to it, the reactive systems have 4 features:

Responsiveness: They are focused on quick and consistent times of response. Error handling is simplified and encourages interaction with the user
Resilience: systems will remain responsive even in the presence of errors. For this to be achieved, failures should be isolated and contained in components and should be able to recover without compromising its integrity.
Elasticity: Because they adapt to variations in workload, allocating and freeing resources dynamically, and because they are designed so that its components do not form bottlenecks.
Oriented messages: the exchange of asynchronous messages is completely trusted. Blocking communication doesn’t exist.The elasticity, faults and messages are points to consider here.

The elasticity differs from scalability: the program is not defined along with the resources; these are automatically allocated at runtime (very similar to functional programming).

Failures differ from mistakes: failure is unexpected and is not used as flow control. You can disable the part of the system that depends directly on these.

The messages differ from the events: in an event-based system, the components actively expect that entities change of state. Instead, the messages have an associated recipient that is activated when they are received.

Viewing data flows

To display the data, a marbles diagrams is often used. You can see in them each data stream and the underlying transactions that can be applied to them:

These diagrams are read from left to right, and each ball represents the occurrence of an event that is being observed. Intermediates rectangles indicate the operation to be applied to each. For example, in the image include the following:

• Map: It applies a transformation as a parameter passed to the emissions of a data stream.

• FlatMap: Make asynchronous applications to data flow , and then crushes emissions to a simple observable .

• Concat: concatenates emissions of two or more streams without crossing them (Note that the flows should “finish” to make it applicable).

• Merge: Combines multiple flows and interlocks emissions.

All these diagrams can be displayed interactively on this website. Can you guess what these two operations make? Operation1, Operation2.

Perhaps the trickiest part of programming is not reactive programming itself; how difficult is to think about the “how” to deal with common problems. For example, if I need an action to be triggered when a user performs multiple clicks on an item, the communication flow should be approximately as follows:

As you can see, there are no temporary variables, buffers, promises, active checks or other traditional programming resources. Only flows, emissions and four lines of code.

Environments and languages to use reactive programming

We can find languages specifically reactive as Elm and R. Most frameworks that have their reactive part are available for .NET and JavaScript. In JavaScript we can find MeteorJS, ProActJS, BaconJS and React.

Perhaps the main exponent is Reactive Extensions, or simply ReactiveX Rx (all names represent the same thing). This is an API for asynchronous programming and reactive available for Java, JavaScript (+ Angular), C # (+ Unity), Ruby, Python, PHP and others. In total there are 402 observable operations, each with its corresponding diagram of marbles.

Here is an example of integration with Angular, when searching for a page on Wikipedia:

https://medium.com/media/1f2542e2c8c57898458a68e7c2b93452/href

Reactive Programming basics was originally published in Hexacta Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Machine Learning — Classification

Emanuel — Wed, 05 Oct 2016 00:00:00 GMT

Learn in this article about Machine Learning and the classification of algorithms, written by one of our software engineering experts from the HAT.

Algorithms’ classification predict the class or category for a single instance of data. For example, email filters use binary classification to determine if an email is spam. There are two forms of classification tasks. The first is binary classification, where the goal is to predict one of two outcomes.

The other is multiclass classification, where the goal is to predict one of many outcomes. The output of a classification algorithm is called a classifier, which can be used to predict the label of a new (unlabeled) instance.

This is a supervised learning algorithms make predictions based on a set of examples. For instance, historical stock prices can be used to hazard guesses at future prices. Each example used for training is labeled with the value of interest.

Let’s start with the question: Is this A or B?

This family of algorithms is called two-class classification.
It’s useful for any question that has just two possible answers. They are several algorithms for use for this question. This next image represent a two-classes support vector machine, one of the most popular used.

Is this A or B or C or D, etc.?

This is called multiclass classification and it’s useful when you have several — or several thousand — possible answers. Multiclass classification chooses the most likely one.
The next image represent a one vs. all classification.

AzureML algorithms for Classification

The category Initialize Classification Model includes the following modules:
To see the complete documentation of each one go here!

Example

1- Selection of data set
We use the Adult Census Incoming Binary classification data set.
The column income is the label.

2- Using select column and Split data
We select the columns that we know will be more useful for the prediction, and then we Split the data for train and score the model.

3- Using the Two-class Boosted Decision
We drop the classification model into the canvas and leave the default parameters values.

4- Score and evaluate model
Run the experiment to check the score and the evaluate model results.
The right two columns, Scored Labels and Scored Probabilities are the prediction results. The Scored Probabilities column shows the probability that the predicted class belongs to the positive one (in this case “> 50K”).

To see more documentation for interpret models result see here.

5- Finished experiment
The image below represent the entire experiment ready to run.

Machine Learning — Classification was originally published in Hexacta Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

TensorFlow: The open library for deep learning

Emanuel — Tue, 29 Mar 2016 00:00:00 GMT

After AlphaGo became the first IA to beat a professional player in the GO game and having also beaten the world’s champion in the last days, it seems as if the issue of “artificial neural networks” regained relevance.

And yes, the software used by Google for most artificial intelligence needs was released a few months ago under the Apache 2.0 license and is now available for download and use by anyone (students, researchers, hackers, engineers, developers and many others).

Its name is TensorFlow, it is coded through a Python interface or C / C ++, and can run in various environments: In multi CPUs, in video plates, in cloud servers or on mobile devices with Android and IOS.

In this blog, the intention is not to enter into the code of it, but to rather make a theoretical introduction to such a deep topic that the expression deep learning will remain short to describe it. If what is sought is code, you can always consult TensorFlow’s quick guide on their official website.

Artificial learning?

Artificial learning! One of the most novel and unexplored areas of computer science. However, there are already some software products that can do this sort of thing: Caffe, Deeplearnig4j, OpenNN and Torch. In any case, if you still do not know where we stand, here is a list of concepts to help us come into topic:

Artificial neural network: it is a computing paradigm that attempts to solve problems from a different approach. Usually they are made up of many input nodes, one or more layers of intermediate nodes and multiple output nodes. To keep things simple, let’s say that each input node can emit a value between 0 and 1, multiplied by a variable weight “w”. Finally a threshold “b” is defined for the output nodes and if the sum of all entries exceed that threshold, then that output would classify as positive defined.

Genetic algorithm: they are algorithms that are based on feedback from their results to make improvements in design. This part is key since in artificial neural networks there are variables that are at the mercy of the algorithm: the “w” weights.
Deep learning: the process by which computers learn to perform a task, given a set of things that are defined as true and refine their neural network, to then generalize and apply them to new situations. As it can be seen, the programmer is not the one who defines how the circuit is formed, but he or she provides the necessary mechanisms to evolve it to its most optimal way through many attempts.

Where I can find TensorFlow?

While Google has used deep learning almost since its beginning with technologies such as prediction API and DistBelief (TensorFlow’s previous generation) now this renowned software library is used in many popular applications (and where integration is so natural that we forget it exists).

Here are some examples:

User behavior: RankBrain is one of the ways that Google directs its search results since October 2015. It can learn which sites are relevant to the user depending on what links the user clicks on; but it does not do it algorithmically, instead it learns by adjusting its neural network.
Speech recognition and natural language processing: Voice-to-text conversion obtained from thousands of samples spoken with their transcripts.
Translator: the translator learns languages from hundreds of texts with official translation. Sometimes it uses the user’s recommendations to improve their work.
Predictive text: by using the self-correction mode or writing textual keyboards on Android devices. The words that are recommended depend on the user and vocabulary.
Image recognition: Try searching for “red car” in the search for images. How can there be so many red cars on the Internet? The truth is that most of the images that are presented are not called “red_car.jpg”, but TensorFlow is doing its job by recognizing “red” and “car” abstractions by looking at the pictures.

How does it all work?

TensorFlow represents information as a multidimensional array (not very surprisingly called tensor). In this arrangement, the data available overturns and usually there is a dimension reserved for the number of samples with which it will train. So, if we have for example 55,000 digits written by hand to recognize images of 28×28 pixels, then we fabricate an array of 55000x28x28.

The necessary neural layers are then deployed to solve the problem and operations to be performed are determined. All these operations can be viewed using TensorBoard, a tool that provides a visual interface. The more layers, the greater the need for processing; that is why each section of the graph can run on different processing units.
After the IA has been trained, you can get obtain very interesting displays such as the ones in following boxes that show the evidence for (blue) and against (red) that the line representing the digit shown:

And here is an example of how the source code would look like:

https://medium.com/media/606985674213d4f5a122e749ac2d9d61/href

Pandas by example: columns

Rodrigo Pombo — Tue, 27 Feb 2018 17:18:52 GMT

Let’s review the many ways to do the most common operations over dataframe columns using pandas.

import pandas as pd

Adding columns to a dataframe

The three most popular ways to add a new column are: indexing, loc and assign:

df = pd.DataFrame({"A": [1,2,3],
                   "B": [2,4,8]})

df["C"] = [1,2,3]
df.loc[:, "D"] = [1,2,3]
df = df.assign(E=[1,2,3])
df

Indexing is usually the simplest method for adding new columns, but it gets trickier to use together with chained indexing. It may add the column to a copy of the dataframe instead of adding it to the original. When this happens pandas will show a warning:

df = pd.DataFrame({"A": [1,2,3],
                   "B": [2,4,8]})

df[df["A"] < 3]["C"] = 100
df

SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
after removing the cwd from sys.path.

To avoid those cases, it’s better to use loc:

df = pd.DataFrame({"A": [1,2,3],
                   "B": [2,4,8]})

df.loc[df["A"] < 3, "C"] = 100
df

loc has two limitations: it mutates the dataframe in-place and it can't be used with method chaining. If that's a problem for you, use assign:

df = pd.DataFrame({"A": [1,2,3],
                   "B": [2,4,8]})

df = df.assign(C=[1,2,3]).assign(D=4)
df

assign is particularly useful when you want to create a new column based on a column from an intermediate dataframe. You can pass a lambda to assign to get the intermediate dataframe:

df = pd.DataFrame({"A": [1,2,3],
                   "B": [2,4,8]})

df = df.assign(C=[1,2,3]).assign(D=lambda idf: idf["C"] * 2)
df

In the previous examples the column name is fixed, but you can also use variable column names:

df = pd.DataFrame({"A": [1,2,3],
                   "B": [2,4,8]})

my_column_name = "C"
another_name = "D"
df = df.assign(**{my_column_name: [1,2,3], another_name: 100})
df

Another option, when you need to insert a column in a specific location, you can use insert:

df = pd.DataFrame({"A": [1,2,3],
                   "B": [2,4,8]})

df.insert(loc=1, column="C", value=[1,2,3])
df

Finally, you can also use concat to add a new column:

df = pd.DataFrame({"A": [1,2,3],
                   "B": [2,4,8]})

new_column = pd.Series([1,2,3])
df = pd.concat([df, new_column.rename("C")], axis=1)
df

No matter what method you use, a common mistake is adding a column with a different index:

df = pd.DataFrame({"A": [1,2,3],
                   "B": [2,4,8]},
                  index=[0,1,2])

new_column = pd.Series([1,2,3],index=[2,3,4])
df["C"] = new_column
df

If you don’t care about the indexes and just want to add the column using the current items order, you can use the values array:

df = pd.DataFrame({"A": [1,2,3],
                   "B": [2,4,8]},
                  index=[0,1,2])

new_column = pd.Series([1,2,3],index=[2,3,4])
df["C"] = new_column.values
df

Renaming columns

The easiest way to rename columns is:

df = pd.DataFrame({"A": [1,2,3],
                   "B": [2,4,8]})

df = df.rename(columns={"A":"X", "B":"Y"})
df

If you need to do something more complex with the name you can pass a lambda to rename:

df = pd.DataFrame({"A": [1,2,3],
                   "B": [2,4,8]})

df = df.rename(columns=lambda cname: cname + "_" + cname)
df

You can also manipulate columns directly:

df = pd.DataFrame({"A": [1,2,3],
                   "B": [2,4,8]})

df.columns = "column-" + df.columns.str.lower()
df

Changing columns order

You can change the order of columns by explicitly listing each column:

df = pd.DataFrame({"A": [1,2,3],
                   "B": [2,4,8],
                   "C": [5,5,5]})

df = df[["A", "C", "B"]]
df

For larger dataframes is easier to use list operations to reorder the columns:

df = pd.DataFrame({"A": [1,2,3],
                   "B": [2,4,8],
                   "C": [5,5,5]})

cols = df.columns.tolist()
column_to_move = "C"
new_position = 1

cols.insert(new_position, cols.pop(cols.index(column_to_move)))
df = df[cols]
df

Deleting columns

You can use dict operations, like del and pop, to remove columns:

df = pd.DataFrame({"A": [1,2,3],
                   "B": [2,4,8],
                   "C": [5,5,5]})
del df["B"]
C = df.pop("C")

df

For multiple columns (or for keeping the original dataframe intact) you can use drop:

df = pd.DataFrame({"A": [1,2,3],
                   "B": [2,4,8],
                   "C": [5,5,5]})

df = df.drop(["B", "C"], axis=1)
df

You can also use columns to select the columns to drop:

df = pd.DataFrame({"A1": [1,2,3],
                   "B2": [2,4,8],
                   "C2": [5,5,5]})

cols_to_drop = [cname for cname in df.columns if cname.endswith("2")]
df = df.drop(cols_to_drop, axis=1)
df

Sometimes is easier to select the columns you want to keep:

df = pd.DataFrame({"A1": [1,2,3],
                   "B2": [2,4,8],
                   "C2": [5,5,5]})

cols_to_keep = [cname for cname in df.columns if cname.endswith("2")]
df = df[cols_to_keep]
df

Thanks for reading.

Pandas by example: columns was originally published in Hexacta Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Infographic: Timeline of software development methodologies

Tomas Henseler — Mon, 05 Feb 2018 00:00:00 GMT

This is a brief racconto of how software development methodologies have evolved in order to understand the changes we are experiencing better.

I have recently attended an Agile conference in Buenos Aires where I had a good time sharing experiences and knowledge about the Agile methodologies world in software development with some colleagues in the industry (among Scrum coaches, architects, developers, etc.).

As usual, during a coffee break, an interesting debate with one of the attendants emerged. Is Scrum better than Waterfall methodology? Are iterations actually good for learning from mistakes? Did Scrum invent the iterative methodology? Do mistakes cost more when using Waterfall methodology? What happened during the years between the birth of one methodology and another?

Certainly, these questions that came up could give rise to a long conversation; however, we concluded that one of the problems of software engineering is that, in our discipline, we forget the lessons of the past many times, conceiving everything new as good and everything old as bad.

In the article Hype Driven Development, Marek Kirejczyk commented on one of the components of this problem. We could all agree that software engineering is a very dynamic discipline that requires constant updating. Nevertheless, it is important to learn from history and experience in order to avoid making the mistakes of the past. This is not for nothing; it is one of the cornerstones of Agile thinking.

With this in mind, I decided to do a small, perhaps arbitrary, racconto (statement) of how software methodologies have evolved in order to understand the changes we are experiencing better.

1910 — Henry Gantt invents the diagram for project management. Along with Frederick Taylor, they intended to improve industrial efficiency by defining processes to perform repetitive tasks:

“It is only through enforced standardization of methods, enforced adoption of the best implements and working conditions, and enforced cooperation that this faster work can be assured.

And the duty of enforcing the adoption of standards and enforcing this cooperation rests with management alone”.

1916: Henri Fayol introduces concepts such as division of work, unity of command, and centralization of decisions.

1927–1932: Elton Mayo carries out the Hawthorne experiment, where he concludes, among other things, that the team productivity gain occurred because of the motivational effect on the workers from the interest being shown to them.

1953: The word software was coined as a prank.

1956: First formal description of Waterfall methodology made by Herbert D. Benington at a symposium on advanced programming methods for digital computers on June 29.

1976: The earliest use of the term Waterfall may have been in a 1976 paper by Bell and Thayer.

1985: The United States Department of Defense standardizes this methodology (Waterfall) and defines it as a standard for its software development providers. This standardization defined 6 stages: Preliminary Design, Detailed Design, Development and Unit Testing, Integration, and Testing.

1986: Barry Boehm describes the process in Spiral. It introduces the iterative development concept to mitigate risks.

1986: Fred Brooks publishes his paper No Silver Bullet where he explains the difficulties inherent in software development. His book, The Mythican Man-Month, becomes the foundational text for software engineering. In 1999, Brooks wins the Turing award for his contributions.

1986: Takeuchi and Nonaka introduce the term Scrum for the first time in their article, New New Product Development Game. In their work, the authors claim that Scrum is an approach to organizational knowledge creation, which is particularly good for bringing innovation in a continuous and incremental way.

1990: Ken Schwaber used what would become Scrum at his company: Advanced Development Methods. Meanwhile, Jeff Sutherland developed a similar approach, referring to it with the single word Scrum.

1995: As a result of their experience, Schawaber and Shuterland publish the paper Scrum methodology.

1996–1998: Rational Software Company develops its unified, iterative, and incremental software process. Focused on architecture and guided by use cases, this process becomes the standard of the industry.

1999: Kent Beck publishes Extreme Programming Explained, setting the stages of the Agile revolution.

2001: The Agile Manifesto is published:

We are uncovering better ways of developing software by doing it and helping others do it.

Through this work we have come to value:

Individuals and interactions over processes and tools Working software over comprehensive documentation Customer collaboration over contract negotiation

Responding to change over following a plan.

2003: Tom and Mary Poppendieck publish their book Lean Software Development, a translation for the software industry of Toyota’s Just in Time system.

2008: During the Agile conference of 2008, Bob Martin proposes a fifth value to the Agile Manifesto: Craftmanship over Crap. A year later, the principles of the movement known as Software Craftmanship are established. The metaphor of the system professional is replaced, going from Engineer to “Medieval Artisan”.

2009: The term DevOps becomes popular in a series of “devopsdays”.

2010: David Anderson publishes his book Kanban, a methodology that is also based on Toyota’s Production System.

Today: DevOps is gaining more and more ground with tools like Docker, Puppet, and Chef. The borders between development and operations begin to become blurred. Agile frameworks have scalability problems, leading to new implementations such as Less (Large-Scale-Scrum) and Safe (Scaled Agile Framework).

And so, we continue looking for the silver bullet in this awesome discipline.

To Sum Up

The software industry is very dynamic and demands constant updating. Almost by definition, it is a field that is constantly looking to the future. In this context, it is normal for the past to be overlooked and with it, the opportunity to learn from valuable lessons.

“Some of the best lessons we ever learn are learned from past mistakes. The error of the past is the wisdom and success of the future”. — Dale Turner-

Comments? Contact us for more information. We’ll quickly get back to you with the information you need.

Originally published at www.hexacta.com on February 5, 2018.

Infographic: Timeline of software development methodologies was originally published in Hexacta Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Hexacta Engineering - Medium

Testing Pandas

Example 1: Level Beginner

Example 1 Test 1

Example 1 Test 2

Example 2: Level Intermediate

Example 2 Test 1

Example 2 Test 2

Example 3: Level Advanced

Example 3 Test

Chocolatest, a sweet story of yet another testing framework.

Part 2: Install and Setup a 3-node Hadoop Cluster Cloudera

Cluster Setup:

Part 1: How To install a 3-Node Hadoop Cluster on Ubuntu 16

JSX can do that?

JSX

JSX for Math

JSX for Everything

Reactive Programming basics

A little theory

Viewing data flows

Environments and languages to use reactive programming

Machine Learning — Classification

Let’s start with the question: Is this A or B?

Is this A or B or C or D, etc.?

AzureML algorithms for Classification

TensorFlow: The open library for deep learning

Artificial learning?

Where I can find TensorFlow?

How does it all work?

More useful links:

Pandas by example: columns

Adding columns to a dataframe

Renaming columns

Changing columns order

Deleting columns

Infographic: Timeline of software development methodologies

This is a brief racconto of how software development methodologies have evolved in order to understand the changes we are experiencing better.

To Sum Up