The post {{ keyword }} appeared first on Honing Data Science.

]]>{{ links }}

The post {{ keyword }} appeared first on Honing Data Science.

]]>The post {{ keyword }} appeared first on Honing Data Science.

]]>{{ links }}

The post {{ keyword }} appeared first on Honing Data Science.

]]>The post {{ keyword }} appeared first on Honing Data Science.

]]>{{ links }}

The post {{ keyword }} appeared first on Honing Data Science.

]]>The post {{ keyword }} appeared first on Honing Data Science.

]]>{{ links }}

The post {{ keyword }} appeared first on Honing Data Science.

]]>The post How to make a barplot using ggplot appeared first on Honing Data Science.

]]>A Barplot or Bar graph is one of the most commonly used plots to visualize the relationship between a numerical and a categorical variable. Here each entity of the categoric variable is represented as a bar and the size or the height of the bar represents its numeric value.

A typical barplot looks as below.

The above barplot shows the relationship between the categorical variable Group based on its numerical variable height. The bars are proportional to its height and the above plot can be used to derive various insights about the categorical variable group such as the group with highest or lowest height, how the height of group1 is compared to group3 etc.

To start with , let’s create a basic bar chart using ggplot.I have also included reproducible code samples for each type.

The data:

To create a barplot using ggplot first install the ggplot2 library and create the dataset.

#install.packages("ggplot2") library(ggplot2) df <- data.frame(name=c("A","B","C","D","E") ,value=c(3,12,5,18,45)) head(df)

The above data set has two columns namely, name and value where value is numeric and name is considered as a categorical variable.

To create a barplot we will be using the geom_bar() of ggplot by specifying the aesthetics as aes(x=name, y=value) and also passing the data as input.

#basic barplot ggplot(df, aes(x=name, y=value)) + geom_bar(stat = "identity")

And the barplot looks as below.

To get a bar graph of counts, don’t map a variable to y, and use stat=”bin” (which is the default) instead of stat=”identity” as below,

#basic barplot with count ggplot(df, aes(x=factor(name))) + geom_bar()

And the barplot plotted against count looks as below.

In the above barplot the bars are of equal height of 1 .This is because there is only one entry for each of the name values A, B, C, D and E.

The bars of a barplot can be plotted vertically or horizontally. A vertical bar chart is sometimes called a column chart. We will see how to create a horizontal barplot later in this article below.

Now let’s see how to customize the above barplot by changing theme, colors, title, labels, barwidth etc.

Using ggplot-barplot it is possible to change the theme of a barplot to any of the below available themes.

To change the theme of a barplot to a dark theme, use theme_dark() use the below code.

#customizing barplot-changing theme ggplot(df, aes(x=as.factor(name), y=value)) + geom_bar(stat="identity")+ theme_dark()

Adding labels to a bar graph will help you interpret the graph correctly. You can label each bar with the actual value by using the geom_text() and adjusting the position of the labels using vjust and size parameters.

For the above barplot lets create the labels outside the bars by setting vjust=-0.3, and size=5

# customizing barplot labels outside bars ggplot(data=df, aes(x=name, y=value)) + geom_bar(stat="identity", fill='blue')+ geom_text(aes(label=value), vjust=-0.3, size=5)

Now let’s create labels inside the bars by setting vjust=1.6 and size=5 as below.

# customizing barplot labels inside bars ggplot(data=df, aes(x=name, y=value)) + geom_bar(stat="identity", fill='blue')+ geom_text(aes(label=value), vjust=1.6, size=5)

Let’s see how to change the line colors of a barplot by coloring them based on their group.

Line colors can be set based on group by specifying the color argument of ggplot as color=as.factor(name).

#customizing barplot -Change barplot line colors by groups ggplot(df, aes(x=name, y=value, color=as.factor(name))) + geom_bar(stat="identity", fill="white")

The width of the bars can be changed using the width argument of geom_bar().The width argument can take a range of values between 0 and 1, where 1 being full width. Larger values make the bars wider, while smaller values make the bars narrower and the default bar width is 0.9.

Let’s create a barplot with width=0.2 and see how it differs from the default barplot.

#customizing barplot-changing barwidth ggplot(df, aes(x=name, y=value)) + geom_bar(stat = "identity", width=0.2)

As we can see the changing the barwidth to 0.2 has created narrower bars compared to the default barplot.

When creating a bar plot with categorical labels, ggplot by default orders the bars in alphabetical order of the categorical labels. In ggplot, bars can be reordered in ascending or descending order using the reorder().

To reorder the bars in ascending order of their height use the reorder() by passing the name and value as arguments.

# customizing barplot- reordering bars ggplot(df, aes(x=reorder(name,value), y=value)) + geom_bar(stat = "identity")

To reorder them in descending order, use a minus sign on the value variable as below.

# customizing barplot- reordering bars descending ggplot(df, aes(x=reorder(name,-value), y=value)) + geom_bar(stat = "identity")

And the resulting barplot looks as below.

In ggplot it is possible to limit the categorical variables to be displayed using the limits parameter of scale_x_discrete().Let’s say we want to display the bars for only A and B, it can be done by passing just A and B to the limits argument of scale_x_discrete()as below,

#customizing barplot-choosing items to display ggplot(df, aes(x=name, y=value)) + geom_bar(stat = "identity", width=0.2) + scale_x_discrete(limits=c("A", "B"))

For the below barplots we have used the mtcars dataset of R and have used the categorical variable gear to create bars based on their count.

The mtcars dataset looks as below,

head(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

Let’s see how to change the bar colors using the fill argument of geom_bar().

#customizing barplot-same color for all the bars ggplot(mtcars, aes(x=as.factor(gear) )) + geom_bar(color="blue", fill=rgb(0.1,0.4,0.5,0.7) )

In the above plot color=”blue” represents the line color and fill=rgb(0.1,0.4,0.5,0.7) is the color filled inside each of the bars.

Now let’s use the scale_fill_hue() to fill colors based on categories.

#customizing barplot-using hue ggplot(mtcars, aes(x=as.factor(gear), fill=as.factor(gear) )) + geom_bar() + scale_fill_hue(c = 50) + theme(legend.position="none")

As we can see in the above barplot each of the 3 categories has been filled with three different colors.

Using ggplot it is also possible to fill the bars with manual colors using the scale_fill_manual().Lets create a barplot by manually specifying the barcolors as red, green and blue.

#customizing barplot-manual colors ggplot(mtcars, aes(x=as.factor(gear), fill=as.factor(gear) )) + geom_bar( ) + scale_fill_manual(values = c("red", "green", "blue") ) + theme(legend.position="none")

Using ggplot is also possible to create the above barplot using the scale_fill_brewer() by setting the palette. Let’s create a barplot with palette = “Set2”.

#customizing barplot-Using RColorBrewer ggplot(mtcars, aes(x=as.factor(gear), fill=as.factor(gear) )) + geom_bar() + scale_fill_brewer(palette = "Set2") + theme(legend.position="none")

Apart from set2 the following palettes are available for use using RColorBrewer.

Barcolors can be filled with greyscale values using the scale_fill_grey() as below

#customizing barplot-using grey scale ggplot(mtcars, aes(x=as.factor(gear), fill=as.factor(gear) )) + geom_bar( ) + scale_fill_grey()+ theme(legend.position="none")

In the above barplot , the bars have been filled with darker to lighter grey scales based on their height.

We can change the legend position of a barplot using legend.position argument of the theme() of ggplot.

Let’s create a barplot with legends placed on top of the plot.

# customizing barplot- change legend position #top ggplot(mtcars, aes(x=as.factor(gear), fill=as.factor(gear) )) + geom_bar( ) + scale_fill_brewer(palette="Blues")+ theme(legend.position="top")

**Note:** The allowed values for the arguments of legend.position are : “left”, “top”, “right”, “bottom”.

In certain cases where the labels are long it often makes sense to turn your barplot horizontal. Using ggplot it can be done using the coord_flip().

The horizontal barplot for the df dataset looks as below.

#horizontal barplot ggplot(df, aes(x=name, y=value)) + geom_bar(stat = "identity") + coord_flip()

If the data contains several groups of categories it can be displayed in a bar graph in one of two ways.

You can either decide to show the bars in groups (grouped barplot) or you can choose to have them stacked (stacked barplot). Let’s see more about these grouped and stacked barcharts below.

Lets first create a dataset with groups to explain the below barcharts.

#Dataset for grouped, stacked and percent stacked barplot # create a dataset set.seed(1234) name <- c(rep("A" , 3) , rep("B" , 3) , rep("C" , 3) , rep("D" , 3) ) group <- rep(c("group1" , "group2" , "group3") , 4) value <- abs(rnorm(12 , 0 , 15)) data <- data.frame(name,group,value)

The dataset looks as below.

head(data) ## name group value ## 1 A group1 18.105986 ## 2 A group2 4.161439 ## 3 A group3 16.266618 ## 4 B group1 35.185466 ## 5 B group2 6.436870 ## 6 B group3 7.590838

The dataset has three variables with the numeric value (name), and 2 categorical variables for the group (name) and the subgroup levels(group).

In a grouped bar chart, for each categorical group there are two or more bars. These bars are color-coded to represent a particular grouping.

For example, a business owner with two stores might make a grouped bar chart with different colored bars to represent each store: the horizontal axis would show the months of the year and the vertical axis would show the revenue and the barplot as a whole can be used to visualize the revenue of the two stores for all the months of the year.

To create a grouped barplot just set the position=”dodge” in the geom_bar() function and map the categorical variable group to fill as below.

# Grouped p <- ggplot(data, aes(fill=group, y=value, x=name)) + geom_bar(position="dodge", stat="identity")

The above grouped barchart has created bars for the three groups for each of the name values A, B, C and D.

In a grouped barchart, there is no space between bars within each group by default. However, some space can be added between bars within a group, by making the width smaller and setting the value for position_dodge to be larger than width as below,

#adding space between bars within each group ggplot(data, aes(fill=group, y=value, x=name)) + geom_bar(position=position_dodge(width=0.8), width=0.7,stat="identity")

And the grouped barchart with space added between the bars looks as below,

A stacked barplot is very similar to the grouped barplot. However, a stacked bar chart stacks bars that represent different groups on top of each other. The height of the resulting bar shows the combined result of the groups. i.e., the subgroups are just displayed on top of each other, not beside as in a grouped barchart.

The only thing to change to get a stacked barchart is to switch the position argument to stack.

#stacked ggplot(data, aes(fill=group, y=value, x=name)) + geom_bar(position="stack", stat="identity")

In the above barchart the groups are stacked on top of each other for each of the name values A, B, C and D.

Note that, stacked bar charts are not suited to data sets where some groups have negative values. In such cases, grouped bar chart are preferable.

In a percent stacked barplot the percentage of each subgroup is represented instead of count or y values. This allows to study the evolution of their proportion in the whole.

To create a percent stacked barplot just switch to position=”fill”.

#percent stacked barplot ggplot(data, aes(fill=group, y=value, x=name)) + geom_bar(position="fill", stat="identity")

As we can see in the above plot ,the bars are plotted against its percentage values. It is now visually easier to interpret the proportion of groups within each bar corresponding to the name variables using this barplot.

So, in this article we have discussed how to create a basic and grouped barplot using ggplot2 and also discussed its various customization options. Hope this article helped you get a better understanding about ggplot2 barplot.

Do let us know your comments and feedbacks about this article below.

The post How to make a barplot using ggplot appeared first on Honing Data Science.

]]>The post Data visualization using ggplot histogram appeared first on Honing Data Science.

]]>A histogram is a type of graph commonly used to visualize the univariate distribution of a numeric data. Here the data is displayed in the form of bins which represents the occurrence of datapoints within a range of values. These bins and the distribution thus formed can be used to understand some useful information about the data such as central location, the spread, shape of data etc. It can also be used to find outliers and gaps in data.

A basic histogram for age looks as below.

From the above histogram it can be interpreted that most of the people fall within the age range of 50-60 and there seems to be less number of people for the range 70-80 and 90-100 .There is also a gap in the histogram for the range 80-90 which indicates that the data for the age range 80-90 might be missing or not available. So, a histogram as above can be used to visualize useful information about a continuous numeric variable. Let’s see more about these histograms, how to create them and its various customization options below.

Histograms are sometimes confused with bar charts. Although a histogram looks similar to a bar chart, the major difference is that a histogram is only used to plot the frequency of occurrences in a continuous data set that has been divided into classes, called bins. Bar charts, on the other hand, is used to plot categorical data.

To create a histogram first install and load ggplot2 package.

#install.packages("ggplot2") library(ggplot2)

We will be using the below dataset to create and explain the histograms. The dataset has two columns namely cond and rating. The variable cond is categorical with two categories A and B and rating is a continuous numeric variable.

set.seed(1234) data <- data.frame(cond = factor(rep(c("A","B"), each=200)), rating = c(rnorm(200),rnorm(200, mean=.8)))

The dataset looks as below.

head(data) ## cond rating ## 1 A -1.2070657 ## 2 A 0.2774292 ## 3 A 1.0844412 ## 4 A -2.3456977 ## 5 A 0.4291247 ## 6 A 0.5060559

Using ggplot2 histograms can be created in two ways with

- qplot() and
- geom_histogram()

Histogram using qplot can be created as below by passing one numeric argument.

#histogram using qplot qplot(data$rating, geom="histogram")

Histogram using geom_histogram() is also created by passing just the numeric variable.

#histogram using geom_histogram ggplot(data, aes(x=rating)) + geom_histogram()

Although the plots for both the histograms looks similar in practice geom_histogram() is widely used since the options for qplot are more confusing to use.

Note that while creating the histograms the below warning message.

`stat_bin()`

using `bins = 30`

. Pick better value with `binwidth`

.

was triggered which needs to be addressed by changing the binwidth.

To construct a histogram, the first step is to bin the range of values i.e., divide the entire range of values into a series of intervals and then count how many values fall into each interval.

So, a histogram basically forms bins from numeric data where the area of the bin indicates the frequency of occurrences. Hence changing the bin size would result in changing the overall appearance and would result in histograms with different distribution and spread of the values.

Note that the height of the bin does not necessarily indicate how many occurrences of scores there were within each individual bin. It is the product of height multiplied by the width of the bin that indicates the frequency of occurrences within that bin. So, only in case of equally spaced bins(bars), the height of the bin represents the frequency of occurrences.

In ggplot2, binsize can be can changed using the binwidth argument. Now let’s explore how changing the binsize affects the histogram by creating two histograms with different binsize.

Let’s first create a histogram with a binwidth of 0.5 units.

#changing the bin size to 0.5 ggplot(data, aes(x=rating)) + geom_histogram(binwidth=.5)

Creating the second histogram with a bandwidth of 0.1 units.

#changing the bin size to 0.1 ggplot(data, aes(x=rating)) + geom_histogram(binwidth=.1)

Now let’s compare the histograms.

As we can see changing the binsize has created histograms with different distribution and spread of data. So, choosing the right binsize is important to get useful information from the histogram.

Now let’s see how to customize the histogram by changing the outline, colors, title, axis labels etc.

The outline and color of a histogram can be changed using the color and fill arguments of geom_histogram().

Color represents the outline color and fill represents the color to be filled inside the bins.

For the above basic histogram, lets change the outline color to red and fill color to grey.

#changing histogram outline and fill colors ggplot(data, aes(x=rating)) + geom_histogram(binwidth=.5, colour="red", fill="grey")

Title can be added to a histogram using the ggtitle() of ggplot2.Let’s set the title of above histogram as “histogram with ggplot2”.

#adding a title ggplot(data, aes(x=rating)) + geom_histogram(binwidth=.5, colour="red", fill="grey")+ ggtitle("histogram with ggplot2")

Labels can be customized using scale_x_continuous() and scale_y_continuous(). We add the desired name to the name argument as a string to change the labels.

#customizing axis labels ggplot(data, aes(x=rating)) + geom_histogram(binwidth=.5, colour="red", fill="grey")+ scale_x_continuous(name = "rating") + scale_y_continuous(name = "Count of values") + ggtitle("histogram with customized axis labels")

Let’s change the x-axis ticks to appear at every 3 units rather than 2 using the breaks = seq(-4,4,3) argument in scale_x_continuous. seq() function indicates the start and endpoints and the units to increment by respectively.

Let’s also change where y-axis begins and ends where we want by adding the argument limits = c(0, 100) to scale_y_continuous.

#customizing axis ticks ggplot(data, aes(x=rating)) + geom_histogram(binwidth=.5, colour="red", fill="grey")+ scale_x_continuous(name = "rating", breaks = seq(-4,4,3)) + scale_y_continuous(name = "Count of values", limits = c(0, 100)) + ggtitle("histogram with customized axis ticks")

The histogram with new axis ticks looks as below.

Let’s transform the x and y axis and see how transformation affects the ggplot histogram .

Let’s first transform the x-axis by taking the square root of them using the scale_x_sqrt().

m <- ggplot(data, aes(x=rating)) #using transformed scales for x-axis m + geom_histogram() + scale_x_sqrt()

The histogram with new transformed x-axis looks as below.

While applying the above transformation all the infinite values resulting from the transformation have been removed.

Hence the transformed scales for negative x-values are not displayed in the above histogram.

Lets now transform the y-axis by taking the square root of them and then reversing them.

This can be done using scale_y_sqrt() and scale_y_reverse() as below.

#using transformed scales for y-axis m <- ggplot(data, aes(x=rating)) m + geom_histogram(binwidth = 0.5) + scale_y_sqrt() m + geom_histogram(binwidth = 0.5) + scale_y_reverse()

And the histograms for the transformed y-axis looks as below.

Note that for the transformed scales, binwidth applies to the transformed data and the bins have constant width on the transformed scale.

Vertical and horizontal lines can be added to a histogram using geom_vline() and geom_hline() of ggplot2.

Now let’s see how to add a vertical line along the mean rating to the above histogram.

#adding mean line ggplot(data, aes(x=rating)) + geom_histogram(binwidth=.5, colour="red", fill="grey") + geom_vline(aes(xintercept=mean(rating)),color="blue", linetype="dashed", size=1) + ggtitle("histogram with mean line")

And the histogram looks as below,

We can also create histograms with density instead of count on y-axis. This can be done by changing the y argument of geom_histogram() as y=..density..

ggplot(data, aes(x=rating))+geom_histogram(aes(y=..density..), binwidth=.5, colour="red", fill="grey")+ ggtitle("histogram with density instead of count")

As we can see the histogram has been plotted with density instead of count on the y axis.

Let’s customize this further by adding a normal density function curve to the above histogram.

We can also add a normal density function curve on top of our histogram to see how closely it fits a normal distribution. In order to overlay the normal density curve, we have added the geom_density() with alpha and fill parameters for transparency and fill color for the density curve. We have used alpha=.2 and fill color as yellow in this case. Note that the normal density curve will not work if count is used instead of density.

And the code to overlay normal density curve looks as given below.

# Histogram with density instead of count on y-axis ggplot(data, aes(x=rating)) + geom_histogram(aes(y=..density..), binwidth=.5, colour="red", fill="grey")+ geom_density(alpha=.2, fill="yellow") + ggtitle("histogram with density instead of count")

As we can see the above histogram seems to perfectly fit a normal distribution.

We can also add a gradient to our color scheme that varies according to the frequency of the values using the scale_fill_gradient(). To add gradient also change the aes(y = ..count..) argument in geom_histogram to aes(fill = ..count..) so that the color is changed based on the count values. For lower count values lets set the color as yellow and red for the higher ones.

The code to customize gradient looks as below.

m <- ggplot(data, aes(x=rating)) m + geom_histogram(aes(fill = ..count..), binwidth =0.7) + scale_fill_gradient("Count", low = "yellow", high = "red")

As we can see, in the above histogram the color is changed from yellow to red based on the count of values.

Using ggplot2 it is possible to create more than one histogram in the same plot. Now let’s see how to create a stacked histogram for the two categories A and B in the cond column in the dataset.

Stacked histograms can be created using the fill argument of ggplot().Let’s set the fill argument as cond and see how the histogram looks like.

#histogram with categories #stacked histograms ggplot(data, aes(rating, fill = cond)) + geom_histogram(binwidth = .5)

We can see two histograms has been created for the two categories A,B and are differentiated by colors. By default , ggplot creates a stacked histogram as above. Let’s customize this further by creating overlaid and interleaved histogram using the position argument of geom_histogram.

Overlaid histograms are created by setting the argument position=”identity”. We have also set the alpha parameter as alpha=.5 for transparency.

# Overlaid histograms ggplot(data, aes(x=rating, fill=cond)) + geom_histogram(binwidth=.5, alpha=.5, position="identity")

Interleaved histograms can by created by changing the position argument as position=”dodge”.

# Interleaved histograms ggplot(data, aes(x=rating, fill=cond)) + geom_histogram(binwidth=.5, position="dodge")

Facets can be created for histogram plots using the facet_grid().Here lets create a facet grid for the histograms created based on the categories A and B of cond by adding facet_grid(cond ~ .)to ggplot

#Using facets ggplot(data, aes(x=rating)) + geom_histogram(binwidth=.5, colour="black", fill="grey") + facet_grid(cond ~ .)

As we can see we have created a facet grid with two histograms for the categories A and B of cond. This can be used in cases where the histograms need to be compared or more than one histogram needs to be plotted in a same graph.

In this article we have discussed how to create histograms using ggplot2 and its various customization options. We first created a basic histogram using qplot() and geom_histogram() of ggplot2.

We then discussed about bin size and how it affects the appearance of a histogram .We then customized the histogram by adding a title, axis labels, ticks, gradient and mean line to a histogram. We also discussed about density curve and created a histogram with normal density curve to see how it fits a normal distribution.

We then moved on to multiple histograms by creating stacked, interleaved and overlaid histograms for the two categories A and B. Finally, we created a faced grid with two histogram plots.

Hope this article helped you get a good understanding about ggplot2 histogram. Do let us know your feedback about this article below.

The post Data visualization using ggplot histogram appeared first on Honing Data Science.

]]>The post How to read data using pandas read_csv appeared first on Honing Data Science.

]]>It is used to read a csv(comma separated values) file and convert to pandas dataframe.

pandas is a very important library used in data science projects using python.

Let’s convert this csv file containing data about Fortune 500 companies into a pandas dataframe.

import pandas as pd df = pd.read_csv("f500.csv") df.head(2)

company rank revenues revenue_change profits assets profit_change ceo industry sector previous_rank country hq_location website years_on_global_500_list employees total_stockholder_equity 0 Walmart 1 485873 0.8 13643.0 198825 -7.2 C. Douglas McMillon General Merchandisers Retailing 1 USA Bentonville, AR http://www.walmart.com 23 2300000 77798 1 State Grid 2 315199 -4.4 9571.3 489838 -6.2 Kou Wei Utilities Energy 2 China Beijing, China http://www.sgcc.com.cn 17 926067 209456

Lets now try to understand what are the different parameters of pandas read_csv and how to use them.

If the separator between each field of your data is not a comma, use the sep argument.For example, we want to change these pipe separated values to a dataframe using pandas read_csv separator.

Use sep = “|” as shown below

import pandas as pd df2 = pd.read_csv("/Users/HoningDS/Desktop/dataset.txt", sep = "|")

The delimiter argument of pandas read_csv function is same as sep. A question may arise , if both sep and delimiter are same, then why do have two arguments. I think it is for backward compatibility.

sep is more commonly used than delimiter.

Use pandas read_csv header to specify which line in your data is to be considered as header.For example, the header is already present in the first line of our dataset shown below(note the bolded line).

In this case, we need to either use header = 0 or don’t use any header argument.

import pandas as pd df = pd.read_csv("f500.csv", header = 0)

header = 1 means consider second line of the dataset as header.

If your csv file does not have header, then you need to set header = None while reading it .Then pandas will use auto generated integer values as header.

Use the names attribute if you would want to specify column names to the dataframe explicitly. All the column names should be mentioned within a list.

Refer to the example below where I am mentioning the column names as COLUMN1, COLUMN2, ..till COLUMN18. Notice how the header of the dataframe changes from earlier header.

Use this argument to specify the row labels to use. If you set index_col to 0, then the first column of the dataframe will become the row label. Notice how the row labels change.

Earlier the row labels were 0,1,2,…etc. Now, the row labels have changed to Walmart, State Grid etc.

df = pd.read_csv("f500.csv", index_col = 0)

Note that you can specify more than one column index (or) a combination of multiple column names as argument for index_col .

In such a case, you need to enclose all of these column indexes or column names in a list.

So I can either have [0,1] (or) [“company”, “rank”] as index_col.

See the impact of this change on the dataframe in the example below. Now the values of both “company” and “rank” constitute the row label.

df = pd.read_csv("f500.csv", index_col = [0,1]) df = pd.read_csv("f500.csv", index_col = ["company","rank"])

Use pandas usecols when you want to load specific columns into dataframe. When your input dataset contains a large number of columns, and you want to load a subset of those columns into a dataframe , then usecols will be very useful.

Performance wise, it is better because instead of loading an entire dataframe into memory and then deleting the not required columns, we can select the columns that we’ll need, while loading the dataset itself.

As a parameter to usecols , you can pass either a list of strings corresponding to the column names or a list of integers corresponding to column index . Refer to below example where I am passing a list of strings as usecols parameter.

import pandas as pd df = pd.read_csv("f500.csv", usecols = ["company", "rank", "revenues"])

You can also use column index positions as parameter to usecols

import pandas as pd df = pd.read_csv("f500.csv", usecols = [0,1,2])

Note that element order is ignored while using usecols. So , if you pass [2,1,0] or [0,1,2] as parameter to usecols, the resulting dataframe columns will have same column order , namely company, rank, revenues !

This behaviour holds true when passing list of column names as well. So, if you pass [“company”, “rank”, “revenues”] or [“rank, “company”, “revenues”] to pandas usecols, the resulting dataframe will have same order of columns , namely “company”, “rank” “revenues” !

If you want column order to be enforced while using usecols , then you need to pass the list containing column names explicitly – see below.

df = pd.read_csv("f500.csv", usecols = ["company" , "rank", "revenues"]) [["company" ,"revenues", "rank"]]

One more use of the usecols parameter is to skip certain columns in your dataframe. See an example below.I am using a callable as a usecols parameter in order to exclude the columns – company, rank, and revenues, and retain all the other columns. Notice the change in the columns present in the dataframe , after using usecols.

df = pd.read_csv("f500.csv", usecols = lambda column : column not in ["company" , "rank", "revenues"])

If your dataset contains only one column, and you want to return a Series from it , set the squeeze option to True.

I created a file containing only one column, and read it using pandas read_csv by setting squeeze = True.We will get a pandas Series object as output, instead of pandas Dataframe.

When a data set doesn’t have any header , and you try to convert it to dataframe by (header = None), pandas read_csv generates dataframe column names automatically with integer values 0,1,2,…

If we want to prefix each column’s name with a string, say, “COLUMN”, such that dataframe column names will now become COLUMN0, COLUMN1, COLUMN2 etc. we use prefix argument.

If a dataset has duplicate column names, convert it to a dataframe by setting mangle_dupe_cols to True.

Refer to example below. The **revenues** column appears twice.

In the dataframe, the second revenues column will be named as revenues.1

Use dtype to set the datatype for the data or dataframe columns. If you want to set data type for mutiple columns, separate them with a comma within the dtype parameter, like {‘col1’ : “float64”, “col2”: “Int64”}

In the below example, I am setting data type of “revenues” column to float64.

df = pd.read_csv("f500.csv", dtype = {"revenues" : "float64"})

Used to specify the parsing engine. Can be python or c based.

c engine is faster but python based parser is more feature rich.

df = pd.read_csv("f500.csv", engine = "c")

Use converters to convert values in certain columns, using a dict of functions.

In the below example, I have declared a function f which replaces decimal point to a comma. I am using converters to call the function f on profits column.

Upon doing this, the decimal point in profits column gets changed to comma(,)

Suppose your dataset contains Yes and No string which you want to interpret as True and False.

we can tell Pandas to convert ‘Yes’ strings to True and ‘No’ string to False using true_values and false_values

If you have leading or trailing spaces in a field, then pandas read_csv may not work as expected.

Note the below example where you can see two spaces in first two rows of col2.

Use skipinitialspace in this scenario to interpret ‘No’ as False. If skipinitialspace is not set to True, then col2 will still have No instead of False.

import pandas as pd df = pd.read_csv("f500_new.csv", encoding = "ISO-8859-1", true_values = ['Yes'], false_values = ['No'], skipinitialspace = True)

You can use pandas read_csv skip rows to

- Exclude reading specified number of rows from the beginning of a csv file , by passing an integer argument (or)
- Skip reading specific row indices from a csv file, by passing a list containing row indices to skip.

Lets use the below dataset to understand skiprows

To skip reading the first 4 rows from this csv file, you can use skiprows = 4

import pandas as pd df = pd.read_csv("f500.csv", skiprows = 4)

To skip reading rows with indices 2 to 4 , you can use

import pandas as pd df = pd.read_csv("f500.csv", skiprows = [2,3,4])

Indicates number of rows to skip from bottom of the file.

Ensure that engine parameter is ‘python’ and not ‘c’ while using skipfooter.

So to skip last 4 rows of a file, you can use skipfooter = 4

import pandas as pd df = pd.read_csv("f500.csv", skipfooter = 4, engine = 'python')

If you want to read a limited number of rows, instead of all the rows in a dataset, use nrows. This is especially useful when reading a large file into a pandas dataframe.

Please note that all these strings are considered as default NaN values by pandas read_csv function – ”, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.

In the below example, col1 has spaces and col2 has #N/A. Observe how both these columns are interpreted as NaN when read using pandas read_csv.

When parsing data, you can choose to include or not the default NaN values.

For example, you don’t want to consider ” and ‘#N/A’ as NaN, then you need to set keep_default_na to False.

If you want any additional strings to be considered as NaN, other than the default NaN values, then use na_values.

In the below example, I have added the list of strings [“Walmart”, “Energy”] as an attribute for na_values, so that both these values will be replaced by NaN.

In the below example, I want to interpret the string “Energy” within the column “sector” to be interpreted as NaN. Hence I included them in a dictionary attribute of na_values

df = pd.read_csv("f500.csv", na_values = {"sector": "Energy"})

Often in data science projects, you might get a scenario where you don’t want to consider all of the default NaN values while parsing.

I mean, suppose you do not want to consider ” as NaN, but want to consider ‘#N/A’ as NaN.

Then you need to set keep_default_na to False, and set na_values to ‘#N/A’.

import pandas as pd df = pd.read_csv("f500_new.csv", encoding = "ISO=8859-1", keep_default_na = False, na_values = '#N/A')

Please note the below important points,

If keep_default_na is True, and na_values are specified, na_values is appended to the default NaN values used for parsing.

If keep_default_na is True, and na_values are not specified, only the default NaN values are used for parsing.

If keep_default_na is False, and na_values are specified, only the NaN values specified na_values are used for parsing.

If keep_default_na is False, and na_values are not specified, no strings will be parsed as NaN.

Source : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

It is used to find and interpret missing values in your dataset.

If you have a large file, and you are sure that data does not contain any missing values, set na_filter as False. This will boost the performance of data parsing.

Using the verbose = True will print additional information.

The time taken for each stage of converting the file into a dataframe, like tokenization, type conversion and memory clean up, will be printed.

import pandas as pd df = pd.read_csv("f500.csv",verbose = True)

Tokenization took: 0.53 ms Type conversion took: 2.14 ms Parser memory cleanup took: 0.01 ms

If skip_blank_lines option is set to False, then wherever blank lines are present, NaN values will be inserted into the dataframe.

If set to True, then the entire blank line will be skipped.

We can use pandas parse_dates to parse columns as datetime. You can either use parse_dates = True or parse_dates = [‘column name’]

Let’s convert col3, that has the string content, to a datetime datatype. If you don’t use parse_dates in the read_csv call, col3 will be represented as an object.

#pandas read_csv without parse_dates parameter df = pd.read_csv("f500_new.csv", encoding = "iso-8859-1") #See the dtypes for this dataframe. df.dtypes company object rank float64 col1 float64 col2 float64 col3 object

Let’s use parse_dates argument of pandas read_csv and set it to the list of column names that we intend to parse as a datetime. See the dtypes after setting parse_dates to col3. It has been read in as datetime64[ns] now !!

#pandas read_csv with parse_dates parameter df = pd.read_csv("f500_new.csv", encoding = "iso-8859-1", parse_dates = ['col3'])

If pandas is unable to convert a particular column to datetime, even after using parse_dates, it will return the object data type.

If you set infer_datetime_format to True and enable parse_dates for a column , pandas read_csv will try to parse the data type of that column into datetime quickly .

The advantage of infer_datetime_format is that the parsing happens very quickly.

If it is successful in inferring the datetime format of the string, then parsing speed will be increased by 10 times !

Often in data science, you will work with date columns for data analysis.

To understand keep_date_col , let us consider that you want to combine the three columns, day , month and year and derive a new date column, called date_col

Note that by defaut, pandas read_csv will just retain the new date column, and drop the columns from which you derived the new date column.

Hence , you can see only date_col in the dataframe’s dtypes

import pandas as pd df = pd.read_csv("f500_new.csv" , encoding = "iso-8859-1" , parse_dates = {'date_col' : ["day", "month", "year"]} ) df['date_col'] = pd.to_datetime(df['date_col']) df.dtypesdate_col datetime64[ns]company object rank int64 revenues int64 revenue_change float64 profits float64

If you set keep_date_col to True , the original date columns, namely day , month and year will be retained , along with the new date column date_col in the pandas dataframe.

Using the date_parser is the most flexible way to parse a file containing date columns

Suppose you have a column called date_col in your dataset which has date in the format YYYY DD MM HH:MM:SS , like shown below

The easiest way is to write a lambda function which can read the data in this format , and pass this lambda function to the date_parser argument .

import pandas as pd mydateparser = lambda x: pd.datetime.strptime(x, "%Y %d %m %H:%M:%S") df = pd.read_csv("f500_new.csv" , encoding = "iso-8859-1", parse_dates = ['date_col'] , date_parser = mydateparser)

Observe now , that the date_col has been populated correctly in pandas dataframe.

Use the dayfirst parameter to indicate that day comes first in your column representing dates.

Consider the below data .

The first row contains “04/10/96” , and pandas considers 04 as the month .

What if, you want pandas to consider 04 as the day instead of month?

Using dayfirst in such a scenario to indicate that the day part comes first in your data

Notice the change in the interpretation by setting dayfirst to True. pandas has interpreted the same value as 4th of October ‘ 1996 , instead of 10th of April ‘ 1996 !

Using the iterator parameter , you can define how much data you want to read , in each iteration.By setting iterator to True , the pandas dataframe object will become a TextFileReader object .

import pandas as pd df = pd.read_csv("f500.csv" , iterator = True) type(df) pandas.io.parsers.TextFileReader

You can use df.get_chunk(n) to retrieve the rows from this object. Each execution of get_chunk will retrieve n number of rows from the last retrieved row.

By specifying a chunksize , you can retrieve the data in same sized ‘chunks’ .

This is especially useful when reading a huge dataset as part of your data science project.

Instead of putting the entire dataset into memory , this is a ‘lazy’ way to read equal sized portions of the data. Setting chunksize will return a TextFileReader object.

See an example below , i have specified a chunk size of 2 here. So two rows will be returned for each for loop execution.

df = pd.read_csv(io.StringIO(temp), chunksize = 2)

pandas read_csv has the ability to read compressed files.

By default , compression parameter is set to ‘infer’ , which means it will try to decompress the files of the type (gzip , zip , bz2 , xz ) and read them into dataframe.

If you want to do analysis on a huge file , it is always better to use compressed file.

Don’t uncompress the file and try to read to dataframe.

I mean, if you have a gz file , then set compression = ‘gzip’ and work with gz file directly.

If a column within your dataset contains a comma to indicate the thousands place, and you try to convert this dataset to a dataframe using pandas read_csv , then this column would be considered as a string !

Refer to example below.

The revenues column contains a comma , and when you try to convert to dataframe, the dtype for revenues becomes an object

To ensure that such a column is interpreted as int64 , you need to indicate to pandas read_csv that comma is a thousands place indicator , like shown below.

Observe that the dtype of revenues column has now changed to int64 instead of object .

import pandas as pd df = pd.read_csv("f500_data.csv" , thousands = ",")

Used to indicate which character is to be considered as decimal point indicator.

In some countries within Europe , a comma may be considered as decimal point indicator.We need to set decimal = “,” to parse such data .

Suppose we have data like shown below.You want to convert this data to a pandas dataframe by considering ~ as line terminator

data = 'a,b,c~1,2,3~4,5,6'

. Then specify lineterminator = ‘~’ within the call to read_csv

The post How to read data using pandas read_csv appeared first on Honing Data Science.

]]>The post Confusion Matrix in R appeared first on Honing Data Science.

]]>A confusion matrix, also known as error matrix is a table layout that is used to visualize the performance of a classification model where the true values are already known.

A typical confusion matrix looks as below:

As seen above a confusion matrix has two dimensions namely Actual class and Predicted class. Each row of the matrix represents the number of instances in a predicted class while each column represents the number of instances in an actual class (or vice versa).

In this section I will explain the confusion matrix of the Bank Marketing dataset.

The dataset can be downloaded from the below link https://archive.ics.uci.edu/ml/machine-learning-databases/00222/.

Here I have used the bank.csv with 10% of the examples (4521 rows) and 17 inputs.

The goal of the Bank Marketing dataset is to predict if the client will subscribe to the term deposit.

The train data contains the following categorical variables such as

- job – job type of the clients,
- marital – marital status of the clients,
- education – education level,
- default – if the credit is in default(yes/no),
- housing – if there is a housing loan(yes/no),
- loan – if the customer currently has a personal loan(yes/no),
- contact – type of contact,
- poutcome – result of the previous marketing campaign contact, and
- y – if the client actually subscribed to the term deposit(yes/no).

Here Attributes (1) through (8) are input variables, and (9) is considered the outcome.

The outcome “y” is either yes (meaning the customer will subscribe to the term deposit) or no (meaning the customer won’t subscribe). For example,the confusion matrix of a Naive Bayes classifier on 100 clients to predict whether they would subscribe to the term deposit looks as below.

From the above table it can be seen that of the 11 clients who actually subscribed to the term deposit, the model predicted 3 subscribed and 8 not subscribed. Similarly, of the 89 clients who did not subscribe to the term, the model predicted 2 subscribed and 87 not subscribed.

Now let’s see the basic terminologies from a confusion matrix which could be used to analyze our classifier results.

True positives (TP) are the number of positive instances the classifier correctly identified as positive. For the above bank marketing case this is the number of correct** **classifications of the “subscribed” class or potential clients that are willing to subscribe a term deposit which is 3 in this case.

False positives (FP) are the number of instances in which the classifier identified as positive but in reality, are negative. For the above case this is the number of incorrect classifications of the “subscribed” class or potential clients that are not willing to subscribe to a term deposit but the model has predicted as belonging to the “subscribed” class.

In the above case 2 customers have been predicted as “non-subscribed” but in reality, they belong to “subscribed” class.

True negatives (TN) are the number of negative instances the classifier correctly identified as negative. For the above case,this is the number of correct classifications of the “Not Subscribed” class or potential clients that are not willing to subscribe to a term deposit which is 87 in this case.

False negatives (FN) are, the number of instances classified as negative but in reality, are positive.For the above case, this is the number of incorrect classifications of the “Not Subscribed” class or potential clients that are not willing to subscribe to a term deposit.

In the above case, 8 customers have been predicted as not willing to subscribe to the term deposit but in reality, they belong to “subscribed” class.

TP and TN are the correct guesses. A good classifier should have large TP and TN and small (ideally zero) numbers for FP and FN.

Accuracy: It is the percentage of number of correctly classified instances among all other instances. It is defined as the sum of TP and TN divided by the total number of instances.

TP+TN * 100 Accuracy = ------------- TP+TN+FP+FN

A good model should have a high accuracy score, but having a high accuracy score alone does not guarantee the model is well established.

Let’s see the other measures that can be used to better evaluate the performance of a classifier.

True positive rate (TPR) also called as recall shows the percentage of positive instances correctly classified as positive.

TP * 100 TPR = --------- TP + FN

In the above case TPR is the percentage of customers correctly predicted as “subscribed”.

False positive rate (FPR) shows the percentage of negative instances incorrectly classified as positive. The FPR is also called the false alarm rate or the type I error rate.

FP * 100 FPR = ---------- FP + TN

In the above case FPR is the percentage of customers of the “non subscribed” class who has been incorrectly classified as “subscribed”.

The false negative rate (FNR) shows what percent of positives the classifier marked as negatives. It is also known as the miss rate or type II error rate. Note that the sum of TPR and FNR is 1.

FN * 100 FNR = ------------ TP + FN

In the above case FNR is the percentage of customers who are willing to subscribe to the term deposit but the model has predicted as belonging to the “not-subscribed” class.

A well-performed model should have a high TPR that is ideally 1 and a low FPR and FNR that are ideally 0.

Precision is the percentage of correctly classified positive instances among all other positive instances.

TP * 100 Precision = ----------- TP + FP

In the above case , Precision is the percentage of correctly classified customers of “subscribed” class among all the other customers classified as “subscribed”.

Use My Data Science Course To Take Your R Skills To The Next Level

Now let’s see how to create a confusion matrix in R and analyze the performance of a classifier.

Here we will be using the Bank Marketing dataset discussed above.

First importing the libraries and the dataset

#importing libraries library(randomForest) #importing the bank marketing dataset bank <- read.csv("Bank.csv", sep=";")

The bank marketing dataset looks as below

> head(bank) age job marital education default balance housing loan contact day month 1 30 unemployed married primary no 1787 no no cellular 19 oct 2 33 services married secondary no 4789 yes yes cellular 11 may 3 35 management single tertiary no 1350 yes no cellular 16 apr 4 30 management married tertiary no 1476 yes yes unknown 3 jun 5 59 blue-collar married secondary no 0 yes no unknown 5 may 6 35 management single tertiary no 747 no no cellular 23 feb duration campaign pdays previous poutcome y 1 79 1 -1 0 unknown no 2 220 1 339 4 failure no 3 185 1 330 1 failure no 4 199 4 -1 0 unknown no 5 226 1 -1 0 unknown no 6 141 2 176 3 failure no

Now let’s explore the structure of the dataset.

str(bank) ## 'data.frame': 4521 obs. of 17 variables: ## $ age : int 30 33 35 30 59 35 36 39 41 43 ... ## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ... ## $ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ... ## $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ... ## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ... ## $ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ... ## $ housing : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ... ## $ loan : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ... ## $ contact : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ... ## $ day : int 19 11 16 3 5 23 14 6 14 17 ... ## $ month : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ... ## $ duration : int 79 220 185 199 226 141 341 151 57 313 ... ## $ campaign : int 1 1 1 4 1 2 1 2 2 1 ... ## $ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ... ## $ previous : int 0 4 1 0 0 3 2 0 0 2 ... ## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ... ## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Now let’s create a new dataframe bank_new with only Age, Job, Marital, Education, Housing, Loan, contact, poutcome and y.

Converting all these variables except the outcome variable(y) to numeric and storing it in the new dataframe bank_new.

bank_new <- data.frame(as.numeric(as.factor(bank$age)), as.numeric(as.factor(bank$job)), as.numeric(as.factor(bank$marital)), as.numeric(as.factor(bank$education)), as.numeric(as.factor(bank$housing)), as.numeric(as.factor(bank$loan)), as.numeric(as.factor(bank$contact)), as.numeric(as.factor(bank$poutcome)), bank$y)

Renaming the columns as below

colnames(bank_new) <- c("Age", "Job", "Marital", "Education", "Housing", "Loan","contact","poutcome" ,"y")

Now let’s split the dataset into train and test data with 80% train data and the remaining 20% as test.

set.seed(2262) train_ind <- sample(seq_len(nrow(bank_new)), size = floor(0.80 * nrow(bank_new))) train <- bank_new[train_ind, ] test <- bank_new[-train_ind, ]

Now let’s use the Random Forest algorithm to predict the customer classes y.

#random forest classifier fitRF <- randomForest(y~., train)

Random Forest output:

> fitRF Call: randomForest(formula = y ~ ., data = train) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 11.31% Confusion matrix: no yes class.error no 3169 25 0.007827176 yes 384 38 0.909952607

The Random Forest output has by default displayed the confusion matrix for the above predicted values of the train data. We will discuss more about the results of confusion matrix below.

Now let’s use the above model to predict the classes for test data.

#predicting the classes and storing it in predicted predicted <- predict(fitRF, test$y)

Now predicted has the predicted results and test$y has the actual results. Lets use them to create a confusion matrix using the confusionMatrix() function of the “caret” library.

#confusion matrix library(caret) confusionMatrix(predicted, test$y)

The results of the above confusion matrix looks as below

> confusionMatrix(predicted, test$y) Confusion Matrix and Statistics Reference Prediction no yes no 797 82 yes 9 17 Accuracy : 0.8994 95% CI : (0.878, 0.9183) No Information Rate : 0.8906 P-Value [Acc > NIR] : 0.2137 Kappa : 0.2373 Mcnemar's Test P-Value : 4.432e-14 Sensitivity : 0.9888 Specificity : 0.1717 Pos Pred Value : 0.9067 Neg Pred Value : 0.6538 Prevalence : 0.8906 Detection Rate : 0.8807 Detection Prevalence : 0.9713 Balanced Accuracy : 0.5803 'Positive' Class : no

As we can see the accuracy of the above predicted results seems to be Accuracy : 0.8994.ie, 89.8% of the above predicted results seems to be correctly classified.

Now lets see how to create another confusion matrix using the table() function.

confusion_table <- table(predicted, test$y)

The results of the above table look as below

> confusion_table predicted no yes no 797 82 yes 9 17

Now let’s use the confusion table to calculate accuracy, precision and recall.

n <- sum(confusion_table) # number of instances diag <- diag(confusion_table) accuracy <- sum(diag) / n # Calculate the Accuracy

> accuracy [1] 0.8994475

So, as mentioned 89.9% of the instances have been correctly classified using the above Random Forest Model.

TP = confusion_table[2,2] FP = confusion_table[1,2] FN = confusion_table[2,1] precision <- TP/(TP+FP) # Calculate the Precision

> precision [1] 0.6538462

The above precision means that only 65.38% of the customers belong to the actual “subscribed” class among all the customers predicted to be “subscribed”.Now let see the recall results.

recall <- TP/(TP+FN) # Calculate the Recall

> recall [1] 0.1717172

The above recall means that only 17% of the “subscribed” customers have been correctly classified as “subscribed”.

From the above results, although the overall accuracy of the above model seems to be good, the precision and recall results shows that the model still needs some improvement in predicting the positive instances better.

In this article we discussed about confusion matrix and its various terminologies. We also discussed how to create a confusion matrix in R using confusionMatrix() and table() functions and analyzed the results using accuracy, recall and precision.

Hope this article helped you get a good understanding about Confusion Matrix. Do let me know your feedback about this article below.

The post Confusion Matrix in R appeared first on Honing Data Science.

]]>The post Weighted Least Squares appeared first on Honing Data Science.

]]>To get a better understanding about Weighted Least Squares, lets first see what Ordinary Least Square is and how it differs from Weighted Least Square.

In a simple linear regression model of the form,

where

is the independent variable

is the independent variable

and are the regression coefficients

is the random error or the residual.

The goal is to find a line that best fits the relationship between the outcome variable and the input variable . With OLS, the linear regression model finds the line through these points such that the sum of the squares of the difference between the actual and predicted values is minimum.

i.e., to find and such that

is minimum.

In such linear regression models, the OLS assumes that the error terms or the residuals (the difference between actual and predicted values) are normally distributed with mean zero and constant variance. This constant variance condition is called homoscedasticity.

If this assumption of homoscedasticity does not hold, the various inferences made with this model might not be true.

To check for constant variance across all values along the regression line, a simple plot of the residuals and the fitted outcome values and the histogram of residuals such as below can be used.

In an ideal case with normally distributed error terms with mean zero and constant variance , the plots should look like this.

From the above plots its clearly seen that the error terms are evenly distributed on both sides of the reference zero line proving that they are normally distributed with mean=0 and has constant variance.

The histogram of the residuals also seems to have datapoints symmetric on both sides proving the normality assumption.

In some cases, the variance of the error terms might be heteroscedastic, i.e., there might be changes in the variance of the error terms with increase/decrease in predictor variable.

In those cases of non-constant variance Weighted Least Squares (WLS) can be used as a measure to estimate the outcomes of a linear regression model.

Now let’s see in detail about WLS and how it differs from OLS.

In a Weighted Least Square model, instead of minimizing the residual sum of square as seen in Ordinary Least Square ,

It minimizes the sum of squares by adding weights to them as shown below,

where _{} is the weight for each value of .

The idea behind weighted least squares is to weigh observations with higher weights more hence penalizing bigger residuals for observations with big weights more that those with smaller residuals.

Note: OLS can be considered as a special case of WLS with all the weights =1.

The weighted least square estimates in this case are given as

where the weighted means are ,

Suppose let’s consider a model where the weights are taken as

Then the residual sum of the transformed model looks as below,

To understand WLS better let’s implement it in R. Here we have used the Computer assisted learning dataset which contains the records of students who had done computer assisted learning. The variables include

cost – the cost of used computer time (in cents) and

num.responses – the number of responses in completing the lesson

Let’s first download the dataset from the ‘HoRM’ package.

install.packages('HoRM') library(HoRM) data(compasst) attach(compasst) > head(compasst, 6) num.responses cost 1 16 77 2 14 70 3 22 85 4 10 50 5 14 62 6 17 70

Let’s first use Ordinary Least Square in the lm function to predict the cost and visualize the results.

learning.lm <- lm(cost ~ num.responses, data= compasst) plot(compasst$num.response,compasst$cost) abline(learning.lm, col='red')

The scatter plot of residuals vs responses is

plot(num.responses, learning.lm$residuals)

Clearly from the above two plots there seems to be a linear relation ship between the input and outcome variables but the response seems to increase linearly with the standard deviation of residuals.

Also, the below histogram of residuals shows clear signs of non normally distributed error term.

#plotting the histogram of residuals hist(learning.lm$residuals, main = "histogram of residuals")

Hence let’s use WLS in the lm function as below,

As mentioned above weighted least squares weighs observations with higher weights more and those observations with less important measurements are given lesser weights.

Hence weights proportional to the variance of the variables are normally used for better predictions. The possible weights include

[table id=1 /]So, in this case since the responses are proportional to the standard deviation of residuals.

σ ^{2} ∝ Response^{2}

Let’s take the weights as

w_{i} = 1/ Response^{2}

Using the above weights in the lm function predicts as below.

w = 1/(num.responses^2) #predicting cost by using WLS in lm function learning.wlm <- lm(cost ~ num.responses, data= compasst, weight=w) #results of learning.wlm > summary(learning.wlm) Call: lm(formula = cost ~ num.responses, data = compasst, weights = w) Weighted Residuals: Min 1Q Median 3Q Max -0.3603 -0.2508 -0.0104 0.3052 0.3447 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 17.4530 4.8970 3.564 0.00515 ** num.responses 3.4100 0.3649 9.346 2.94e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2975 on 10 degrees of freedom Multiple R-squared: 0.8973, Adjusted R-squared: 0.887 F-statistic: 87.34 on 1 and 10 DF, p-value: 2.945e-06

Whereas the results of OLS looks like this

#results of learning.lm > summary(learning.lm) Call: lm(formula = cost ~ num.responses, data = compasst) Residuals: Min 1Q Median 3Q Max -6.389 -3.536 -0.334 3.319 6.418 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 19.4727 5.5162 3.530 0.00545 ** num.responses 3.2689 0.3651 8.955 4.33e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.598 on 10 degrees of freedom Multiple R-squared: 0.8891, Adjusted R-squared: 0.878 F-statistic: 80.19 on 1 and 10 DF, p-value: 4.33e-06

Comparing the residuals in both the cases, note that the residuals in the case of WLS is much lesser compared to those in the OLS model.

#OLS residuals Residuals: Min 1Q Median 3Q Max -6.389 -3.536 -0.334 3.319 6.418 #WLS residuals Weighted Residuals: Min 1Q Median 3Q Max -0.3603 -0.2508 -0.0104 0.3052 0.3447

Now let’s compare the R-Squared values in both the cases.

> summary(learning.lm)$r.squared [1] 0.8891177 > summary(learning.wlm)$r.squared [1] 0.8972716

From the above R squared values it is clearly seen that adding weights to the lm model has improved the overall predictability.

Now let’s implement the same example in Python.

Let’s now import the same dataset which contains records of students who had done computer assisted learning. The dataset can be found here.

#importing libraries import pandas as pd import numpy as np import statsmodels.api as sm #for OLS,WLS import seaborn as sns import matplotlib.pyplot as plt #importing the dataset learning = pd.read_csv("learning.csv") learning.head() Out[1]: num.responses cost 0 16 77 1 14 70 2 22 85 3 10 50 4 14 62

The goal here is to predict the cost which is the cost of used computer time given the num.responses which is the number of responses in completing the lesson.

Now let’s first use Ordinary Least Square method to predict the cost.

#OLS Y = learning.cost X = learning["num.responses"] learning_ols = sm.OLS(Y,X).fit()

Visualizing the results

#cost Vs num.Response Y_pred = learning_ols.predict(X) plt.scatter(X, Y) plt.xlabel("num.responses") plt.ylabel("cost") plt.plot(X, Y_pred, color='red') plt.show()

The above scatter plot shows a linear relationship between cost and number of responses. Now let’s plot the residuals to check for constant variance(homoscedasticity).

#residual plot sns.residplot(X, Y)

The above residual plot shows that the number of responses seems to increase linearly with the standard deviation of residuals, hence proving heteroscedasticity (non-constant variance).

Now let’s check the histogram of the residuals.

#Histogram of residuals ax = plt.hist(learning_ols.resid) plt.xlim(-40,50) plt.xlabel('Residuals')

The histogram of the residuals shows clear signs of non-normality.So, the above predictions that were made based on the assumption of normally distributed error terms with mean=0 and constant variance might be suspect.

Now let’s use Weighted Least Square method to predict the cost and see how the results vary.

#WLS w = 1/(learning["num.responses"]^2) learning_wls = sm.WLS(Y,X, weights=w).fit()

Comparing the R Square ^{ }values:

print('R2_ols: ', learning_ols.rsquared) print('R2_WLS: ', learning_wls.rsquared) R2_ols: 0.9915861646070941 R2_WLS: 0.9916229612643661

One of the biggest advantages of Weighted Least Square is that it gives better predictions on regression with datapoints of varying quality.

In a Weighted Least Square regression it is easy to remove an observation from the model by just setting their weights to zero.Outliers or less performing observations can be just down weighted in Weighted Least Square to improve the overall performance of the model.

One of the biggest disadvantages of weighted least squares, is that Weighted Least Squares is based on the assumption that the weights are known exactly. But exact weights are almost never known in real applications, so estimated weights must be used instead.

The effect of using estimated weights is difficult to assess, but experience indicates that small variations in the weights due to estimation do not often affect a regression analysis or its interpretation.** **** **

So, in this article we have learned what Weighted Least Square is, how it performs regression, when to use it, and how it differs from Ordinary Least Square. We have also implemented it in R and Python on the Computer Assisted Learning dataset and analyzed the results.** **

Hope this article helped you get an understanding about Weighted Least Square estimates.

Do let us know your comments and feedback about this article below.

The post Weighted Least Squares appeared first on Honing Data Science.

]]>The post Topic Modeling using Latent Dirichlet Allocation (LDA) appeared first on Honing Data Science.

]]>Topic modeling provides us with methods to organize, understand and summarize large collections of text data.

There are many approaches for obtaining topics from a text document. In this post, I will explain one of the widely used topic model called Latent Dirichlet Allocation (LDA).

Latent Dirichlet Allocation (LDA) is an example of topic model where each document is considered as a collection of topics and each word in the document corresponds to one of the topics.

So, given a document LDA basically clusters the document into topics where each topic contains a set of words which best describe the topic.

For example, consider the following product reviews:

Review 1: A Five Star Book: I just finished reading. I expected an average romance read, but instead I found one of my favorite books of all time. If you are a lover of romance novels then this is a must read.

Review 2: Delicious cookie mix: This is the first time I have ever tried baking with a cookie mix. Mixing up the dough can get VERY messy. However, with a cookie mix like this you have a lot of flexibility in the ratio of ingredients (I like to add some extra butter) and was able to make no mess super delicious cookies.

Review 3: A fascinating insight into the life of modern Japanese teens: I thoroughly enjoyed reading this book. Steven Wardell is clearly a talented young author, adopted for some of his schooling into this family of four teens, and thus able to view family life in Japan from the inside out. A great read!

In this case LDA considers each review as a document and finds the topics corresponding to these documents. Each topic group contains a set of words along with their percentage contribution to the topic.

In the case of above reviews, the results of LDA would be

Topic 1:40% books, 30%read, 20% romance

Topic 2:45% japan, 30%read, 20%author

Topic 3:30% cookie, 30% mix, 20% delicious

From the above, we could interpret that Topic 3 is related to Review 2 and Topics 1 and 2 are partially related to Reviews 1 and 3.

To get a much better understanding let me explain this by implementing LDA in python.

The following are the steps to implement LDA in Python.

- Import the dataset.
- Preprocess the text data
- Create Gensim dictionary and corpus
- Building the Topic Model
- Analyze the results
- Dominant topic within documents

Here we will be using the Amazon reviews dataset which contains the customer reviews of different amazon products.

import pandas as pd import numpy as np #read the csv file with amazon reviews reviews_df=pd.read_csv('reviews.csv',error_bad_lines=False) reviews_df['Reviews'] = reviews_df['Reviews'].astype(str) reviews_df.head(6)

Importing text preprocessing libraries

#text processing import re import string import nltk from gensim import corpora, models, similarities from nltk.corpus import stopwords from nltk.stem.porter import PorterStemmer

Here we are using three functions to preprocess the text data.

The initial_clean function performs an initial clean by removing punctuations, uppercase text etc.

def initial_clean(text): """ Function to clean text-remove punctuations, lowercase text etc. """ text = re.sub("[^a-zA-Z ]", "", text) text = text.lower() # lower case text text = nltk.word_tokenize(text) return text

The words are then tokenized where just the words are separated from the text data. For e.g., for the below text data.

text = "All work and no play makes jack a dull boy, all work and no play"

The tokenized output would be

initial_clean(text) ['All', 'work', 'and', 'no', 'play', 'makes', 'jack', 'a', 'dull', 'boy', ',', 'all', 'work', 'and', 'no', 'play']

The next function remove_stop_words() removes all the stop words from the text data. Stopwords are basically the most commonly used words in English language such as the, an is etc.

It is common to remove these stopwords from text data as they could be considered as noise or distracting features when used in text algorithms.

stop_words = stopwords.words('english') stop_words.extend(['news', 'say','use', 'not', 'would', 'say', 'could', '_', 'be', 'know', 'good', 'go', 'get', 'do','took','time','year', 'done', 'try', 'many', 'some','nice', 'thank', 'think', 'see', 'rather', 'easy', 'easily', 'lot', 'lack', 'make', 'want', 'seem', 'run', 'need', 'even', 'right', 'line','even', 'also', 'may', 'take', 'come', 'new','said', 'like','people']) def remove_stop_words(text): return [word for word in text if word not in stop_words]

The next function stem_words() stems the words to its base forms to reduce variant forms of words.

For e.g., the sentence “obesity causes many problems” would be stemmed as “Obes caus mani problem”. Here we are using porters stemming algorithm to perform stemming.

stemmer = PorterStemmer() def stem_words(text): """ Function to stem words """ try: text = [stemmer.stem(word) for word in text] text = [word for word in text if len(word) > 1] # no single letter words except IndexError: pass return text

Applying all the above preprocessing steps using apply_all() function.

def apply_all(text): """ This function applies all the functions above into one """ return stem_words(remove_stop_words(initial_clean(text)))

# clean reviews and create new column "tokenized" import time t1 = time.time() reviews_df['tokenized_reviews'] = reviews_df['Reviews'].apply(apply_all) t2 = time.time() print("Time to clean and tokenize", len(reviews_df), "reviews:", (t2-t1)/60, "min") #Time to clean and tokenize 3209 reviews: 0.21254388093948365 min

The new cleaned and tokenized data looks as below.

Importing LDA genism libraries

#LDA import gensim import pyLDAvis.gensim

To perform topic modeling using LDA the two main inputs are the dictionary(id2word) and the corpus.Here we are using gensim library for building the dictionary and the corpus.

In Gensim, the words are referred to as “tokens” and the index of each word in the dictionary is called “id”. Dictionary is nothing but the collection of unique word-id’s and corpus is the mapping of (word_id, word_frequency).Lets create them as below.

#Create a Gensim dictionary from the tokenized data tokenized = reviews_df['tokenized_reviews'] #Creating term dictionary of corpus, where each unique term is assigned an index. dictionary = corpora.Dictionary(tokenized) #Filter terms which occurs in less than 1 review and more than 80% of the reviews. dictionary.filter_extremes(no_below=1, no_above=0.8) #convert the dictionary to a bag of words corpus corpus = [dictionary.doc2bow(tokens) for tokens in tokenized] print(corpus[:1])

Below is the output

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 2), (12, 1), (13, 1)]]

From the above output of the corpus (0,1) implies that the word-id 0 has occurred only once in the first document and (1,1) implies that word-id 1 has occurred once and so on. It just maps the word-ids to their frequency of occurrence.

Let’s find the corpus with the words and their frequencies using the below code.

[[(dictionary[id], freq) for id, freq in cp] for cp in corpus[:1]]

The output is:

[[('big', 1), ('comfort', 1), ('definit', 1), ('instead', 1), ('kindl', 1), ('palm', 1), ('paper', 1), ('paperwhit', 1), ('read', 1), ('recommend', 1), ('regular', 1), ('small', 2), ('thought', 1), ('turn', 1)]]

The corpus output thus created as above is also called the Document Term Matrix and is given as input for the LDA topic model.

#LDA ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 7, id2word=dictionary, passes=15) ldamodel.save('model_combined.gensim') topics = ldamodel.print_topics(num_words=4) for topic in topics: print(topic)

Here num_topics is the number of topics to be created and passes is the number of times to iterate through the entire corpus.

The LDA algorithm creates two matrices called the document-topic matrix and a topic-words matrix.

Topic-Words matrix contains the probability distribution of words generated from those topics. By running the LDA algorithm on the above data produces the below outputs.

(0, '0.046"echo" + 0.033"alexa" + 0.026"show" + 0.025"music"') (1, '0.049"read" + 0.047"book" + 0.040"kindl" + 0.029"love"') (2, '0.042"kid" + 0.023"great" + 0.018"tablet" + 0.014"set"') (3, '0.025"work" + 0.024"great" + 0.023"amazon" + 0.022"app"') (4, '0.029"kindl" + 0.017"read" + 0.016"one" + 0.015"screen"') (5, '0.107"love" + 0.065"bought" + 0.040"gift" + 0.038"one"') (6, '0.088"tablet" + 0.051"great" + 0.031"price" + 0.026"fire"')

This output shows the Topic-Words matrix for the 7 topics created and the 4 words within each topic which best describes them. From the above output we could guess that each topic and their corresponding words revolve around a common theme (For e.g., Topic 1 is related to alexa and echo’s music, whereas Topic 2 is about reading books using amazon kindle).

Document-Topic matrix contains the probability distribution of the topics present in the documents. Now, let’s use the Document-Topic matrix to find the probability distribution of the topics present in each document.

get_document_topics = ldamodel.get_document_topics(corpus[0]) print(get_document_topics)

Using the above code for the first review as below

“I thought it would be as big as small paper but turn out to be just like my palm. I think it is too small to read on it… not very comfortable as regular Kindle. Would definitely recommend a paperwhite instead.”

The topic proportions produced are

[(4, 0.94627726)]

It is clearly evident from the output that the above review which speaks about the readability of kindle screens is 95% related to Topic 4(4, ‘0.029*”kindl” + 0.017*”read” + 0.016*”one” + 0.015*”screen”‘) which seems to be pretty accurate.

Using genism pyLDAvis feature the topics created could be visualized as below.

#visualizing topics lda_viz = gensim.models.ldamodel.LdaModel.load('model.gensim') lda_display = pyLDAvis.gensim.prepare(lda_viz, corpus, dictionary, sort_topics=True) pyLDAvis.display(lda_display)

The above display shows the correlation between the topics as well as the top most relevant terms for each selected topic (topic 1 in this case).

Now to get a much better idea and also to verify our results lets create a function called dominant_topic() which finds the most dominant topic for each review and displays it along with their topic proportions and keywords.

def dominant_topic(ldamodel, corpus, texts): #Function to find the dominant topic in each review sent_topics_df = pd.DataFrame() # Get main topic in each review for i, row in enumerate(ldamodel[corpus]): row = sorted(row, key=lambda x: (x[1]), reverse=True) # Get the Dominant topic, Perc Contribution and Keywords for each review for j, (topic_num, prop_topic) in enumerate(row): if j == 0: # => dominant topic wp = ldamodel.show_topic(topic_num,topn=4) topic_keywords = ", ".join([word for word, prop in wp]) sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True) else: break sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords'] contents = pd.Series(texts) sent_topics_df = pd.concat([sent_topics_df, contents], axis=1) return(sent_topics_df)

df_dominant_topic = dominant_topic(ldamodel=ldamodel, corpus=corpus, texts=reviews_df['Reviews']) df_dominant_topic.head()

From the above output its clearly seen that the topics created and their percentage contribution greatly relate to the context of the reviews.

So, to summarize, in this article we explained about Topic Modeling using LDA, how it works, steps involved in creating an LDA topic model, visualizing the topics, finding dominant topics etc.

Hope the above article helped you get an overall idea about LDA topic modeling. Do let us know your comments and feedbacks about this article below.

The post Topic Modeling using Latent Dirichlet Allocation (LDA) appeared first on Honing Data Science.

]]>