Related Post

Features are nothing but columns/dimensions and Feature Engineering is the process of creating new Features or Predictors based on Domain Knowledge or Statistical Principles. Feature Engineering has always been around with Machine Learning but the latest in that is Automated Feature Engineering which has become a thing recently with Researchers started using Machine Learning itself to create new Features that can help in Model Accuracy. While most of the automated Featuring Engineering address numeric data, Text Data has always been left out in this race because of its inherent unstructured nature. No more, I could say.

Michael Kearney, Assistant Professor in University of Missouri, well known in the R community for the modern twitter package rtweet, has come up with a new R packaged called `textfeatures`

that basically generates a bunch of features for any text data that you supply. Before you dream of Deep Learning based Package for Automated Text Feature Engineering, This isn’t that. This uses very simple Text Analysis principles and generates features like Number of Upper Case letters, Number of Punctuations – plain simple stuff and nothing fancy but pretty useful ones.

`textfeatures`

can be installed directly from CRAN and the development version is available on github.

install.packages("textfeatures")

In this post, we will use `textfeatures`

package to generate features for Fifa official world cup ios app reviews from the UK. We will R package `itunesr`

to extract reviews and `tidyverse`

for data manipulation and plotting.

Let us load all the required packages.

#install.packages("itunesr") #install.packages("textfeatures") #install.packages("tidyverse") library(itunesr) library(textfeatures) library(tidyverse)

#Get UK Reviews of Fifa official world cup ios app #https://itunes.apple.com/us/app/2018-fifa-world-cup-russia/id756904853?mt=8 reviews1 <- getReviews(756904853,"GB",1) reviews2 <- getReviews(756904853,"GB",2) reviews3 <- getReviews(756904853,"GB",3) reviews4 <- getReviews(756904853,"GB",4) #Combining all the reviews into one dataframe reviews <- rbind(reviews1, reviews2, reviews3, reviews4)

As we have got the reviews, Let us allow textfeatures to do its magic. We will use the function `textfeatures()`

to do that.

#Combining all the reviews into one dataframe reviews <- rbind(reviews1, reviews2, reviews3, reviews4) # generate text features feat <- textfeatures(reviews$Review) # check what all features generated glimpse(feat)Observations: 200 Variables: 17 $ n_chars 149, 13, 263, 189, 49, 338, 210, 186, 76, 14, 142, 114, 242, ... $ n_commas 1, 0, 0, 0, 0, 1, 2, 1, 1, 0, 1, 1, 0, 3, 0, 0, 1, 0, 3, 1, 0... $ n_digits 0, 0, 6, 3, 0, 4, 1, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 3, 0, 0, 0... $ n_exclaims 0, 0, 2, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 0, 0... $ n_extraspaces 1, 0, 0, 0, 0, 3, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 4, 0... $ n_hashtags 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... $ n_lowers 140, 11, 225, 170, 46, 323, 195, 178, 70, 12, 129, 106, 233, ... $ n_lowersp 0.9400000, 0.8571429, 0.8560606, 0.9000000, 0.9400000, 0.9557... $ n_mentions 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... $ n_periods 2, 0, 5, 1, 0, 0, 3, 1, 2, 0, 2, 1, 1, 2, 0, 0, 4, 2, 0, 4, 0... $ n_urls 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... $ n_words 37, 2, 55, 45, 12, 80, 42, 41, 14, 3, 37, 28, 50, 16, 15, 8, ... $ n_caps 4, 1, 12, 8, 2, 7, 3, 4, 2, 2, 6, 4, 6, 2, 3, 1, 6, 4, 29, 9,... $ n_nonasciis 0, 0, 0, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... $ n_puncts 2, 1, 7, 0, 1, 3, 4, 1, 1, 0, 4, 2, 2, 0, 1, 0, 1, 1, 0, 7, 2... $ n_capsp 0.03333333, 0.14285714, 0.04924242, 0.04736842, 0.06000000, 0... $ n_charsperword 3.947368, 4.666667, 4.714286, 4.130435, 3.846154, 4.185185, 4...

As you can see above, textfeatures have created 17 new features. Please note that these features will remain same for any text data.

For this post, we wouldn’t build a Machine Learning Model but these features can very well be used for build a Classification model like Sentiment Classification or Category Classification.

But right now, we will just visualize the outcome with some features.

We can see if there’s any relations between a number of characters and number of characters per word, with respect to the Review rating. A hypothesis could be that people who give good rating wouldn’t write long or otherwise. We are not going to validate it here, but just visualizing using a scatter plot.

# merging features with original reviews reviews_all % ggplot(aes(n_charsperword, n_chars, colour = Rating)) + geom_point()

Let’s bring in a different perspective to the same hypothesis with a different plot but while comparing against a number of words instead of a number of characters.

reviews_all %>% ggplot(aes(n_charsperword, n_words)) + geom_point() + facet_wrap(~Rating) + stat_smooth()

Thus, you can use `textfeatures`

to automatically generate new features and make a better understanding of your text data. Hope this post helps you get started with this beautiful package and if you’d like to know more on Text Analysis check out this tutorial by Julia Silge. The complete code used here is available on my github.

Related Post

Related Post

So, let’s start!

First of all, we need to have a single list with all the results to facilitate the next steps. I am assuming on this step that you already designed a model and can calculate the predictions out of your test set. So, on my list I have the following objects:

- Index (row id, it can be a user_id, email, lead_id…)
- Tag (known label)
- Score (calculated with the model we are studying)

- Train set
- Test set

- nfolds, ntrees, max_depth, seed, sample_rate….

- log_loss
- auc

Once we automate our results object, we can start with our beautiful plots!

NOTE: When asked to provide ‘tag’ in the functions, it would be your classifier’s **labels** (works best with only 2 classes [binary].. works with more if needed). When asked for ‘score’, you should input the model’s result [continuous] **values** (scores / predictions).

I have always given importance to the density plot because it gives us visual information on skewness, distribution and our model’s facility to distinguish each class. Here we can see how the model has distributed both our categories, our whole test set and the cumulative of each category (the more separate, the better).

mplot_density <- function(tag, score, model_name = NA, subtitle = NA, save = FALSE, file_name = "viz_distribution.png") { require(ggplot2) require(gridExtra) if (length(tag) != length(score)) { message("The tag and score vectors should be the same length.") stop(message(paste("Currently, tag has",length(tag),"rows and score has",length(score)))) } if (length(unique(tag)) != 2) { stop("This function is for binary models. You should only have 2 unique values for the tag value!") } out <- data.frame(tag = as.character(tag), score = as.numeric(score)) p1 <- ggplot(out) + theme_minimal() + geom_density(aes(x = 100 * score, group = tag, fill = as.character(tag)), alpha = 0.6, adjust = 0.25) + guides(fill = guide_legend(title="Tag")) + xlim(0, 100) + labs(title = "Score distribution for binary model", y = "Density by tag", x = "Score") p2 <- ggplot(out) + theme_minimal() + geom_density(aes(x = 100 * score), alpha = 0.9, adjust = 0.25, fill = "deepskyblue") + labs(x = "", y = "Density") p3 <- ggplot(out) + theme_minimal() + geom_line(aes(x = score * 100, y = 100 * (1 - ..y..), color = as.character(tag)), stat = 'ecdf', size = 1) + geom_line(aes(x = score * 100, y = 100 * (1 - ..y..)), stat = 'ecdf', size = 0.5, colour = "black", linetype="dotted") + ylab('Cumulative') + xlab('') + guides(color=FALSE) if(!is.na(subtitle)) { p1 <- p1 + labs(subtitle = subtitle) } if(!is.na(model_name)) { p1 <- p1 + labs(caption = model_name) } if(save == TRUE) { png(file_name, height = 1800, width = 2100, res = 300) grid.arrange( p1, p2, p3, ncol = 2, nrow = 2, heights = 2:1, layout_matrix = rbind(c(1,1), c(2,3))) dev.off() } return( grid.arrange( p1, p2, p3, ncol = 2, nrow = 2, heights = 2:1, layout_matrix = rbind(c(1,1), c(2,3)))) }

The ROC curve will give us an idea of how our model is performing with our test set. You should know by now that if the AUC is close to 50% then the model is as good as a random selector; on the other hand, if the AUC is near 100% then you have a “perfect model” (wanting or not, you must have been giving the model the answer this whole time!). So it is always good to check this plot and check that we are getting a reasonable Area Under the Curve with a nice and closed 95% confidence range.

# ROC Curve mplot_roc <- function(tag, score, model_name = NA, subtitle = NA, interval = 0.2, plotly = FALSE, save = FALSE, file_name = "viz_roc.png") { require(pROC) require(ggplot2) if (length(tag) != length(score)) { message("The tag and score vectors should be the same length.") stop(message(paste("Currently, tag has",length(tag),"rows and score has",length(score)))) } roc <- pROC::roc(tag, score, ci=T) coords <- data.frame( x = rev(roc$specificities), y = rev(roc$sensitivities)) ci <- data.frame(roc$ci, row.names = c("min","AUC","max")) p <- ggplot(coords, aes(x = x, y = y)) + geom_line(colour = "deepskyblue", size = 1) + geom_point(colour = "blue3", size = 0.9, alpha = 0.4) + geom_segment(aes(x = 0, y = 1, xend = 1, yend = 0), alpha = 0.2, linetype = "dotted") + scale_x_reverse(name = "% Specificity [False Positive Rate]", limits = c(1,0), breaks = seq(0, 1, interval), expand = c(0.001,0.001)) + scale_y_continuous(name = "% Sensitivity [True Positive Rate]", limits = c(0,1), breaks = seq(0, 1, interval), expand = c(0.001, 0.001)) + theme_minimal() + theme(axis.ticks = element_line(color = "grey80")) + coord_equal() + ggtitle("ROC Curve: AUC") + annotate("text", x = 0.25, y = 0.10, vjust = 0, size = 4.2, label = paste("AUC =", round(100*ci[c("AUC"),],2))) + annotate("text", x = 0.25, y = 0.05, vjust = 0, size = 2.8, label = paste0("95% CI: ", round(100*ci[c("min"),],2),"-", round(100*ci[c("max"),],2))) if(!is.na(subtitle)) { p <- p + labs(subtitle = subtitle) } if(!is.na(model_name)) { p <- p + labs(caption = model_name) } if (plotly == TRUE) { require(plotly) p <- ggplotly(p) } if (save == TRUE) { p <- p + ggsave(file_name, width = 6, height = 6) } return(p) }

If we’d have to cut the score in n equal-sized buckets, what would the score cuts be? Is the result a ladder (as it should), or a huge wall, or a valley? Is our score distribution lineal and easy to split?

mplot_cuts <- function(score, splits = 10, subtitle = NA, model_name = NA, save = FALSE, file_name = "viz_ncuts.png") { require(ggplot2) if (splits > 25) { stop("You should try with less splits!") } deciles <- quantile(score, probs = seq((1/splits), 1, length = splits), names = TRUE) deciles <- data.frame(cbind(Deciles=row.names(as.data.frame(deciles)), Threshold=as.data.frame(deciles))) p <- ggplot(deciles, aes(x = reorder(Deciles, deciles), y = deciles * 100, label = round(100 * deciles, 1))) + geom_col(fill="deepskyblue") + xlab('') + theme_minimal() + ylab('Score') + geom_text(vjust = 1.5, size = 3, inherit.aes = TRUE, colour = "white", check_overlap = TRUE) + labs(title = paste("Cuts by score: using", splits, "equal-sized buckets")) if(!is.na(subtitle)) { p <- p + labs(subtitle = subtitle) } if(!is.na(model_name)) { p <- p + labs(caption = model_name) } if (save == TRUE) { p <- p + ggsave(file_name, width = 6, height = 6) } return(p) }

This parameter is the easiest to sell to the C-level guys. “Did you know that with this model, if we chop the worst 20% of leads we would have avoided 60% of the frauds and only lose 8% of our sales?” That’s what this plot will give you.

The math behind the plot might be a bit foggy for some readers so let me try and explain further: if you sort from the lowest to the highest score all your observations / people / leads, then you can literally, for instance, select the top 5 or bottom 15% or so. What we do now is split all those “ranked” rows into similar-sized-buckets to get the best bucket, the second best one, and so on. Then, if you split all the “Goods” and the “Bads” into two columns, keeping their buckets’ colours, we still have it sorted and separated, right? To conclude, if you’d say that the worst 20% cases (all from the same worst colour and bucket) were to take an action, then how many of each label would that represent on your test set? There you go!

mplot_splits <- function(tag, score, splits = 5, subtitle = NA, model_name = NA, facet = NA, save = FALSE, file_name = "viz_splits.png") { require(ggplot2) require(dplyr) require(RColorBrewer) if (length(tag) != length(score)) { message("The tag and score vectors should be the same length.") stop(message(paste("Currently, tag has",length(tag),"rows and score has",length(score)))) } if (splits > 10) { stop("You should try with less splits!") } df <- data.frame(tag, score, facet) npersplit <- round(nrow(df)/splits) names % mutate(quantile = ntile(score, splits)) %>% group_by(quantile) %>% summarise(n = n(), max_score = round(100 * max(score), 1), min_score = round(100 * min(score), 1)) %>% mutate(quantile_tag = paste0(quantile," (",min_score,"-",max_score,")")) p % mutate(quantile = ntile(score, splits)) %>% group_by(quantile, facet, tag) %>% tally() %>% ungroup() %>% group_by(facet, tag) %>% arrange(desc(quantile)) %>% mutate(p = round(100*n/sum(n),2), cum = cumsum(100*n/sum(n))) %>% left_join(names, by = c("quantile")) %>% ggplot(aes(x = as.character(tag), y = p, label = as.character(p), fill = as.character(quantile_tag))) + theme_minimal() + geom_col(position = "stack") + geom_text(size = 3, position = position_stack(vjust = 0.5), check_overlap = TRUE) + xlab("Tag") + ylab("Total Percentage by Tag") + guides(fill = guide_legend(title=paste0("~",npersplit," p/split"))) + labs(title = "Tag vs Score Splits Comparison") + scale_fill_brewer(palette = "Spectral") if(!is.na(subtitle)) { p <- p + labs(subtitle = subtitle) } if(!is.na(model_name)) { p <- p + labs(caption = model_name) } if(!is.na(facet)) { p <- p + facet_grid(. ~ facet, scales = "free") } if (save == TRUE) { p <- p + ggsave(file_name, width = 6, height = 6) } return(p) }

Once we have defined these functions above, we can create a new one that will bring everything together into one single plot. If you pay attention to the variables needed to create this dashboard you would notice it actually only needs two: the label or tag, and the score. You can customize the splits for the upper right plot, set a subtitle, define the model’s name, save it in a new folder, change the image’s name.

mplot_full <- function(tag, score, splits = 8, subtitle = NA, model_name = NA, save = FALSE, file_name = "viz_full.png", subdir = NA) { require(ggplot2) require(gridExtra) options(warn=-1) if (length(tag) != length(score)) { message("The tag and score vectors should be the same length.") stop(message(paste("Currently, tag has",length(tag),"rows and score has",length(score)))) } p1 <- mplot_density(tag = tag, score = score, subtitle = subtitle, model_name = model_name) p2 <- mplot_splits(tag = tag, score = score, splits = splits) p3 <- mplot_roc(tag = tag, score = score) p4 <- mplot_cuts(score = score) if(save == TRUE) { if (!is.na(subdir)) { dir.create(file.path(getwd(), subdir)) file_name <- paste(subdir, file_name, sep="/") } png(file_name, height = 2000, width = 3200, res = 300) grid.arrange( p1, p2, p3, p4, widths = c(1.3,1), layout_matrix = rbind(c(1,2), c(1,2), c(1,3), c(4,3))) dev.off() } return( grid.arrange( p1, p2, p3, p4, widths = c(1.3,1), layout_matrix = rbind(c(1,2), c(1,2), c(1,3), c(4,3))) ) }

That’s it. This dashboard will give us almost everything we need to visually evaluate our model’s performance into the test set.

One bonus tip for these plots: you can set the subtitle and subdirectory before you plot everything so you don’t have to change it whenever you are trying a new model.

subtitle <- paste(results$project, "- AUC:", round(100 * results$auc_test, 2)) subdir <- paste0("Models/", round(100*results$auc_test, 2), "-", results$model_name)

If you are working with a ML algorithm that let’s you see the importance of each variable, you can use the following function to see the results:

mplot_importance <- function(var, imp, colours = NA, limit = 15, model_name = NA, subtitle = NA, save = FALSE, file_name = "viz_importance.png", subdir = NA) { require(ggplot2) require(gridExtra) options(warn=-1) if (length(var) != length(imp)) { message("The variables and importance values vectors should be the same length.") stop(message(paste("Currently, there are",length(var),"variables and",length(imp),"importance values!"))) } if (is.na(colours)) { colours <- "deepskyblue" } out <- data.frame(var = var, imp = imp, Type = colours) if (length(var) < limit) { limit <- length(var) } output <- out[1:limit,] p <- ggplot(output, aes(x = reorder(var, imp), y = imp * 100, label = round(100 * imp, 1))) + geom_col(aes(fill = Type), width = 0.1) + geom_point(aes(colour = Type), size = 6) + coord_flip() + xlab('') + theme_minimal() + ylab('Importance') + geom_text(hjust = 0.5, size = 2, inherit.aes = TRUE, colour = "white") + labs(title = paste0("Variables Importances. (", limit, " / ", length(var), " plotted)")) if (length(unique(output$Type)) == 1) { p <- p + geom_col(fill = colours, width = 0.2) + geom_point(colour = colours, size = 6) + guides(fill = FALSE, colour = FALSE) + geom_text(hjust = 0.5, size = 2, inherit.aes = TRUE, colour = "white") } if(!is.na(model_name)) { p <- p + labs(caption = model_name) } if(!is.na(subtitle)) { p <- p + labs(subtitle = subtitle) } if(save == TRUE) { if (!is.na(subdir)) { dir.create(file.path(getwd(), subdir)) file_name <- paste(subdir, file_name, sep="/") } p <- p + ggsave(file_name, width=7, height=6) } return(p) }

Hope you guys enjoyed this post and any further comments or suggestions are more than welcome. Not a programmer here but I surely enjoy sharing my code and ideas! Feel free to connect with me in LinkedIn and/or write below in the comments.

You can also install my library(lares) which already have these functions in it:

devtools::install_github("laresbernardo/lares")

Related Post

Related Post

The function do the following:

- Clean Data from NA’s and Blanks
- Separate the clean data – Integer dataframe, Double dataframe, Factor dataframe, Numeric dataframe, and Factor and Numeric dataframe.
- View the new dataframes
- Create a view of the summary and describe from the clean data.
- Create histograms of the data frames.
- Save all the objects

This will happen in seconds.

First, load `Hmisc`

package. I always save the original file.

The code below is the engine that cleans the data file.

cleandata <- dataname[complete.cases(dataname),]

The function is below. You need to copy the code and save it in an R file. Run the code and the function `cleanme`

will appear.

cleanme <- function(dataname){ #SAVE THE ORIGINAL FILE oldfile <- write.csv(dataname, file = "oldfile.csv", row.names = FALSE, na = "") #CLEAN THE FILE. SAVE THE CLEAN. IMPORT THE CLEAN FILE. CHANGE THE TO A DATAFRAME. cleandata <- dataname[complete.cases(dataname),] cleanfile <- write.csv(cleandata, file = "cleanfile.csv", row.names = FALSE, na = "") cleanfileread <- read.csv(file = "cleanfile.csv") cleanfiledata <- as.data.frame(cleanfileread) #SUBSETTING THE DATA TO TYPES logicmeint <- cleanfiledata[,sapply(cleanfiledata,is.integer)] logicmedouble <- cleanfiledata[,sapply(cleanfiledata,is.double)] logicmefactor <- cleanfiledata[,sapply(cleanfiledata,is.factor)] logicmenum <- cleanfiledata[,sapply(cleanfiledata,is.numeric)] mainlogicmefactors <- cleanfiledata[,sapply(cleanfiledata,is.factor) | sapply(cleanfiledata,is.numeric)] #VIEW ALL FILES View(cleanfiledata) View(logicmeint) View(logicmedouble) View(logicmefactor) View(logicmenum) View(mainlogicmefactors) #describeFast(mainlogicmefactors) #ANALYTICS OF THE MAIN DATAFRAME cleansum <- summary(cleanfiledata) print(cleansum) cleandec <- describe(cleanfiledata) print(cleandec) #ANALYTICS OF THE FACTOR DATAFRAME factorsum <- summary(logicmefactor) print(factorsum) factordec <- describe(logicmefactor) print(factordec) #ANALYTICS OF THE NUMBER DATAFRAME numbersum <- summary(logicmenum) print(numbersum) numberdec <- describe(logicmefactor) print(numberdec) mainlogicmefactorsdec <- describe(mainlogicmefactors) print(mainlogicmefactorsdec) mainlogicmefactorssum <- describe(mainlogicmefactors) print(mainlogicmefactorssum) #savemenow <- saveRDS("cleanmework.rds") #readnow <- readRDS(savemenow) #HISTOGRAM PLOTS OF ALL TYPES hist(cleanfiledata) hist(logicmeint) hist(logicmedouble) hist(logicmefactor) hist(logicmenum) #plot(mainlogicmefactors) save(cleanfiledata, logicmeint, mainlogicmefactors, logicmedouble, logicmefactor, logicmenum, numberdec, numbersum, factordec, factorsum, cleandec, oldfile, cleandata, cleanfile, cleanfileread, file = "cleanmework.RData") }

Type in and run:

cleanme(dataname)

When all the data frames appear, type to load the workspace as objects.

load("cleanmework.RData")

Enjoy

Related Post

Related Post

- Extract FRED Data for OLS Regression Analysis: A Complete R Tutorial
- MNIST For Machine Learning Beginners With Softmax Regression
- Time-dependent ROC for Survival Prediction Models in R
- xplain – R package for interpretations and explanations of statistical results
- Taking the baseline measurement into account: constrained LDA in R

The first step which is involved after data gathering, manipulation is creating your linear model by selecting the 2 numeric variables.

For a small dataset and with a little bit of domain knowledge we can find out such critical variables and start our analysis. But at times doing some pre-examination can make our variable selection simpler. This step can be referred to as a selection of prominent variables for our “LM” model.

So in this post, we will discuss 2 such very straightforward methods which can visually help us to identify a correlation between all available variables in our data set.

We will try to analyze the most basic and universal dataset iris

`summary(iris)`

## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 ## Species ## setosa :50 ## versicolor:50 ## virginica :50 ## ## ##

`str(iris)`

## 'data.frame': 150 obs. of 5 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... ## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Use of pairs(dataset) method

```
pairs(iris)
```

From the above results for `mtcars`

and `iris`

we can see that `pairs()`

method produces a matrix of scatterplots between all possible variables of our dataset. How to read this matrix to gain insights? Every variable constitutes 1 row and 1 column of our pairs matrix. For example, we want to see the scatterplot and correlation between `Sepal.Length`

and `Petal.Length`

. `Sepal.Length`

constitutes the 1st row and 1st column of the matrix. Similarly `Petal.Length`

constitutes 3rd row and 3rd column of the matrix. The plot which is at the intersection of a 1st row and 3rd column or 3rd row and 1st column shows a correlation in terms of scatterplots between `Sepal.Length`

and `PetalLength`

. (refer plots on position (1,3) for `Sepal.Length`

on Y axis and `Petal.Length`

on X axis. And refer plot on position (3,1) for `Petal.Length`

on Y axis and `Sepal.Length`

on X axis).

Note: Interchanging of axes will only impact visualization of the plot but the correlation coefficient remains unchanged.

`cor(iris$Sepal.Length,iris$Petal.Length)`

## [1] 0.8717538

`cor(iris$Petal.Length,iris$Sepal.Length)`

## [1] 0.8717538

From the above matrix for `iris`

we can deduce the following insights:

- Correlation between
`Sepal.Length`

and`Petal.Length`

is strong and dense. `Sepal.Length`

and`Sepal.Width`

seems to show very little correlation as datapoints are spreaded through out the plot area.`Petal.Length`

and`Petal.Width`

also shows strong correlation.

Note: The insights are made from the interpretation of scatterplots(with no absolute value of the coefficient of correlation calculated). Some more examination will be required to be done once significant variables are obtained for linear regression modeling. (with help of residual plots, the coefficient of determination i.e Multiplied R square we can reach closer to our results)

As we saw in the matrix we are just provided with scatterplots which shows the relationship between various variables of our dataset. We have to visually beleive that correlation details shown are true or use cor(x,y) function to evaluate our insights further.

To overcome this we have one more very simple to implement method `ggpairs(dataset)`

. This method belongs to GGally package which also needs ggplot2 to be imported.

```
install.packages("ggplot2")
install.packages("GGally")
library(GGally)
library(ggplot2)
```

Now just one more line of code and it is done.

`ggpairs(iris)`

## plot: [1,1] [=>-------------------------------------------] 4% est: 0s plot: [1,2] [===>-----------------------------------------] 8% est: 2s plot: [1,3] [====>----------------------------------------] 12% est: 3s plot: [1,4] [======>--------------------------------------] 16% est: 3s plot: [1,5] [========>------------------------------------] 20% est: 3s plot: [2,1] [==========>----------------------------------] 24% est: 3s plot: [2,2] [============>--------------------------------] 28% est: 3s plot: [2,3] [=============>-------------------------------] 32% est: 3s plot: [2,4] [===============>-----------------------------] 36% est: 3s plot: [2,5] [=================>---------------------------] 40% est: 2s plot: [3,1] [===================>-------------------------] 44% est: 2s plot: [3,2] [=====================>-----------------------] 48% est: 2s plot: [3,3] [======================>----------------------] 52% est: 2s plot: [3,4] [========================>--------------------] 56% est: 2s plot: [3,5] [==========================>------------------] 60% est: 2s plot: [4,1] [============================>----------------] 64% est: 1s plot: [4,2] [==============================>--------------] 68% est: 1s plot: [4,3] [===============================>-------------] 72% est: 1s plot: [4,4] [=================================>-----------] 76% est: 1s plot: [4,5] [===================================>---------] 80% est: 1s plot: [5,1] [=====================================>-------] 84% est: 1s `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## plot: [5,2] [=======================================>-----] 88% est: 0s `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## plot: [5,3] [========================================>----] 92% est: 0s `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## plot: [5,4] [==========================================>--] 96% est: 0s `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## plot: [5,5] [=============================================]100% est: 0s

We have all necessary results above, and it even gives us some more insights. Compared to `pairs()`

we are not just limited to scatterplots, but with `ggpairs()`

we can see various plots for numeric/categorical variables.

What `ggpairs`

provides use more thaAlong withairs()

- Alongwith scatterplots correlation coefficients are also providing in the same panel.
- Density plots for every numeric continuous variable help us to identify skewness, kurtosis and distribution information.
- Box plots are used to represent statistical summary for categorical and respective numeric variable.
- Bar charts
- Histograms

Additionaly you can also explore `ggcorr()`

method of `GGally`

which gives graphical representation of only correlation coefficients without any plots.

So we can conculde that the panel gives ample of necesaary insights but the process is a little bit time consuming compared to `pairs()`

method.

Using some level of pre-examination over your dataset at the primary stage can help us identify significant variables for suitable regression models.

Related Post

- Extract FRED Data for OLS Regression Analysis: A Complete R Tutorial
- MNIST For Machine Learning Beginners With Softmax Regression
- Time-dependent ROC for Survival Prediction Models in R
- xplain – R package for interpretations and explanations of statistical results
- Taking the baseline measurement into account: constrained LDA in R

Related Post

We will go through the following tools:

- Charts in R
- Hypothesis Testing in R

In Six Sigma projects, Charts plays a very important role. They are used to support the interpretation of data, as a graphical representation of data provides a better understanding of the process. So, Charts are used not only in analyze phase, but at every step of Six Sigma projects.

A bar chart is a very simple chart where some quantities are shown as the height of bars. Each bar represents a factor where the variable under study is being measured. A bar chart is usually the best graphical representation for counts. For example, the printer cartridge manufacturer distributes its product to five regions. An unexpected amount of defective cartridges has been returned in the last month. Bar plot of defects by region is as below:

library("SixSigma") with(ss.data.pc.r, barplot(pc.def, names.arg = pc.regions, las = 2, main = "Barplot of defects by region", sub = "Printer cartridge example"))

A histogram is a bar chart for continuous variables. It shows the distribution of the measurements of variables. A histogram is used to find the distribution of a variable. That is, the variable is centered or biased. Are the observations close to the central values, or is it a spread distribution? Is it a normal distribution?

For example, histograms for the variables volume and density in the `ss.data.pc`

data set.

hist(ss.data.pc$pc.volume, main="Printer Cartridge Volume", xlab="Volume", col="#DDDDDD")

Superimpose a density line for a theoretical normal distribution whose mean is 16 and standard deviation is 1

hist(ss.data.pc$pc.volume, main = "Printer Cartridge Volume", xlab = "Volume", col = "#BBBBBB", border = "white", bg = "red", freq = FALSE, ylim = c(0,0.4)) curve(dnorm(x,16,1), add = TRUE, lty = 2, lwd = 2) lines(density(ss.data.pc$pc.volume), lwd = 2)

A scatter plot is an important tool to reveal relationships between two variables. There are three types of correlation between two variables: Positive correlation (high values of one variable lead to high values of the other one ). A negative correlation (Inversely proportional, high values of one variable lead to low values of the other one). No correlation: the variables are independent.

In the `ss.data.pc`

dataset, we have two continuous variables: `pc.volume`

and `pc.density`

, Scatterplot can be used to find patterns for this relation.

plot(pc.volume ~ pc.density, data = ss.data.pc, main = "Searching correlation between Density and Volume", col = "#666666", pch = 16, sub = "Printer Cartridge Example", xlab = "Volume of Ink", ylab = "Density")

The box-whisker chart is also known as the box plot. It graphically summarizes the distribution of a continuous variable. The sides of the box are the first and third quartiles (25th and 75th percentile, respectively). Thus, inside the box, we have the middle 50% of the data. The median is plotted as a line that crosses the box The box plot tells us if the distribution is centered or biased (the position of the median with respect to the rest of the data), if there are outliers (points outside the whiskers), or if the data are close to the center values (small whiskers or boxes).

It is useful when we want to compare groups and check if there are differences among them. Example: In a production line, we have three fillers for the printer cartridges. We want to determine if there are any differences among the fillers and identify outliers.

boxplot(pc.volume ~ pc.filler, data = ss.data.pc, col = "#CCCCCC", main = "Box Plot of Volume by Filler", sub = "Printer Cartridge Example", xlab = "Filler", ylab = "Volume")

A statistical hypothesis is an assumption about a population parameter. This assumption may or may not be true. Hypothesis testing refers to the formal procedures used by statisticians to accept or reject statistical hypotheses. So it is intended to confirm or validate some conjectures about the process we are analyzing. For example, if we have data from a process that are normally distributed and we want to verify if the mean of the process has changed with respect to the historical mean, we should make the following hypothesis test:

H0: μ = μ0

H1: μ not equal μ0

where H0 denotes the null hypothesis and H1 denotes the alternative hypothesis. Thus we are testing H0 (“the mean has not changed”) vs. H1 (“the mean has changed”).

Hypothesis testing can be performed in two ways: one-sided tests and two-sided tests. An example of the latter is when we want to know if the mean of a process has increased:

H0: μ = μ0,

H1: μ >μ0.

Hypothesis testing tries to find evidence about the refutability of the null hypothesis using probability theory. We want to check if a situation of the alternative hypothesis is arising. Subsequently, we will reject the null hypothesis if the data do not support it with “enough evidence.” The threshold for enough evidence can be set by expressing a significance level α. A 5% significance level is a widely accepted value in most cases.

There are some functions in R to perform hypothesis tests, for example, `t.test`

for means, `prop.test`

for proportions, `var.test`

and `bartlett.test`

for variances, `chisq.test`

for contingency table tests and `goodness-of-fit`

tests, `poisson.test`

for Poisson distributions, `binom.test`

for binomial distributions, `shapiro.test`

for normality tests. Usually, these functions also provide a confidence interval for the parameter tested.

Example: Hypothesis test to verify if the length of the strings is different from the target value of 950 mm:

H0 : μ = 950,

H1 : μ not equal to 950.

t.test(ss.data.strings$len, mu = 950, conf.level = 0.95)One Sample t-test data: ss.data.strings$len t = 0.6433, df = 119, p-value = 0.5213 alternative hypothesis: true mean is not equal to 950 95 percent confidence interval: 949.9674 950.0640 sample estimates: mean of x 950.0157

As the p-value >0.05, we accept that the mean is not different from the target.

Two means can also be compared, for example, for two types of strings. To compare the length of the string types E6 and E1, we use the following code:

data.E1 <- ss.data.strings$len[ss.data.strings$type == "E1"] data.E6 <- ss.data.strings$len[ss.data.strings$type == "E6"] t.test(data.E1, data.E6)Welch Two Sample t-test data: data.E1 and data.E6 t = -0.3091, df = 36.423, p-value = 0.759 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.1822016 0.1339911 sample estimates: mean of x mean of y 949.9756 949.9997

Variances can be compared using the var.test function:

var.test(data.E1, data.E6)F test to compare two variances data: data.E1 and data.E6 F = 1.5254, num df = 19, denom df = 19, p-value = 0.3655 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 0.6037828 3.8539181 sample estimates: ratio of variances 1.525428

In this case, the statistic used is the ratio between variances, and the null hypothesis is “the ratio of variances is equal to 1,” that is, the variances are equal.

Proportions can be compared using the prop.test function. Do we have the same defects for every string type? For example, compare types E1 and A5:

defects <- data.frame(type = ss.data.strings$type, res = ss.data.strings$res < 3) defects <- aggregate(res ~ type, data = defects, sum) prop.test(defects$res, rep(20,6))6-sample test for equality of proportions without continuity correction data: defects$res out of rep(20, 6) X-squared = 5.1724, df = 5, p-value = 0.3952 alternative hypothesis: two.sided sample estimates: prop 1 prop 2 prop 3 prop 4 prop 5 prop 6 0.05 0.00 0.00 0.10 0.00 0.05

The p-value for the hypothesis test of equal proportions >0.05, so we cannot reject the null hypothesis, and therefore we do not reject that the proportions are equal.

A normality test to check if the data follow a normal distribution can be performed with the `shapiro.test()`

function:

shapiro.test(ss.data.strings$len)Shapiro-Wilk normality test data: ss.data.strings$len W = 0.98463, p-value = 0.1903

The statistic used to perform this hypothesis test is the Shapiro–Wilk statistic. In

this test, the hypotheses are as follows:

H0 : The data are normally distributed.

H1 : The data are not normally distributed.

The p-value > 0.05, we cannot reject the hypothesis of normality for the data, so we do not have enough evidence to reject normality.

A type I error occurs when we reject the null hypothesis and it is true. We commit a type II error when we do not reject the null hypothesis and it is false. The probability of the former is represented as α, and it is the significance level of the hypothesis test (1 − α is the confidence level). The probability of the latter is represented as β, and the value 1−β is the statistical power of the test.

The R functions `power.prop.test`

and `power.t.test`

can be used to determine the power or the sample size.

Example: Black belt plan to perform a new analysis to find out the sample size needed to estimate the mean length of the strings with a maximum error of δ = 0.1 cm. He sets the significance level (α = 0.05) and the power (1−β = 0.90). The sample size can be determined using the following command:

power.t.test(delta = 0.1, power = 0.9, sig.level = 0.05, sd = sd (ss.data.strings$len))Two-sample t test power calculation n = 151.2648 delta = 0.1 sd = 0.2674321 sig.level = 0.05 power = 0.9 alternative = two.sided NOTE: n is number in *each* group

So the sample size must be 152.

In next part, we will go through Improve Phase of Six Sigma DMAIC process. Please let me know your feedback in the comments section. Make sure to like & share it. Happy Learning!!

Related Post

Related Post

A categorical variable (sometimes called a nominal variable) is one that has two or more categories, but there is no intrinsic ordering to the categories. For example, gender is a categorical variable having two categories (male and female) and there is no intrinsic ordering to the categories. Hair color is also a categorical variable having a number of categories (blonde, brown, brunette, red, etc.) and again, there is no agreed way to order these from highest to lowest. A purely categorical variable is one that simply allows you to assign categories but you cannot clearly order the variables. If the variable has a clear ordering, then that variable would be an ordinal variable.

Now let’s discuss using seaborn to plot categorical data! There are a few main plot types for this:

- barplot
- countplot
- boxplot
- violinplot
- striplot
- swarmplot

Let’s go through examples of each!

First, we will import the library `Seaborn`

.

import seaborn as sns %matplotlib inline #to plot the graphs inline on jupyter notebook

To demonstrate the various categorical plots used in Seaborn, we will use the in-built dataset present in the seaborn library which is the ‘tips’ dataset.

t=sns.load_dataset('tips') #to check some rows to get a idea of the data present t.head()

The ‘tips’ dataset is a sample dataset in Seaborn which looks like this.

A barplot can be created by the following command below,

sns.barplot(x='sex',y='total_bill',data=t)

Here parameters x, y refers to the name of the variables in the dataset provided in parameter ‘data’.

This is essentially the same as barplot except the estimator is explicitly counting the number of occurrences. Which is why we only pass the x value. Command for creating countplot is:

sns.countplot(x='sex',data=t)

This gives the countplot as follows:

A box plot (or box-and-whisker plot) shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable. The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be “outliers” using a method that is a function of the inter-quartile range.

sns.boxplot(x='day',y='total_bill',data=t,palette='rainbow')

We can also make boxplot for the whole dataframe as:

#Can do entire dataframe with orient='h' sns.boxplot(data=t,palette='coolwarm',orient='h')

It’s also possible to add a nested categorical variable with the hue parameter.

sns.boxplot(x="day",y="total_bill",hue="smoker",data=t, palette="coolwarm")Output:

sns.violinplot(x="day", y="total_bill", data=t,palette='rainbow')

`hue`

can also be applied to violin plot.

sns.violinplot(x="day",y="total_bill",data=t,hue='sex',palette='Set1')

The `stripplot`

will draw a scatterplot where one variable is categorical. A strip plot can be drawn on its own, but it is also a good complement to a box or violin plot in cases where you want to show all observations along with some representation of the underlying distribution.

The `swarmplot`

is similar to `stripplot()`

, but the points are adjusted (only along the categorical axis) so that they don’t overlap. This gives a better representation of the distribution of values, although it does not scale as well to large numbers of observations (both in terms of the ability to show all the points and in terms of the computation needed to arrange them).

sns.stripplot(x="day", y="total_bill", data=t)

sns.stripplot(x="day",y="total_bill",data=t,jitter=True,hue='sex',palette='Set1')

sns.swarmplot(x="day", y="total_bill", data=t)

sns.swarmplot(x="day",y="total_bill",hue='sex',data=t,palette="Set1", split=True)

Hope you like this post.

Related Post