Related Post

Multiple correspondence analysis (MCA) is a multivariate data analysis and data mining tool for finding and constructing a low-dimensional visual representation of variable associations among groups of categorical variables. Variable clustering as a tool for identifying redundancy is often applied to get a first impression of variable associations and multivariate data structure.

The motivations of this post are to illustrate the applications of: 1) preparing input variables for analysis and predictive modeling, 2) MCA as a multivariate exploratory data analysis and categorical data mining tool for business insights of customer churn data, and 3) variable clustering of categorical variables for the identification of redundant variables.

**Customer churn data in this analysis: **

Customer attrition is a metrics businesses use to monitor and quantify the loss of customers and/or clients for various reasons. The data set includes customer-level demographic, account and services information including monthly charge amounts and length of service with the company. Customers who left the company for competitors (Yes) or staying with the company (No) have been identified in the last column labeled churn.

** Load Packages **

require(caret) require(plyr) require(car) require(dplyr) require(reshape2) theme_set(theme_bw(12))

The data set used in this post was obtained from the watson-analytics-blog site. Click the hyperlink “Watson Analytics Sample Dataset – Telco Customer Churn” to download the file “WA_Fn-UseC_-Telco-Customer-Churn.csv”.

setwd("path to the location of your copy of the saved csv data file") churn <- read.table("WA_Fn-UseC_-Telco-Customer-Churn,csv", sep=",", header=TRUE) ## inspect data dimensions and structure str(churn)## 'data.frame': 7043 obs. of 21 variables: ## $ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ... ## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ... ## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ... ## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ... ## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ... ## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ... ## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ... ## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ... ## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ... ## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ... ## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ... ## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ... ## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ... ## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ... ## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ... ## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ... ## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ... ## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ... ## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ... ## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ... ## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1

The raw data set contains 7043 records and 21 variables. Looking at the data structure, some data columns need recoding. For instance, changing values from “No phone service” and “No internet service” to “No”, for consistency. The following code statements are to recode those observations and more.

## recode selected observations churn$MultipleLines <- as.factor(mapvalues(churn$MultipleLines, from=c("No phone service"), to=c("No"))) churn$InternetService <- as.factor(mapvalues(churn$InternetService, from=c("Fiber optic"), to=c("Fiberoptic"))) churn$PaymentMethod <- as.factor(mapvalues(churn$PaymentMethod, from=c("Credit card (automatic)","Electronic check","Mailed check", "Bank transfer (automatic)"), to=c("Creditcard","Electronicheck","Mailedcheck","Banktransfer"))) churn$Contract <- as.factor(mapvalues(churn$Contract, from=c("Month-to-month", "Two year", "One year"), to=c("MtM","TwoYr", "OneYr"))) cols_recode1 <- c(10:15) for(i in 1:ncol(churn[,cols_recode1])) { churn[,cols_recode1][,i] <- as.factor(mapvalues (churn[,cols_recode1][,i], from =c("No internet service"),to=c("No"))) }

Besides, values in the SeniorCitizen column were entered as 0 and 1. Let’s recode this variable as “No” and “Yes”, respectively, for consistency.

churn$SeniorCitizen <- as.factor(mapvalues(churn$SeniorCitizen, from=c("0","1"), to=c("No", "Yes")))

Exclude the consumer id and total charges columns from data analysis.

cols_drop <- c(1, 20) churn <- churn[,-cols_drop]

Let’s do summary statistics of the two numerical variables to see distribution of the data.

summary(churn$MonthlyCharges)## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 18.25 35.50 70.35 64.76 89.85 118.80

summary(churn$tenure)## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.00 9.00 29.00 32.37 55.00 72.00

On the basis of the data distributions above, values in the tenure and monthly charges numerical columns could be coerced to a 3-level categorical value as follows.

churn$tenure <- as.factor(car::recode(churn$tenure, "1:9 = 'ShortTenure'; 9:29 = 'MediumTenure'; else = 'LongTenure'")) churn$MonthlyCharges <- as.factor(car::recode(churn$MonthlyCharges, "1:35 = 'LowCharge';35:70 = 'MediumCharge'; else = 'HighCharge'"))

It’s time to check for missing values in the pre-processed data set.

mean(is.na(churn))## [1] 0

There are no missing values. How about the category levels of each variable?

## check for factor levels in each column nfactors <- apply(churn, 2, function(x) nlevels(as.factor(x))) nfactors## gender SeniorCitizen Partner Dependents ## 2 2 2 2 ## tenure PhoneService MultipleLines InternetService ## 3 2 2 3 ## OnlineSecurity OnlineBackup DeviceProtection TechSupport ## 2 2 2 2 ## StreamingTV StreamingMovies Contract PaperlessBilling ## 2 2 3 2 ## PaymentMethod MonthlyCharges Churn ## 4 3 2

Now, the data set is ready for analysis.

inTrain <- createDataPartition(churn$Churn, p=0.7, list=FALSE) ## set random seed to make reproducible results set.seed(324) training <- churn[inTrain,] testing <- churn[-inTrain,]

Check for the dimensions of the training and testing data sets

dim(training) ; dim(testing)## [1] 4931 19 ## [1] 2112 19

As expected, the training data set contains 4931 observations and 19 columns, whereas the testing data set contains 2112 observations and 19 columns.

Invoke the FactoMiner & factoextra packages.

require(FactoMineR) require(factoextra) res.mca <- MCA(training, quali.sup=c(17,19), graph=FALSE) fviz_mca_var(res.mca, repel=TRUE)

A general guide to extrapolating from the figure above for business insights would be to observe and make a note as to how close input variables are to the target variable and to each other. For instance, customers with month to month contract, those with fiber optic internet service, senior citizens, customers with high monthly charges, single customers or customers with no dependents, those with paperless billing are being related to a short tenure with the company and a propensity of risk to churn.

On the other hand, customers with 1 – 2 years contract, those with DSL internet service, younger customers, those with low monthly charges, customers with multiple lines, online security and tech support services are being related to a long tenure with the company and a tendency to stay with company.

Load the ClustOfVar package and the hclustvar function produces a tree of variable groups. A paper detailing the ClustOfVar package is here

require(ClustOfVar) # run variable clustering excluding the target variable (churn) variable_tree <- hclustvar(X.quali = training[,1:18]) #plot the dendrogram of variable groups plot(variable_tree)

Here is a tree of categorical variable groups.

The dendrogram suggests that the 18 input variables can be combined into approximately 7 – 9 groups of variables. That is one way of going about it. The good news is that the ClusofVar package offers a function to cut the cluster into any number of desired groups (clusters) of variables. So, the syntax below will run 25 bootstrap samples of the trees to produce a plot of stability of variable cluster partitions.

# requesting for 25 bootstrap samplings and a plot stability(variable_tree, B=25)

The plot of stability of variable cluster partitions suggests approximately a 7 to 9-cluster solution. The syntax below will list a consensus list of 9-clusters along with the variables names included in each cluster.

## cut the variable tree into 9(?) groups of variables clus <- cutreevar(variable_tree,9) ## print the list of variables in each cluster groups print(clus$var)## $cluster1 ## squared loading ## gender 1 ## ## $cluster2 ## squared loading ## SeniorCitizen 1 ## ## $cluster3 ## squared loading ## Partner 0.7252539 ## Dependents 0.7252539 ## ## $cluster4 ## squared loading ## tenure 0.7028741 ## Contract 0.7162027 ## PaymentMethod 0.4614923 ## ## $cluster5 ## squared loading ## PhoneService 0.6394947 ## MultipleLines 0.6394947 ## ## $cluster6 ## squared loading ## InternetService 0.9662758 ## MonthlyCharges 0.9662758 ## ## $cluster7 ## squared loading ## OnlineSecurity 0.5706136 ## OnlineBackup 0.4960295 ## TechSupport 0.5801424 ## ## $cluster8 ## squared loading ## DeviceProtection 0.5319719 ## StreamingTV 0.6781781 ## StreamingMovies 0.6796616 ## ## $cluster9 ## squared loading ## PaperlessBilling 1

The 9-clusters and the variable names in each cluster are listed above. The practical guide to minimizing redundancy is to select a cluster representative. However, subject-matter considerations should have a say in the consideration and selection of other candidate representatives of each variable cluster group.

what was the overall customer churn rate in the training data set?

# overall customer churn rate round(prop.table(table(training$Churn))*100,1)## ## No Yes ## 73.5 26.5

The overall customer attrition rate was approximately 26.5%.

** Customer churn rate by demography, account and service information **

cols_aggr_demog <- c(1:4,6:7,9:14,16) variable <- rep(names(training[,cols_aggr_demog]),each=4) demog_counts=c() for(i in 1:ncol(training[,cols_aggr_demog])) { demog_count <- ddply(training, .(training[,cols_aggr_demog][,i],training$Churn), "nrow") names(demog_count) <- c("class","Churn","count") demog_counts <- as.data.frame(rbind(demog_counts, demog_count)) } demog_churn_rate <- as.data.frame(cbind(variable, demog_counts)) demog_churn_rate1 <- dcast(demog_churn_rate, variable + class ~ Churn, value.var="count") demog_churn_rate2 <- mutate(demog_churn_rate1, churn_rate=round((Yes/(No+Yes)*100)-26.5,1)) demog <- as.data.frame(paste(demog_churn_rate2$variable,demog_churn_rate2$class)) names(demog) <- c("Category") demog2 <- as.data.frame(cbind(demog,demog_churn_rate2)) cols_aggr_nlev3 <- c(5,8,15,18) variable <- rep(names(training[,cols_aggr_nlev3]),each=6) nlev3_counts=c() for(i in 1:ncol(training[,cols_aggr_nlev3])) { nlev3_count <- ddply(training, .(training[,cols_aggr_nlev3][,i],training$Churn), "nrow") names(nlev3_count) <- c("class","Churn","count") nlev3_counts <- as.data.frame(rbind(nlev3_counts, nlev3_count)) } nlev3_churn_rate <- as.data.frame(cbind(variable, nlev3_counts)) nlev3_churn_rate1 <- dcast(nlev3_churn_rate, variable + class ~ Churn, value.var="count") nlev3_churn_rate2 <- mutate(nlev3_churn_rate1, churn_rate=round((Yes/(No+Yes)*100)-26.5,1)) nlev3 <- as.data.frame(paste(nlev3_churn_rate2$variable,nlev3_churn_rate2$class)) names(nlev3) <- c("Category") nlev3 <- as.data.frame(cbind(nlev3,nlev3_churn_rate2)) variable <- rep("PaymentMethod",8) nlev4_count <- ddply(training, .(training[,17],training$Churn), "nrow") names(nlev4_count) <- c("class","Churn","count") nlev4_churn_rate <- as.data.frame(cbind(variable, nlev4_count)) nlev4_churn_rate1 <- dcast(nlev4_churn_rate, variable + class ~ Churn, value.var="count") nlev4_churn_rate2 <- mutate(nlev4_churn_rate1, churn_rate=round((Yes/(No+Yes)*100)-26.5,1)) nlev4 <- as.data.frame(paste(nlev4_churn_rate$variable4,nlev4_churn_rate2$class)) names(nlev4) <- c("Category") nlev4 <- as.data.frame(cbind(nlev4,nlev4_churn_rate2)) final_agg <- as.data.frame(rbind(demog2, nlev3, nlev4)) ggplot(final_agg, aes(Category, churn_rate, color=churn_rate < 0)) + geom_segment(aes(x=reorder(Category, -churn_rate), xend = Category, y = 0, yend = churn_rate), size = 1.1, alpha = 0.7) + geom_point(size = 2.5) + theme(legend.position="none") + xlab("Variable") + ylab("Customer Churn (%)") + ggtitle("Customer Attrition rate \n Difference from the overall average (%)") + theme(axis.text.x=element_text(angle=45, hjust=1)) + coord_flip()

Looking at the figure above, customers with higher than average attrition rates include those with an electronic check, with month to month contracts, with higher monthly charges and paperless billing. On a positive note, customers with low monthly charges, longer period contract, with online security services, with dependents or with partners, those paying with credit card or bank transfer showed a much lower than average rates of attrition.

Variables such as contract length, bill payment method, internet service type and even customer demography appeared to play a role in customer attrition and retention. The next step for this company would be to deploy predictive and prescriptive models that would score prospective customers for the propensity of risk to churn. Hope this post is helpful. Please leave your comments or suggestions below. Ok to networking with the author on LinkedIn.

Related Post

Related Post

There are two separate data sets for web scraping in this post. The first data set is from a recently released World Happiness Report 2017 by the United Nations Sustainable Development Solutions Network. The 2017 report launched on March 20, the day of world happiness, contained global rankings for happiness and social well-being. The second data set for web scraping is the 2015 social progress index of countries in the world. Social Progress Index has been described as measuring the extent to which countries provide for the social and environmental needs of their citizens. In this exercise, the two data sets joined by country column were pre-processed prior to principal component analysis (PCA) and clustering. The goals of the clustering approach in this post were to segment rows of the over 150 countries in the data into separate groups (clusters), The expectation is that sets of countries within a cluster are as similar as possible to each other for happiness and social progress, and as dissimilar as possible to the other sets of countries assigned in different clusters.

**Load Required Packages**

require(rvest) require(magrittr) require(plyr) require(dplyr) require(reshape2) require(ggplot2) require(FactoMineR) require(factoextra) require(cluster) require(useful)

# Import the first data set from the site url1 <- "https://en.wikipedia.org/wiki/World_Happiness_Report" happy <- read_html(url1) %>% html_nodes("table") %>% extract2(1) %>% html_table() # inspect imported data structure str(happy)## 'data.frame': 155 obs. of 12 variables: ## $ Overall Rank : int 1 2 3 4 5 6 7 8 9 10 ... ## $ Change in rank : int 3 -1 0 -2 0 1 -1 0 0 0 ... ## $ Country : chr "Norway" "Denmark" "Iceland" "Switzerland" ... ## $ Score : num 7.54 7.52 7.5 7.49 7.47 ... ## $ Change in score : num 0.039 -0.004 0.003 -0.015 0.056 0.038 -0.088 -0.02 -0.029 -0.007 ... ## $ GDP per capita : num 1.62 1.48 1.48 1.56 1.44 ... ## $ Social support : num 1.53 1.55 1.61 1.52 1.54 ... ## $ Healthy life expectancy : num 0.797 0.793 0.834 0.858 0.809 0.811 0.835 0.817 0.844 0.831 ... ## $ Freedom to make life choices: num 0.635 0.626 0.627 0.62 0.618 0.585 0.611 0.614 0.602 0.613 ... ## $ Generosity : num 0.362 0.355 0.476 0.291 0.245 0.47 0.436 0.5 0.478 0.385 ... ## $ Trust : num 0.316 0.401 0.154 0.367 0.383 0.283 0.287 0.383 0.301 0.384 ... ## $ Dystopia : num 2.28 2.31 2.32 2.28 2.43 ...

Pre-process the first data set for analysis:

## Exclude columns with ranks and scores, retain the other columns happy <- happy[c(3,6:11)] ### rename column headers colnames(happy) <- gsub(" ", "_", colnames(happy), perl=TRUE) names(happy)## [1] "Country" "GDP_per_capita" ## [3] "Social_support" "Healthy_life_expectancy" ## [5] "Freedom_to_make_life_choices" "Generosity" ## [7] "Trust" ### standardize names of selected countries to confirm with country names in the the map database happy$Country <- as.character(mapvalues(happy$Country, from = c("United States", "Congo (Kinshasa)", "Congo (Brazzaville)", "Trinidad and Tobago"), to = c("USA","Democratic Republic of the Congo", "Democratic Republic of the Congo", "Trinidad")))

### scrape social progress index data report from the site url2 <- "https://en.wikipedia.org/wiki/List_of_countries_by_Social_Progress_Index" social <- read_html(url2) %>% html_nodes("table") %>% .[[2]] %>% html_table(fill=T) # check imported data structure str(social)## 'data.frame': 163 obs. of 9 variables: ## $ Country : chr "Norway" "Sweden" "Switzerland" "Iceland" ... ## $ Rank (SPI) : chr "1" "2" "3" "4" ... ## $ Social Progress Index : chr "88.36" "88.06" "87.97" "87.62" ... ## $ Rank (BHN) : chr "9" "8" "2" "6" ... ## $ Basic Human Needs : chr "94.8" "94.83" "95.66" "95" ... ## $ Rank (FW) : chr "1" "3" "2" "4" ... ## $ Foundations of Well-being: chr "88.46" "86.43" "86.5" "86.11" ... ## $ Rank (O) : chr "9" "5" "10" "11" ... ## $ Opportunity : chr "81.82" "82.93" "81.75" "81.73" ...

Pre-process the second data set for analysis:

### Again, exclude columns with ranks, keep the rest social <- social[c(1,5,7,9)] ### rename column headers names(social) <- c("Country", "basic_human_needs", "foundations_well_being", "opportunity") ### Standardize country names to confirm with country names in the map dataset social$Country <- as.character(mapvalues(social$Country, from = c("United States", "Côte d'Ivoire","Democratic Republic of Congo", "Congo", "Trinidad and Tobago"), to=c("USA", "Ivory Cost","Democratic Republic of the Congo", "Democratic Republic of the Congo", "Trinidad"))) ## coerce character data columns to numeric social[, 2:4] <- sapply(social[, 2:4], as.numeric)## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion ## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion ## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion

Join the two imported data sets

### perform left join soc.happy <- left_join(happy, social, by = c('Country' = 'Country')) ### check for missing values in the combined data set mean(is.na(soc.happy[, 2:10]))## [1] 0.0353857

Left joining the two data files introduced missing values (approximately 3.5% of the total data set) in the combined data set. R offers a variety of imputation algorithms to populate missing values. In this post, a median imputation strategy was applied by replacing missing values with the median for each column.

### median imputation for(i in 1:ncol(soc.happy[, 2:10])) { soc.happy[, 2:10][is.na(soc.happy[, 2:10][,i]), i] <- median(soc.happy[, 2:10][,i], na.rm = TRUE) } ### summary of the raw data summary(soc.happy[,2:10])## GDP_per_capita Social_support Healthy_life_expectancy ## Min. :0.0000 Min. :0.000 Min. :0.0000 ## 1st Qu.:0.6600 1st Qu.:1.042 1st Qu.:0.3570 ## Median :1.0550 Median :1.252 Median :0.6050 ## Mean :0.9779 Mean :1.187 Mean :0.5474 ## 3rd Qu.:1.3150 3rd Qu.:1.412 3rd Qu.:0.7190 ## Max. :1.8710 Max. :1.611 Max. :0.9490 ## Freedom_to_make_life_choices Generosity Trust ## Min. :0.0000 Min. :0.0000 Min. :0.0000 ## 1st Qu.:0.3010 1st Qu.:0.1530 1st Qu.:0.0570 ## Median :0.4370 Median :0.2320 Median :0.0890 ## Mean :0.4078 Mean :0.2461 Mean :0.1224 ## 3rd Qu.:0.5140 3rd Qu.:0.3220 3rd Qu.:0.1530 ## Max. :0.6580 Max. :0.8380 Max. :0.4640 ## basic_human_needs foundations_well_being opportunity ## Min. :26.81 Min. :44.12 Min. :21.12 ## 1st Qu.:61.94 1st Qu.:63.09 1st Qu.:41.43 ## Median :75.73 Median :69.41 Median :49.55 ## Mean :71.82 Mean :68.29 Mean :52.12 ## 3rd Qu.:84.73 3rd Qu.:74.79 3rd Qu.:62.83 ## Max. :96.03 Max. :88.46 Max. :86.58

An important procedure for meaningful clustering and dimension reduction steps involves data transformation and scaling variables. The simple function below will transform all variables to values between 0 and 1.

## transform variables to a range of 0 to 1 range_transform <- function(x) { (x - min(x))/(max(x)-min(x)) } soc.happy[,2:10] <- as.data.frame(apply(soc.happy[,2:10], 2, range_transform)) ### summary of transformed data shows success of transformation summary(soc.happy[,2:10])## GDP_per_capita Social_support Healthy_life_expectancy ## Min. :0.0000 Min. :0.0000 Min. :0.0000 ## 1st Qu.:0.3528 1st Qu.:0.6468 1st Qu.:0.3762 ## Median :0.5639 Median :0.7772 Median :0.6375 ## Mean :0.5227 Mean :0.7367 Mean :0.5768 ## 3rd Qu.:0.7028 3rd Qu.:0.8765 3rd Qu.:0.7576 ## Max. :1.0000 Max. :1.0000 Max. :1.0000 ## Freedom_to_make_life_choices Generosity Trust ## Min. :0.0000 Min. :0.0000 Min. :0.0000 ## 1st Qu.:0.4574 1st Qu.:0.1826 1st Qu.:0.1228 ## Median :0.6641 Median :0.2768 Median :0.1918 ## Mean :0.6198 Mean :0.2936 Mean :0.2638 ## 3rd Qu.:0.7812 3rd Qu.:0.3842 3rd Qu.:0.3297 ## Max. :1.0000 Max. :1.0000 Max. :1.0000 ## basic_human_needs foundations_well_being opportunity ## Min. :0.0000 Min. :0.0000 Min. :0.0000 ## 1st Qu.:0.5075 1st Qu.:0.4278 1st Qu.:0.3103 ## Median :0.7067 Median :0.5704 Median :0.4343 ## Mean :0.6503 Mean :0.5451 Mean :0.4735 ## 3rd Qu.:0.8368 3rd Qu.:0.6917 3rd Qu.:0.6372 ## Max. :1.0000 Max. :1.0000 Max. :1.0000

Although the previous data transformation step could have been adequate for next steps of this analysis, the function shown below would re-scale all variables to a mean of 0 and standard deviation of 1.

## Scaling variables to a mean of 0 and standard deviation of 1 sd_scale <- function(x) { (x - mean(x))/sd(x) } soc.happy[,2:10] <- as.data.frame(apply(soc.happy[,2:10], 2, sd_scale)) ### summary of the scaled data summary(soc.happy[,2:10])## GDP_per_capita Social_support Healthy_life_expectancy ## Min. :-2.3045 Min. :-4.1377 Min. :-2.2984 ## 1st Qu.:-0.7492 1st Qu.:-0.5050 1st Qu.:-0.7994 ## Median : 0.1817 Median : 0.2271 Median : 0.2419 ## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 ## 3rd Qu.: 0.7944 3rd Qu.: 0.7849 3rd Qu.: 0.7206 ## Max. : 2.1046 Max. : 1.4787 Max. : 1.6863 ## Freedom_to_make_life_choices Generosity Trust ## Min. :-2.7245 Min. :-1.8320 Min. :-1.2102 ## 1st Qu.:-0.7137 1st Qu.:-0.6929 1st Qu.:-0.6467 ## Median : 0.1949 Median :-0.1048 Median :-0.3304 ## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 ## 3rd Qu.: 0.7093 3rd Qu.: 0.5652 3rd Qu.: 0.3023 ## Max. : 1.6713 Max. : 4.4067 Max. : 3.3767 ## basic_human_needs foundations_well_being opportunity ## Min. :-2.6059 Min. :-2.5434 Min. :-1.9025 ## 1st Qu.:-0.5721 1st Qu.:-0.5471 1st Qu.:-0.6559 ## Median : 0.2263 Median : 0.1180 Median :-0.1575 ## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 ## 3rd Qu.: 0.7474 3rd Qu.: 0.6842 3rd Qu.: 0.6576 ## Max. : 1.4016 Max. : 2.1228 Max. : 2.1154

Now, the data is ready to explore variable relationships with each other.

corr <- cor(soc.happy[, 2:10], method="pearson") ggplot(melt(corr, varnames=c("x", "y"), value.name="correlation"), aes(x=x, y=y)) + geom_tile(aes(fill=correlation)) + scale_fill_gradient2(low="green", mid="yellow", high="red", guide=guide_colorbar(ticks=FALSE, barheight = 5), limits=c(-1,1)) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + labs(title="Heatmap of Correlation Matrix", x=NULL, y=NULL)

Principal component analysis is a multivariate statistics widely used for examining relationships among several quantitative variables. PCA identifies patterns in the variables to reduce the dimensions of the data set in multiple regression and clustering, among others.

soc.pca <- PCA(soc.happy[, 2:10], graph=FALSE) fviz_screeplot(soc.pca, addlabels = TRUE, ylim = c(0, 65))

The percentages on each bar indicate the proportion of total variance explained by the respective principal component. As seen in the scree plot, the first three principal components accounted for ~80% of the total variance.

soc.pca$eig## eigenvalue percentage of variance cumulative percentage of variance ## comp 1 5.0714898 56.349887 56.34989 ## comp 2 1.4357885 15.953205 72.30309 ## comp 3 0.6786121 7.540134 79.84323 ## comp 4 0.6022997 6.692219 86.53544 ## comp 5 0.4007136 4.452373 90.98782 ## comp 6 0.3362642 3.736269 94.72409 ## comp 7 0.2011131 2.234590 96.95868 ## comp 8 0.1471443 1.634937 98.59361 ## comp 9 0.1265747 1.406386 100.00000

The output suggests that the first two principal components are worthy of consideration. A practical guide to determining the optimum number of principal component axes could be to consider only those components with eigen values greater than or equal to 1.

# Contributions of variables to principal components soc.pca$var$contrib[,1:3]## Dim.1 Dim.2 Dim.3 ## GDP_per_capita 15.7263477 2.455323e+00 3.162470e-02 ## Social_support 12.0654754 5.445993e-01 1.345610e+00 ## Healthy_life_expectancy 15.1886385 2.259270e+00 3.317633e+00 ## Freedom_to_make_life_choices 6.6999181 2.207049e+01 8.064596e+00 ## Generosity 0.3270114 4.189437e+01 5.406678e+01 ## Trust 4.3688692 2.570658e+01 3.211058e+01 ## basic_human_needs 14.5402807 3.836956e+00 1.021076e+00 ## foundations_well_being 15.1664220 1.232353e+00 4.169125e-02 ## opportunity 15.9170370 6.170376e-05 4.113192e-04

The first principal component explained approximately 57% of the total variation and appears to represent opportunity, GDP per capita, healthy life expectancy, Foundations of Well-being, and basic human needs.

The second principal component explained a further 16% of the total variation and represented opportunity, social support, and generosity.

The syntax for clustering algorithms require specifying the number of desired clusters (k=) as an input. The practical issue is what value should k take? In the absence of a subject-matter knowledge, R offers various empirical approaches for selecting a value of k. One such R tool for suggested best number of clusters is the NbClust package.

require(NbClust) nbc <- NbClust(soc.happy[, 2:10], distance="manhattan", min.nc=2, max.nc=30, method="ward.D", index='all')## Warning in pf(beale, pp, df2): NaNs produced ## *** : The Hubert index is a graphical method of determining the number of clusters. ## In the plot of Hubert index, we seek a significant knee that corresponds to a ## significant increase of the value of the measure i.e the significant peak in Hubert ## index second differences plot. ## ## *** : The D index is a graphical method of determining the number of clusters. ## In the plot of D index, we seek a significant knee (the significant peak in Dindex ## second differences plot) that corresponds to a significant increase of the value of ## the measure. ## ## ******************************************************************* ## * Among all indices: ## * 4 proposed 2 as the best number of clusters ## * 9 proposed 3 as the best number of clusters ## * 3 proposed 4 as the best number of clusters ## * 1 proposed 5 as the best number of clusters ## * 1 proposed 9 as the best number of clusters ## * 1 proposed 16 as the best number of clusters ## * 1 proposed 28 as the best number of clusters ## * 2 proposed 29 as the best number of clusters ## * 1 proposed 30 as the best number of clusters ## ## ***** Conclusion ***** ## ## * According to the majority rule, the best number of clusters is 3 ## ## ## *******************************************************************

The NbClust algorithm suggested a 3-cluster solution to the combined data set. So, we will apply K=3 in next steps.

set.seed(4653) pamK3 <- pam(soc.happy[, 2:10], diss=FALSE, 3, keep.data=TRUE) # Number of countries assigned in the three clusters fviz_silhouette(pamK3)## cluster size ave.sil.width ## 1 1 28 0.40 ## 2 2 82 0.29 ## 3 3 47 0.27

The number of countries assigned in each cluster is displayed in the table above.

An interesting aspect of the K-medoids algorithm is that it finds observations from the data that are typical of each cluster. So, the following code will ask for a list of “the typical countries” as selected by the algorithm to be closest to the center of the cluster.

## which countries were typical of each cluster soc.happy$Country[pamK3$id.med]## [1] "Germany" "Macedonia" "Burkina Faso"

We can now display individual countries and superimpose their cluster assignments on the plane defined by the first two principal components.

soc.happy['cluster'] <- as.factor(pamK3$clustering) fviz_pca_ind(soc.pca, label="none", habillage = soc.happy$cluster, #color by cluster palette = c("#00AFBB", "#E7B800", "#FC4E07", "#7CAE00", "#C77CFF", "#00BFC4"), addEllipses=TRUE )

map.world <- map_data("world") # LEFT JOIN map.world_joined <- left_join(map.world, soc.happy, by = c('region' = 'Country')) ggplot() + geom_polygon(data = map.world_joined, aes(x = long, y = lat, group = group, fill=cluster, color=cluster)) + labs(title = "Applied Clustering World Happiness and Social Progress Index", subtitle = "Based on data from:https://en.wikipedia.org/wiki/World_Happiness_Report and\n https://en.wikipedia.org/wiki/List_of_countries_by_Social_Progress_Index", x=NULL, y=NULL) + coord_equal() + theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank(), axis.text.x=element_blank(), axis.text.y=element_blank(), axis.ticks=element_blank(), panel.background=element_blank() )

Cluster analysis and dimension reduction algorithms such as PCA are widely used approaches to split data sets into distinct subsets of groups as well as to reduce the dimensionality of the variables, respectively. Together, these analytics approaches: 1) make the structure of the data easier to understand, and 2) also make subsequent data mining tasks easier to perform. Hopefully, this post has attempted to illustrate the applications of web scraping, pre-processing data for analysis, PCA, and cluster analysis using various tools available in R. Please let us know your comments and suggestions. Ok to networking with the author in LinkdIn.

Related Post

Related Post

- Logistic Regression Regularized with Optimization
- Analytical and Numerical Solutions to Linear Regression Problems
- How to create a loop to run multiple regression models
- Regression model with auto correlated errors – Part 3, some astrology
- Regression model with auto correlated errors – Part 2, the models

setwd("your directory") sms_data<-read.csv("sms_spam.csv",stringsAsFactors = FALSE)

Next, check the proportion of spam and ham in your data set

prop.table(table(sms_data$type))ham spam 0.8659849 0.1340151

As you can see, the proportion of ham in the data set is almost 6.6 times higher than spam. Let us select some amount of data from this entire data set to train a model for our classifier.

simply_text<-Corpus(VectorSource(sms_data$text)) cleaned_corpus<-tm_map(simply_text, tolower) cleaned_corpus<-tm_map(cleaned_corpus,removeNumbers) cleaned_corpus<-tm_map(cleaned_corpus,removeWords,stopwords()) sms_dtm<-DocumentTermMatrix(cleaned_corpus) sms_train<-cleaned_corpus[1:3000] sms_test<-cleaned_corpus[3001:5574] freq_term=(findFreqTerms(sms_dtm,lowfreq=10, highfreq = Inf)) sms_freq_train<-DocumentTermMatrix(sms_train, list(dictionary=freq_term)) sms_freq_test<-DocumentTermMatrix(sms_test,list(dictionary=freq_term)) print(NCOL(sms_freq_train)) print(NCOL(sms_freq_test)) y_train<-factor(sms_data$type[1:3000]) y_test<-factor(sms_data$type[3001:5574]) prop.table(table(y_train)) print(NCOL(sms_freq_train)) print(NCOL(sms_freq_test))[1] 856 [1] 856 ham spam 0.8636667 0.1363333

Notice that the proportion of spam and ham in the training data set is similar to that of the entire data. One of the widely used classifiers is Linear Support Vector Machine. From my last writing on Linear Support Vector Machine, you can find that in case of Linear SVM we solve the following optimization problem.

$$Minimize_{w,b}\frac{\vert\vert w\vert\vert^{2}}{2}$$

subject to the constraint

$$y_{i}(w^{T}x_{i}+b)\geq \frac{1}{\vert \vert w\vert\vert}~\forall ~x_{i}~in ~Training~Data~set$$.

Here \(w\) is a weight vector while \(b\) is a scalar also known as bias. For linearly inseparable case, we introduce some penalty \(0\leq \xi_{i} \leq 1\) in the objective function and our constraint.

$$Minimize_{w,b}(\frac{\vert\vert w\vert\vert^{2}}{2}+ C\sum_{i=1}^{n}\xi_{i})$$

subject to the constraints

$$(1)~y_{i}(w^{T}x_{i}+b)\geq 1-\xi_{i}~\forall ~x_{i}~in ~Training~Data~set~of~size~n$$ and

$$(2)~\xi_{i}\geq 0~\forall ~n~points~in~the~Training~Data~set$$

\(C\) is a user defined parameter which is known as regularization parameter. The regularization parameter is a special parameter. It tells the optimizer what it should minimize more between \(\frac{\vert\vert w\vert\vert^{2}}{2}\) and \(\sum_{i=1}^{n}\xi_{i}\). For example if \(C=0\), then only \(\frac{\vert\vert w\vert\vert^{2}}{2}\) will be minimized whereas if \(C\) is a large value, then \(\sum_{i=1}^{n}\xi_{i}\) will be minimized.

In this example case where the class proportions are so skewed, the choice of the regularization parameter will be a concern. Notice that \(C\) is the scalar-weight of the penalty of mis-classification. Intuitively you can think that, since the proportion of ham is 6.6 times higher than spam samples in the training data, the linear SVM may get biased towards classifying hams more accurately as mis-classifying lots of ham will lead to more aggregated penalty.

Thus, instead of using Linear SVM directly on such data set, it is better to use weighted Linear SVM where instead of using one regularization parameter, we use two separate regularization parameters, \(C_{1}, C_{2}\) where \(C_{1}\) (respectively \(C_{2}\)) is the weight on penalty of mis-classifying a ham sample (respectively a spam sample). So this time our objective function becomes,

$$Minimize_{w,b}(\frac{\vert\vert w\vert\vert^{2}}{2}+ C_{1}\sum_{i=1}^{n_{1}}\xi_{i} +C_{2}\sum_{j=1}^{n_{2}}\xi_{j})$$

where \(n_{1}\) (respective \(n_{2}\)) are the number of hams (respectively spams) in our training data.

One simple way is to set

$$C_{1}=\frac{1}{n_{1}}$$ and

$$C_{2}=\frac{1}{n_{2}}$$

The R code for doing this is as follows:

y<-((sms_data$type)) wts<-1/table(y) print(wts)ham spam 0.000207168 0.001338688

Now train your weighted linear SVM. An easy way of doing so is as follows:

library(e1071) sms_freq_matrx<-as.matrix(sms_freq_train) sms_freq_dtm<-as.data.frame(sms_freq_matrx) sms_freq_matrx_test<-as.matrix(sms_freq_test) sms_freq_dtm_test<-as.data.frame(sms_freq_matrx_test) trained_model<-svm(sms_freq_dtm, y_train, type="C-classification", kernel="linear", class.weights = wts)

Let us use the model for some predictions

y_predict<-predict(trained_model, sms_freq_dtm_test) library(gmodels) CrossTable(y_predict,y_test,prop.chisq = FALSE)| y_test y_predict | ham | spam | Row Total | -------------|-----------|-----------|-----------| ham | 2002 | 233 | 2235 | | 0.896 | 0.104 | 0.868 | | 0.895 | 0.689 | | | 0.778 | 0.091 | | -------------|-----------|-----------|-----------| spam | 234 | 105 | 339 | | 0.690 | 0.310 | 0.132 | | 0.105 | 0.311 | | | 0.091 | 0.041 | | -------------|-----------|-----------|-----------| Column Total | 2236 | 338 | 2574 | | 0.869 | 0.131 | | -------------|-----------|-----------|-----------|

Of the 2236 hams, 2002 got correctly classified as ham while 234 has got mis-classified as spam. However, the case is not so shiny with spams. Of the 338 spams, 105 got correctly classified as spam while 233 has got mis-classified as ham.

Now let us see what happens without using the weights.

trained_model2<-svm(sms_freq_dtm, y_train, type="C-classification", kernel="linear") y_predict<-predict(trained_model2, sms_freq_dtm_test) library(gmodels) CrossTable(y_predict,y_test,prop.chisq = FALSE)| y_test y_predict | ham | spam | Row Total | -------------|-----------|-----------|-----------| ham | 1922 | 224 | 2146 | | 0.896 | 0.104 | 0.834 | | 0.860 | 0.663 | | | 0.747 | 0.087 | | -------------|-----------|-----------|-----------| spam | 314 | 114 | 428 | | 0.734 | 0.266 | 0.166 | | 0.140 | 0.337 | | | 0.122 | 0.044 | | -------------|-----------|-----------|-----------| Column Total | 2236 | 338 | 2574 | | 0.869 | 0.131 | | -------------|-----------|-----------|-----------|

Well, the number of spams correctly classified increased but the number of ham correctly classified decreased.

Let us repeat our experiment by increasing the values of \(C_{1}\) and \(C_{2}\).

wts<-100/table(y) print(wts)ham spam 0.0207168 0.1338688

The results presented below show that the mis-classification for spam has reduced and the accuracy for spam classification has increased. However, the accuracy for ham has decreased.

| y_test y_predict | ham | spam | Row Total | -------------|-----------|-----------|-----------| ham | 1908 | 219 | 2127 | | 0.897 | 0.103 | 0.826 | | 0.853 | 0.648 | | | 0.741 | 0.085 | | -------------|-----------|-----------|-----------| spam | 328 | 119 | 447 | | 0.734 | 0.266 | 0.174 | | 0.147 | 0.352 | | | 0.127 | 0.046 | | -------------|-----------|-----------|-----------| Column Total | 2236 | 338 | 2574 | | 0.869 | 0.131 | | -------------|-----------|-----------|-----------|

We continued our experiments further by setting

wts<-10/table(y) print(wts)ham spam 0.00207168 0.01338688

The results are

| y_test y_predict | ham | spam | Row Total | -------------|-----------|-----------|-----------| ham | 1931 | 230 | 2161 | | 0.894 | 0.106 | 0.840 | | 0.864 | 0.680 | | | 0.750 | 0.089 | | -------------|-----------|-----------|-----------| spam | 305 | 108 | 413 | | 0.738 | 0.262 | 0.160 | | 0.136 | 0.320 | | | 0.118 | 0.042 | | -------------|-----------|-----------|-----------| Column Total | 2236 | 338 | 2574 | | 0.869 | 0.131 | | -------------|-----------|-----------|-----------|

One can repeat the experiment with different values of wts and see the results.

Related Post

- Logistic Regression Regularized with Optimization
- Analytical and Numerical Solutions to Linear Regression Problems
- How to create a loop to run multiple regression models
- Regression model with auto correlated errors – Part 3, some astrology
- Regression model with auto correlated errors – Part 2, the models

Related Post

In this post, we leverage a few other NLP techniques to analyze another text corpus – A collection of tweets. Given a tweet, we would like to extract the key words or phrases which conveys the gist of the meaning of the tweet. For the purpose of this demo, we will extract President Donald Trump’s tweets (~3000 in total) from twitter using Twitter’s API.

I would not cover the twitter data extraction part in this post and directly jump on to the actual analysis (The data extraction code is in Python). However, I have uploaded a csv file with the extracted tweets here. You can download the file to your local machine & load it in your R console if you wish to follow along with the tutorial.

So let’s jump in now to the actual analysis!

There are various different approaches that one can try for this. We will try out one specific approach in this post – POS (Part of Speech) tagging followed by pattern based chunking and extraction

Consider the following tweet from our sample:

After being forced to apologize for its bad and inaccurate coverage of me after winning the election, the FAKE NEWS @nytimes is still lost!

A POS tag output for this text would be as follows:

After/IN being/VBG forced/VBN to/TO apologize/VB for/IN its/PRP$ bad/JJ and/CC inaccurate/JJ coverage/NN of/IN me/PRP after/IN winning/VBG the/DT election/NN ,/, the/DT FAKE/NNP NEWS/NNP @nytimes/NNP is/VBZ still/RB lost/VBN !/

A definition of each of these tags can be found here.

We can make couple of simple observations from the POS tagged text:

- As is obvious, certain tags are more informative than others. For. E.g. Noun tags (starting with NN) would carry more information than prepositions or conjunctions. Similarly, if we would like to know “what” is being spoken about, Noun words may be more relevant than others.
- Secondly, chunks of words would carry more meaning than looking at individual words in isolation. For e.g. in the text above the word “coverage” alone does not adequately convey the meaning of the text. However, “inaccurate coverage” very nicely captures one of the key themes in this tweet.

This is where you may face some challenges, if you are trying to implement this in R. Python seems to be more generally preferred for tasks such as this, with many of the POS tagging & chunking examples that you will find online, based on the NLTK library in Python. While NLTK is great to work with, I tried implementing this in R as well using the openNLP package.

This is the flow that we will follow:

Text –> Sentence Annotation –> Word Annotation –> POS Tagging –> Chunking

Given below is an explanation of the approach:

The POS tagging code is pretty much based on the code examples that are given as part of openNLP’s documentation. Here is the code:

#We will try the approach for 1 tweet; Finally we will convert this into a function x <- "the text of the tweet" x <- as.String(x) # Before POS tagging, we need to do Sentence annotation followed by word annotation wordAnnotation <- annotate(x, list(Maxent_Sent_Token_Annotator(), Maxent_Word_Token_Annotator())) # POS tag the words & extract the "words" from the output POSAnnotation <- annotate(x, Maxent_POS_Tag_Annotator(), wordAnnotation) POSwords <- subset(POSAnnotation, type == "word") # Extract the tags from the words tags <- sapply(POSwords$features, '[[', "POS") # Create a data frame with words and tags tokenizedAndTagged <- data.frame(Tokens = x[POSwords], Tags = tags)

For us to chunk the POS tagged text, we would have to first define what POS pattern we would consider as a chunk. For e.g. an Adjective-Noun(s) combination (JJ-NN) can be a useful pattern to extract (in the example above this pattern would have given us the “inaccurate coverage” chunk). Similarly, we may wish to chunk and extract proper nouns (so for e.g. in this tweet – *“ Hope you like my nomination of Judge Neil Gorsuch for the United States Supreme Court. He is a good and brilliant man, respected by all.“*, chunking and extracting proper nouns (NNP, NNPS) would give us “Judge Neil Gorsuch” & “United States Supreme Court”)

So once we define what POS pattern we consider as a chunk, the next step is to extract them.

This is a bit trickier (In python’s nltk, there is a very useful function that helps extract chunks from POS tagged text using RegEx based pattern search. I couldn’t find a similar function in openNLP, so I wrote a simple workaround for this function in R)

Given below is the code with explanatory comments:

# Define a flag(tags_mod) for pos tags - Flag set to 1 if it contains the POS tag we are interested in else 0 # In this case we only want Noun and Adjective tags (NN, JJ) # Note that this will also capture variations such as NNP, NNPS etc tokenizedAndTagged$Tags_mod = grepl("NN|JJ", tokenizedAndTagged$Tags) # Initialize a vector to store chunk indexes chunk = vector() # Iterate thru each word and assign each one to a group # if the word doesn’t belong to NN|JJ tags (i.e. tags_mod flag is 0) assign it to the default group (0) # If the ith tag is in “NN|JJ” (i.e. tags_mod flag is 1) assign it to group i-1 if the (i-1)th tag_mod flag is also 1; else assign it to a new group chunk[1] = as.numeric(tokenizedAndTagged$Tags_mod[1]) for (i in 2:nrow(tokenizedAndTagged)) { if(!tokenizedAndTagged$Tags_mod[i]) { chunk[i] = 0 } else if (tokenizedAndTagged$Tags_mod[i] == tokenizedAndTagged$Tags_mod[i-1]) { chunk[i] = chunk[i-1] } else { chunk[i] = max(chunk) + 1 } }

Finally extract matching pattern

# Split and chunk words text_chunk <- split(as.character(tokenizedAndTagged$Tokens), chunk) tag_pattern <- split(as.character(tokenizedAndTagged$Tags), chunk) names(text_chunk) <- sapply(tag_pattern, function(x) paste(x, collapse = "-")) # Extract chunks matching pattern # We will extract JJ-NN chunks and two or more continuous NN tags # "NN.-NN" -> The "." in this regex will match all variants of NN: NNP, NNS etc res = text_chunk[grepl("JJ-NN|NN.-NN", names(text_chunk))]

The overall function is available here.

Testing the approach on a few tweets gave fairly promising results.

I am providing below some sample results.

**Sample Tweet 1:**

“Jeff Sessions is an honest man. He did not say anything wrong. He could have stated his response more accurately, but it was clearly not….”

Chunking Output:

c(“Jeff Sessions”, “honest man”)

**Sample Tweet 2:**

Since November 8th, Election Day, the Stock Market has posted $3.2 trillion in GAINS and consumer confidence is at a 15 year high. Jobs!

Chunking Output:

c(“Election Day”, “Stock Market”)

**Sample Tweet 3:**

Russia talk is FAKE NEWS put out by the Dems, and played up by the media, in order to mask the big election defeat and the illegal leaks

Chunking Output:

c(“Russia talk”, “FAKE NEWS”, “big election defeat”, “illegal leaks”)

As can be seen, this approach seems to be working fairly well.

The extracted chunks does convey some of the key themes present in the text.

We can offcourse try out more complex techniques & alternate approaches and if time permits, I will try exploring & blogging about a few of them in the future.

However, the intent of this post was to provide readers a quick overview of POS tagging and chunking approaches and the usefulness of these techniques for certain NLP Tasks.

Hope you find this useful!

Related Post

Related Post

sms_data<-read.csv("sms_spam.csv",stringsAsFactors = FALSE) head(sms_data)type 1 ham 2 ham 3 spam 4 ham 5 ham 6 spam text 1 Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat... 2 Ok lar... Joking wif u oni... 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's 4 U dun say so early hor... U c already then say... 5 Nah I don't think he goes to usf, he lives around here though 6 FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, Â£1.50 to rcv

Linear SVM tries to find a separating hyper-plane between two classes with maximum gap in-between. A hyper-plane in \(d\)- dimension is a set of points \(x \in \mathbb R^{d}\) satisfying the equation

$$w^{T}x+b=0$$

Let us denote \(h(x)=w^{T}(x)+b\)

Here \(w \) is a \(d\)-dimensional weight vector while \(b\) is a scalar denoting the bias. Both \(w\) and \(b\) are unknown to you. You will try to find and tune both \(w\) and \(b\) such that \(h(x)\) can separate the spams from hams as accurately as possible. I am drawing an arbitrary line to demonstrate a separating hyper-plane using the following code.

plot(c(-1,10), c(-1,10), type = "n", xlab = "x", ylab = "y", asp = 1) abline(a =12 , b = -5, col=3) text(1,5, "h(x)=0", col = 2, adj = c(-.1, -.1)) text(2,3, "h(x)>0,Ham", col = 2, adj = c(-.1, -.1)) text(0,3, "h(x)<0,Spam", col = 2, adj = c(-.1, -.1))

Oh, btw, did you notice that Linear SVM has decision rules. Let \(y\) be an indicator variable which takes the value -1 if the message is spam and 1 if it is a ham. Then, the decision rules are

$$y=-1~iff~h(x)~is~less~than~0$$

$$y=1~iff~h(x)~is~greater~than~0$$

Notice that for any point \(x\) in your data set, \(h(x)\neq 0\) as we want the function \(w^{T}x+b\) to clearly distinguish between spam and ham. This brings us to a important condition.

Linear SVM assumes that the two classes are linearly separable that is a hyper-plane can separate out the two classes and the data points from the two classes do not get mixed up. Of course this is not an ideal assumption and how we will discuss it later how linear SVM works out the case of non-linear separability. But for a reader with some experience here I pose a question which is like this Linear SVM creates a discriminant function but so does LDA. Yet, both are different classifiers. Why ? (Hint: LDA is based on Bayes Theorem while Linear SVM is based on the concept of margin. In case of LDA, one has to make an assumption on the distribution of the data per class. For a newbie, please ignore the question. We will discuss this point in details in some other post.)

Consider two points \(a_{1}\) and \(a_{2}\) lying on the hyper-plane defined by \(w^{T}x+b\). Thus, \(h(a_{1})=w^{T}a_{1}+b=0\) and \(h(a_{2})=w^{T}a_{2}+b=0\). Therefore,

$$w^{T}(a_{1}-a_{2})=0$$.

Notice that the vector \(a_{1}-a_{2}\) is lying on the separating hyper-plane defined by \(w^{T}x+b\). Thus, \(w\) is orthogonal to the hyper-plane.

Consider a point \(x_{i}\) and let \(x_{p}\) be the orthogonal projection of \(x_{i}\) onto the hyper-plane. Assume $h(x_{i})>0$ for \(x\).

\(\frac{w}{\vert\vert w\vert\vert}\) is a unit vector in the direction of \(w\).

Let \(r\) be the distance between \(x_{i}\) and \(x_{p}\). Since \(x_{i}\) is projected on the hyper-plane, there is some loss of information which is calculated as

$$(r=x_{i}-x_{p})$$.

Notice that \(r\) is parallel to the unit vector \(\frac{w}{\vert\vert w\vert\vert}\) and therefore \(r\) can be written as some multiple of \(\frac{w}{\vert\vert w\vert\vert}\). Thus, \(r=c\times \frac{w}{\vert\vert w\vert\vert}\), for some scalar value \(c\). It can be shown that

$$c=\frac{h(x_{i})}{\vert\vert w\vert\vert}$$

Actually, \(c\times \frac{w}{\vert\vert w\vert\vert}\) represents the orthogonal distance of the point \(x_{i}\) from the hyper-plane. Distance has to be positive. \(\frac{w}{\vert\vert w\vert\vert}\) cannot be negative. However, \(c=\frac{h(x)}{\vert\vert w\vert\vert}\). Though, before starting the equations we assumed that \(h(x_{i})\) is positive, but that is not a general case. In general \(c\) can be negative also. Therefore we multiply \(y\) with \(c\). Remember your \(y\) from the decision rules. That is

$$c=y\times \frac{h(x_{i})}{\vert\vert w\vert\vert}$$

Linear SVM is based on the concept of margin which is defined as the smallest value of \(c\) for all the \(n\) points in your training data set. That is

$$\delta^{\star}=min\{c~\forall~points~in~the~training~data~set\}$$. All points \(x^{\star}\) which are lying at an orthogonal distance of \(\delta^{\star}\) from the hyper-plane \(w^{T}x+b=0\) are called support vectors.

Let \(\delta^{\star}=\frac{y^{\star}\times h(x^{\star})}{\vert\vert w\vert\vert}\), where \(x^{\star}\) is a support vector. We want \(\delta^{\star} = \frac{1}{\vert\vert w\vert\vert}\). Clearly, we want to multiply some value \(s\) such that \(s\times y^{\star}\times h(x^{\star})=1\). Therefore, \(s=\frac{1}{y^{\star}\times h(x^{\star})}\).

After scaling down, we have the following constraint for SVM:

$$y\times (w^{T}x+b)\geq \frac{1}{\vert\vert w\vert\vert}$$.

Now consider two support vectors, one denoted by \(a_{1}\) on the positive side, while the other \(a_{2}\) on the negative side of the hyper-plane. For the time being let us not use \(y\). Then, we have \(w^{T}a_{1}+b= \frac{1}{\vert\vert w\vert\vert}\) and \(w^{T}a_{2}+b= \frac{-1}{\vert\vert w\vert\vert}\).Thus, \(w^{T}(a_{1}-a_{2})=\frac{2}{\vert\vert w\vert\vert}\).\(w^{T}(a_{1}-a_{2})\) depicts the separation/gap between the support vectors on the either side. We want to maximize this gap \(\frac{2}{\vert\vert w\vert\vert}\). For the purpose, we need to minimize \(\vert\vert w\vert\vert\). Instead,we can minimize \(\frac{\vert\vert w\vert\vert^{2}}{2}\). Formally, we want to solve

$$Minimize_{w,b}\frac{\vert\vert w\vert\vert^{2}}{2}$$

subject to the constraint

$$y_{i}(w^{T}x_{i}+b)\geq \frac{1}{\vert \vert w\vert\vert}~\forall ~x_{i}~in ~Training~Data~set$$.

For linearly inseparable case, we introduce some penalty \(0\leq \xi_{i} \leq 1\) in the objective function and our constraint.

$$Minimize_{w,b}(\frac{\vert\vert w\vert\vert^{2}}{2}+ C\sum_{i=1}^{n}\xi_{i})$$

subject to the constraints

$$(1)~y_{i}(w^{T}x_{i}+b)\geq 1-\xi_{i}~\forall ~x_{i}~in ~Training~Data~set~of~size~n$$ and

$$(2)~\xi_{i}\geq 0~\forall ~n~points~in~the~Training~Data~set$$

\(C\) is a user defined parameter.

In case you wish to see how \(\xi_{i}\) is used as penalty, please consider the the three figures below. You can also choose to skip them if the above details are okay for you.

Let us now start our coding. The needed code is here.

library(tm) setwd(path to the directory) print(getwd()) sms_data<-read.csv("sms_spam.csv",stringsAsFactors = FALSE) print(head(sms_data)) sms_data$type<-factor(sms_data$type) simply_text<-Corpus(VectorSource(sms_data$text)) cleaned_corpus<-tm_map(simply_text, tolower) cleaned_corpus<-tm_map(cleaned_corpus,removeNumbers) cleaned_corpus<-tm_map(cleaned_corpus,removeWords,stopwords()) #cleaned_corpus<-tm_map(cleaned_corpus,removePunctuation) #cleaned_corpus<-tm_map(cleaned_corpus,stripWhitespace) sms_dtm<-DocumentTermMatrix(cleaned_corpus) sms_train<-sms_dtm[1:4000] sms_test<-sms_dtm[4001:5574] #sms_train<-cleaned_corpus[1:4000] #sms_test<-cleaned_corpus[4001:5574] freq_term=(findFreqTerms(sms_dtm,lowfreq=2)) #sms_freq_train<-DocumentTermMatrix(sms_train, list(dictionary=freq_term)) #sms_freq_test<-DocumentTermMatrix(sms_test,list(dictionary=freq_term)) y<-sms_data$type y_train<-y[1:4000] y_test<-y[4001:5574] library(e1071) tuned_svm<-tune(svm, train.x=sms_freq_train, train.y = y_train,kernel="linear", range=list(cost=10^(-2:2), gamma=c(0.1, 0.25,0.5,0.75,1,2)) ) print(tuned_svm) adtm.m<-as.matrix(sms_freq_train) adtm.df<-as.data.frame(adtm.m) svm_good_model<-svm(y_train~., data=adtm.df, kernel="linear",cost=tuned_svm$best.parameters$cost, gamma=tuned_svm$best.parameters$gamma) adtm.m_test<-as.matrix(sms_freq_test) adtm.df_test=as.data.frame(adtm.m_test) y_pred<-predict(svm_good_model,adtm.df_test) table(y_test,y_pred) library(gmodels) CrossTable(y_pred,y_test,prop.chisq = FALSE)type 1 ham 2 ham 3 spam 4 ham 5 ham 6 spam text 1 Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat... 2 Ok lar... Joking wif u oni... 3 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's 4 U dun say so early hor... U c already then say... 5 Nah I don't think he goes to usf, he lives around here though 6 FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, Â£1.50 to rcv Parameter tuning of ‘svm’: - sampling method: 10-fold cross validation - best parameters: cost gamma 1 0.1 - best performance: 0.01825 Cell Contents |-------------------------| | N | | N / Row Total | | N / Col Total | | N / Table Total | |-------------------------| Total Observations in Table: 1574 | y_test y_pred | ham | spam | Row Total | -------------|-----------|-----------|-----------| ham | 1295 | 191 | 1486 | | 0.871 | 0.129 | 0.944 | | 0.952 | 0.897 | | | 0.823 | 0.121 | | -------------|-----------|-----------|-----------| spam | 66 | 22 | 88 | | 0.750 | 0.250 | 0.056 | | 0.048 | 0.103 | | | 0.042 | 0.014 | | -------------|-----------|-----------|-----------| Column Total | 1361 | 213 | 1574 | | 0.865 | 0.135 | | -------------|-----------|-----------|-----------|

In the above example we are not using the cleaned corpus. We see that the accuracy/precision of our classifier for ham is 0.871 and recall is 0.95. However for Spam messages, the precision is 0.250, while recall is just 0.103. We found that the usage of cleaned corpus degrades the performance for the SVM more.

Related Post

Related Post

On the other hand, you may want to get a basic understanding of stock prices time series forecasting by taking advantage of a simple model providing with a sufficient reliability. For such purpose, the Black-Scholes-Merton model as based upon the lognormal distribution hypothesis and largely used in financial analysis can be helpful. The rest of this short dissertation shows how to take advantage of it.

As said, I am going to introduce the Black-Scholes-Merton model which assumes that percentage changes in the stock price in a short period of time are normally distributed.

The return of the stock price **S(t)** at time t can be expressed under those hypothesis as:

$$

\begin{equation}

\begin{aligned}

\frac{S(t)-S(t_{0})}{S(t_{0})}\ \sim\ N(u\ \Delta T,\ \sigma^2\Delta T) \ \ \ \ \ (1) \\

\end{aligned}

\end{equation}

$$

where the left term is the (discrete) return on stock price S at time t. By formulating the same equation in terms of first order differentials for price S (dS) and time t (dt), the equation (1) turns out to be expressed in terms of continuously compounded returns, obtaining:

$$

\begin{equation}

\begin{cases}

t = t_{0} + \Delta T \\

ln(\frac{S(t)}{S(t_{0})})\ \sim \ N((u\ -\ \frac{\sigma^2}{2})\ \Delta T,\ \sigma^2\Delta T) \ \ \ \ \ (2) \\

\end{cases}

\end{equation}

$$

Equation (2) states that the log returns follow a normal distribution, where **u** is the stock price drift, **σ** is the stock price standard deviation. From equation (2), it can be stated the following relationship binding stock price S(t) with stock price S(t0):

$$

\begin{equation}

\begin{aligned}

S(t)\ =\ S(t_{0})\ e^{(u\ -\ \frac{\sigma^2}{2})\ +\ \sigma B(t)} \ \ \ \ \ (3) \\

\end{aligned}

\end{equation}

$$

The **B(t)** term represents the Brownian motion.

Furthermore, equation (2) allows to determine the distribution of the stock price as stated by the following equation:

$$

\begin{equation}

\begin{aligned}

\ln(S(t_{0}+\Delta T))\ \sim \ N(ln(S(t_{0})) + (u\ -\ \frac{\sigma^2}{2})\Delta T,\ \sigma^2\Delta T)\ \ \ \ \ (4) \\

\end{aligned}

\end{equation}

$$

The drift **u** and the standard deviation **σ** can be estimated from the stock price time series history.

The time interval **ΔT** represents our future horizon. Please note that both the drift and the standard deviation must refer to the same time scale, in the sense that if **ΔT** is measured in days, we have to use the daily returns and daily standard deviation, if **ΔT** is measured in weeks, we have to use weekly returns and weekly standard deviation, and so on.

Taking the exponential of both terms of equation (4) we obtain:

$$

\begin{equation}

\begin{aligned}

S(t_{0} + \Delta T)\ \sim \ exp(\ N(ln(S(t_{0})) + (u\ -\ \frac{\sigma^2}{2})\ \Delta T,\ \sigma^2 \Delta T)) \\

= exp(\ N(\ \hat u(\Delta T),\ \hat\sigma^2(\Delta T)) \ \ \ \ \ (5) \\

\end{aligned}

\end{equation}

$$

Above equation provides with a family of normal distributions having known parameters and dependent on the time interval **ΔT** = [0, T]. Lower and upper bounds at each instant of time t in ΔT can be modeled as a 95% confidence interval as determined by the 1.96 multiple of the standard deviation at time t in ΔT. As a result, the expected value, lower and upper bounds of the stock price S(t) are so determined:

$$

\begin{equation}

\begin{cases}

E(S(t))\ =\ exp(ln(S(t_{0})) + (u\ -\ \frac{\sigma^2}{2})\ \Delta T) \ \ \ \ \ (6) \\

LB(S(t))\ = exp(ln(S(t_{0})) + (u\ -\ \frac{\sigma^2}{2})\ \Delta T\ -\ 1.96*\sigma \sqrt \Delta T) \\

UB(S(t))\ = exp(ln(S(t_{0})) + (u\ -\ \frac{\sigma^2}{2})\ \Delta T\ +\ 1.96*\sigma \sqrt \Delta T) \\

\end{cases}

\end{equation}

$$

I will take advantage of the *timeSeries* package for computing returns (see returns() function) and the *quantmod* package for financial stock prices availability (see getSymbols() function). Specifically, I will download the Apple share price history.

suppressPackageStartupMessages(library(timeSeries)) suppressPackageStartupMessages(library(quantmod)) # downloading stock price getSymbols("AAPL") head(AAPL)AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted 2007-01-03 86.29 86.58 81.90 83.80 309579900 10.85709 2007-01-04 84.05 85.95 83.82 85.66 211815100 11.09807 2007-01-05 85.77 86.20 84.40 85.05 208685400 11.01904 2007-01-08 85.96 86.53 85.28 85.47 199276700 11.07345 2007-01-09 86.45 92.98 85.15 92.57 837324600 11.99333 2007-01-10 94.75 97.80 93.45 97.00 738220000 12.56728

# using adjusted close price Y <- coredata(AAPL[,"AAPL.Adjusted"]) # history time span hist_start <- 1 hist_end <- 100 hist <- c(hist_start:hist_end) # historical prices Y.hist <- Y[hist] # historical returns = (Y(t1)-Y(t0))/Y(t0) Y.hist.ret <- returns(Y.hist) # standard deviation computed based on history (sv_hist <- sd(Y.hist.ret, na.rm=T))[1] 0.01886924

Aboveshown value is the estimate of our standard deviation of our share price daily returns as determined by hystorical observations.

It is a good practice to compute the confidence intervals for estimated parameters in order to understand if we have sufficient precision as implied by the samples set size.

# 95% confidence interval n <- length(hist) sv_hist_low <- sqrt((n-1)*sv_hist^2/qchisq(.975, df=n-1)) sv_hist_up <- sqrt((n-1)*sv_hist^2/qchisq(.025, df=n-1)) (sv_hist_confint_95 <- c(sv_hist_low, sv_hist, sv_hist_up))[1] 0.01656732 0.01886924 0.02191993

I am going to show a plot outlining future share price evolution.

# relative future time horizon t <- 1:20 # martingale hypothesis (the average of future returns is equal to the current value) u <- 0 # future expected value as based on normal distribution hypothesis fc <- log(Y.hist[hist_end]) + (u - 0.5*sv_hist^2)*t # lower bound 95% confidence interval fc.lb <- fc - 1.96*sv_hist*sqrt(t) # upper bound 95% confidence interval fc.ub <- fc + 1.96*sv_hist*sqrt(t) # collecting lower, expected and upper values fc_band <- list(lb = exp(fc.lb), m = exp(fc), ub = exp(fc.ub)) # absolute future time horizon xt <- c(hist_end + t) # stock price history line plot plot(Y[hist_start:(hist_end + max(t))], type='l', xlim = c(0, hist_end + max(t)), ylim = c(5, 20), xlab = "Time Index", ylab = "Share Price", panel.first = grid()) # starting point for our forecast suppressWarnings(points(x = hist_end, y = Y.hist[hist_start+hist_end-1], pch = 21, bg = "green")) # lower bound stock price forecast lines(x = xt, y = fc_band$lb, lty = 'dotted', col = 'violet', lwd = 2) # expected stock price forecast lines(x = xt, y = fc_band$m, lty = 'dotted', col = 'blue', lwd = 2) # upper bound stock price forecast lines(x = xt, y = fc_band$ub, lty = 'dotted', col = 'red', lwd = 2)

The plot shows the lower (violet) and upper (red) bounds including the actual future price evolution and the forecasted expected value (blue). In that, I did not account for a drift **u** (u = 0) and as a consequence, there is a flat line representing the future expected value (actually its slope is slightly negative as determined by the -0.5*σ^2 term).

If you like to have future stock price drift more consistent with its recent history, you may compute a return based on the same time scale of the standard deviation.

The lines of code to add are the following:

# added line of code (mu_hist <- mean(Y.hist.ret, na.rm=T)) n <- length(hist) # 95% confidence interval for the mean mu_hist_low <- mu_hist - qt(0.975, df=n-1)*sv_hist/sqrt(n) mu_hist_up <- mu_hist + qt(0.975, df=n-1)*sv_hist/sqrt(n) (mu_hist_confint_95 <- c(mu_hist_low, mu_hist, mu_hist_up))[1] -0.0006690514 0.0030750148 0.0068190811

Above the confidence interval for historical daily returns is shown. We have also to change the assigment to the variable u, the drift.

# drift computed on historical values (u <- mu_hist)[1] 0.0030750148

Furthermore, the code shown above can be easily enhanced with sliders to specify the stock price history to take into account for parameters estimation and the desired future time horizon. That can be done by taking advantage of the manipulate package or by implementing a Shiny gadget or application, for example.

Using the same model is possible to compute the probability that the future stock price be above or below a predetermined value at time t. That is possible by computing the normal distribution parameters as a function of ΔT = t – t0 and a density distribution basic property. Herein is how.

This is the historical share price daily drift **u**:

(u <- mu_hist)[1] 0.003075015

This is the current share price **S(t0)**:

(curr_share_price <- Y.hist[hist_end])[1] 14.72056

This is the mean **mu_t** of our normal distribution computed with ΔT= 10, ten units of time (days) ahead of the current time:

t <- 10 (mu_t <- log(curr_share_price) + u - 0.5*sv_hist^2)*t[1] 26.92142

This is the standard deviation **sv_t** of our normal distribution computed with ΔT = 10, ten units of time (days) ahead of the current time:

(sv_t <- sv_hist*sqrt(t))[1] 0.05966977

Arbitrarly, I determine a target price 10% higher of the current one and hence equal to:

(target_share_price <- 1.10*curr_share_price)[1] 16.19261

The probability that the share price at time t is equal or greater (please note the lower.tail = FALSE parameter) than the target price is the following:

pnorm(q = log(target_share_price), mean = mu_t, sd = sv_t, lower.tail = FALSE)[1] 0.06072166

Our model states there is a probability of 6% that share price is above or equal to the target price.

**The Misbehavior of Markets: criticism to the lognormal hypothesis**

In 1963, Mandelbrot published a research highlighting that, contrary to the general assumption that share price movements were normally distributed, they instead followed a Pareto-Levy distribution, which has infinite variance. That implies that values considered with negligible probability determined by normal distribution hypothesis, they actually are not that unlikely to happen in case of Pareto-Levy distribution.

Based on that market booms and crashes are more frequent than we may think. Be aware of this while applying lognormal distribution assumptions.

We have outlined an approach that can be easily implemented to compute expected values, lower and upper bounds of future stock prices. That is based on the well known Black-Scholes-Merton model and its normal distribution hypothesis.

The plot showing future share price expected value, lower and upper bounds can be enhanced with interactive inputs to allow users to select history length and future horizon of interest. By using the same model, probabilities associated to future price thresholds can be computed.

A reference for the Black-Scholes-Merton model we talked about can be found in the following book:

- Options, Futures and Other Derivatives, John C. Hull, Prentice Hall, 8th Ed.

A reference for Mandlebrot criticism to share price lognormal distribution hypothesis is the following book:

- The Misbehavior of Markets: a Fractal View of Finance Turbolence, Benoit Mandelbrot & Richard L. Hudson, Basic Books Ed.

Disclaimer

Any securities or databases referred in this post are solely for illustration purposes, and under no regard should the findings presented here be interpreted as investment advice or a promotion of any particular security or source.

If you have any question, feel free to comment below.

Related Post