Categories

Tags

Related Post

Categories

Tags

Regression analysis is the most demanding machine learning method in 2018. One group of regression analysis for measuring economic efficiency is stochastic frontier analysis (SFA). This method is well suited for benchmarking and finding improvements for optimization in companies and organizations. It can, therefore, be used to design companies so they generate more value for employees and customers.

In this article, you will learn how to program an SFA model in R and also use this model in a benchmarking evaluation setting – before and after changes in organizations or companies.

One of the main advantages of SFA models is that they can predict how you can create more value in companies – either for employees or customers. This is due to the models’ ability to measure efficiency improvements that can give room for optimization. Basically – an SFA model consists of a parameter you want to explain and optimize on (y) and a number of variables that you know will affect this outcome in either a positive or a negative way (the x’s). In this respect, it is like a multiple ols regression model. The difference is that the explanatory variables (the x’s) are considered inputs that generate the outcome variable (the y). These inputs can either be production inputs or cost inputs. This means that the SFA model can either be a production function or a cost function. So you either want to add value by finding room for optimization in cost or production inputs of the company. This can be illustrated with the following graph:

This makes the SFA model very suitable for commercial machine learning analysis. There are already published some really efficient and great SFA packages in R – these are `sfa`

and `frontier`

. The package `sfa`

focuses on implementing stochastic frontier analysis in R and you can make both production- and cost models. The package `frontier`

are focused upon more specialised SFA models. In this article we will mainly work with `sfa`

. The below coding creates an SFA model in R.

First we load packages into the R library:

# Stochastic frontier analysis in R # SFA packages library(sfa) library(frontier) # Other packages library(foreign) library(readxl) library(tmvtnorm) library(htmlTable)

Next step is to generate a dataset in R and also plotting the dataset in R:

# Generation of random data for dataframe ns = 15 set.seed(121181) x = runif(ns, 0, 1) ybar = x^(1/2) set.seed(121181) u = rtmvnorm(n = ns, mean = c(0.25), sigma = c(0.2), lower = c(0)) y = ybar/(1 + u) plot(y ~ x, type = "p", col = "red", ylim = c(0,1))

The above coding gives the following plot:

3rd step is creation of the SFA model:

# Representation of the efficiency frontier x.seq <- seq(0, 1, by = 0.01) t.fr <- x.seq^(1/2) lines(t.fr ~ x.seq, col = "blue") # The frontier efficiency measurement lambda = y/sqrt(x) theta = y^2/x delta = 1/theta # Table for Frontier Efficiency measures htmlTable(print(tab1))lambda theta delta 1 0.6475214 0.4192839 2.385019 2 0.7918257 0.6269879 1.594927 3 0.9576361 0.9170669 1.090433 4 0.7704356 0.5935711 1.684718 5 0.7528086 0.5667209 1.764537 6 0.6541730 0.4279423 2.336763 7 0.8588736 0.7376639 1.355631 8 0.7548694 0.5698277 1.754916 9 0.8929326 0.7973287 1.254188 10 0.5870898 0.3446745 2.901288 11 0.6966463 0.4853160 2.060513 12 0.5375163 0.2889238 3.461120 13 0.6908752 0.4773085 2.095081 14 0.5192932 0.2696654 3.708299 15 0.6621521 0.4384453 2.280786

The above coding also generates the following SFA graph:

In this section, I will show how you can use SFA analysis for benchmarking evaluation of changes in companies or organizations, before- and after these changes. In the first part of the analysis, the SFA estimates are calculated, using the method described in the above section. In order to find the average treatment effect, before and after the intervention, the estimates are divided in a before category and an after category. Thereafter, the estimates are divided into a treatment and control category. This creates four different categories of the SFA estimates, depicted in table I:

As the table shows, the difference estimates are first calculated as the difference between the pre-intervention minus the post-intervention of the treatment group. Thereafter, the second difference estimate is calculated as the pre-intervention minus the post-intervention of the control group. These two different estimates form the DiD design by subtracting the two estimates. This creates the DiD formula: (D-C) – (B-A).

Now it is time to code the SFA benchmarking evaluation model in R. The below coding are using a DRG healthcare dataset, that are not open source. The coding can be applied to any sort of dataset with a time variable (Post) and treatment variable (Treatment) that are coded in the above table format. The below coding creates data management a SFA difference-in-difference model for Benchmarking Evaluation:

# Load dataset into R drg_data <- read_excel("healthcaredata.xlsx")

# Create data for each Benchmarking Evaluation group drg.data.A <- subset((drg_data,drg_data$Post < 1 & drg_data$Treatment < 1)) drg.data.B 0 & drg_data$Treatment < 1)) drg.data.C <- subset((drg_data,drg_data$Post 0)) drg.data.D 0 & drg_data$Treatment > 0))

Now its time to benchmarking the SFA difference-in-difference model for benchmarking evaluation:

#SFA - group A SFA_A <- sfa(log(drg) ~ log(los) + log(readmis) + log(agecat) + log(sex), data = drg.data.A) #SFA - group B SFA_B <- sfa(log(drg) ~ log(los) + log(readmis) + log(agecat) + log(sex), data = drg.data.B) #SFA - group C SFA_C <- sfa(log(drg) ~ log(los) + log(readmis) + log(agecat) + log(sex), data = drg.data.C) #SFA - group D SFA_D <- sfa(log(drg) ~ log(los) + log(readmis) + log(agecat) + log(sex), data = drg.data.D) # Create a table for the model stargazer(SFA_A,SFA_B,SFA_C,SFA_D, type="text",out="/SFADiDModel.doc")---------------------------------------------------------------------------- Table I – Estimated results for the various groups (1) (2) (3) (4) VARIABLES Group A Group B Group C Group D LOS -0.174*** -0.123*** -0.310*** -0.342*** (0.0297) (0.0369) (0.0305) (0.0437) Sex -0.0571*** -0.0537* 0.236*** 0.00239 (0.0221) (0.0282) (0.0248) (0.0380) Age category 0.221*** 0.170*** 0.542*** 0.335*** (0.0417) (0.0527) (0.0338) (0.0529) Readmissions 0.504*** 0.477*** 0.556*** 0.714*** (0.0134) (0.0180) (0.0144) (0.0206) Constant 2.689*** 2.778*** 2.197*** 2.632** (0.674) (0.841) (0.779) (1.255) Observations 8,461 5,327 8,594 4,523 --------------------------------------------------------------------------- Standard errors in parentheses *** p<0.01, ** p<0.05, * p<0.1

Now it is time to predict the technical efficens measures for the above period and post estimates, and display them in a table:

#Technical efficiens - group A eff_A <- efficiencies( SFA_A, margEff = TRUE ) #Technical efficiens - group B eff_B <- efficiencies( SFA_B, margEff = TRUE ) #Technical efficiens - group C eff_C <- efficiencies( SFA_C, margEff = TRUE ) #Technical efficiens - group D eff_D <- efficiencies( SFA_D, margEff = TRUE ) # Create a table for the TE model stargazer(eff_A,eff_B,eff_C,eff_D, type="text",out="/SFADiDTEModel.doc")Table II – Technical Efficiency estimates: DRG as outcome --------------------------------- VARIABLES Mean Group A 0.967 Group B 0.982 Group C 0.966 Group D 0.970 Aggregate DiD 0.00969 --------------------------------

As described above: Group A is (pre, treat), Group B is (post, treat), Group C is (pre, comp) and Group D is (post, comp). Aggregate DiD is DiD estimate = (D-C) – (B-A).

- Using Stargazer in R – CRAN.R-project.org
- Using sfa in R – CRAN.R-project.org
- Using frontier in R – CRAN.R-project.org
- Using tmvtnorm in R – CRAN.R-project.org
- Using htmlTable in R – CRAN.R-project.org
- Using readxl in R – CRAN.R-project.org
- Using foreign in R – CRAN.R-project.org

Related Post

Categories

Tags

Related Post

Categories

Tags

One of the most demanded skills of the data analyst/scientist in 2018 is visualization. Organizations and companies have a huge demand for data visualization because this branch of methods is very efficient for the recognition of patterns and getting insight into big data.

In this article, I show how you can add value to your visualizations by making them interactive with efficient packages in R. Get ready to utilize the power of: `ggplot2`

, `dygraphs`

and `plotly`

!

`ggplot2`

`ggplot2`

was really a gamechanger in data science when it was realeased for R Statistical Computing in 2007. This package revolutionised the possibilities for coding high quality and elegant visualisations in data science software. The package have been optimised and updated since it was released. The below coding shows how you make different kinds of visualisations with `ggplot2`

:

# Load visualisation packages & datasets library(ggplot2) library(colorspace) library(datasets) head(iris) #Scatter plot with clusters spc<-ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point() spc

The above coding gives first a scatter cluster plot with `ggplot2`

:

# line plot lpc<-ggplot(iris, aes(x=Petal.Width, y=Petal.Length, group=Species)) + geom_line(aes(color=Species))+ geom_point(aes(color=Species)) lpc

And it also display a cluster lineplot with `ggplot2`

:

`dygraphs`

Another value generating visualisation package in R is `dygraphs`

. This package focuses on creating interactive visualisations with elegant interactive coding modules. Furthermore, the package specialises in creating visualisations for machine learning methods. The below coding generates different visualisation graphs with `dygraphs`

:

# Data management packages in R library(zoo) library(xts) # dygraphs in R library(dygraphs) # Generate dataset value<-sample(100:800,100,replace = T) time<-seq(from=Sys.time(),to=Sys.time()+60*60*24*99,by="day") data<-data.frame(time=time,value=value) # dygraph lineplot dygraph(dy_data)

The above coding generates the following interactive lineplot:

# dygraph stepplot dygraph(dy_data) %>% dyOptions(colors="red", pointSize = 2,drawGrid =T,stepPlot = T) %>% dyRangeSelector()

Therafter it also creates the following interactive stepplot:

`plotly`

The last package in this article about value adding visualisation packages in R is `plotly`

. This package focuses on creating interactive and value-adding visualisations and has a clear design as well. The below coding creates different visualisations in `plotly`

:

#plotly graphs library(plotly) # Data management packages in R library(zoo) library(xts) # Generate dataset value<-sample(100:800,100,replace = T) time<-seq(from=Sys.time(),to=Sys.time()+60*60*24*99,by="day") data<-data.frame(time=time,value=value) #plotly lineplot lp % add_trace(y = value,mode = 'lines')%>% layout(title = "Generated dataset", xaxis = list(title = "Months"), yaxis = list (title = "Values")) lp

The above coding gives us the following interactive line plot:

#plotly line marker plot lm% add_trace(y = value,mode = 'lines+markers')%>% layout(title = "Generated dataset", xaxis = list(title = "Months"), yaxis = list (title = "Values")) lm

The new coding gives us an interactive line marker plot in `plotly`

:

#plotly marker plot m% add_trace(y = value,mode = 'markers')%>% layout(title = "Generated dataset", xaxis = list(title = "Months"), yaxis = list (title = "Values")) m

The 3rd coding gives us the following interactive marker plot in `plotly`

:

Enjoy your visualisations!

1. Using ggplot2 in R – CRAN.R-project.org

2. Using dygraphs in R – CRAN.R-project.org

3. Using plotly in R – CRAN.R-project.org

Related Post

Categories

Tags

Related Post

Categories

Tags

A neural network is a computational system that creates predictions based on existing data. Let us train and test a neural network using the *neuralnet* library in R.

A neural network consists of:

**Input layers:**Layers that take inputs based on existing data**Hidden layers:**Layers that use backpropagation to optimise the weights of the input variables in order to improve the predictive power of the model**Output layers:**Output of predictions based on the data from the input and hidden layers

In this particular example, our goal is to develop a neural network to determine if a stock pays a dividend or not.

As such, we are using the neural network to solve a classification problem. By classification, we mean ones where the data is classified by categories. e.g. a fruit can be classified as an apple, banana, orange, etc.

In our dataset, we assign a value of **1** to a stock that pays a dividend. We assign a value of **0** to a stock that does not pay a dividend. The dataset for this example is available at dividendinfo.csv.

Our independent variables are as follows:

**fcfps:**Free cash flow per share (in $)**earnings_growth:**Earnings growth in the past year (in %)**de:**Debt to Equity ratio**mcap:**Market Capitalization of the stock**current_ratio:**Current Ratio (or Current Assets/Current Liabilities)

We firstly set our directory and load the data into the R environment:

setwd("your directory") mydata <- read.csv("dividendinfo.csv") attach(mydata)

Let’s now take a look at the steps we will follow in constructing this model.

One of the most important procedures when forming a neural network is data normalization. This involves adjusting the data to a common scale so as to accurately compare predicted and actual values. Failure to normalize the data will typically result in the prediction value remaining the same across all observations, regardless of the input values.

We can do this in two ways in R:

- Scale the data frame automatically using the
*scale*function in R - Transform the data using a
*max-min normalization*technique

We implement both techniques below but choose to use the max-min normalization technique. Please see this useful link for further details on how to use the normalization function.

**Scaled Normalization**

scaleddata<-scale(mydata)

**Max-Min Normalization**

For this method, we invoke the following function to normalize our data:

normalize <- function(x) { return ((x - min(x)) / (max(x) - min(x))) }

Then, we use **lapply** to run the function across our existing data (we have termed the dataset loaded into R as **mydata**):

maxmindf <- as.data.frame(lapply(mydata, normalize))

We have now scaled our new dataset and saved it into a data frame titled *maxmindf*:

We base our training data (trainset) on 80% of the observations. The test data (testset) is based on the remaining 20% of observations.

# Training and Test Data trainset <- maxmindf[1:160, ] testset <- maxmindf[161:200, ]

We now load the *neuralnet* library into R.

Observe that we are:

- Using neuralnet to “regress” the dependent
*“dividend”*variable against the other independent variables - Setting the number of hidden layers to (2,1) based on the hidden=(2,1) formula
- The linear.output variable is set to FALSE, given the impact of the independent variables on the dependent variable (dividend) is assumed to be non-linear
- The threshold is set to 0.01, meaning that if the change in error during an iteration is less than 1%, then no further optimization will be carried out by the model

Deciding on the number of hidden layers in a neural network is not an exact science. In fact, there are instances where accuracy will likely be higher without any hidden layers. Therefore, trial and error plays a significant role in this process.

One possibility is to compare how the accuracy of the predictions change as we modify the number of hidden layers.

For instance, using a (2,1) configuration ultimately yielded *92.5%* classification accuracy for this example.

#Neural Network library(neuralnet) nn <- neuralnet(dividend ~ fcfps + earnings_growth + de + mcap + current_ratio, data=trainset, hidden=c(2,1), linear.output=FALSE, threshold=0.01) nn$result.matrix plot(nn)

Our neural network looks like this:

We now generate the error of the neural network model, along with the weights between the inputs, hidden layers, and outputs:

nn$result.matrix1 error 2.027188266758 reached.threshold 0.009190064608 steps 750.000000000000 Intercept.to.1layhid1 3.287965374794 fcfps.to.1layhid1 -1.723307330428 earnings_growth.to.1layhid1 -0.076629853467 de.to.1layhid1 1.243670462201 mcap.to.1layhid1 -3.520369700429 current_ratio.to.1layhid1 -3.068677865885 Intercept.to.1layhid2 3.618803162161 fcfps.to.1layhid2 1.109150492946 earnings_growth.to.1layhid2 -11.588713924832 de.to.1layhid2 -1.526458929898 mcap.to.1layhid2 -3.769192938001 current_ratio.to.1layhid2 -4.547481937028 Intercept.to.2layhid1 2.991704593713 1layhid.1.to.2layhid1 -7.372717428050 1layhid.2.to.2layhid1 -22.367528820159 Intercept.to.dividend -5.673537382132 2layhid.1.to.dividend 17.963989719804

As already mentioned, our neural network has been created using the training data. We then compare this to the test data to gauge the accuracy of the neural network forecast.

In the below:

- The “subset” function is used to eliminate the dependent variable from the test data
- The “compute” function then creates the prediction variable
- A “results” variable then compares the predicted data with the actual data
- A confusion matrix is then created with the table function to compare the number of true/false positives and negatives

#Test the resulting output temp_test <- subset(testset, select = c("fcfps","earnings_growth", "de", "mcap", "current_ratio")) head(temp_test) nn.results <- compute(nn, temp_test) results <- data.frame(actual = testset$dividend, prediction = nn.results$net.result)

The predicted results are compared to the actual results:

resultsactual prediction 161 0 0.003457573932 162 1 0.999946522139 163 0 0.006824520245 ... 198 0 0.005474975456 199 0 0.003427332586 200 1 0.999985252611

Then, we round up our results using **sapply** and create a confusion matrix to compare the number of true/false positives and negatives:

roundedresults<-sapply(results,round,digits=0) roundedresultsdf=data.frame(roundedresults) attach(roundedresultsdf) table(actual,prediction)prediction actual 0 1 0 17 0 1 3 20

A confusion matrix is used to determine the number of true and false positives generated by our predictions. The model generates 17 true negatives (0’s), 20 true positives (1’s), while there are 3 false negatives.

Ultimately, we yield an *92.5% (37/40)* accuracy rate in determining whether a stock pays a dividend or not.

We have already seen how a neural network can be used to solve **classification** problems by attempting to group data based on its attributes. However, what if we wish to solve a **regression** problem using a neural network? i.e. one where the dependent variable is an interval one and can take on a wide range of values?

Let us now visit the gasoline.csv dataset. In this example, we wish to analyze the impact of the explanatory variables **capacity**, **gasoline**, and **hours** on the dependent variable **consumption**.

Essentially, we wish to determine the gasoline spend per year (in $) for a particular vehicle based on different factors.

Accordingly, our variables are as follows:

**consumption:**Spend (in $) on gasoline per year for a particular vehicle**capacity:**Capacity of the vehicle’s fuel tank (in litres)**gasoline:**Average cost of gasoline per pump**hours:**Hours driven per year by owner

Again, we normalize our data and split into training and test data:

# MAX-MIN NORMALIZATION normalize <- function(x) { return ((x - min(x)) / (max(x) - min(x))) } maxmindf <- as.data.frame(lapply(fullData, normalize)) # TRAINING AND TEST DATA trainset <- maxmindf[1:32, ] testset <- maxmindf[33:40, ]

We then run our neural network and generate our parameters:

#4. NEURAL NETWORK library(neuralnet) nn <- neuralnet(consumption ~ capacity + gasoline + hours,data=trainset, hidden=c(2,1), linear.output=TRUE, threshold=0.01) nn$result.matrix1 error 0.158611967443 reached.threshold 0.007331578682 steps 66.000000000000 Intercept.to.1layhid1 1.401987575173 capacity.to.1layhid1 1.307794013481 gasoline.to.1layhid1 -3.102267882386 hours.to.1layhid1 -3.246720660493 Intercept.to.1layhid2 -0.897276576566 capacity.to.1layhid2 -1.934594889387 gasoline.to.1layhid2 3.739470402932 hours.to.1layhid2 1.973830465259 Intercept.to.2layhid1 -1.125920206855 1layhid.1.to.2layhid1 3.175227041522 1layhid.2.to.2layhid1 -2.419360506652 Intercept.to.consumption 0.683726702522 2layhid.1.to.consumption -0.545431580477

Here is what our neural network looks like in visual format:

Then, we validate (or test the accuracy of our model) by comparing the estimated gasoline spend yielded from the neural network to the actual spend as reported in the test output:

results <- data.frame(actual = testset$consumption, prediction = nn.results$net.result) resultsactual prediction 33 0.7556029883 0.6669224684 34 0.7801494130 0.6458686668 35 0.8356456777 0.6549105183 36 0.8399146211 0.6646982158 37 0.8431163287 0.6631168047 38 0.8890074707 0.6629885579 39 0.9124866596 0.6649999344 40 1.0000000000 0.6665075920

In the below code, we are then converting the data back to its original format, and yielding an accuracy of 90% on a mean absolute deviation basis (i.e. the average deviation between estimated and actual gasoline consumption stands at a mean of 10%). Note that we are also converting our data back into standard values given that they were previously scaled using the max-min normalization technique:

predicted=results$prediction * abs(diff(range(consumption))) + min(consumption) actual=results$actual * abs(diff(range(consumption))) + min(consumption) comparison=data.frame(predicted,actual) deviation=((actual-predicted)/actual) comparison=data.frame(predicted,actual,deviation) accuracy=1-abs(mean(deviation)) accuracy[1] 0.9017828022

You can see that we obtain 90% accuracy using a (2,1) hidden configuration. This is quite good, especially considering that our dependent variable is in the interval format. However, let’s see if we can get it higher!

What happens if we now use a (5,2) hidden configuration in our neural network? Here is the generated output:

nn <- neuralnet(consumption ~ capacity + gasoline + hours,data=trainset, hidden=c(5,2), linear.output=TRUE, threshold=0.01) nn$result.matrix1 error 0.049463073770 reached.threshold 0.009079608691 steps 183.000000000000 Intercept.to.1layhid1 -0.484165225327 capacity.to.1layhid1 3.271476705612 gasoline.to.1layhid1 -13.185417334090 hours.to.1layhid1 0.926588147188 Intercept.to.1layhid2 -0.931405056650 capacity.to.1layhid2 0.527977084370 gasoline.to.1layhid2 5.893120354012 hours.to.1layhid2 -0.435230849092 Intercept.to.1layhid3 0.389302962895 capacity.to.1layhid3 -1.502423111329 gasoline.to.1layhid3 -4.684748555999 hours.to.1layhid3 -6.319048800780 Intercept.to.1layhid4 -0.094490811578 capacity.to.1layhid4 -2.399916325456 gasoline.to.1layhid4 -4.115161295471 hours.to.1layhid4 5.013344559754 Intercept.to.1layhid5 0.759624731279 capacity.to.1layhid5 -0.565467044104 gasoline.to.1layhid5 -7.076912238164 hours.to.1layhid5 -6.709144936619 Intercept.to.2layhid1 0.157424617083 1layhid.1.to.2layhid1 7.364054381868 1layhid.2.to.2layhid1 -3.671237007644 1layhid.3.to.2layhid1 6.295218032535 1layhid.4.to.2layhid1 -0.303371875453 1layhid.5.to.2layhid1 12.271950628363 Intercept.to.2layhid2 0.353976458576 1layhid.1.to.2layhid2 -2.460042549015 1layhid.2.to.2layhid2 0.062791089253 1layhid.3.to.2layhid2 2.376623876363 1layhid.4.to.2layhid2 -2.385599836002 1layhid.5.to.2layhid2 5.234292659554 Intercept.to.consumption 0.921627990820 2layhid.1.to.consumption -0.524918897571 2layhid.2.to.consumption -0.669503028647

results <- data.frame(actual = testset$consumption, prediction = nn.results$net.result) resultsactual prediction 33 0.7556029883 0.6554040151 34 0.7801494130 0.7781191265 35 0.8356456777 0.7611519348 36 0.8399146211 0.7980981880 37 0.8431163287 0.8027250788 38 0.8890074707 0.8047567120 39 0.9124866596 0.7969363797 40 1.0000000000 0.7802800479

predicted=results$prediction * abs(diff(range(consumption))) + min(consumption) actual=results$actual * abs(diff(range(consumption))) + min(consumption) comparison=data.frame(predicted,actual) deviation=((actual-predicted)/actual) comparison=data.frame(predicted,actual,deviation) accuracy=1-abs(mean(deviation)) accuracy[1] 0.9577401232

We see that our accuracy rate has now increased to nearly 96%, indicating that modifying the number of hidden nodes has enhanced our model!

In this tutorial, you have learned how to use a neural network to solve classification problems.

Specifically, you saw how we can:

- Normalize data for meaningful analysis
- Classify data using a neural network
- Test accuracy using a confusion matrix
- Determine accuracy when the dependent variable is in interval format

Many thanks for viewing this tutorial. Please visit my blog for this and other posts.

Related Post

Categories

Tags

Related Post

Categories

Tags

In this post, I will show you, how to use visualization and transformation for exploring your data in R. I will use several functions that come with Tidyverse package.

In general, there are two types of variables, categorical and continuous. In this section, I will show the best option to examine their distributions using the data from NHANES.

```
library(tidyverse)
library(RNHANES)
d13 = nhanes_load_data("DEMO_H", "2013-2014") %>%
transmute(SEQN=SEQN, wave=cycle, INDFMIN2, RIDRETH1) %>%
left_join(nhanes_load_data("BMX_H", "2013-2014"), by="SEQN") %>%
select(SEQN, wave, INDFMIN2, RIDRETH1, BMXBMI) %>%
mutate(
annincome = recode_factor(INDFMIN2,
'1' = "lowest",
'2' = "lowest",
'3' = "lowest",
'4' = "low",
'5' = "low",
'6' = "low",
'7' = "medium",
'8' = "medium",
'9' = "medium",
'10' = "high",
'12' = "high",
'13' = "high",
'14' = "highest",
'15' = "highest")) %>%
filter(!is.na(BMXBMI), !is.na(annincome))
```

With the dataset created I will visualize the distribution using a bar chart.

```
ggplot(data = d13) +
geom_bar(aes(annincome))
```

To see the exact number for each category, I can also calculate these values with `count()`

```
d13 %>%
count(annincome)
```

For a continuous variable it is necessary to use the histogram. I chose to see how BMI is distributed in NHANES population for 2013, with `binwidth = 5`

, so cut the variable by 5 unit increase.

```
ggplot(data = d13) +
geom_histogram(aes(BMXBMI), binwidth = 5)
```

Combining 'ggplot2' and 'dplyr', I can see the relevant values fo Bmi with the function `cut_width()`

by 5 unit increase)

```
d13 %>%
count(cut_width(BMXBMI, 5))
```

To combine the information I showed previously in the same plot, for information about BMI and annual income I will use `geomfreqpoly()`

, and have the multiple histograms below.

```
ggplot(data = d13, aes(BMXBMI, color = annincome)) +
geom_freqpoly(binwidth = 1)
```

Now I am going to demonstrate a link of a continuous variable based on the other categorical variable using the boxplot.

```
ggplot(data = d13, aes(annincome, BMXBMI)) +
geom_boxplot()
```

So for each box, the middle line is the median 50th percentile for each category. In my case, if I chose category medium for annual income the median of BMI is ~27. The upper and the lower line of the box shows 75th (BMI=31) percentile, and 25th (BMI=20) percentile and the distance between them is called the Interquartile Range.

For two categorical variable, I need to visualize the relation between them, but I also would like to know the number of observations, so I will use 'geom_tile' and 'fill aesthetic' and have the graph below.

```
d13 %>%
mutate(race = recode_factor(RIDRETH1,
`1` = "Mexian American",
`2` = "Hispanic",
`3` = "Non-Hispanic, White",
`4` = "Non-Hispanic, Black",
`5` = "Others")) %>%
count(race, annincome) %>%
ggplot(aes(race, annincome)) +
geom_tile(aes(fill = n))
```

Below, I will see how do BMI and cholesterol come along with each other drawn in a scatterplot.

```
data13 = d13 %>%
left_join(nhanes_load_data("TCHOL_H", "2013-2014"), by="SEQN") %>%
select(SEQN, wave, INDFMIN2, RIDRETH1, BMXBMI, LBXTC)
ggplot(data = data13) +
geom_point(aes(BMXBMI, LBXTC))
```

Because the points overplot in the previous scatterplot, I can use 'alpha aesthetic' for a more useful graph.

```
ggplot(data = data13) +
geom_point(aes(BMXBMI, LBXTC),
alpha = 1/20)
```

Another way to visualize a relationship of two continuous variables is by using bins and treating one of the variables as a definite. Adding 'cut_number' will make the comparison fairer as there is the same number of points in each bin.

```
ggplot(data = data13, aes(BMXBMI, LBXTC)) +
geom_boxplot(aes(group = cut_number(BMXBMI, 20)))
```

Hope this post will help you chose the right and best way to illustrate distribution and relations within and between variables.

Related Post

Categories

Tags

Related Post

Categories

Tags

**Keras** is an API used for running high-level neural networks. The model runs on top of TensorFlow, and was developed by Google.

The main competitor to Keras at this point in time is PyTorch, developed by Facebook. While PyTorch has a somewhat higher level of community support, it is a particularly verbose language and I personally prefer Keras for greater simplicity and ease of use in building and deploying models.

In this particular example, a neural network will be built in Keras to solve a regression problem, i.e. one where our dependent variable (y) is in interval format and we are trying to predict the quantity of y with as much accuracy as possible.

A neural network is a computational system that creates predictions based on existing data. Let us train and test a neural network using the *neuralnet* library in R.

A neural network consists of:

**Input layers:**Layers that take inputs based on existing data**Hidden layers:**Layers that use backpropagation to optimise the weights of the input variables in order to improve the predictive power of the model**Output layers:**Output of predictions based on the data from the input and hidden layers

For this example, we use a **linear activation function** within the **keras** library to create a regression-based neural network. We will use the cars dataset. Essentially, we are trying to predict the value of a potential car sale (i.e. how much a particular person will spend on buying a car) for a customer based on the following attributes:

- Age
- Gender
- Average miles driven per day
- Personal debt
- Monthly income

Firstly, we import our libraries. Note that you will need TensorFlow installed on your system to be able to execute the below code. Depending on your operating system, you can find one of my YouTube tutorials on how to install on Windows 10 here.

from tensorflow.python.keras.models import Sequential from tensorflow.python.keras.layers import Dense from tensorflow.python.keras.wrappers.scikit_learn import KerasRegressor import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.model_selection import KFold from sklearn.pipeline import Pipeline from sklearn.preprocessing import MinMaxScaler

import os; path="C:/yourdirectory" os.chdir(path) os.getcwd()

Since we are implementing a neural network, the variables need to be normalized in order for the neural network to interpret them properly. Therefore, our variables are transformed using the **MaxMinScaler()**:

#Variables dataset=np.loadtxt("cars.csv", delimiter=",") x=dataset[:,0:5] y=dataset[:,5] y=np.reshape(y, (-1,1)) scaler = MinMaxScaler() print(scaler.fit(x)) print(scaler.fit(y)) xscale=scaler.transform(x) yscale=scaler.transform(y)

The data is then split into training and test data:

X_train, X_test, y_train, y_test = train_test_split(xscale, yscale)

Now, we train the neural network. We are using the four **input variables** (age, gender, miles, debt, and income), along with four **hidden neurons**, and finally using the **linear activation function** to process the output.

model = Sequential() model.add(Dense(12, input_dim=5, kernel_initializer='normal', activation='relu')) model.add(Dense(8, activation='relu')) model.add(Dense(1, activation='linear')) model.summary()

The **mean_squared_error (mse)** and **mean_absolute_error (mae)** are our loss functions – i.e. an estimate of how accurate the neural network is in predicting the test data. We can see that with the validation_split set to 0.2, 80% of the training data is used to test the model, while the remaining 20% is used for testing purposes.

model.compile(loss='mse', optimizer='adam', metrics=['mse','mae'])

From the output, we can see that the more epochs are run, the lower our **MSE** and **MAE** become, indicating improvement in accuracy across each iteration of our model.

Let’s now fit our model.

history = model.fit(X_train, y_train, epochs=150, batch_size=50, verbose=1, validation_split=0.2) history = model.fit(X_train, y_train, epochs=150, batch_size=50, verbose=1, validation_split=0.2)Train on 540 samples, validate on 135 samples Train on 577 samples, validate on 145 samples Epoch 1/150 2018-03-24 19:31:05.078618: I C:\tf_jenkins\workspace\rel-win\M\windows\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2 50/577 [=>............................] 50/577 [=>............................] - ETA: 6s - loss: 0.1718 - mean_squared577/577 [==============================]577/577 [==============================] - 1s 1ms/step - loss: 0.1522 - mean_squared_error: 0.1522 - mean_absolute_error: 0.3003 - val_loss: 0.1368 - val_mean_squared_error: 0.1368 - val_mean_absolute_error: 0.2714 Epoch 2/150 50/577 [=>............................] 50/577 [=>............................] - ETA: 0s - loss: 0.1506 - mean_squared577/577 [==============================]577/577 [==============================] - 0s 56us/step - loss: 0.1153 - mean_squared_error: 0.1153 - mean_absolute_error: 0.2524 - val_loss: 0.1027 - val_mean_squared_error: 0.1027 - val_mean_absolute_error: 0.2341 Epoch 3/150 50/577 [=>............................] 50/577 [=>............................] - ETA: 0s - loss: 0.0733 - mean_squared577/577 [==============================]577/577 [==============================] - 0s 56us/step - loss: 0.0843 - mean_squared_error: 0.0843 - mean_absolute_error: 0.2183 - val_loss: 0.0770 - val_mean_squared_error: 0.0770 - val_mean_absolute_error: 0.2095 Epoch 4/150 50/577 [=>............................] 50/577 [=>............................] - ETA: 0s - loss: 0.0577 - mean_squared577/577 [==============================]577/577 [==============================] - 0s 57us/step - loss: 0.0626 - mean_squared_error: 0.0626 - mean_absolute_error: 0.1952 - val_loss: 0.0583 - val_mean_squared_error: 0.0583 - val_mean_absolute_error: 0.1935 Epoch 5/150 50/577 [=>............................] 50/577 [=>............................] - ETA: 0s - loss: 0.0498 - mean_squared577/577 [==============================]577/577 [==============================] - 0s 58us/step - loss: 0.0475 - mean_squared_error: 0.0475 - mean_absolute_error: 0.1774 - val_loss: 0.0454 - val_mean_squared_error: 0.0454 - val_mean_absolute_error: 0.1798 ... Epoch 145/150 50/577 [=>............................] 50/577 [=>............................] - ETA: 0s - loss: 0.0138 - mean_squared577/577 [==============================]577/577 [==============================] - 0s 64us/step - loss: 0.0154 - mean_squared_error: 0.0154 - mean_absolute_error: 0.0902 - val_loss: 0.0161 - val_mean_squared_error: 0.0161 - val_mean_absolute_error: 0.0932 Epoch 146/150 50/577 [=>............................] 50/577 [=>............................] - ETA: 0s - loss: 0.0119 - mean_squared577/577 [==============================]577/577 [==============================] - 0s 61us/step - loss: 0.0154 - mean_squared_error: 0.0154 - mean_absolute_error: 0.0903 - val_loss: 0.0162 - val_mean_squared_error: 0.0162 - val_mean_absolute_error: 0.0936 Epoch 147/150 50/577 [=>............................] 50/577 [=>............................] - ETA: 0s - loss: 0.0161 - mean_squared577/577 [==============================]577/577 [==============================] - 0s 61us/step - loss: 0.0155 - mean_squared_error: 0.0155 - mean_absolute_error: 0.0913 - val_loss: 0.0161 - val_mean_squared_error: 0.0161 - val_mean_absolute_error: 0.0939 Epoch 148/150 50/577 [=>............................] 50/577 [=>............................] - ETA: 0s - loss: 0.0222 - mean_squared577/577 [==============================]577/577 [==============================] - 0s 63us/step - loss: 0.0153 - mean_squared_error: 0.0153 - mean_absolute_error: 0.0900 - val_loss: 0.0164 - val_mean_squared_error: 0.0164 - val_mean_absolute_error: 0.0934 Epoch 149/150 50/577 [=>............................] 50/577 [=>............................] - ETA: 0s - loss: 0.0147 - mean_squared577/577 [==============================]577/577 [==============================] - 0s 64us/step - loss: 0.0153 - mean_squared_error: 0.0153 - mean_absolute_error: 0.0897 - val_loss: 0.0161 - val_mean_squared_error: 0.0161 - val_mean_absolute_error: 0.0935 Epoch 150/150 50/577 [=>............................] 50/577 [=>............................] - ETA: 0s - loss: 0.0152 - mean_squared577/577 [==============================]577/577 [==============================] - 0s 59us/step - loss: 0.0153 - mean_squared_error: 0.0153 - mean_absolute_error: 0.0901 - val_loss: 0.0162 - val_mean_squared_error: 0.0162 - val_mean_absolute_error: 0.0934

Here, we can see that keras is calculating both the **training loss** and **validation loss**, i.e. the deviation between the predicted y and actual y as measured by the mean squared error.

As you can see, we have specified 150 epochs for our model. This means that we are essentially training our model over 150 **forward** and **backward** passes, with the expectation that our loss will decrease with each epoch, meaning that our model is predicting the value of y more accurately as we continue to train the model.

Let’s see what this looks like when we plot our respective losses:

print(history.history.keys()) # "Loss" plt.plot(history.history['loss']) plt.plot(history.history['val_loss']) plt.title('model loss') plt.ylabel('loss') plt.xlabel('epoch') plt.legend(['train', 'validation'], loc='upper left') plt.show()

Both the training and validation loss decrease in an exponential fashion as the number of epochs is increased, suggesting that the model gains a high degree of accuracy as our epochs (or number of forward and backward passes) is increased.

So, we’ve seen how we can train a neural network model, and then validate our training data against our test data in order to determine the accuracy of our model.

However, what if we now wish to use the model to estimate unseen data?

Let’s take the following array as an example:

Xnew = np.array([[40, 0, 26, 9000, 8000]])

Using this data, let’s plug in the new values to see what our calculated figure for car sales would be:

Xnew = np.array([[40, 0, 26, 9000, 8000]]) ynew=model.predict(Xnew) print("X=%s, Predicted=%s" % (Xnew[0], ynew[0]))X=[ 40 0 26 9000 8000], Predicted=[2653.6077]

In this tutorial, you have learned how to:

- Construct neural networks with Keras
- Scale data appropriately with MinMaxScaler
- Calculate training and test losses
- Make predictions using the neural network model

Many thanks for viewing this tutorial. Please visit my blog for this and other posts.

Related Post

Categories

Tags

Related Post

Categories

Tags

Panel data, along with cross-sectional and time series data, are the main data types that we encounter when working with regression analysis.

- Cross-Sectional: Data collected at one particular point in time
- Time Series: Data collected across several time periods
- Panel Data: A mixture of both cross-sectional and time series data, i.e. collected at a particular point in time and across several time periods

When it comes to panel data, standard regression analysis often falls short in isolating fixed and random effects.

- Fixed Effects: Effects that are independent of random disturbances, e.g. observations independent of time.
- Random Effects: Effects that include random disturbances.

Let us see how we can use the `plm`

library in R to account for fixed and random effects. There is a video tutorial link at the end of the post.

For this tutorial, we are going to use a dataset of weekly internet usage in MB across 33 weeks across three different companies (A, B, and C). Find the dataset here: internetplm.csv.

Let us run our fixed (within) and random (random) effect models:

#Fixed Effects (within) and Random Effects (random) library(plm) plmwithin <- plm(usage~income+videohours+webpages+gender+age, data = mydata, model = "within") plmrandom <- plm(usage~income+videohours+webpages+gender+age, data = mydata, model = "random") summary(plmwithin)Oneway (individual) effect Within Model Call: plm(formula = usage ~ income + videohours + webpages + gender + age, data = mydata, model = "within") Balanced Panel: n=33, T=3, N=99 Residuals : Min. 1st Qu. Median 3rd Qu. Max. -7001.684 -1744.973 -18.793 1212.604 6423.064 Coefficients : Estimate Std. Error t-value Pr(>|t|) income 0.60389 0.14897 4.0538 0.0001451 *** videohours 1261.43899 111.13253 11.3508 < 2.2e-16 *** webpages 97.79839 42.06834 2.3248 0.0234299 * gender -1328.98316 812.05060 -1.6366 0.1068685 age 54.15710 36.54555 1.4819 0.1435128 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Total Sum of Squares: 4312700000 Residual Sum of Squares: 692730000 R-Squared: 0.83937 Adj. R-Squared: 0.51719 F-statistic: 63.7531 on 5 and 61 DF, p-value: < 2.22e-16

summary(plmrandom)Oneway (individual) effect Random Effect Model (Swamy-Arora's transformation) Call: plm(formula = usage ~ income + videohours + webpages + gender + age, data = mydata, model = "random") Balanced Panel: n=33, T=3, N=99 Effects: var std.dev share idiosyncratic 11356210 3370 0.912 individual 1093188 1046 0.088 theta: 0.1191 Residuals : Min. 1st Qu. Median 3rd Qu. Max. -8153.757 -1808.112 -90.183 2101.833 8507.735 Coefficients : Estimate Std. Error t-value Pr(>|t|) (Intercept) -1831.20613 1482.37880 -1.2353 0.21982 income 0.51416 0.12419 4.1401 7.627e-05 *** videohours 1265.14090 94.06589 13.4495 < 2.2e-16 *** webpages 95.35198 37.72137 2.5278 0.01316 * gender -1688.46458 708.69203 -2.3825 0.01923 * age 57.27720 30.01488 1.9083 0.05944 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Total Sum of Squares: 5790300000 Residual Sum of Squares: 1031900000 R-Squared: 0.82178 Adj. R-Squared: 0.77198 F-statistic: 85.7674 on 5 and 93 DF, p-value: < 2.22e-16

In the case of the first regression, we are accounting for fixed effects (or internet usage independent of time), while the second is accounting for random effects (including time).

Note that the variables *gender* and *age* which were deemed insigificant in the fixed effects regression are now being deemed significant in the random effects regression.

Now, let us use *fixef* to isolate the effects of time on internet usage.

#Effect of time on internet usage fixef(plmwithin,type="dmean")1 2 3 4 5 6 7 8 9 10 11 3054.6050 -467.6298 -1557.9555 524.9491 -18.3859 -1374.6854 2300.6432 1890.2534 -541.1499 4721.5132 241.6241 12 13 14 15 16 17 18 19 20 21 22 -1438.6650 2208.1437 -2036.5721 -848.0145 508.1098 -269.4162 2335.4139 803.8547 -3633.5643 -1399.4824 2142.2455 23 24 25 26 27 28 29 30 31 32 33 258.5465 -2458.9901 -1724.9200 -3979.0983 70.2551 717.6419 1289.9506 -1763.4756 2241.4846 2945.1578 -4742.3872

summary(fixef(plmwithin, type = "dmean"))Estimate Std. Error t-value Pr(>|t|) 1 3054.605 2860.669 1.0678 0.28561 2 -467.630 2452.521 -0.1907 0.84878 3 -1557.955 2368.244 -0.6579 0.51063 4 524.949 2734.229 0.1920 0.84775 5 -18.386 2861.050 -0.0064 0.99487 6 -1374.685 2719.298 -0.5055 0.61319 7 2300.643 2353.954 0.9774 0.32839 8 1890.253 2598.290 0.7275 0.46692 9 -541.150 2518.805 -0.2148 0.82989 10 4721.513 2619.317 1.8026 0.07146 . 11 241.624 2765.534 0.0874 0.93038 12 -1438.665 2868.905 -0.5015 0.61604 13 2208.144 2651.892 0.8327 0.40503 14 -2036.572 2549.967 -0.7987 0.42448 15 -848.015 2643.324 -0.3208 0.74835 16 508.110 2594.214 0.1959 0.84472 17 -269.416 2463.474 -0.1094 0.91291 18 2335.414 2662.246 0.8772 0.38036 19 803.855 2475.448 0.3247 0.74538 20 -3633.564 2504.424 -1.4509 0.14682 21 -1399.482 2727.735 -0.5131 0.60791 22 2142.245 2477.582 0.8647 0.38723 23 258.546 2702.259 0.0957 0.92378 24 -2458.990 3036.375 -0.8098 0.41803 25 -1724.920 2489.795 -0.6928 0.48844 26 -3979.098 2758.028 -1.4427 0.14910 27 70.255 2595.908 0.0271 0.97841 28 717.642 2494.345 0.2877 0.77357 29 1289.951 2575.964 0.5008 0.61654 30 -1763.476 2458.731 -0.7172 0.47323 31 2241.485 2441.171 0.9182 0.35851 32 2945.158 2725.083 1.0808 0.27980 33 -4742.387 2820.176 -1.6816 0.09265 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Observe that weeks 10 and 33 are significant at the 10% level. For week 10, we see that usage across A, B, and C is higher than usual:

On the other hand, usage for week 33 across A, B, and C is lower than usual:

This could indicate that something is taking place across these particular weeks to cause internet usage to be higher and lower than usual. We see that using *fixef* has allowed us to isolate the effect of time in this regard.

Now, let us extract fixed effects – i.e. effects independent of time.

#Extracting fixed effects with fixef twoway <- plm(usage~income+videohours+webpages+gender+age+company,data=mydata,model="within",effect="time") fixef(twoway,effect="time")a b c -451.3598 -1202.3393 -3451.6416

We see that of the three internet companies, company C has the largest negative coefficient.

Let’s subset usage for the three companies and see what is going on:

#Usage by provider a <- subset(mydata, company=="a") b <- subset(mydata, company=="b") c <- subset(mydata, company=="c") #Mean usage by provider mean(a$usage) mean(b$usage) mean(c$usage)[1] 11646.94 [1] 8851.848 [1] 6279.03

We see that mean internet usage for company C is lower than that of A and B.

Moreover, visualising usage across the 33 weeks reveals an interesting insight:

#Usage visualisations plot(a$usage,type='l') plot(b$usage,type='l') plot(c$usage,type='l')

We see that in the case of C, usage remains low overall but there are certain periods where we see a spike in usage. However, we see these spikes much more frequently for A and B.

In this regard, comparing fixed and random effects has allowed us to isolate the impact of time on usage patterns for C.

In this tutorial, you have learned:

- The difference between a fixed and random effects model
- How to use the plm library
- How to isolate fixed and random effects in a panel dataset

Many thanks for viewing this tutorial. Please visit my blog for this and other posts.

Related Post