This is the bimonthly post (for 2015-08-04) for new R Jobs from R-users.com.
Employers: visit this link to post a new R job to the R community (it’s free and quick).
Job seekers: please follow the links below to learn more and apply for your job of interest (or visit previous R jobs posts).
]]>
I’m announcing the alpha launch of EcoPy: Ecological Data Analysis in Python. EcoPy is a Python module that contains a number of techniques (PCA, CA, CCorA, nMDS, MDS, RDA, etc.) for exploring complex multivariate data. For those of you familiar with R, think of this as the Python equivalent to the ‘vegan‘ package.
However, I’m not done! This is the alpha launch, which means you should exercise caution before using this package for research. I’ve stress-tested a couple of simple examples to ensure I get equivalent output in numerous programs, but I haven’t tested it out with real, messy data yet. There might be broken bits or quirks I don’t know about. For the moment, be sure to verify your results with other software.
That said, I need help! My coding might be sloppy, or you might find errors that I didn’t, or you might suggest improvements to the interface. If you want to contribute, either by helping with coding or stress-testing on real-world examples, let me know. The module is available on github and full documentation is available at readthedocs.org.
However, this depends on a lot of things.
First, it is model-dependent. For example, trees might have trouble with a classification data set if the class boundary is a diagonal line since their class boundaries are made using orthogonal slices of the data (oblique trees excepted).
Second, the process of predictor encoding benefits the most from subject-specific knowledge of the problem. In my example above, you need to know the patterns of your data to improve the format of the predictor. Feature engineering is very different in image processing, information retrieval, RNA expressions profiling, etc. You need to know something about the problem and your particular data set to do it well.
Here is some training set data where two predictors are used to model a two-class system (I'll unblind the data at the end):
There is also a corresponding test set that we will use below.
There are some observations that we can make:
Depending on what model that we might choose to use, the between-predictor correlation might bother us. Also, we should look to see of the individual predictors are important. To measure this, we'll use the area under the ROC curve on the predictor data directly.
Here are univariate box-plots of each predictor (on the log scale):
There is some mild differentiation between the classes but a significant amount of overlap in the boxes. The area under the ROC curves for predictor A and B are 0.61 and 0.59, respectively. Not so fantastic.
What can we do? Principal component analysis (PCA) is a pre-processing method that does a rotation of the predictor data in a manner that creates new synthetic predictors (i.e. the principal components or PC's). This is conducted in a way where the first component accounts for the majority of the (linear) variation or information in the predictor data. The second component does the same for any information in the data that remains after extracting the first component and so on. For these data, there are two possible components (since there are only two predictors). Using PCA in this manner is typically called feature extraction.
Let's compute the components:
> library(caret) > head(example_train)
PredictorA PredictorB Class
2 3278.726 154.89876 One
3 1727.410 84.56460 Two
4 1194.932 101.09107 One
12 1027.222 68.71062 Two
15 1035.608 73.40559 One
16 1433.918 79.47569 One
> pca_pp <- preProcess(example_train[, 1:2], + method = c("center", "scale", "pca")) + pca_pp
Call:
preProcess.default(x = example_train[, 1:2], method = c("center",
"scale", "pca"))
Created from 1009 samples and 2 variables
Pre-processing: centered, scaled, principal component signal extraction
PCA needed 2 components to capture 95 percent of the variance
> train_pc <- predict(pca_pp, example_train[, 1:2]) > test_pc <- predict(pca_pp, example_test[, 1:2]) > head(test_pc, 4)
PC1 PC2
1 0.8420447 0.07284802
5 0.2189168 0.04568417
6 1.2074404 -0.21040558
7 1.1794578 -0.20980371
Note that we computed all the necessary information from the training set and apply these calculations to the test set. What do the test set data look like?
These are the test set predictors simply rotated.
PCA is unsupervised, meaning that the outcome classes are not considered when the calculations are done. Here, the area under the ROC curves for the first component is 0.5 and 0.81 for the second component. These results jive with the plot above; the first component has an random mixture of the classes while the second seems to separate the classes well. Box plots of the two components reflect the same thing:
There is much more separation in the second component.
This is interesting. First, despite PCA being unsupervised, it managed to find a new predictor that differentiates the classes. Secondly, it is the last component that is most important to the classes but the least important to the predictors. It is often said that PCA doesn't guarantee that any of the components will be predictive and this is true. Here, we get lucky and it does produce something good.
However, imagine that there are hundreds of predictors. We may only need to use the first X components to capture the majority of the information in the predictors and, in doing so, discard the later components. In this example, the first component accounts for 92.4% of the variation in the predictors; a similar strategy would probably discard the most effective predictor.
How does the idea of feature engineering come into play here? Given these two predictors and seeing the first scatterplot shown above, one of the first things that occurs to me is "there are two correlated, positive, skewed predictors that appear to act in tandem to differentiate the classes". The second thing that occurs to be is "take the ratio". What does that data look like?
The corresponding area under the ROC curve is 0.8, which is nearly as good as the second component. A simple transformation based on visually exploring the data can do just as good of a job as an unbiased empirical algorithm.
These data are from the cell segmentation experiment of Hill et al, and predictor A is the "surface of a sphere created from by rotating the equivalent circle about its diameter" (labeled as EqSphereAreaCh1
in the data) and predictor B is the perimeter of the cell nucleus (PerimCh1
). A specialist in high content screening might naturally take the ratio of these two features of cells because it makes good scientific sense (I am not that person). In the context of the problem, their intuition should drive the feature engineering process.
However, in defense of an algorithm such as PCA, the machine has some benefit. In total, there are almost sixty predictors in these data whose features are just as arcane as EqSphereAreaCh1
. My personal favorite is the "Haralick texture measurement of the spatial arrangement of pixels based on the co-occurrence matrix". Look that one up some time. The point is that there are often too many features to engineer and they might be completely unintuitive from the start.
Another plus for feature extraction is related to correlation. The predictors in this particular data set tend to have high between-predictor correlations and for good reasons. For example, there are many different ways to quantify the eccentricity of a cell (i.e. how elongated it is). Also, the size of a cell's nucleus is probably correlated with the size of the overall cell and so on. PCA can mitigate the effect of these correlations in one fell swoop. An approach of manually taking ratios of many predictors seems less likely to be effective and would take more time.
Last year, in one of the R&D groups that I support, there was a bit of a war being waged between the scientists who focused on biased analysis (i.e. we model what we know) versus the unbiased crowd (i.e. just let the machine figure it out). I fit somewhere in-between and believe that there is a feedback loop between the two. The machine can flag potentially new and interesting features that, once explored, become part of the standard book of "known stuff".
]]>“Feature engineering” is a fancy term for making sure that your predictors are encoded in the model in a manner that makes it as easy as possible for the model to achieve good performance. For example, if your have a date field as a predictor and there are larger differences in response for the weekends versus the weekdays, then encoding the date in this way makes it easier to achieve good results.
However, this depends on a lot of things.
First, it is model-dependent. For example, trees might have trouble with a classification data set if the class boundary is a diagonal line since their class boundaries are made using orthogonal slices of the data (oblique trees excepted).
Second, the process of predictor encoding benefits the most from subject-specific knowledge of the problem. In my example above, you need to know the patterns of your data to improve the format of the predictor. Feature engineering is very different in image processing, information retrieval, RNA expressions profiling, etc. You need to know something about the problem and your particular data set to do it well.
Here is some training set data where two predictors are used to model a two-class system (I’ll unblind the data at the end):
There is also a corresponding test set that we will use below.
There are some observations that we can make:
Depending on what model that we might choose to use, the between-predictor correlation might bother us. Also, we should look to see of the individual predictors are important. To measure this, we’ll use the area under the ROC curve on the predictor data directly.
Here are univariate box-plots of each predictor (on the log scale):
There is some mild differentiation between the classes but a significant amount of overlap in the boxes. The area under the ROC curves for predictor A and B are 0.61 and 0.59, respectively. Not so fantastic.
What can we do? Principal component analysis (PCA) is a pre-processing method that does a rotation of the predictor data in a manner that creates new synthetic predictors (i.e. the principal components or PC’s). This is conducted in a way where the first component accounts for the majority of the (linear) variation or information in the predictor data. The second component does the same for any information in the data that remains after extracting the first component and so on. For these data, there are two possible components (since there are only two predictors). Using PCA in this manner is typically called feature extraction.
Let’s compute the components:
> library(caret) > head(example_train)
PredictorA PredictorB Class
2 3278.726 154.89876 One
3 1727.410 84.56460 Two
4 1194.932 101.09107 One
12 1027.222 68.71062 Two
15 1035.608 73.40559 One
16 1433.918 79.47569 One
> pca_pp <- preProcess(example_train[, 1:2], + method = c("center", "scale", "pca")) + pca_pp
Call:
preProcess.default(x = example_train[, 1:2], method = c("center",
"scale", "pca"))
Created from 1009 samples and 2 variables
Pre-processing: centered, scaled, principal component signal extraction
PCA needed 2 components to capture 95 percent of the variance
> train_pc <- predict(pca_pp, example_train[, 1:2]) > test_pc <- predict(pca_pp, example_test[, 1:2]) > head(test_pc, 4)
PC1 PC2
1 0.8420447 0.07284802
5 0.2189168 0.04568417
6 1.2074404 -0.21040558
7 1.1794578 -0.20980371
Note that we computed all the necessary information from the training set and apply these calculations to the test set. What do the test set data look like?
These are the test set predictors simply rotated.
PCA is unsupervised, meaning that the outcome classes are not considered when the calculations are done. Here, the area under the ROC curves for the first component is 0.5 and 0.81 for the second component. These results jive with the plot above; the first component has an random mixture of the classes while the second seems to separate the classes well. Box plots of the two components reflect the same thing:
There is much more separation in the second component.
This is interesting. First, despite PCA being unsupervised, it managed to find a new predictor that differentiates the classes. Secondly, it is the last component that is most important to the classes but the least important to the predictors. It is often said that PCA doesn’t guarantee that any of the components will be predictive and this is true. Here, we get lucky and it does produce something good.
However, imagine that there are hundreds of predictors. We may only need to use the first X components to capture the majority of the information in the predictors and, in doing so, discard the later components. In this example, the first component accounts for 92.4% of the variation in the predictors; a similar strategy would probably discard the most effective predictor.
How does the idea of feature engineering come into play here? Given these two predictors and seeing the first scatterplot shown above, one of the first things that occurs to me is “there are two correlated, positive, skewed predictors that appear to act in tandem to differentiate the classes”. The second thing that occurs to be is “take the ratio”. What does that data look like?
The corresponding area under the ROC curve is 0.8, which is nearly as good as the second component. A simple transformation based on visually exploring the data can do just as good of a job as an unbiased empirical algorithm.
These data are from the cell segmentation experiment of Hill et al, and predictor A is the “surface of a sphere created from by rotating the equivalent circle about its diameter” (labeled as EqSphereAreaCh1
in the data) and predictor B is the perimeter of the cell nucleus (PerimCh1
). A specialist in high content screening might naturally take the ratio of these two features of cells because it makes good scientific sense (I am not that person). In the context of the problem, their intuition should drive the feature engineering process.
However, in defense of an algorithm such as PCA, the machine has some benefit. In total, there are almost sixty predictors in these data whose features are just as arcane as EqSphereAreaCh1
. My personal favorite is the “Haralick texture measurement of the spatial arrangement of pixels based on the co-occurrence matrix”. Look that one up some time. The point is that there are often too many features to engineer and they might be completely unintuitive from the start.
Another plus for feature extraction is related to correlation. The predictors in this particular data set tend to have high between-predictor correlations and for good reasons. For example, there are many different ways to quantify the eccentricity of a cell (i.e. how elongated it is). Also, the size of a cell’s nucleus is probably correlated with the size of the overall cell and so on. PCA can mitigate the effect of these correlations in one fell swoop. An approach of manually taking ratios of many predictors seems less likely to be effective and would take more time.
Last year, in one of the R&D groups that I support, there was a bit of a war being waged between the scientists who focused on biased analysis (i.e. we model what we know) versus the unbiased crowd (i.e. just let the machine figure it out). I fit somewhere in-between and believe that there is a feedback loop between the two. The machine can flag potentially new and interesting features that, once explored, become part of the standard book of “known stuff”.
Now I’m gonna tell my momma that I’m a traveller, I’m gonna follow the sun (The Sun, Parov Stelar)
Inspired by this book I read recently, I decided to do this experiment. The idea is comparing how easy is to find sequences of numbers inside Pi, e, Golden Ratio (Phi) and a randomly generated number. For example, since Pi is 3.1415926535897932384… the 4-size sequence 5358 can be easily found at the begining as well as the 5-size sequence 79323. I considered interesting comparing Pi with a random generated number. What I though before doing the experiment is that it would be easier finding sequences inside the andom one. Why? Because despite of being irrational and transcendental I thought there should be some kind of residual pattern in Pi that should make more difficult to find random sequences inside it than do it inside a randomly generated number.
At first sight, is equally easy (or difficult), to find random sequences inside all numbers: my hypothesis was wrong.
As you can see here, 100.000 digits is more than enough to find 4-size sequences. In fact, from 45.000 digits I reach 100% of successful matches:
I only find 60% of 5-size sequences inside 100.000 digits of numbers:
And only 10% of 6-size sequences:
Why these four numbers are so equal in order to find random sequences inside them? I don’t know. What I know is that if you want to find your telephone number inside Pi, you will probably need an enormous number of digits.
library(rvest) library(stringr) library(reshape2) library(ggplot2) library(extrafont);windowsFonts(Comic=windowsFont("Comic Sans MS")) library(dplyr) library(magrittr) library(scales) p = html("http://www.geom.uiuc.edu/~huberty/math5337/groupe/digits.html") f = html("http://www.goldennumber.net/wp-content/uploads/2012/06/Phi-To-100000-Places.txt") e = html("http://apod.nasa.gov/htmltest/gifcity/e.2mil") p %>% html_text() %>% substr(., regexpr("3.14",.), regexpr("Go to Historical",.)) %>% gsub("[^0-9]", "", .) %>% substr(., 1, 100000) -> p f %>% html_text() %>% substr(., regexpr("1.61",.), nchar(.)) %>% gsub("[^0-9]", "", .) %>% substr(., 1, 100000) -> f e %>% html_text() %>% substr(., regexpr("2.71",.), nchar(.)) %>% gsub("[^0-9]", "", .) %>% substr(., 1, 100000) -> e r = paste0(sample(0:9, 100000, replace = TRUE), collapse = "") results=data.frame(Cut=numeric(0), Pi=numeric(0), Phi=numeric(0), e=numeric(0), Random=numeric(0)) bins=20 dgts=6 samp=min(10^dgts*2/100, 10000) for (i in 1:bins) { cut=100000/bins*i p0=substr(p, start=0, stop=cut) f0=substr(f, start=0, stop=cut) e0=substr(e, start=0, stop=cut) r0=substr(r, start=0, stop=cut) sample(0:(10^dgts-1), samp, replace = FALSE) %>% str_pad(dgts, pad = "0") -> comb comb %>% sapply(function(x) grepl(x, p0)) %>% sum() -> p1 comb %>% sapply(function(x) grepl(x, f0)) %>% sum() -> f1 comb %>% sapply(function(x) grepl(x, e0)) %>% sum() -> e1 comb %>% sapply(function(x) grepl(x, r0)) %>% sum() -> r1 results=rbind(results, data.frame(Cut=cut, Pi=p1, Phi=f1, e=e1, Random=r1)) } results=melt(results, id.vars=c("Cut") , variable.name="number", value.name="matches") opts=theme( panel.background = element_rect(fill="darkolivegreen1"), panel.border = element_rect(colour="black", fill=NA), axis.line = element_line(size = 0.5, colour = "black"), axis.ticks = element_line(colour="black"), panel.grid.major = element_line(colour="white", linetype = 1), panel.grid.minor = element_blank(), axis.text.y = element_text(colour="black"), axis.text.x = element_text(colour="black"), text = element_text(size=20, family="Comic"), legend.text = element_text(size=25), legend.key = element_blank(), legend.position = c(.75,.2), legend.background = element_blank(), plot.title = element_text(size = 30)) ggplot(results, aes(x = Cut, y = matches/samp, color = number))+ geom_line(size=1.5, alpha=.8)+ scale_color_discrete(name = "")+ scale_x_continuous(breaks=seq(100000/bins, 100000, by=100000/bins))+ scale_y_continuous(labels = percent)+ theme(axis.text.x = element_text(angle = 90, vjust=.5, hjust = 1))+ labs(title=paste0("Finding ",dgts, "-size strings into 100.000-digit numbers"), x="Cut Position", y="% of Matches")+opts
The post Ternary Interpolation / Smoothing appeared first on ggtern: ternary diagrams in R.
]]>For a long time, people have been sending me requests for a suitable smoothing / contouring / interpolation geometry be made available via ggtern, over and above the Kernel Density function. I am very pleased to say, that the recent version 1.0.6 has this feature added. Let me demonstrate how it works.
This geometry & stat required an additional mapping, to ‘value’, in order to conduct the interpolation, which by default is done using multivariate linear regression as the mechanism for the interpolation:
#Start by setting up the environment library(ggtern) data(Feldspar) theme_base <- function(){ list( theme_bw(), theme(legend.position = c(0,1), legend.justification = c(0,1))) } #Demonstrate the simplest example plot <- ggtern(Feldspar,aes(x=Or,y=An,z=Ab)) + geom_interpolate_tern(aes(value=P.Gpa,color=..level..)) + geom_point() + theme_base() + labs(color="P/Gpa Model") plot
There are plenty of options to tailor how the interpolation gets done, lets reduce the spacing between the contours, and add some more vibrant colours:
#Demonstrate the use of some options plot <- ggtern(Feldspar,aes(x=Or,y=An,z=Ab)) + geom_interpolate_tern(aes(value= P.Gpa,fill=..level..), colour = "white", formula = value~poly(x,y, degree=2, raw=TRUE), method = "lm", binwidth = 25, buffer = 1.5, n = 200) + geom_point() + theme_base() + scale_fill_gradient(low="green",high="red") + labs(fill="P/GPa Model") plot
For further information, see the help file, ?stat_interpolate_tern or ?geom_interpotate_tern
The post Ternary Interpolation / Smoothing appeared first on ggtern: ternary diagrams in R.
Many processes in chemistry, especially in synthesis, require attaining a certain target value for a property of interest. For example, when synthesizing drug capsules that contain a medicine, a chemist has to ensure that the concentration of the medicine meets a target value. If the concentration is too high or too low, then the patient ingesting the drug capsules could suffer catastrophic health problems. Thus, monitoring this attainment is a very important part of analytical chemistry.
Of course, natural variation in any chemical process will result in some variation in the output, so the target value will rarely be attained exactly. There is usually an acceptable range of values, but any deviation of the output beyond this acceptable range must be discovered and treated with alarm, as the underlying process for generating that output may be inherently faulty. The process should be stopped, examined, and repaired before any more output can be generated. From a statistical perspective, there needs to be some mechanism to monitor for outliers as the process unfolds.
A control chart is a useful tool for monitoring chemical processes to detect outliers. In this tutorial, I will
Read the rest of this blog post to learn how to build the above control chart in R!
A control chart is a scatter plot that allows a chemist to monitor a process as it happens over time. It plots the quantity of interest on the vertical axis against time (or the order of the generation of the data) on the horizontal axis. There are many variations of control charts, but they generally show how far the data deviate from a target value. Here is one type of control charts from Harris (2003) that you can easily plot. There are 5 special horizontal lines in this type of control charts. Suppose that
(***Although the true values of the population mean and population standard deviation can never be accurately determined, many laboratories and companies have long histories of doing the same chemical process. Thus, they have very unbiased and precise estimates of the population mean and population standard deviation, and those estimates can be considered to be “true” for such practical purposes.)
Then, you can plot the data versus time, and add 5 special lines to this scatter plot. The values of these lines are in brackets below.
A chemist can use these lines to determine if the data are deviating too far away from the target value. Significantly large deviations indicate that something is wrong with the data-generating process. Daniel Harris (2003) suggests stopping the process and examining for malfunctions if any of the following events occur.
a) 1 datum falls outside of the action lines
b) 2 out of 3 consecutive data fall between the warning lines and the action lines
c) 7 consecutive data are all above or all below the target value line
d) 6 consecutive data steadily increase or steadily decrease, regardless of their location
e) 14 consecutive data alternate up and down regardless of their location
f) some obvious non-random pattern
Suppose that you are producing vitamin C capsules, and the target weight percentage of vitamin C in each capsule is 95%. Based on past experience in making these capsules, you know that the standard deviation of the weight percentage is 0.005. I have simulated 25 data for this production process, and you can find this data set at the end of this blog post. I have called the vector of this data set “vitamin_c”.
Notice my use of
Here is the R script for plotting a control chart for this production process according to the specifications as outlined in Harris (2003). Note that I have labeled the 5 horizontal lines using abbreviations:
##### Plotting a Control Chart in Analytical Chemistry ##### By Eric Cai, The Chemical Statistician # first, import the data vector from the bottom of this blog post # assign it to the variable "vitamin_c" # obtain the number of data in this vector n = length(vitamin_c) # create a vector of the order of the production # this will be the horizontal axis in the control chart ordering = 1:n # the target weight percentage is 95% mu = 95 # from past experience, you know that the standard deviation is 0.005 # treat this as the "true" standard deviation sigma = 0.005 # set the 5 horizontal lines of the control chart upper_action_line = mu + 3*sigma/sqrt(n) upper_warning_line = mu + 2*sigma/sqrt(n) target_value = mu lower_warning_line = mu - 2*sigma/sqrt(n) lower_action_line = mu - 3*sigma/sqrt(n) # put all of the values of the 5 horizontal lines into 1 vector control_lines = c(upper_action_line, upper_warning_line, target_value, lower_warning_line, lower_action_line) # create a vector of labels for the 5 horizontal lines control_labels = c('UA', 'UW', 'TV', 'LW', 'LA') # export the control chart in PNG format to a folder of your choice png('Write Your Working Directory Path Here/control chart for vitamin c production.png') # note the use of the "yaxt = 'n'" option to suppress the default y-axis plot(ordering, vitamin_c, main = 'Control Chart for Vitamin C Production', xlab = 'Order of Production', ylab = 'Weight Percentage', yaxt = 'n', ylim = c(lower_action_line - sd, upper_action_line + sd)) # draw the 5 horizontal lines along the left vertical axis abline(h = control_lines) # write the labels for the 5 horizontal lines axis(2, at = control_lines, labels = control_labels) dev.off()
Here is the resulting control chart.
Notice that Data #12-18 (7 consecutive points) all fall below the target value line. Based on criterion c) above, that warrants shutting down the production process to examine for a possible malfunction.
Harris, Daniel C. “Quantitative analytical chemistry” (2003).
Weight Percentage of Vitamin C |
94.9991591445192 |
95.0013843593435 |
94.9987445081374 |
95.0000701427664 |
95.0017114408727 |
94.9993970920185 |
94.9995278336148 |
94.9993646286875 |
94.9997142263651 |
95.0001381082248 |
95.0012276303438 |
94.9991982205454 |
94.9989196074000 |
94.9998424656439 |
94.9989282399601 |
94.9998610138594 |
94.9994026869053 |
94.9978160332399 |
95.0002408172559 |
94.9997406445933 |
95.0009005119453 |
95.0009418693939 |
95.0014679619034 |
95.0007067610896 |
95.0008190089303 |
Filed under: Analytical Chemistry, Applied Statistics, Chemistry, Data Analysis, Descriptive Statistics, Plots, Practical Applications of Chemistry, R programming, Statistics, Statistics in Industry and Practice, Tutorials Tagged: abline(), analytical chemistry, axis(), chemistry, control chart, lower action line, lower warning line, plot, quality assurance, quality control, R, R programming, statistical process control, target value, upper action line, upper warning line, vitamin c
Originally posted on Models are illuminating and wrong:
I recently interviewed Hadley Wickham the creator of Ggplot2 and a famous R Stats person. He works for RStudio and his job is to work on Open Source software aimed at Data Geeks. Hadley is famous for his contributions to Data Science tooling and inspires a lot of other languages! I include some light edits.
1. What project have you worked on do you wish you could go back to, and do better? ggplot2! ggplot2 is such a heavily used package, and I knew so little about R programming when I wrote it. It works (mostly) and I love it, but the internals are pretty hairy. Even I don’t understand how a lot of works these days. Fortunately, I’ve had the opportunity to spend some more time on it lately, as I’ve been working on the 2nd edition of the ggplot2 book.
One thing that I’m particularly excited about is…
View original 1,059 more words