The post Factor Analysis with the Principal Factor Method and R appeared first on Aaron Schlegel.

]]>
(This article was first published on ** R – Aaron Schlegel**, and kindly contributed to R-bloggers)

As discussed in a previous post on the principal component method of factor analysis, the \hat{\Psi} term in the estimated covariance matrix S, S = \hat{\Lambda} \hat{\Lambda}’ + \hat{\Psi}, was excluded and we proceeded directly to factoring S and R. The principal factor method of factor analysis (also called the principal axis method) finds an initial estimate of \hat{\Psi} and factors S – \hat{\Psi}, or R – \hat{\Psi} for the correlation matrix. Rearranging the estimated covariance and correlation matrices with the estimated p \times m \hat{\Lambda} matrix yields:

S – \hat{\Psi} = \hat{\Lambda} \hat{\Lambda}^\prime

R – \hat{\Psi} = \hat{\Lambda} \hat{\Lambda}^\prime

Therefore the principal factor method begins with eigenvalues and eigenvectors of S – \hat{\Psi} or R – \hat{\Psi}. \hat{\Psi} is a diagonal matrix of the ith communality. As in the principal component method, the ith communality, \hat{h}^2_i, is equal to s_{ii} – \hat{\psi}_i for S – \hat{\Psi} and 1 – \hat{\psi}_i for R – \hat{\Psi}. The diagonal of S or R is replaced by their respective communalities in \hat{\psi}_i which gives us the following forms:

S – \hat{\Psi} = \begin{bmatrix} \hat{h}^2_1 && s_{12} && \cdots && s_{1p} \\\ s_{21} && \hat{h}^2_2 && \cdots && s_{2p} \\\ \vdots && \vdots && && \vdots \\\ s_{p1} && s_{p2} && \cdots && \hat{h}^2_p \\\ \end{bmatrix}

R – \hat{\Psi} = \begin{bmatrix} \hat{h}^2_1 && r_{12} && \cdots && r_{1p} \\\ r_{21} && \hat{h}^2_2 && \cdots && r_{2p} \\\ \vdots && \vdots && && \vdots \\\ r_{p1} && r_{p2} && \cdots && \hat{h}^2_p \\\ \end{bmatrix}

An initial estimate of the communalities is made using the squared multiple correlation between the observation vector y_i and the other p – 1 variables. The squared multiple correlation in the case of R – \hat{\Psi} is equivalent to the following:

\hat{h}^2_i = 1 – \frac{1}{r^{ii}}

Where r^{ii} is the ith diagonal element of R^{-1}. In the case of S – \hat{\Psi}, the above is multiplied by the variance of the respective variable.

The factor loadings are then calculated by finding the eigenvalues and eigenvectors of the R – \hat{\Psi} or S – \hat{\Psi} matrix.

We will perform factor analysis using the principal factor method on the rootstock data as done previously with the principal component method to see if the approaches differ significantly. The data were obtained from the companion FTP site of the book Methods of Multivariate Analysis by Alvin Rencher. The data contains four dependent variables as follows:

- trunk girth at four years (mm \times 100)
- extension growth at four years (m)
- trunk girth at 15 years (mm \times 100)
- weight of tree above ground at 15 years (lb \times 1000)

Load the data and name the columns

root <- read.table('ROOT.DAT', col.names = c('Tree.Number', 'Trunk.Girth.4.Years', 'Ext.Growth.4.Years', 'Trunk.Girth.15.Years', 'Weight.Above.Ground.15.Years'))

Since the variables of the rootstock data were measured on different scales, we will proceed with using the correlation matrix R to perform factor analysis.

Find the correlation matrix.

R <- cor(root[,2:5]) round(R, 2) ## Trunk.Girth.4.Years Ext.Growth.4.Years ## Trunk.Girth.4.Years 1.00 0.88 ## Ext.Growth.4.Years 0.88 1.00 ## Trunk.Girth.15.Years 0.44 0.52 ## Weight.Above.Ground.15.Years 0.33 0.45 ## Trunk.Girth.15.Years ## Trunk.Girth.4.Years 0.44 ## Ext.Growth.4.Years 0.52 ## Trunk.Girth.15.Years 1.00 ## Weight.Above.Ground.15.Years 0.95 ## Weight.Above.Ground.15.Years ## Trunk.Girth.4.Years 0.33 ## Ext.Growth.4.Years 0.45 ## Trunk.Girth.15.Years 0.95 ## Weight.Above.Ground.15.Years 1.00

Calculate and replace the diagonal of R with the estimated communalities.

R.smc <- (1 - 1 / diag(solve(R))) diag(R) <- R.smc round(R, 2) ## Trunk.Girth.4.Years Ext.Growth.4.Years ## Trunk.Girth.4.Years 0.80 0.88 ## Ext.Growth.4.Years 0.88 0.81 ## Trunk.Girth.15.Years 0.44 0.52 ## Weight.Above.Ground.15.Years 0.33 0.45 ## Trunk.Girth.15.Years ## Trunk.Girth.4.Years 0.44 ## Ext.Growth.4.Years 0.52 ## Trunk.Girth.15.Years 0.91 ## Weight.Above.Ground.15.Years 0.95 ## Weight.Above.Ground.15.Years ## Trunk.Girth.4.Years 0.33 ## Ext.Growth.4.Years 0.45 ## Trunk.Girth.15.Years 0.95 ## Weight.Above.Ground.15.Years 0.91

Now that we have an initial estimate of the communalities, we can find the eigenvalues and eigenvectors of the R – \hat{\Psi} matrix.

r.eigen <- eigen(R) r.eigen$values ## [1] 2.64598796 0.90756969 -0.03306996 -0.09008523

The R matrix is no longer positive semidefinite due to replacing the diagonal with the communalities. Thus there are a few small negative eigenvalues. Since negative eigenvalues cannot be used to estimate \hat{\Lambda} (due to taking the square root of the D matrix), we can proceed with m = 2.

An interesting note is when negative eigenvalues exist, the cumulative proportion of variance calculated from the eigenvalues will exceed 1 and then decline back to 1 after considering the negative eigenvalues. The following loop demonstrates this phenomenon:

tot.prop <- 0 for (i in r.eigen$values) { tot.prop <- tot.prop + i / sum(r.eigen$values) print(tot.prop) } ## [1] 0.7713346 ## [1] 1.035901 ## [1] 1.026261 ## [1] 1

Obtain the factor loadings as before by multiplying the square root of the first two eigenvalues by their respective eigenvectors.

r.lambda <- as.matrix(r.eigen$vectors[,1:2]) %*% diag(sqrt(r.eigen$values[1:2])) r.lambda ## [,1] [,2] ## [1,] -0.7402973 0.5421762 ## [2,] -0.8034091 0.4520610 ## [3,] -0.8770380 -0.4050851 ## [4,] -0.8266112 -0.4951379

The communalities, specific variances and complexity of the factor loadings can then be calculated.

r.h2 <- rowSums(r.lambda^2) r.u2 <- 1 - r.h2 com <- rowSums(r.lambda^2)^2 / rowSums(r.lambda^4)

Collect the results into a `data.frame`

.

cor.pa <- data.frame(cbind(round(r.lambda, 2), round(r.h2, 2), round(r.u2, 3), round(com, 1))) colnames(cor.pa) <- c('PA1', 'PA2', 'h2', 'u2', 'com') cor.pa ## PA1 PA2 h2 u2 com ## 1 -0.74 0.54 0.84 0.158 1.8 ## 2 -0.80 0.45 0.85 0.150 1.6 ## 3 -0.88 -0.41 0.93 0.067 1.4 ## 4 -0.83 -0.50 0.93 0.072 1.6

`psych`

PackageThe psych package also performs factor analysis using the principal method with the `fa()`

function.

library(psych)

The `fa()`

function performs the iterated principal factor method by default, which, as the name implies, iterates the initial communality estimates with those of the resulting \hat{\Lambda} matrix until they converge. This approach to factor analysis will be demonstrated in a future post. Setting the `max.iter`

argument to 1 will output the non-iterated principal factor method results.

root.cor.fa <- fa(root[,2:5], nfactors = 2, rotate = 'none', fm = 'pa', max.iter = 1) root.cor.fa ## Factor Analysis using method = pa ## Call: fa(r = root[, 2:5], nfactors = 2, rotate = "none", max.iter = 1, ## fm = "pa") ## Standardized loadings (pattern matrix) based upon correlation matrix ## PA1 PA2 h2 u2 com ## Trunk.Girth.4.Years 0.74 0.54 0.84 0.158 1.8 ## Ext.Growth.4.Years 0.80 0.45 0.85 0.150 1.6 ## Trunk.Girth.15.Years 0.88 -0.41 0.93 0.067 1.4 ## Weight.Above.Ground.15.Years 0.83 -0.50 0.93 0.072 1.6 ## ## PA1 PA2 ## SS loadings 2.65 0.91 ## Proportion Var 0.66 0.23 ## Cumulative Var 0.66 0.89 ## Proportion Explained 0.74 0.26 ## Cumulative Proportion 0.74 1.00 ## ## Mean item complexity = 1.6 ## Test of the hypothesis that 2 factors are sufficient. ## ## The degrees of freedom for the null model are 6 and the objective function was 4.19 with Chi Square of 187.92 ## The degrees of freedom for the model are -1 and the objective function was 0.17 ## ## The root mean square of the residuals (RMSR) is 0.02 ## The df corrected root mean square of the residuals is NA ## ## The harmonic number of observations is 48 with the empirical chi square 0.24 with prob < NA ## The total number of observations was 48 with MLE Chi Square = 7.26 with prob < NA ## ## Tucker Lewis Index of factoring reliability = 1.281 ## Fit based upon off diagonal values = 1 ## Measures of factor score adequacy ## PA1 PA2 ## Correlation of scores with factors 0.98 0.93 ## Multiple R square of scores with factors 0.95 0.86 ## Minimum correlation of possible factor scores 0.90 0.72

The results of the `fa()`

function align to our own other than an arbitrary scaling of the first factor by -1.

Rotate the factors using varimax rotation to improve interpretation. Varimax rotation of the loadings can be done with the `varimax()`

function, or in the `fa()`

function by setting the `rotate`

argument to `varimax`

.

root.cor.fa.v <- fa(root[,2:5], nfactors = 2, rotate = 'varimax', fm = 'pa', max.iter = 1) root.cor.fa.v ## Factor Analysis using method = pa ## Call: fa(r = root[, 2:5], nfactors = 2, rotate = "varimax", max.iter = 1, ## fm = "pa") ## Standardized loadings (pattern matrix) based upon correlation matrix ## PA1 PA2 h2 u2 com ## Trunk.Girth.4.Years 0.19 0.90 0.84 0.158 1.1 ## Ext.Growth.4.Years 0.30 0.87 0.85 0.150 1.2 ## Trunk.Girth.15.Years 0.92 0.29 0.93 0.067 1.2 ## Weight.Above.Ground.15.Years 0.95 0.18 0.93 0.072 1.1 ## ## PA1 PA2 ## SS loadings 1.87 1.68 ## Proportion Var 0.47 0.42 ## Cumulative Var 0.47 0.89 ## Proportion Explained 0.53 0.47 ## Cumulative Proportion 0.53 1.00 ## ## Mean item complexity = 1.1 ## Test of the hypothesis that 2 factors are sufficient. ## ## The degrees of freedom for the null model are 6 and the objective function was 4.19 with Chi Square of 187.92 ## The degrees of freedom for the model are -1 and the objective function was 0.17 ## ## The root mean square of the residuals (RMSR) is 0.02 ## The df corrected root mean square of the residuals is NA ## ## The harmonic number of observations is 48 with the empirical chi square 0.24 with prob < NA ## The total number of observations was 48 with MLE Chi Square = 7.26 with prob < NA ## ## Tucker Lewis Index of factoring reliability = 1.281 ## Fit based upon off diagonal values = 1 ## Measures of factor score adequacy ## PA1 PA2 ## Correlation of scores with factors 0.97 0.94 ## Multiple R square of scores with factors 0.94 0.87 ## Minimum correlation of possible factor scores 0.88 0.75

The principal factor method (and iterated principal factor method) will usually yield results close to the principal component method if either the correlations or the number of variables is large (Rencher, 2002, pp. 424).

Perform the principal component method of factor analysis and compare with the principal factor method.

root.cor.pa <- principal(root[,2:5], nfactors = 2, rotate = 'varimax') root.cor.pa ## Principal Components Analysis ## Call: principal(r = root[, 2:5], nfactors = 2, rotate = "varimax") ## Standardized loadings (pattern matrix) based upon correlation matrix ## RC1 RC2 h2 u2 com ## Trunk.Girth.4.Years 0.16 0.96 0.95 0.051 1.1 ## Ext.Growth.4.Years 0.28 0.93 0.94 0.061 1.2 ## Trunk.Girth.15.Years 0.94 0.29 0.97 0.027 1.2 ## Weight.Above.Ground.15.Years 0.97 0.19 0.98 0.022 1.1 ## ## RC1 RC2 ## SS loadings 1.94 1.90 ## Proportion Var 0.48 0.48 ## Cumulative Var 0.48 0.96 ## Proportion Explained 0.50 0.50 ## Cumulative Proportion 0.50 1.00 ## ## Mean item complexity = 1.1 ## Test of the hypothesis that 2 components are sufficient. ## ## The root mean square of the residuals (RMSR) is 0.03 ## with the empirical chi square 0.39 with prob < NA ## ## Fit based upon off diagonal values = 1

Both methods achieved a simple structure of the loadings following rotation. The loadings from each method are rather similar and don’t differ significantly, though the principal component method yielded factors that load more heavily on the variables which the factors hypothetically represent. However, the factors resulting from the principal component method explain 96% of the cumulative variance compared to 89% from the principal factor method. Though not a drastic difference, one is inclined to proceed with the principal component method in this case as the factors account for almost all of the variance in the variables.

Rencher, A. (2002). Methods of Multivariate Analysis (2nd ed.). Brigham Young University: John Wiley & Sons, Inc.

The post Factor Analysis with the Principal Factor Method and R appeared first on Aaron Schlegel.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Aaron Schlegel**.

R-bloggers.com offers

IT research firms study software products and corporate strategies, they survey customers regarding their satisfaction with the products and services, and then provide their analysis on each in reports they sell to their clients. Each company has its own criteria for rating companies, so they don’t always agree. However, I find the reports extremely interesting reading. While these reports are expensive, the companies that receive good ratings often purchase copies to give away to potential customers. An Internet search of the report title will often reveal the companies that are distributing such copies.

Gartner, Inc. is one of the companies that provides such reports. Out of the roughly 100 companies selling data science software, Gartner selected 16 which had either high revenue or lower revenue but high growth (see full report for details). After extensive input from both customers and company representatives, Gartner analysts rated the companies on their “completeness of vision” and their “ability to execute” that vision. Figure 3 shows the resulting plot. Note that purely open source software is not rated by Gartner, but nearly all the software in Figure 3 includes the ability to interact with R and Python.

The Leader’s Quadrant is the place for companies who have a future direction in line with their customer’s needs and the resources to execute that vision. The four companies in the Leaders quadrant have remained the same for the last three reports: IBM, KNIME, RapidMiner, and SAS. Of these, they rate IBM as having slightly greater “completeness of vision” due to the extensive integration they offer to open source software compared to SAS Institute. KNIME and RapidMiner are quite similar as the are driven by an easy to use workflow interface. Both offer free and open source versions, but RapidMiner’s is limited by a cap on the amount of data that it can analyze. IBM and SAS are market leaders based on revenue and, as we have seen, KNIME and RapidMiner are the ones with high growth.

The companies in the Visionaries quadrant are those that have a good future plans but which may not have the resources to execute that vision. Of these, Microsoft increased its ability to execute compared to the 2016 report, and Alpine, one of the smallest companies, declined sharply in their ability to execute. The remaining three companies in this quadrant have just been added: H2O.ai, Dataiku, and Domino Data Lab.

Those in the Challenger’s quadrant have ample resources but less customer confidence on their future plans. Mathworks, the makers of MATLAB, is new to the report. Quest purchased Statistica from Dell, and it appears in roughly the same position as Dell did last year.

The Niche Players quadrant offer tools that are not as broadly applicable.

In 2017 Gartner dropped coverage of Accenture, Lavastorm, Megaputer, Predixion Software, and Prognoz.

]]>
(This article was first published on ** Rstats – OUseful.Info, the blog…**, and kindly contributed to R-bloggers)

Earlier this week, I spent a day chatting to folk from the House of Commons Library as a part of a bit of temporary day-a-week-or-so bit of work I’m doing with the Parliamentary Digital Service.

During one of the conversations on matters loosely geodata-related with Carl Baker, Carl mentioned an NHS Digital data set describing the number of people on a GP Practice list who live within a particular LSOA (Lower Super Output Area). There are possible GP practice closures on the Island at the moment, so I thought this might be an interesting dataset to play with in that respect.

Another thing Carl is involved with is producing a regularly updated briefing on Accident and Emergency Statistics. Excel and QGIS templates do much of the work in producing the updated documents, so much of the data wrangling side of the report generation is automated using those tools. Supporting regular updating of briefings, as well as answering specific, *ad hoc* questions from MPs, producing debate briefings and other current topic briefings, seems to be an important Library activity.

As I’ve been looking for opportunities to compare different automation routes using things like Jupyter notebooks and RMarkdown, I thought I’d have a play with the GP list/LSOA data, showing how we might be able to use each of those two routes to generate maps showing the geographical distribution, across LSOAs at least, for GP practices on the Isle of Wight. This demonstrates several things, including: data ingest; filtering according to practice codes accessed from another dataset; importing a geoJSON shapefile; generating a choropleth map using the shapefile matched to the GP list LSOA codes.

The first thing I tried was using a python/pandas Jupyter notebook to create a choropleth map for a particular practice using the `folium`

library. This didn’t take long to do at all – I’ve previously built an NHS admin database that lets me find practice codes associated with a particular CCG, such as the Isle of Wight CCG, as well as a notebook that generates a choropleth over LSOA boundaries, so it was simply a case of copying and pasting old bits of code and adding in the new dataset.You can see a rendered example of the notebook here (download).

One thing you might notice from the rendered notebook is that I actually “widgetised” it, allowing users of the live notebook to select a particular practice and render the associated map.

Whilst I find the Jupyter notebooks to provide a really friendly and accommodating environment for pulling together a recipe such as this, the report generation workflows are arguably still somewhat behind the workflows supported by RStudio and the `knitr`

tools.

So what does an RStudio workflow have to offer? Using Rmarkdown (Rmd) we can combine text, code and code outputs in much the same way as we can in a Jupyter notebook, but with slightly more control over the presentation of the output.

For example, from a single Rmd file we can knit an output HTML file that incorporates an interactive *leaflet* map, or a static PDF document.

It’s also possible to use a parameterised report generation workflow to generate separate reports for each practice. For example, applying this parameterised report generation script to a generic base template report will generate a set of PDF reports on a per practice basis for each practice on the Isle of Wight.

The `bookdown`

package, which I haven’t played with yet, also looks promising for its ability to generate a single output document from a set of source documents. (I have a question in about the extent to which `bookdown`

supports partially paramterised compound document creation).

Having started thinking about comparisons between Excel, Jupyter and RStudio workflows, possible next steps are:

- to look for sensible ways of comparing the workflow associated with each,
- the ramp-up skills required, and blockers (including cultural blockers) associated with getting started with new tools such as Jupyter or RStudio, and
- the various ways in which each tool/workflow supports: transparency; maintainability; extendibility; correctness; reuse; integration with other tools; ease and speed of use.

It would also be interesting to explore how much time and effort would actually be involved in trying to port a legacy Excel report generating template to Rmd or ipynb, and what sorts of issue would be likely to arise, and what benefits Excel offers compared to Jupyter and RStudio workflows.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Rstats – OUseful.Info, the blog…**.

R-bloggers.com offers

(This article was first published on ** Vijay Barve**, and kindly contributed to R-bloggers)

Guest Post by Enjie (Jane) LI

I have been using bdvis package (version 0.2.9) to visualize the iNaturalist records of RAScals project (http://www.inaturalist.org/projects/rascals).

Initially, the mapgrid function in the bdvis version 0.2.9 was written to map the number of records, number of species and completeness in a 1-degree cell grid (~111km x 111km resolution).

I applied this function to the RASCals dataset, see the following code. However, the mapping results are not satisfying. The 1 degree cells are too big to reveal the details in the study areas. Also, the raster grid was on top the basemap, which makes it really hard to associate the mapping results with physical locations.

library(rinat) library(bdvis) rascals=get_inat_obs_project("rascals") conf <- list(Latitude="latitude", Longitude="longitude", Date_collected="Observed.on", Scientific_name="Scientific.name") rascals <- format_bdvis(rascals, config=conf) ## Get rid of a record with weird location log rascals <- rascals[!(rascals$Id== 4657868),] rascals <- getcellid(rascals) rascals <- gettaxo(rascals) bdsummary(rascals) a <- mapgrid(indf = rascals, ptype = "records", title = "distribution of RASCals records", bbox = NA, legscale = 0, collow = "blue", colhigh = "red", mapdatabase = "county", region = "CA", customize = NULL) b <- mapgrid(indf = rascals, ptype = "species", title = "distribution of species richness of RASCals records", bbox = NA, legscale = 0, collow = "blue", colhigh = "red", mapdatabase = "county", region = "CA", customize = NULL)

I contacted developers of the package regarding these two issues. They have been very responsive to resolve them. They quickly added the gridscale argument in the mapgrid function. This new argument allows the users to choose scale (0.1 or 1). The default 1-degree cell grid for mapping.

Here are mapping results from using the gridscale argument. **Make sure you have bdvis version 0.2.14 or later.**

c <- mapgrid(indf = rascals, ptype = "records", title = "distribution of RASCals records", bbox = NA, legscale = 0, collow = "blue", colhigh = "red", mapdatabase = "county", region = "CA", customize = NULL, gridscale = 0.1) d <- mapgrid(indf = rascals, ptype = "species", title = "distribution of species richness of RASCals records", bbox = NA, legscale = 0, collow = "blue", colhigh = "red", mapdatabase = "county", region = "CA", customize = NULL, gridscale = 0.1)

We can see that the new map with a finer resolution definitely revealed more information within the range of our study area. One more thing to note is that in this version developers have adjusted the basemap to be on top of the raster layer. This has definitely made the map easier to read and reference back to the physical space.

Good job team! Thanks for developing and perfecting the bdvis package.

References

- RASCals: http://www.inaturalist.org/projects/rascals
- Barve, V., & Otegui, J. (2016). bdvis: Biodiversity data visualizations Version: 0.2.9 Accessed from https://cran.r-project.org/web/packages/bdvis/index.html
- Barve, V., & Otegui, J. (2017). bdvis: Biodiversity data visualizations Version: 0.2.14 Accessed from https://github.com/vijaybarve/bdvis

To **leave a comment** for the author, please follow the link and comment on their blog: ** Vijay Barve**.

R-bloggers.com offers

The post Euler Problem 13: Large Sum of 1000 Digits appeared first on The Devil is in the Data.

]]>
(This article was first published on ** The Devil is in the Data**, and kindly contributed to R-bloggers)

Euler Problem 13 asks to add one hundred numbers with fifty digits. This seems like a simple problem where it not that most computers are not designed to deal with numbers with a lot of integers. For example:

When asking R to compute this value we get 1.844674e+19, losing most of the digits and limiting the accuracy of the results. Computers solve this problem using Arbitrary-precision Arithmetic. There are many software libraries that can process long integers without loosing accuracy. Euler Problem 13 requires this type of approach.

Work out the first ten digits of the sum of the following one-hundred 50-digit numbers.

The easy way to solve this problem is to use the gmp package for working with very large integers. This package uses a special number types such as Big Rational and Big Integer. The number of digits in these number types is only limited by the size of the memory.

library(gmp) numbers <- readLines("Euler/p013_numbers.txt") digits <- as.bigz(sum(as.numeric(numbers))) answer <- substr(as.character(digits),1,10)

To find the solution to this problem using only base R, I wrote a function to add numbers using strings instead of integers. The function adds leading zeros to the smallest number to make them both the same length. The function then proceeds to add numbers in the same way we were taught in primary school. This function can in principle be used for several other Euler Problems using large integers.

# Add numbers with many digits big.add <- function(a, b) { # Add leading zeros to smallest numer if (nchar(a) < nchar(b)) a <- paste0(paste(rep(0, nchar(b)-nchar(a)), collapse=""), a) if (nchar(a) > nchar(b)) b <- paste0(paste(rep(0, nchar(a)-nchar(b)), collapse=""), b) solution <- vector() remainder <- 0 for (i in nchar(b):1) { p <- as.numeric(substr(a, i, i)) q <- as.numeric(substr(b, i, i)) r <- p + q + remainder if (r >= 10 & i!=1) { solution <- c(solution, r %% 10) remainder <- (r - (r %% 10))/10 } else { solution <- c(solution, r) remainder <- 0 } } return(paste(rev(solution), collapse="")) }

With this function, the problem is easy to solve. The second part of the code runs this function over the one hundred numbers provided on the Euler Problem page and calculates the answer.

numbers <- readLines("Euler/p013_numbers.txt") answer <- "0" for (i in numbers) { answer <- big.add(answer, i) } answer <- substr(answer, 1, 10)

You can expand this function to multiply a very large number with a smaller number using the Reduce function. This function adds the number a to itself, using the *big.add* function. The outcome of the addition is used in the next iteration until it has been repeated *b* times. The number b in this function needs to be a ‘low’ number because it uses a vector of the length b.

big.mult <- function(a, b) { Reduce(big.add, rep(a, as.numeric(b))) }

The post Euler Problem 13: Large Sum of 1000 Digits appeared first on The Devil is in the Data.

To **leave a comment** for the author, please follow the link and comment on their blog: ** The Devil is in the Data**.

R-bloggers.com offers

(This article was first published on ** Revolutions**, and kindly contributed to R-bloggers)

If you're an Excel user (or any other spreadsheet, really), adapting to learn R can be hard. As this blog post by Gordon Shotwell explains, one of the reasons is that simple things can be harder to do in R than Excel. But it's worth perservering, because complex things can be easier.

While Excel (ahem) excels at things like arithmetic and tabulations (and some complex things too, like presentation), R's programmatic focus introduces concepts like data structures, iteration, and functions. Once you've made the investment in learning R, these abstractions make reducing complex tasks into discrete steps possible, and automating repeated similar tasks much easier. For the full, compelling argument, follow the link to the Yhat blog, below.

Yhat blog: R for Excel Users

To **leave a comment** for the author, please follow the link and comment on their blog: ** Revolutions**.

R-bloggers.com offers

My inaugural blog as a Data Science Consultant for SmartCat. The code that accompanies the analyses presented here is available at the respective GitHub repository. On how to use R to estimate the optimal time during the day for aliens to invade Earth and a few more interesting things.

A few days ago, NASA has announced a press conference citing a “… Discovery Beyond Our Solar System”, and I always tend to get excited about such news. I have learned about the #askNASA hashtag used for public and media to approach NASA with their questions on Twitter. And when I hear the word “hashtag”, social media analytics is what first comes to mind. The I thought, what could I find out by studying tweets on #askNASA? Nothing much, unfortunately, because the number of such tweets doesn’t exactly skyrocket (many questions are posted there, of course, but the volume of tweets is far below what one needs for a serious social media analytics study). Ok, then: I will study the whole twitterverse of NASA and NASA related accounts, in order to discover what is the relative position of #AskNASA in the semantic space of what’s being tweeted from NASA. I have used {TwitteR} to access the Twitter Search API for retrospective tweets from almost all of the NASA Twitter accounts listed on https://www.nasa.gov/socialmedia/ (I haven’t included the personal accounts of NASA astronauts; only the main account, and then everything found under the following categories: NASA Centers & Facilities, Organizations & Programs, Missions & Topics, plus NASAAstronauts and NASAPeople; 141 Twitter account in total). I have also accessed the Twitter Search API to collect all recent tweets with the #askNASA hashtag there. In total, I have produced a collection of 1575 tweets with the #askNASA hashtag; these were cleaned from all tweets posted on behalf of NASA and NASA related accounts. From the NASA accounts alone, I was able to get to 255,241 tweets - following a several hours long exercise of the userTimeline() {TwitteR} function.

While scraping with {TwitteR} mindlessly, I was planning the analysis, and my thoughts started wondering around all the cool work that people at NASA do… What will they announce today? A new exoplanet, I can bet. People are crazy about exoplanets, and the aliens, and the SETI program, and Astrobiology, and all that stuff. A Twin Earth! However, nobody realizes, I thought, that the potential discovery of this Earth on behalf of some technologically advanced alien civilization could pose a real existential treat for us humans: a true global catastrophic risk. And with all their antennas, golden plates with pictographs and Bach on their interstellar probes… nobody seems to worry about Sir Stephen Hawking’s well-reasoned warning on how intelligent aliens could destroy humanity (even not he himself, c.f. “Stephen Hawking: Intelligent Aliens Could Destroy Humanity, But Let’s Search Anyway” - says Hawking)! Anyways, it was probably around the third or fourth glass of wine in the Belgrade Cafe from where I’ve left {TwitteR} do the job from my netbook when I’ve realized what I want to do this time with R: I will estimate the optimal time during the day for Aliens to invade our planet by analyzing the daily oscillation in the sentiment and volume of tweets from NASA accounts. Assumption: if aliens somehow figure out where we live, that will be because of these guys with big radio-antennas. Next: given that whatever alien civilization decides to invade Earth, they will certainly be so technologically advanced to immediately discover the very source of our quest for them. Finally, given their technological supremacy, they will be available to analyze all the necessary information to ensure the success of their mission: including our (precious!) tweets.

And here it is, with a little help from {tm.plugin.sentiment}, {dplyr} and {ggplot2}:

emoHours <- tweetsDF %>%

group_by(Hour) %>%

summarise(tweets = n(),

positive = length(which(Polarity > 0)),

neutral = length(which(Polarity == 0)),

negative = length(which(Polarity < 0))

)

emoHours$positive <- emoHours$positive/emoHours$tweets

emoHours$neutral <- emoHours$neutral/emoHours$tweets

emoHours$negative <- emoHours$negative/emoHours$tweets

emoHours$Hour <- as.numeric(emoHours$Hour)

emoHours$Volume <- emoHours$tweets/max(emoHours$tweets)

emoHours <- emoHours %>%

gather(key = Measure,

value = Value,

positive:Volume

)

ggplot(emoHours, aes(x = Hour, y = Value, color = Measure)) +

geom_path(size = .25) +

geom_point(size = 1.5) +

geom_point(size = 1, color = "White") +

ggtitle("Optimal Time to Invade Earth") +

scale_x_continuous(breaks = 0:23, labels = as.character(0:23)) +

theme_bw() +

theme(plot.title = element_text(size = 12)) +

theme(axis.text.x = element_text(size = 8, angle = 90))

Figure 1. Optimal time to invade Earth w. {tm.plugin.sentiment}, {dplyr}, and {ggplot2}

The tweetsDF data frame becomes available after running the previous chunks of code that you will find in the GitHub repo for this blog post. The Polarity column comes from the application of {tm.plugin.sentiment} functions over a {tm} pre-processed corpus of all 255,241 tweets that were collected from NASA’s accounts:

### --- Sentiment Analysis

# - as {tm} VCorpus

nasaCorpus <- VCorpus(VectorSource(tweetsDF$text))

# - {tm.plugin.sentiment} sentiment scores

# - Term-Document Matrix

nasaTDM <- TermDocumentMatrix(nasaCorpus,

control = list(tolower = TRUE,

removePunctuation = TRUE,

removeNumbers = TRUE,

removeWords = list(stopwords("english")),

stripWhitespace = TRUE,

stemDocument = TRUE,

minWordLength = 3,

weighting = weightTf))

# - {tm.plugin.sentiment} polarity score

# - NOTE: that would be (n-p)/(n+p)

nasaPolarity <- polarity(nasaTDM)

sum(nasaPolarity != 0)

tweetsDF$Polarity <- nasaPolarity

The optimal time for Alien invasion is obviously somewhere between 7:00 and 9:00 in the morning (NOTE for the aliens: all times are GMT). All tweets were categorized as neutral, positive, or negative in respect to their polarity, which is given as (n-p)/(n+p), n being the count of negative and p of positive words in the respective tweets. Then, instead of going for a time series analyses, I have simply grouped all tweets per hour of the day in which they have occurred, recalculating the counts of positive, negative, and neutral tweets into proportions of total tweets per hour. Thus, the vertical axis presents proportion of tweets per hour. The Volume variable is simply a rescaling of the plain simple counts of number of tweets per hour by the maximum count found in the data set, so that this can be conveniently presented on a 0 to 1 scale in the chart. And what we see is that between 7:00 and 9:00, approximately, an anomaly in the hourly distribution of tweets from NASA’s accounts takes place: a sudden increase in the proportion of neutral and negative tweets, accompanied by the drop in the volume of tweets occurring. So that’s when we’re moody and not relaxed, and probably tweeting less given the pressure of the daily work routine before lunch: the ideal time for an Alien civilization to invade.

Of course, technologically advanced aliens, who know statistics very well, could as well ask whether the described phenomenon is simply a by product of the increased measurement error related to the quite obvious drop in the sample sizes for the respective, invasion-critical hours…

Putting aside the question of alien invasion, I am really very interested to learn from NASA today what is that was discovered beyond the limits of the Solar system. To prove how the discoveries of potentially habitable exoplanets are popular, the following analysis was conducted. In the first step, we simply concatenate all tweets originating from the same account, while treating the tweets with the #askNASA hashtag as a separate group (i.e. as it was a Twitter account in itself). Given that I was interested in the account level of analysis here, and provided that individual tweets offer too little information for typical BoW approaches in text mining, this step is really justified. Then, I have produced a typical Term-Document Matrix from all available tweets, preserving all terms beginning with “@” or “#” there, and then cleaned the matrix from everything else. Finally, the term counts were turned into binary (present/absent) information in order to compute the Jaccard similarity coefficients across the accounts:

tweetTexts <- tweetsDF %>%

group_by(screeName) %>%

summarise(text = paste(text, collapse = " "))

# - accNames is to be used later:

accNames <- tweetTexts$screeName

accNames <- append(accNames, "askNASA")

tweetTexts <- tweetTexts$text

askNasaText <- paste(dT$text, collapse = "")

tweetTexts <- append(tweetTexts, askNasaText)

tweetTexts <- enc2utf8(tweetTexts)

tweetTexts <- VCorpus(VectorSource(tweetTexts))

# - Term-Doc Matrix for this:

removePunctuationSpecial <- function(x) {

x <- gsub("#", "HASHCHAR", x)

x<- gsub("@", "MONKEYCHAR", x)

x <- gsub("[[:punct:]]+", "", x)

x <- gsub("HASHCHAR", "#", x)

x <- gsub("MONKEYCHAR", "@", x)

return(x)

}

tweetTexts <- tm_map(tweetTexts,

content_transformer(removePunctuationSpecial),

lazy = TRUE)

tweetsTDM <- TermDocumentMatrix(tweetTexts,

control = list(tolower = FALSE,

removePunctuation = FALSE,

removeNumbers = TRUE,

removeWords = list(stopwords("english")),

stripWhitespace = TRUE,

stemDocument = FALSE,

minWordLength = 3,

weighting = weightTf))

# - store TDM object:

saveRDS(tweetsTDM, "tweetsTDM.Rds")

# - extract only mention and hashtag features:

tweetsTDM <- t(as.matrix(tweetsTDM))

w <- which(grepl("^@|^#", colnames(tweetsTDM)))

tweetsTDM <- tweetsTDM[, -w]

# - keep only mention and hashtag features w. Freq > 50

wK <- which(colSums(tweetsTDM) > 10)

tweetsTDM <- tweetsTDM[, wK]

# - transform to binary for Jaccard distance

wPos <- which(tweetsTDM > 0, arr.ind = T)

tweetsTDM[wPos] <- 1

# - Jaccard distances for accounts and #asknasa:

simAccounts <- dist(tweetsTDM, method = "Jaccard", by_rows = T)

simAccounts <- as.matrix(simAccounts)

The following {igraph} works this way: each account - of which one, be reminded, #askNASA is not really an account but represents information on all tweets with the respective hashtag- points to the account which is most similar to it in respect to the Jaccard distance computed from the presence and absence of mentions and hashtags used. So this is more some proxy of a “social distance” between accounts than a true distributional semantics measure. It can be readily observed that #askNASA points to @PlanetQuest as its nearest neighbor in this analysis. The Jaccard distance was used since I am not really into using typical Term-Document Count Matrices in analyzing tweets; they simply convey too sparse an information for a typical approach to make any sense.

Figure 2.Social Network of NASA Twitter accounts + #askNASA. {igraph}

Goran S. Milovanović, PhD

Data Science Consultant, SmartCat

(This article was first published on ** The Exactness of Mind**, and kindly contributed to R-bloggers)

This post was originally published on SmartCat, 22 Feb 2017.

My inaugural blog as a Data Science Consultant for SmartCat. The code that accompanies the analyses presented here is available at the respective GitHub repository. On how to use R to estimate the optimal time during the day for aliens to invade Earth and a few more interesting things.

A few days ago, NASA has announced a press conference citing a “… Discovery Beyond Our Solar System”, and I always tend to get excited about such news. I have learned about the #askNASA hashtag used for public and media to approach NASA with their questions on Twitter. And when I hear the word “hashtag”, social media analytics is what first comes to mind. The I thought, what could I find out by studying tweets on #askNASA? Nothing much, unfortunately, because the number of such tweets doesn’t exactly skyrocket (many questions are posted there, of course, but the volume of tweets is far below what one needs for a serious social media analytics study). Ok, then: I will study the whole twitterverse of NASA and NASA related accounts, in order to discover what is the relative position of #AskNASA in the semantic space of what’s being tweeted from NASA. I have used {TwitteR} to access the Twitter Search API for retrospective tweets from almost all of the NASA Twitter accounts listed on https://www.nasa.gov/socialmedia/ (I haven’t included the personal accounts of NASA astronauts; only the main account, and then everything found under the following categories: NASA Centers & Facilities, Organizations & Programs, Missions & Topics, plus NASAAstronauts and NASAPeople; 141 Twitter account in total). I have also accessed the Twitter Search API to collect all recent tweets with the #askNASA hashtag there. In total, I have produced a collection of 1575 tweets with the #askNASA hashtag; these were cleaned from all tweets posted on behalf of NASA and NASA related accounts. From the NASA accounts alone, I was able to get to 255,241 tweets – following a several hours long exercise of the userTimeline() {TwitteR} function.

While scraping with {TwitteR} mindlessly, I was planning the analysis, and my thoughts started wondering around all the cool work that people at NASA do… What will they announce today? A new exoplanet, I can bet. People are crazy about exoplanets, and the aliens, and the SETI program, and Astrobiology, and all that stuff. A Twin Earth! However, nobody realizes, I thought, that the potential discovery of this Earth on behalf of some technologically advanced alien civilization could pose a real existential treat for us humans: a true global catastrophic risk. And with all their antennas, golden plates with pictographs and Bach on their interstellar probes… nobody seems to worry about Sir Stephen Hawking’s well-reasoned warning on how intelligent aliens could destroy humanity (even not he himself, c.f. “Stephen Hawking: Intelligent Aliens Could Destroy Humanity, But Let’s Search Anyway” – says Hawking)! Anyways, it was probably around the third or fourth glass of wine in the Belgrade Cafe from where I’ve left {TwitteR} do the job from my netbook when I’ve realized what I want to do this time with R: I will estimate the optimal time during the day for Aliens to invade our planet by analyzing the daily oscillation in the sentiment and volume of tweets from NASA accounts. Assumption: if aliens somehow figure out where we live, that will be because of these guys with big radio-antennas. Next: given that whatever alien civilization decides to invade Earth, they will certainly be so technologically advanced to immediately discover the very source of our quest for them. Finally, given their technological supremacy, they will be available to analyze all the necessary information to ensure the success of their mission: including our (precious!) tweets.

And here it is, with a little help from {tm.plugin.sentiment}, {dplyr} and {ggplot2}:

emoHours <- tweetsDF %>%

group_by(Hour) %>%

summarise(tweets = n(),

positive = length(which(Polarity > 0)),

neutral = length(which(Polarity == 0)),

negative = length(which(Polarity < 0))

)

emoHours$positive <- emoHours$positive/emoHours$tweets

emoHours$neutral <- emoHours$neutral/emoHours$tweets

emoHours$negative <- emoHours$negative/emoHours$tweets

emoHours$Hour <- as.numeric(emoHours$Hour)

emoHours$Volume <- emoHours$tweets/max(emoHours$tweets)

emoHours <- emoHours %>%

gather(key = Measure,

value = Value,

positive:Volume

)

ggplot(emoHours, aes(x = Hour, y = Value, color = Measure)) +

geom_path(size = .25) +

geom_point(size = 1.5) +

geom_point(size = 1, color = "White") +

ggtitle("Optimal Time to Invade Earth") +

scale_x_continuous(breaks = 0:23, labels = as.character(0:23)) +

theme_bw() +

theme(plot.title = element_text(size = 12)) +

theme(axis.text.x = element_text(size = 8, angle = 90))

**Figure 1.** Optimal time to invade Earth w. {tm.plugin.sentiment}, {dplyr}, and {ggplot2}

The tweetsDF data frame becomes available after running the previous chunks of code that you will find in the GitHub repo for this blog post. The Polarity column comes from the application of {tm.plugin.sentiment} functions over a {tm} pre-processed corpus of all 255,241 tweets that were collected from NASA’s accounts:

### --- Sentiment Analysis

# - as {tm} VCorpus

nasaCorpus <- VCorpus(VectorSource(tweetsDF$text))

# - {tm.plugin.sentiment} sentiment scores

# - Term-Document Matrix

nasaTDM <- TermDocumentMatrix(nasaCorpus,

control = list(tolower = TRUE,

removePunctuation = TRUE,

removeNumbers = TRUE,

removeWords = list(stopwords("english")),

stripWhitespace = TRUE,

stemDocument = TRUE,

minWordLength = 3,

weighting = weightTf))

# - {tm.plugin.sentiment} polarity score

# - NOTE: that would be (n-p)/(n+p)

nasaPolarity <- polarity(nasaTDM)

sum(nasaPolarity != 0)

tweetsDF$Polarity <- nasaPolarity

The optimal time for Alien invasion is obviously somewhere between 7:00 and 9:00 in the morning (NOTE for the aliens: all times are GMT). All tweets were categorized as neutral, positive, or negative in respect to their polarity, which is given as (n-p)/(n+p), n being the count of negative and p of positive words in the respective tweets. Then, instead of going for a time series analyses, I have simply grouped all tweets per hour of the day in which they have occurred, recalculating the counts of positive, negative, and neutral tweets into proportions of total tweets per hour. Thus, the vertical axis presents proportion of tweets per hour. The Volume variable is simply a rescaling of the plain simple counts of number of tweets per hour by the maximum count found in the data set, so that this can be conveniently presented on a 0 to 1 scale in the chart. And what we see is that between 7:00 and 9:00, approximately, an anomaly in the hourly distribution of tweets from NASA’s accounts takes place: a sudden increase in the proportion of neutral and negative tweets, accompanied by the drop in the volume of tweets occurring. So that’s when we’re moody and not relaxed, and probably tweeting less given the pressure of the daily work routine before lunch: the ideal time for an Alien civilization to invade.

Of course, technologically advanced aliens, who know statistics very well, could as well ask whether the described phenomenon is simply a by product of the increased measurement error related to the quite obvious drop in the sample sizes for the respective, invasion-critical hours…

Putting aside the question of alien invasion, I am really very interested to learn from NASA today what is that was discovered beyond the limits of the Solar system. To prove how the discoveries of potentially habitable exoplanets are popular, the following analysis was conducted. In the first step, we simply concatenate all tweets originating from the same account, while treating the tweets with the #askNASA hashtag as a separate group (i.e. as it was a Twitter account in itself). Given that I was interested in the account level of analysis here, and provided that individual tweets offer too little information for typical BoW approaches in text mining, this step is really justified. Then, I have produced a typical Term-Document Matrix from all available tweets, preserving all terms beginning with “@” or “#” there, and then cleaned the matrix from everything else. Finally, the term counts were turned into binary (present/absent) information in order to compute the Jaccard similarity coefficients across the accounts:

tweetTexts <- tweetsDF %>%

group_by(screeName) %>%

summarise(text = paste(text, collapse = " "))

# - accNames is to be used later:

accNames <- tweetTexts$screeName

accNames <- append(accNames, "askNASA")

tweetTexts <- tweetTexts$text

askNasaText <- paste(dT$text, collapse = "")

tweetTexts <- append(tweetTexts, askNasaText)

tweetTexts <- enc2utf8(tweetTexts)

tweetTexts <- VCorpus(VectorSource(tweetTexts))

# - Term-Doc Matrix for this:

removePunctuationSpecial <- function(x) {

x <- gsub("#", "HASHCHAR", x)

x<- gsub("@", "MONKEYCHAR", x)

x <- gsub("[[:punct:]]+", "", x)

x <- gsub("HASHCHAR", "#", x)

x <- gsub("MONKEYCHAR", "@", x)

return(x)

}

tweetTexts <- tm_map(tweetTexts,

content_transformer(removePunctuationSpecial),

lazy = TRUE)

tweetsTDM <- TermDocumentMatrix(tweetTexts,

control = list(tolower = FALSE,

removePunctuation = FALSE,

removeNumbers = TRUE,

removeWords = list(stopwords("english")),

stripWhitespace = TRUE,

stemDocument = FALSE,

minWordLength = 3,

weighting = weightTf))

# - store TDM object:

saveRDS(tweetsTDM, "tweetsTDM.Rds")

# - extract only mention and hashtag features:

tweetsTDM <- t(as.matrix(tweetsTDM))

w <- which(grepl("^@|^#", colnames(tweetsTDM)))

tweetsTDM <- tweetsTDM[, -w]

# - keep only mention and hashtag features w. Freq > 50

wK <- which(colSums(tweetsTDM) > 10)

tweetsTDM <- tweetsTDM[, wK]

# - transform to binary for Jaccard distance

wPos <- which(tweetsTDM > 0, arr.ind = T)

tweetsTDM[wPos] <- 1

# - Jaccard distances for accounts and #asknasa:

simAccounts <- dist(tweetsTDM, method = "Jaccard", by_rows = T)

simAccounts <- as.matrix(simAccounts)

The following {igraph} works this way: each account – of which one, be reminded, #askNASA is not really an account but represents information on all tweets with the respective hashtag- points to the account which is most similar to it in respect to the Jaccard distance computed from the presence and absence of mentions and hashtags used. So this is more some proxy of a “social distance” between accounts than a true distributional semantics measure. It can be readily observed that #askNASA points to @PlanetQuest as its nearest neighbor in this analysis. The Jaccard distance was used since I am not really into using typical Term-Document Count Matrices in analyzing tweets; they simply convey too sparse an information for a typical approach to make any sense.

**Figure 2.**Social Network of NASA Twitter accounts + #askNASA. {igraph}

Goran S. Milovanović, PhD

Data Science Consultant, SmartCat

To **leave a comment** for the author, please follow the link and comment on their blog: ** The Exactness of Mind**.

R-bloggers.com offers

(This article was first published on ** Revolutions**, and kindly contributed to R-bloggers)

Radiohead is known for having some fairly maudlin songs, but of all of their tracks, which is the *most* depressing? Data scientist and R enthusiast Charlie Thompson ranked all of their tracks according to a "gloom index", and created the following chart of gloominess for each of the band's nine studio albums. (Click for the interactive version, crated with with highcharter package for R, which allows you to explore individual tracks.)

If you're familiar with the albums, this looks pretty reasonable. Radiohead's debut, "Honey Pablo" was fairly poppy musically, but contained some pretty dark lyrics (especially in the break-out hit, *Creep*). Their most recent album "A Moon Shaped Pool" is a fantastic listen, but it isn't exactly going to get the party started.

The "Gloom Index" charted above is a combination of three quantities, and then scaled from 1 (Radiohead's gloomiest song, *True Love Waits*), to 100 (the cheeriest, *15 Step*).

The first quantity is Spotify's "valence score", which Spotify describes as a "quantity describing the musical positiveness of a track". Valence scores range from 0 (songs that sound depressed or angry) to 1 (songs that are happy or euphoric). Charlie extracted the list of Radiohead's 101 singles and the valence score for each from the Spotify developer API, using the httr package for R. This is useful in its own right, but several songs have the same valence score, so Charlie also looks at song lyrics to further differentiate them.

The second quantity is the percentage of words in the lyrics that that are "sad". Charlie scraped the song lyrics from Genius using the rvest package, and then used the tidytext package to break the lyrics into words, eliminate common "stop words" like 'the' and 'a', and count the number with negative sentiment.

The third quantity is the "lyrical density" (following a method described by Myles Harrison): the number of words per second, easily calculated from the Spotify track length data and the word counts calculated in the prior step.

The three quantities are combined together to create the "gloom index" as follows:

$$ \mathrm{gloomIndex} = \frac{1-\mathrm{valence}}{2} + \frac{\mathrm{pctSad}*(1+\mathrm{lyricalDensity})}{2} $$

Roughly, this is the average of the valence score and (almost) the number of sad words per second. (I'm guessing Charlie adds 1 to the lyrical density to get the two terms to about the same scale, so that both have about equal weight.)

It would be interesting to compare the "Gloom Index" for Radiohead with that for other famously downbeat artists (say, Bon Iver or Low). You'd need to so away with scaling the Gloom Index from 1 to 100 as Charlie has done here, but the formula could easily be adapted to make a universal score. If you'd like to give it a try, all of the R code is included in Charlie's blog post, linked below.

RCharlie: fitteR happieR

To **leave a comment** for the author, please follow the link and comment on their blog: ** Revolutions**.

R-bloggers.com offers