(This article was first published on ** Win-Vector Blog » R**, and kindly contributed to R-bloggers)

Illustration from Project Gutenberg

*The goal of cluster analysis is to group the observations in the data into *clusters* such that every datum in a cluster is more similar to other datums in the same cluster than it is to datums in other clusters. This is an analysis method of choice when annotated training data is not readily available. In this article, based on chapter 8 of Practical Data Science with R, the authors discuss one approach to evaluating the clusters that are discovered by a chosen clustering method. *

An important question when evaluating clusters is whether a given cluster is “real”-does the cluster represent actual structure in the data, or is it an artifact of the clustering algorithm? This is especially important with clustering algorithms like k-means, where the user has to specify the number of clusters *a priori*. It’s been our experience that clustering algorithms will often produce several clusters that represent actual structure or relationships in the data, and then one or two clusters that are buckets that represent “other” or “miscellaneous.” Clusters of “other” tend to be made up of data points that have no real relationship to each other; they just don’t fit anywhere else.

One way to assess whether a cluster represents true structure is to see if the cluster holds up under plausible variations in the dataset. The `fpc `

package has a function called `clusterboot()`

that uses bootstrap resampling to evaluate how stable a given cluster is. (For a full description of the algorithm, see Christian Henning, “Cluster-wise assessment of cluster stability,” Research Report 271, Dept. of Statistical Science, University College London, December 2006). `clusterboot()`

is an integrated function that both performs the clustering and evaluates the final produced clusters. It has interfaces to a number of R clustering algorithms, including both `hclust `

and `kmeans`

.

`clusterboot`

‘s algorithm uses the *Jaccard coefficient*, a similarity measure between sets. The Jaccard similarity between two sets A and B is the ratio of the number of elements in the intersection of A and B over the number of elements in the union of A and B. The basic general strategy is as follows:

- Cluster the data as usual.
- Draw a new dataset (of the same size as the original) by resampling the original dataset with replacement (meaning that some of the data points may show up more than once, and others not at all). Cluster the new dataset.
- For every cluster in the original clustering, find the most similar cluster in the new clustering (the one that gives the maximum Jaccard coefficient) and record that value. If this maximum Jaccard coefficient is less than 0.5, the original cluster is considered to be
*dissolved*-it didn’t show up in the new clustering. A cluster that’s dissolved too often is probably not a “real” cluster. - Repeat steps 2-3 several times.

The cluster stability of each cluster in the original clustering is the mean value of its Jaccard coefficient over all the bootstrap iterations. As a rule of thumb, clusters with a stability value less than 0.6 should be considered unstable. Values between 0.6 and 0.75 indicate that the cluster is measuring a pattern in the data, but there isn’t high certainty about which points should be clustered together. Clusters with stability values above about 0.85 can be considered highly stable (they’re likely to be real clusters).

Different clustering algorithms can give different stability values, even when the algorithms produce highly similar clusterings, so `clusterboot()`

is also measuring how stable the clustering algorithm is.

**The protein dataset**

To demonstrate `clusterboot()`

, we’ll use a small dataset from 1973 on protein consumption from nine different food groups in 25 countries in Europe. The original dataset can be found here. A tab-separated text file with the data can be found in this directory. The data file is called `protein.txt`

; additional information can be found in the file `protein_README.txt.`

The goal is to group the countries based on patterns in their protein consumption. The dataset is loaded into R as a data frame called `protein`

, as shown in the next listing.

protein <- read.table("protein.txt", sep="t", header=TRUE) summary(protein) # Country RedMeat WhiteMeat Eggs # Albania : 1 Min. : 4.400 Min. : 1.400 Min. :0.500 # Austria : 1 1st Qu.: 7.800 1st Qu.: 4.900 1st Qu.:2.700 # Belgium : 1 Median : 9.500 Median : 7.800 Median :2.900 # Bulgaria : 1 Mean : 9.828 Mean : 7.896 Mean :2.936 # Czechoslovakia: 1 3rd Qu.:10.600 3rd Qu.:10.800 3rd Qu.:3.700 # Denmark : 1 Max. :18.000 Max. :14.000 Max. :4.700 # (Other) :19 # Milk Fish Cereals Starch # Min. : 4.90 Min. : 0.200 Min. :18.60 Min. :0.600 # 1st Qu.:11.10 1st Qu.: 2.100 1st Qu.:24.30 1st Qu.:3.100 # Median :17.60 Median : 3.400 Median :28.00 Median :4.700 # Mean :17.11 Mean : 4.284 Mean :32.25 Mean :4.276 # 3rd Qu.:23.30 3rd Qu.: 5.800 3rd Qu.:40.10 3rd Qu.:5.700 # Max. :33.70 Max. :14.200 Max. :56.70 Max. :6.500 # # Nuts Fr.Veg # Min. :0.700 Min. :1.400 # 1st Qu.:1.500 1st Qu.:2.900 # Median :2.400 Median :3.800 # Mean :3.072 Mean :4.136 # 3rd Qu.:4.700 3rd Qu.:4.900 # Max. :7.800 Max. :7.900 # Use all the columns except the first # (Country). vars.to.use <- colnames(protein)[-1] # Scale the data columns to be zero mean # and unit variance. # The output of scale() is a matricx. pmatrix <- scale(protein[,vars.to.use]) # optionally, store the centers and # standard deviations of the original data, # so you can "unscale" it later. pcenter <- attr(pmatrix, "scaled:center") pscale <- attr(pmatrix, "scaled:scale")

Before running `clusterboot()`

we’ll cluster the data using a hierarchical clustering algorithm (Ward’s method):

# Create the distance matrix. d <- dist(pmatrix, method="euclidean") # Do the clustering. pfit <- hclust(d, method="ward") # Plot the dendrogram. plot(pfit, labels=protein$Country)

The dendrogram suggests five clusters, as shown in the figure below. You can draw the rectangles on the dendrogram using the command `rect.hclust(pfit, k=5)`

.

Let’s extract and print the clusters:

# A convenience function for printing out the # countries in each cluster, along with the values # for red meat, fish, and fruit/vegetable # consumption. print_clusters <- function(labels, k) { for(i in 1:k) { print(paste("cluster", i)) print(protein[labels==i,c("Country","RedMeat","Fish","Fr.Veg")]) } } # get the cluster labels groups <- cutree(pfit, k=5) # --- results -- > print_clusters(groups, 5) [1] "cluster 1" Country RedMeat Fish Fr.Veg 1 Albania 10.1 0.2 1.7 4 Bulgaria 7.8 1.2 4.2 18 Romania 6.2 1.0 2.8 25 Yugoslavia 4.4 0.6 3.2 [1] "cluster 2" Country RedMeat Fish Fr.Veg 2 Austria 8.9 2.1 4.3 3 Belgium 13.5 4.5 4.0 9 France 18.0 5.7 6.5 12 Ireland 13.9 2.2 2.9 14 Netherlands 9.5 2.5 3.7 21 Switzerland 13.1 2.3 4.9 22 UK 17.4 4.3 3.3 24 W Germany 11.4 3.4 3.8 [1] "cluster 3" Country RedMeat Fish Fr.Veg 5 Czechoslovakia 9.7 2.0 4.0 7 E Germany 8.4 5.4 3.6 11 Hungary 5.3 0.3 4.2 16 Poland 6.9 3.0 6.6 23 USSR 9.3 3.0 2.9 [1] "cluster 4" Country RedMeat Fish Fr.Veg 6 Denmark 10.6 9.9 2.4 8 Finland 9.5 5.8 1.4 15 Norway 9.4 9.7 2.7 20 Sweden 9.9 7.5 2.0 [1] "cluster 5" Country RedMeat Fish Fr.Veg 10 Greece 10.2 5.9 6.5 13 Italy 9.0 3.4 6.7 17 Portugal 6.2 14.2 7.9 19 Spain 7.1 7.0 7.2

There’s a certain logic to these clusters: the countries in each cluster tend to be in the same geographical region. It makes sense that countries in the same region would have similar dietary habits. You can also see that

- Cluster 2 is made of countries with higher-than-average red meat consumption.
- Cluster 4 contains countries with higher-than-average fish consumption but low produce consumption.
- Cluster 5 contains countries with high fish and produce consumption.

Let’s run `clusterboot()`

on the protein data, using hierarchical clustering with five clusters.

# load the fpc package library(fpc) # set the desired number of clusters kbest.p<-5 # Run clusterboot() with hclust # ('clustermethod=hclustCBI') using Ward's method # ('method="ward"') and kbest.p clusters # ('k=kbest.p'). Return the results in an object # called cboot.hclust. cboot.hclust <- clusterboot(pmatrix,clustermethod=hclustCBI, method="ward", k=kbest.p) # The results of the clustering are in # cboot.hclust$result. The output of the hclust() # function is in cboot.hclust$result$result. # # cboot.hclust$result$partition returns a # vector of clusterlabels. groups<-cboot.hclust$result$partition # -- results -- > print_clusters(groups, kbest.p) [1] "cluster 1" Country RedMeat Fish Fr.Veg 1 Albania 10.1 0.2 1.7 4 Bulgaria 7.8 1.2 4.2 18 Romania 6.2 1.0 2.8 25 Yugoslavia 4.4 0.6 3.2 [1] "cluster 2" Country RedMeat Fish Fr.Veg 2 Austria 8.9 2.1 4.3 3 Belgium 13.5 4.5 4.0 9 France 18.0 5.7 6.5 12 Ireland 13.9 2.2 2.9 14 Netherlands 9.5 2.5 3.7 21 Switzerland 13.1 2.3 4.9 22 UK 17.4 4.3 3.3 24 W Germany 11.4 3.4 3.8 [1] "cluster 3" Country RedMeat Fish Fr.Veg 5 Czechoslovakia 9.7 2.0 4.0 7 E Germany 8.4 5.4 3.6 11 Hungary 5.3 0.3 4.2 16 Poland 6.9 3.0 6.6 23 USSR 9.3 3.0 2.9 [1] "cluster 4" Country RedMeat Fish Fr.Veg 6 Denmark 10.6 9.9 2.4 8 Finland 9.5 5.8 1.4 15 Norway 9.4 9.7 2.7 20 Sweden 9.9 7.5 2.0 [1] "cluster 5" Country RedMeat Fish Fr.Veg 10 Greece 10.2 5.9 6.5 13 Italy 9.0 3.4 6.7 17 Portugal 6.2 14.2 7.9 19 Spain 7.1 7.0 7.2 # The vector of cluster stabilities. # Values close to 1 indicate stable clusters > cboot.hclust$bootmean [1] 0.7905000 0.7990913 0.6173056 0.9312857 0.7560000 # The count of how many times each cluster was # dissolved. By default clusterboot() runs 100 # bootstrap iterations. # Clusters that are dissolved often are unstable. > cboot.hclust$bootbrd [1] 25 11 47 8 35

The above results show that the cluster of countries with high fish consumption (cluster 4) is highly stable (cluster stability = 0.93). Clusters 1 and 2 are also quite stable; cluster 5 (cluster stability 0.76) less so. Cluster 3 (cluster stability 0.62) has the characteristics of what we’ve been calling the “other” cluster. Note that `clusterboot()`

has a random component, so the exact stability values and number of times each cluster is dissolved will vary from run to run.

Based on these results, we can say that the countries in cluster 4 have highly similar eating habits, distinct from those of the other countries (high fish and red meat consumption, with a relatively low amount of fruits and vegetables); we can also say that the countries in clusters 1 and 2 represent distinct eating patterns as well. The countries in cluster 3, on the other hand, show eating patterns that are different from those of the countries in other clusters, but aren’t as strongly similar to each other.

The `clusterboot()`

algorithm assumes that you have the number of clusters, *k*. Obviously, determining *k* will be harder for datasets that are larger than our protein example. There are ways of estimating *k*, but they are beyond the scope of this article. Once you have an idea of the number of clusters, however, `clusterboot()`

is a useful tool for evaluating the strength of the patterns that you have discovered.

For more about clustering, please refer to our free sample chapter 8 of *Practical Data Science with R*.

To **leave a comment** for the author, please follow the link and comment on his blog: ** Win-Vector Blog » R**.

R-bloggers.com offers

(This article was first published on ** Revolutions**, and kindly contributed to R-bloggers)

Revolution R Open, the enhanced open source R distribution from Revolution Analytics and Microsoft, is now available for download. This update brings multi-threaded performance to the latest update to the R engine from the R Core Group, which includes several improvements and bug fixes. Significant amongst these is default support for HTTPS connections, making it easy to follow the best practices for R security recommended by the R Consortium.

Revolution R Open uses a CRAN snapshot taken on August 27, 2015; you can review highlights of new and updated R packages introduced since RRO 3.2.1 at our Package Spotlight page. Don't forget you can always use packages released after this date using the built-in checkpoint package or by using any CRAN mirror. (By the way, the checkpoint package also now downloads packages using HTTPS.)

RRO 3.2.2 is available for download now at the link below for Windows, Mac and Linux systems, and is 100% compatible with R 3.2.2. We welcome comments, suggestions and other discussion on the RRO Forums. If you're new to Revolution R Open, here are some tips to get started, and there are many data sources you can explore with RRO. Thanks go as always to the contributors to the R Project upon which RRO is built.

To **leave a comment** for the author, please follow the link and comment on his blog: ** Revolutions**.

R-bloggers.com offers

(This article was first published on ** RStudio Blog**, and kindly contributed to R-bloggers)

dplyr 0.4.3 includes over 30 minor improvements and bug fixes, which are described in detail in the release notes. Here I wanted to draw your attention five small, but important, changes:

`mutate()`

no longer randomly crashes! (Sorry it took us so long to fix this – I know it’s been causing a lot of pain.)- dplyr now has much better support for non-ASCII column names. It’s probably not perfect, but should be a lot better than previous versions.
- When printing a
`tbl_df`

, you now see the types of all columns, not just those that don’t fit on the screen:`data_frame(x = 1:3, y = letters[x], z = factor(y)) #> Source: local data frame [3 x 3] #> #> x y z #> (int) (chr) (fctr) #> 1 1 a a #> 2 2 b b #> 3 3 c c`

`bind_rows()`

gains a`.id`

argument. When supplied, it creates a new column that gives the name of each data frame:`a <- data_frame(x = 1, y = "a") b <- data_frame(x = 2, y = "c") bind_rows(a = a, b = b) #> Source: local data frame [2 x 2] #> #> x y #> (dbl) (chr) #> 1 1 a #> 2 2 c bind_rows(a = a, b = b, .id = "source") #> Source: local data frame [2 x 3] #> #> source x y #> (chr) (dbl) (chr) #> 1 a 1 a #> 2 b 2 c # Or equivalently bind_rows(list(a = a, b = b), .id = "source") #> Source: local data frame [2 x 3] #> #> source x y #> (chr) (dbl) (chr) #> 1 a 1 a #> 2 b 2 c`

- dplyr is now more forgiving of unknown attributes. All functions should now copy column attributes from the input to the output, instead of complaining. Additionally
`arrange()`

,`filter()`

,`slice()`

, and`summarise()`

preserve attributes of the data frame itself.

To **leave a comment** for the author, please follow the link and comment on his blog: ** RStudio Blog**.

R-bloggers.com offers

(This article was first published on ** Brokering Closure » R**, and kindly contributed to R-bloggers)

In data analysis it happens sometimes that it is neccesary to use *weights*. Contexts

that come to mind include:

- Analysis of data from complex surveys, e.g. stratified samples. Sample inclusion probabilities might have been unequal and thus observations from different strata should have different weights.
- Application of propensity score weighting e.g. to correct data being Missing At Random (MAR).
- Inverse-variance weighting (https://en.wikipedia.org/wiki/Inverse-variance_weighting) when different observations have been measured with different precision which is known apriori.
- We are analyzing data in an aggregated form such that the weight variable encodes how many original observations each row in the aggregated data represents.
- We are given survey data with post-stratification weights.

If you use, or have been using, SPSS you probably know about the possibility to define one of the variables as weights. This information is used when producing cross-tabulations (cells include sums of weights), regression models and so on. SPSS weights are *frequency weights* in the sense that $w_i$ is the number of observations particular case $i$ represents.

On the other hand, in R `lm`

and `glm`

functions have `weights`

argument that serves a related purpose.

```
suppressMessages(local({
library(dplyr)
library(ggplot2)
library(survey)
library(knitr)
library(tidyr)
library(broom)
}))
```

Let’s compare different ways in which a linear model can be fitted to data with weights. We start by generating some artificial data:

```
set.seed(666)
N <- 30 # number of observations
# Aggregated data
aggregated <- data.frame(x=1:5) %>%
mutate( y = round(2 * x + 2 + rnorm(length(x)) ),
freq = as.numeric(table(sample(1:5, N,
replace=TRUE, prob=c(.3, .4, .5, .4, .3))))
)
aggregated
```

```
## x y freq
## 1 1 5 4
## 2 2 8 5
## 3 3 8 8
## 4 4 12 8
## 5 5 10 5
```

```
# Disaggregated data
individuals <- aggregated[ rep(1:5, aggregated$freq) , c("x", "y") ]
```

Visually:

```
ggplot(aggregated, aes(x=x, y=y, size=freq)) + geom_point() + theme_bw()
```

Let’s fit some models:

```
models <- list(
ind_lm = lm(y ~ x, data=individuals),
raw_agg = lm( y ~ x, data=aggregated),
ind_svy_glm = svyglm(y~x, design=svydesign(id=~1, data=individuals),
family=gaussian() ),
ind_glm = glm(y ~ x, family=gaussian(), data=individuals),
wei_lm = lm(y ~ x, data=aggregated, weight=freq),
wei_glm = glm(y ~ x, data=aggregated, family=gaussian(), weight=freq),
svy_glm = svyglm(y ~ x, design=svydesign(id=~1, weights=~freq, data=aggregated),
family=gaussian())
)
```

```
## Warning in svydesign.default(id = ~1, data = individuals): No weights or
## probabilities supplied, assuming equal probability
```

In short, we have the following linear models:

`ind_lm`

is a OLS fit to individual data (the*true*model).`ind_agg`

is a OLS fit to aggregated data (definitely wrong).`ind_glm`

is a ML fit to individual data`ind_svy_glm`

is a ML fit to individual data using simple random sampling with replacement design.`wei_lm`

is OLS fit to aggregated data with frequencies as weights`wei_glm`

is a ML fit to aggregated data with frequencies as weights`svy_glm`

is a ML fit to aggregated using “survey” package and using frequencies as weights in the sampling design.

We would expect that models `ind_lm`

, `ind_glm`

, and `ind_svy_glm`

will be identical.

Summarise and gather in long format

```
results <- do.call("rbind", lapply( names(models), function(n) cbind(model=n, tidy(models[[n]])) )) %>%
gather(stat, value, -model, -term)
```

Check if point estimates of model coefficients are identical:

```
results %>% filter(stat=="estimate") %>%
select(model, term, value) %>%
spread(term, value)
```

```
## model (Intercept) x
## 1 ind_lm 4.33218 1.474048
## 2 raw_agg 4.40000 1.400000
## 3 ind_svy_glm 4.33218 1.474048
## 4 ind_glm 4.33218 1.474048
## 5 wei_lm 4.33218 1.474048
## 6 wei_glm 4.33218 1.474048
## 7 svy_glm 4.33218 1.474048
```

Apart from the “wrong” `raw_agg`

model, the coefficients are identical across models.

Let’s check the inference:

```
# Standard Errors
results %>% filter(stat=="std.error") %>%
select(model, term, value) %>%
spread(term, value)
```

```
## model (Intercept) x
## 1 ind_lm 0.652395 0.1912751
## 2 raw_agg 1.669331 0.5033223
## 3 ind_svy_glm 0.500719 0.1912161
## 4 ind_glm 0.652395 0.1912751
## 5 wei_lm 1.993100 0.5843552
## 6 wei_glm 1.993100 0.5843552
## 7 svy_glm 1.221133 0.4926638
```

```
# p-values
results %>% filter(stat=="p.value") %>%
mutate(p=format.pval(value)) %>%
select(model, term, p) %>%
spread(term, p)
```

```
## model (Intercept) x
## 1 ind_lm 3.3265e-07 2.1458e-08
## 2 raw_agg 0.077937 0.068904
## 3 ind_svy_glm 2.1244e-09 2.1330e-08
## 4 ind_glm 3.3265e-07 2.1458e-08
## 5 wei_lm 0.118057 0.085986
## 6 wei_glm 0.118057 0.085986
## 7 svy_glm 0.038154 0.058038
```

Recall, that the correct model is `ind_lm`

. Observations:

`raw_agg`

is clearly wrong, as expected.- Should the
`weight`

argument to`lm`

and`glm`

implement frequency weights, the results for`wei_lm`

and`wei_glm`

will be identical to that from`ind_lm`

. Only the point estimates are correct, all the inference stats are not correct. - The model using design with sampling weights
`svy_glm`

gives correct point estimates, but incorrect inference. - Suprisingly, the model fit with “survey” package to the individual data using simple random sampling design (
`ind_svy_glm`

) does not give identical inference stats to those from`ind_lm`

. They are close though.

Functions weights `lm`

and `glm`

implement *precision weights*: inverse-variance weights that can be used to model differential precision with which the outcome variable was estimated.

Functions in the “survey” package implement *sampling weights*: inverse of the probability of particular observation to be selected from the population to the sample.

Frequency weights are a different animal.

However, it is possible get correct inference statistics for the model fitted to aggregated data using `lm`

with frequency weights supplied as `weights`

. What needs correcting is the degrees of freedom (see also http://stackoverflow.com/questions/10268689/weighted-regression-in-r).

```
models$wei_lm_fixed <- models$wei_lm
models$wei_lm_fixed$df.residual <- with(models$wei_lm_fixed, sum(weights) - length(coefficients))
results <- do.call("rbind", lapply( names(models), function(n) cbind(model=n, tidy(models[[n]])) )) %>%
gather(stat, value, -model, -term)
```

```
## Warning in summary.lm(x): residual degrees of freedom in object suggest
## this is not an "lm" fit
```

```
# Coefficients
results %>% filter(stat=="estimate") %>%
select(model, term, value) %>%
spread(term, value)
```

```
## model (Intercept) x
## 1 ind_lm 4.33218 1.474048
## 2 raw_agg 4.40000 1.400000
## 3 ind_svy_glm 4.33218 1.474048
## 4 ind_glm 4.33218 1.474048
## 5 wei_lm 4.33218 1.474048
## 6 wei_glm 4.33218 1.474048
## 7 svy_glm 4.33218 1.474048
## 8 wei_lm_fixed 4.33218 1.474048
```

```
# Standard Errors
results %>% filter(stat=="std.error") %>%
select(model, term, value) %>%
spread(term, value)
```

```
## model (Intercept) x
## 1 ind_lm 0.652395 0.1912751
## 2 raw_agg 1.669331 0.5033223
## 3 ind_svy_glm 0.500719 0.1912161
## 4 ind_glm 0.652395 0.1912751
## 5 wei_lm 1.993100 0.5843552
## 6 wei_glm 1.993100 0.5843552
## 7 svy_glm 1.221133 0.4926638
## 8 wei_lm_fixed 0.652395 0.1912751
```

See model `wei_lm_fixed`

. Thus, correcting the degrees of freedom manually gives correct coefficient estimates as well as inference statistics.

Aggregating data and using frequency weights can save you quite some time. To illustrate it, let’s generate large data set in a disaggregated and aggregated form.

```
N <- 10^4 # number of observations
# Aggregated data
big_aggregated <- data.frame(x=1:5) %>%
mutate( y = round(2 * x + 2 + rnorm(length(x)) ),
freq = as.numeric(table(sample(1:5, N, replace=TRUE, prob=c(.3, .4, .5, .4, .3))))
)
# Disaggregated data
big_individuals <- aggregated[ rep(1:5, big_aggregated$freq) , c("x", "y") ]
```

… and fit `lm`

models weighting the model on aggregated data. Benchmarking:

```
library(microbenchmark)
speed <- microbenchmark(
big_individual = lm(y ~ x, data=big_individuals),
big_aggregated = lm(y ~ x, data=big_aggregated, weights=freq)
)
speed %>% group_by(expr) %>% summarise(median=median(time / 1000)) %>%
mutate( ratio = median / median[1])
```

```
## Source: local data frame [2 x 3]
##
## expr median ratio
## 1 big_individual 7561.158 1.0000000
## 2 big_aggregated 1492.057 0.1973319
```

So quite an improvement.

The improvement is probably the bigger, the more we are able to aggregate the data.

To **leave a comment** for the author, please follow the link and comment on his blog: ** Brokering Closure » R**.

R-bloggers.com offers

Probably not, but now you can start ;)

Shiny apps are usually single task, not very heavy websites. It may be not so easy to turn them into online shop/service provider.

Anyway you can find this post interesting as it presents a paperwork-less implementation to accept payments.

Some notable features of presented setup:

- self-hosted
- license-free
- not relying on ANY third party services (banks, brokers, payment processors, exchange markets)
- no new accounts, registrations, verifications
- no new accounts, registrations, verifications also for your customer
- low fee

You are publishing article, an expertise on particular topic. The part of your article is publicly available and acts as introduction and general overview on the subject. The other part of article is detailed analysis, that part you want to sell to a viewer for a single dollar per view (1 USD). So what you need is to conditionally display the second part of article based on the received payment condition.

Obviously such payments can be employed to sell goods or services, but then it usually better to add user account functionality to track delivery process.

Shiny app connects to bitcoin node directly using httr and jsonlite packages. In the process two JSON-RPC methods are employed:

`getnewaddress`

- generate new payment address for the current session`getreceivedbyaddress`

- refresh the balance to check if payment received

The workhorse function to handle JSON-RPC looks like:

```
bitcoind.rpc <- function(host = getOption("rpchost","127.0.0.1"),
user = getOption("rpcuser"),
password = getOption("rpcpassword"),
port = getOption("rpcport","8332"),
id = NA_integer_, method, params = list()){
stopifnot(is.character(user), is.character(password), is.character(method))
rpcurl <- paste0("http://",user,":",password,"@",host,":",port)
req <- httr::POST(rpcurl, body = jsonlite::toJSON(list(jsonrpc = "1.0", id = id, method = method, params = params), auto_unbox=TRUE))
if(httr::http_status(req)$category != "success"){
message(jsonlite::fromJSON(httr::content(req, "text"))$error$message)
httr::stop_for_status(req)
}
jsonlite::fromJSON(httr::content(req, "text"))
}
```

Shiny app will additionally:

- calculate current fiat currencies value based on rates from kraken exchange ticker as
`(ask+bid)/2`

- produce clickable payment link
- if hrbrmstr/qrencoder package is installed, it will also encode payment link into QR code so payment can be easily done from smartphone wallet

- install bitcoin node, Download Bitcoin Core, see also Running A Full Node
- allow RPC against your bitcoin daemon
- install R and shiny package, for QR code also install hrbrmstr/qrencoder package

To run presented shinyPay app you can fork jangorecki/shinyPay and replace below bitcoin node details in `app.R`

file:

```
options(rpchost = "192.168.56.103",
rpcuser = "bitcoinduser",
rpcpassword = "userpassbitcoind")
```

If you would setup bitcoin daemon on the same ip/credentials you can run the app by simply:

```
shiny::runGitHub("jangorecki/shinyPay")
```

The post does not contain reproducible code as it requires bitcoin node. Below a screen from the shinyPay app and short description.

To request payment I simply click on button to generate an address for my payment.

I choose the amount of 0.0035 BTC (~1 USD), the QR code gets refreshed.

I've scanned the QR code with on Android (2.3.5) phone using this wallet app.

My android wallet made payment over HSDPA network on different ISP than the ISP which my bitcoin node uses (30 Mbps).

The payment arrived after 2-3 seconds. It gets first confirmation after 2 minutes (on avarage first conf takes around 10 minutes).

Total transaction fee was 0.0001 BTC (~0.028 USD). In this particular case it was 2.8% of the payment, but the transaction fee is constant no matter how big transaction is. For a 100 dollar payment the 2 cents fee would be unnoticeable (0.028%).

Mentioned transaction can be found in blockchain.

If you are not comfortable with hosting own bitcoin node you can use external bitcoin node provider.

The most popular would be blockchain.info, a less known alternative blockcypher.com. There may be some others I'm not aware of.

For non-merchant users who are already familiar with blockchain.info I will add an important info: unlike when using blockchain.info over www where you are not sharing priv key to server, when using blockchain.info over rpc you have to share priv key to server in order to generate new addresses.

Unlike when using the javascript wallet transaction signing is conducted server side which means your private keys are shared with the server. However many operations such as fetching a balance and viewing transactions are possible without needing to decrypt the private keys. The is possible through the use of double encryption only when the second password is provided does the server have ability to access the funds in a wallet. For merchants who need to only receive transactions it maybe possible to never provide a second password.

source: blockchain.info/api/json_rpc_api

If you would like to automatically convert collected funds into fiat currencies you can use `sendtoaddress`

rpc method to transfer your funds to your deposit address on a bitcoin exchange market, and there automatically buy fiat currency using market api (see Rbitcoin). This obviously requires to create an account on the market and usually go through formal KYC/AML verification process.

There are numerous ways to increase the security of your wallet, listing just some of them:

- validate sold content/service/product value against received funds
- increase confirmations required for payment to be accepted
- increase confirmations for low mining-priority transactions
- run multiple nodes to verify against double-spend
- automatically send confirmed payments to offline wallet
- restrict rpc methods
- restrict access to particular ip
- use json-rpc over ssl
- isolation of bitcoin node machine
- ssh tunnel

You should not try to implement all possible ways but rather weight the risk and scale your security to the volume/value of received transactions.

Anyway I'm not an expert in security, for more info on that you can look at Securing your wallet, bitcoin wiki or bitcoin stackexchange.

If you are interested in getting comprehensive knowledge about the money of the internet you should look at the free online full semester course (10 ECTS) at University of Nicosia: Introduction to Digital Currencies.

]]>
(This article was first published on ** Jan Gorecki - R**, and kindly contributed to R-bloggers)

Have you ever think about accepting payments in your shiny app?

Probably not, but now you can start

Shiny apps are usually single task, not very heavy websites. It may be not so easy to turn them into online shop/service provider.

Anyway you can find this post interesting as it presents a paperwork-less implementation to accept payments.

Some notable features of presented setup:

- self-hosted
- license-free
- not relying on ANY third party services (banks, brokers, payment processors, exchange markets)
- no new accounts, registrations, verifications
- no new accounts, registrations, verifications also for your customer
- low fee

You are publishing article, an expertise on particular topic. The part of your article is publicly available and acts as introduction and general overview on the subject. The other part of article is detailed analysis, that part you want to sell to a viewer for a single dollar per view (1 USD). So what you need is to conditionally display the second part of article based on the received payment condition.

Obviously such payments can be employed to sell goods or services, but then it usually better to add user account functionality to track delivery process.

Shiny app connects to bitcoin node directly using httr and jsonlite packages. In the process two JSON-RPC methods are employed:

`getnewaddress`

– generate new payment address for the current session`getreceivedbyaddress`

– refresh the balance to check if payment received

The workhorse function to handle JSON-RPC looks like:

```
bitcoind.rpc <- function(host = getOption("rpchost","127.0.0.1"),
user = getOption("rpcuser"),
password = getOption("rpcpassword"),
port = getOption("rpcport","8332"),
id = NA_integer_, method, params = list()){
stopifnot(is.character(user), is.character(password), is.character(method))
rpcurl <- paste0("http://",user,":",password,"@",host,":",port)
req <- httr::POST(rpcurl, body = jsonlite::toJSON(list(jsonrpc = "1.0", id = id, method = method, params = params), auto_unbox=TRUE))
if(httr::http_status(req)$category != "success"){
message(jsonlite::fromJSON(httr::content(req, "text"))$error$message)
httr::stop_for_status(req)
}
jsonlite::fromJSON(httr::content(req, "text"))
}
```

Shiny app will additionally:

- calculate current fiat currencies value based on rates from kraken exchange ticker as
`(ask+bid)/2`

- produce clickable payment link
- if hrbrmstr/qrencoder package is installed, it will also encode payment link into QR code so payment can be easily done from smartphone wallet

- install bitcoin node, Download Bitcoin Core, see also Running A Full Node
- allow RPC against your bitcoin daemon
- install R and shiny package, for QR code also install hrbrmstr/qrencoder package

To run presented shinyPay app you can fork jangorecki/shinyPay and replace below bitcoin node details in `app.R`

file:

```
options(rpchost = "192.168.56.103",
rpcuser = "bitcoinduser",
rpcpassword = "userpassbitcoind")
```

If you would setup bitcoin daemon on the same ip/credentials you can run the app by simply:

```
shiny::runGitHub("jangorecki/shinyPay")
```

The post does not contain reproducible code as it requires bitcoin node. Below a screen from the shinyPay app and short description.

To request payment I simply click on button to generate an address for my payment.

I choose the amount of 0.0035 BTC (~1 USD), the QR code gets refreshed.

I’ve scanned the QR code with on Android (2.3.5) phone using this wallet app.

My android wallet made payment over HSDPA network on different ISP than the ISP which my bitcoin node uses (30 Mbps).

The payment arrived after 2-3 seconds. It gets first confirmation after 2 minutes (on avarage first *conf* takes around 10 minutes).

Total transaction fee was 0.0001 BTC (~0.028 USD). In this particular case it was 2.8% of the payment, but the transaction fee is constant no matter how big transaction is. For a 100 dollar payment the 2 cents fee would be unnoticeable (0.028%).

Mentioned transaction can be found in blockchain.

If you are not comfortable with hosting own bitcoin node you can use external bitcoin node provider.

The most popular would be blockchain.info, a less known alternative blockcypher.com. There may be some others I’m not aware of.

For non-merchant users who are already familiar with blockchain.info I will add an important info: unlike when using blockchain.info over www where you are not sharing priv key to server, when using blockchain.info over rpc you have to share priv key to server in order to generate new addresses.

Unlike when using the javascript wallet transaction signing is conducted server side which means your private keys are shared with the server. However many operations such as fetching a balance and viewing transactions are possible without needing to decrypt the private keys. The is possible through the use of double encryption only when the second password is provided does the server have ability to access the funds in a wallet. For merchants who need to only receive transactions it maybe possible to never provide a second password.

source: blockchain.info/api/json_rpc_api

If you would like to automatically convert collected funds into fiat currencies you can use `sendtoaddress`

rpc method to transfer your funds to your deposit address on a bitcoin exchange market, and there automatically buy fiat currency using market api (see Rbitcoin). This obviously requires to create an account on the market and usually go through formal KYC/AML verification process.

There are numerous ways to increase the security of your wallet, listing just some of them:

- validate sold content/service/product value against received funds
- increase confirmations required for payment to be accepted
- increase confirmations for low mining-priority transactions
- run multiple nodes to verify against double-spend
- automatically send confirmed payments to offline wallet
- restrict rpc methods
- restrict access to particular ip
- use json-rpc over ssl
- isolation of bitcoin node machine
- ssh tunnel

You should not try to implement all possible ways but rather weight the risk and scale your security to the volume/value of received transactions.

Anyway I’m not an expert in security, for more info on that you can look at Securing your wallet, bitcoin wiki or bitcoin stackexchange.

If you are interested in getting comprehensive knowledge about the *money of the internet* you should look at the free online full semester course (10 ECTS) at University of Nicosia: Introduction to Digital Currencies.

To **leave a comment** for the author, please follow the link and comment on his blog: ** Jan Gorecki - R**.

R-bloggers.com offers

Once again time for the monthly upstream Armadillo update -- version 5.500.2 was released earlier today by Conrad. And a new and matching RcppArmadillo release 0.5.500.2.0 in already on CRAN and will go to Debian shortly.

Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab.

This release contains mostly bug fixes and some internal code refactoring:

## Changes in RcppArmadillo version 0.5.500.2.0 (2015-09-03)

Upgraded to Armadillo 5.500.2 ("Molotov Cocktail")

expanded object constructors and generators to handle

`size()`

based specification of dimensionsfaster handling of submatrix rows

faster

`clamp()`

fixes for handling sparse matrices

Courtesy of CRANberries, there is also a diffstat report for the most recent CRAN release. As always, more detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

]]>
(This article was first published on ** Thinking inside the box **, and kindly contributed to R-bloggers)

Once again time for the monthly upstream Armadillo update — version 5.500.2 was released earlier today by Conrad. And a new and matching RcppArmadillo release 0.5.500.2.0 in already on CRAN and will go to Debian shortly.

Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab.

This release contains mostly bug fixes and some internal code refactoring:

## Changes in RcppArmadillo version 0.5.500.2.0 (2015-09-03)

Upgraded to Armadillo 5.500.2 ("Molotov Cocktail")

expanded object constructors and generators to handle

`size()`

based specification of dimensionsfaster handling of submatrix rows

faster

`clamp()`

fixes for handling sparse matrices

Courtesy of CRANberries, there is also a diffstat report for the most recent CRAN release. As always, more detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To **leave a comment** for the author, please follow the link and comment on his blog: ** Thinking inside the box **.

R-bloggers.com offers

(This article was first published on ** Xi'an's Og » R**, and kindly contributed to R-bloggers)

**W**hile my arXiv newspage today had a puzzling entry about modelling UFOs sightings in France, it also broadcast our revision of Reliable ABC model choice via random forests, version that we resubmitted today to Bioinformatics after a quite thorough upgrade, the most dramatic one being the realisation we could also approximate the posterior probability of the selected model via another random forest. (With no connection with the recent post on forest fires!) As discussed a little while ago on the ‘Og. And also in conjunction with our creating the abcrf R package for running ABC model choice out of a reference table. While it has been an excruciatingly slow process (the initial version of the arXived document dates from June 2014, the PNAS submission was rejected for not being enough Bayesian, and the latest revision took the whole summer), the slow maturation of our thoughts on the model choice issues led us to modify the role of random forests in the ABC approach to model choice, in that we reverted our earlier assessment that they could only be trusted for selecting the most likely model, by realising this summer the corresponding posterior could be expressed as a posterior loss and estimated by a secondary forest. As first considered in Stoehr et al. (2014). (In retrospect, this brings an answer to one of the earlier referee’s comments.) Next goal is to incorporate those changes in DIYABC (and wait for the next version of the software to appear). Another best-selling innovation due to Arnaud: we added a practical implementation section in the format of FAQ for issues related with the calibration of the algorithms.

Filed under: Books, pictures, R, Statistics, University life Tagged: ABC model choice, abcrf, Bayesian model choice, DIYABC, France, model posterior probabilities, PNAS, R, random forests, UFOs

To **leave a comment** for the author, please follow the link and comment on his blog: ** Xi'an's Og » R**.

R-bloggers.com offers

(This article was first published on ** Freakonometrics » R-english**, and kindly contributed to R-bloggers)

Mid november, a nice workshop on big data and environment will be organized, in Argentina,

We will talk a lot about climate models, and I wanted to play a little bit with those data, stored on http://dods.ipsl.jussieu.fr/mc2ipsl/.

Since Ewen (aka @3wen) has been working on those datasets recently, he kindly told me how to read those datasets (in some ncdf format). He did show me interesting functions, and did share with me some lines of codes. I will mention some of them in a brief post, because I’ve been impressed to see how memory issues were not a big deal, here… But first of all, we need half a dozen packages

> library(ncdf) > library(lubridate) > library(dplyr) > library(stringr) > library(tidyr) > library(ggplot2)

Then i will download one of those files, from January 2000 till December 2012.

> download.file("http://dods.ipsl.jussieu.fr/ mc2ipsl/CMIP5/output1/IPSL/IPSL-CM5A-MR/historicalNat/day/atmos/day/r1i1p1/latest/tas/tas_day_IPSL-CM5A-MR_historicalNat_r1i1p1_20000101-20121231.nc",destfile="temp.nc")

Those file are not huge. But big.

> file.size("temp.nc") [1] 390965452

That’s almost a 400Mo file. For one single file (and there are hundreds on those servers). Now, if we ‘read’ that file, in R, we have a rather small R object,

> nc <- open.ncdf("temp.nc") > object.size(nc) 155344 bytes

That’s only 150Ko… One out of 2516 compared with the file online. Somehow, we only extract partial information here. We will scan the file completely later on… We can already create some variables that will be used later on, such as the date, the latitude, the longitude, etc

> lon <- get.var.ncdf(nc,"lon") > (nlon <- dim(lon)) [1] 144 > lat <- get.var.ncdf(nc,"lat") > nlat <- dim(lat) > time <- get.var.ncdf(nc,"time") > tunits <- att.get.ncdf(nc, "time", "units") > orig <- as.Date(ymd_hms(str_replace( tunits$value, "days since ", ""))) > dates <- orig+time > ntime <- length(dates)

Here we have the range of latitude and longitude in the dataset, as well as the date. Now it is time to read the file. Based on the starting date, and asking only for temperature

> commencement <- ind_conserver[1] > dname="tas"

we can red the file

> tab_val <- get.var.ncdf(nc, dname, start=c(1,1,commencement)) > dim(tab_val) [1] 144 143 4745 > object.size(tab_val) 781672528 bytes

The file is big now, a bit less that 800Mo.

> fill_value <- att.get.ncdf(nc, dname, "_FillValue") > tab_val[tab_val == fill_value$value] <- NA

Let us now summarize all the information in nice data frames, that will be used to generate graphs

> tab_val_vect <- as.vector(tab_val) > tab_val_mat <- matrix(tab_val_vect, nrow = nlon * nlat, ncol = ntime) > lon_lat <- expand.grid(lon, lat) > lon_lat <- lon_lat %>% mutate(Var1 = as.vector(Var1), Var2 = as.vector(Var2))

Note that the file is still big (the same size as the previous file, here we just reshape the dataset)

> res <- data.frame(cbind(lon_lat, tab_val_mat)) > object.size(res) 782496048 bytes

Here is the dimension of our dataset

> dim(res) [1] 20592 4747

We can give understandable names to variables

> noms <- str_c(year(dates_conserver), + month.abb[month(dates_conserver)] %>% tolower, sprintf("%02s", day(dates_conserver)), sep = "_") > names(res) <- c("lon", "lat", noms)

Now, let us focus on the ploting part. If we focus on Europe, we need to filter the dataset

> res <- res %>% mutate(lon = ifelse(lon >= 180, yes = lon - 360, no = lon)) %>% filter(lon < 40, lon > -20, lat < 75, lat > 30) %>% gather(key, value, -lon, -lat) %>% separate(key, into = c("year", "month", "day"), sep="_")

Let use filter, with only years from 1986 till 2005

> temp_moy_hist <- res %>% + filter(year >= 1986, year <= 2005) %>% + group_by(lon, lat, month) %>% + summarise(temp = mean(value, na.rm = TRUE)) %>% + ungroup

Now we can plot it. With the temperature in Europe, per month.

> ggplot(data = temp_moy_hist, aes(x = lon, y = lat, fill = temp)) + geom_raster() + coord_quickmap() + facet_wrap(~month)

Actually, it is possible to add contour lines of European countries,

> download.file( "http://freakonometrics.free.fr/carte_europe_df.rda",destfile="carte_europe_df.rda") > load("carte_europe_df.rda")

If we aggregate all the month, and get only one graph, we have

> ggplot() + geom_raster(data = temp_moy_hist, aes(x = lon, y = lat, fill = temp)) + geom_polygon(data = carte_pays_europe, aes(x=long, y=lat, group=group), fill = NA, col = "black") + coord_quickmap(xlim = c(-20, 40), ylim = c(30, 75))

I will spend some time now to analyze the maps, but I have been really impressed by those functions, that can partially read a ‘big’ dataset, to extract some information, and then we can read the file to create a dataset that can be used to get a graph.

To **leave a comment** for the author, please follow the link and comment on his blog: ** Freakonometrics » R-english**.

R-bloggers.com offers