(This article was first published on ** Creative Data Solutions » r-bloggers**, and kindly contributed to R-bloggers)

At a recent Saint Louis R users meeting I had the pleasure of giving a basic introduction to the awesome dplyr R package. For me, data analysis ubiquitously involves splitting the data based on grouping variable and then applying some function to the subsets or what is termed split-apply (typically split-lapply-apply). Having personally recently incorporated dplyr into my data wrangling workflows; I’ve found this package’s syntax and performance a joy to work with. My feeling about dplyr are as follows.

Data wrangling without dplyr.

Data wrangling with dplyr.

This tutorial features an introduction to common dplyr verbs and an overview of implementing split-apply in dplyr.

Some of my conclusions were; not only does dplyr make writing data wrangling code clearer and far faster, the packages calculation speed is also very high (non-sophisticated comparison to base).

The plot above shows the calculation time for 10 replications in seconds (y-axis) for calculating the median of varying number of groups (x-axis), rows (y-facet) and columns (x-facet) with (green line) and without (red line) dplyr.

To **leave a comment** for the author, please follow the link and comment on his blog: ** Creative Data Solutions » r-bloggers**.

R-bloggers.com offers

- Explore: Quickly and easily summarize, visualize, and analyze your data
- Cross-platform: It runs in a browser on Windows, Mac, and Linux
- Reproducible: Recreate results at any time and share work with others as a state file or an Rmarkdown report
- Programming: Integrate Radiant’s analysis functions into your own R-code
- Context: Data and examples focus on business applications

Radiant is interactive. Results update immediately when inputs are changed (i.e., no separate dialog boxes). This greatly facilitates exploration and understanding of the data.

Radiant works on Windows, Mac, or Linux. It can run without an Internet connection and no data will leave your computer. You can also run the app as a web application on a server.

Simply saving output is not enough. You need the ability to recreate results for the same data and/or when new data become available. Moreover, others may want to review your analysis and results. Save and load the state of the application to continue your work at a later time or on another omputer. Share state files with others and create reproducible reports using Rmarkdown.

If you are using Radiant on a server you can even share the url (include the SSUID) with others so they can see what you are working on. Thanks for this feature go to Joe Cheng.

Although Radiant’s web-interface can handle quite a few data and analysis tasks, at times you may prefer to write your own code. Radiant provides a bridge to programming in R(studio) by exporting the functions used for analysis. For more information about programming with Radiant see the programming page on the documentation site.

Radiant focuses on business data and decisions. It offers context-relevant tools, examples, and documentation to reduce the business analytics learning curve.

- Required: R version 3.1.2 or later
- Required: A modern browser (e.g., Chrome or Safari). Internet Explorer (version 11 or higher) should work as well
- Recommended: Rstudio

Radiant is available on CRAN. To install the latest version with complete documentation for offline access open R(studio) and copy-and-paste the command below:

```
install.packages("radiant", repos = "http://vnijs.github.io/radiant_miniCRAN/")
```

Once all packages are installed use the commands below to launch the app:

```
library(radiant); radiant("marketing")
```

See also the `Installing Radiant`

video:

Documentation and tutorials are available at http://vnijs.github.io/radiant/ and in the Radiant web interface (the `?`

icons and the `Help`

menu).

Want some help getting started? Watch the tutorials on the documentation site

Not ready to install Radiant on your computer? Try it online at the links below:

Please send questions and comments to: radiant@rady.ucsd.edu.

aggregated on R-bloggers - the complete collection of blogs about R

]]>
(This article was first published on ** R(adiant) news**, and kindly contributed to R-bloggers)

Radiant is a platform-independent browser-based interface for business analytics in R, based on the Shiny package.

- Explore: Quickly and easily summarize, visualize, and analyze your data
- Cross-platform: It runs in a browser on Windows, Mac, and Linux
- Reproducible: Recreate results at any time and share work with others as a state file or an Rmarkdown report
- Programming: Integrate Radiant’s analysis functions into your own R-code
- Context: Data and examples focus on business applications

Radiant is interactive. Results update immediately when inputs are changed (i.e., no separate dialog boxes). This greatly facilitates exploration and understanding of the data.

Radiant works on Windows, Mac, or Linux. It can run without an Internet connection and no data will leave your computer. You can also run the app as a web application on a server.

Simply saving output is not enough. You need the ability to recreate results for the same data and/or when new data become available. Moreover, others may want to review your analysis and results. Save and load the state of the application to continue your work at a later time or on another omputer. Share state files with others and create reproducible reports using Rmarkdown.

If you are using Radiant on a server you can even share the url (include the SSUID) with others so they can see what you are working on. Thanks for this feature go to Joe Cheng.

Although Radiant’s web-interface can handle quite a few data and analysis tasks, at times you may prefer to write your own code. Radiant provides a bridge to programming in R(studio) by exporting the functions used for analysis. For more information about programming with Radiant see the programming page on the documentation site.

Radiant focuses on business data and decisions. It offers context-relevant tools, examples, and documentation to reduce the business analytics learning curve.

- Required: R version 3.1.2 or later
- Required: A modern browser (e.g., Chrome or Safari). Internet Explorer (version 11 or higher) should work as well
- Recommended: Rstudio

Radiant is available on CRAN. To install the latest version with complete documentation for offline access open R(studio) and copy-and-paste the command below:

```
install.packages("radiant", repos = "http://vnijs.github.io/radiant_miniCRAN/")
```

Once all packages are installed use the commands below to launch the app:

```
library(radiant); radiant("marketing")
```

See also the `Installing Radiant`

video:

Documentation and tutorials are available at http://vnijs.github.io/radiant/ and in the Radiant web interface (the `?`

icons and the `Help`

menu).

Want some help getting started? Watch the tutorials on the documentation site

Not ready to install Radiant on your computer? Try it online at the links below:

Please send questions and comments to: radiant@rady.ucsd.edu.

aggregated on R-bloggers - the complete collection of blogs about R

To **leave a comment** for the author, please follow the link and comment on his blog: ** R(adiant) news**.

R-bloggers.com offers

(This article was first published on ** Statistical Reflections of a Medical Doctor » R**, and kindly contributed to R-bloggers)

The ability of PGAMs to estimate the log-baseline hazard rate, endows them with the capability to be used as smooth alternatives to the Kaplan Meier curve. If we assume for the shake of simplicity that there are no proportional co-variates in the PGAM regression, then the quantity modeled corresponds to the log-hazard of the survival function. Note that the only assumptions made by the PGAM is that the a) log-hazard is a smooth function, with b) a given *maximum* complexity (number of degrees of freedom) and c) continuous second derivatives. A PGAM provides estimates of the log-hazard constant, , and the time-varying deviation, . These can be used to *predict *the value of the survival function, , by approximating the integral appearing in the definition of by numerical quadrature.

From the above definition it is obvious that the value of the survival distribution at any given time point is a non-linear function of the PGAM estimate. Consequently, the predicted survival value, , cannot be derived in closed form; as with all non-linear PGAM estimates, a simple Monte Carlo simulation algorithm may be used to derive both the expected value of and its uncertainty. For the case of the survival function, the simulation steps are provided in Appendix (Section A3) of our paper. The following R function can be used to predict the survival function and an associated confidence interval at a grid of points. It accepts as arguments a) the vector of time points, b) a PGAM object for the fitted log-hazard function, c) a list with the nodes and weights of a Gauss-Lobatto rule for the integration of the predicted survival, d) the number of Monte Carlo samples to obtain and optionally e) the seed of the random number generation. Of note, the order of the quadrature used to predict the survival function is not necessarily the same as the order used to fit the log-hazard function.

## Calculate survival and confidence interval over a grid of points ## using a GAM SurvGAM<-Vectorize(function(t,gm,gl.rule,CI=0.95,Nsim=1000,seed=0) ## t : time at which to calculate relative risk ## gm : gam model for the fit ## gl.rule : GL rule (list of weights and nodes) ## CI : CI to apply ## Nsim : Number of replicates to draw ## seed : RNG seed { q<-(1-CI)/2.0 ## create the nonlinear contrast pdfnc<-data.frame(stop=t,gam.dur=1) L<-length(gl.rule$x) start<-0; ## only for right cens data ## map the weights from [-1,1] to [start,t] gl.rule$w<-gl.rule$w*0.5*(t-start) ## expand the dataset df<-Survdataset(gl.rule,pdfnc,fu=1) ## linear predictor at each node Xp <- predict(gm,newdata=df,type="lpmatrix") ## draw samples set.seed(seed) br <- rmvn(Nsim,coef(gm),gm$Vp) res1<-rep(0,Nsim) for(i in 1:Nsim){ ## hazard function at the nodes hz<-exp(Xp%*%br[i,]) ## cumumative hazard chz1<-gl.rule$w %*% hz[1:L,] ##survival res1[i]<-exp(-chz1) } ret<-data.frame(t=t,S=mean(res1), LCI=quantile(res1,prob=q), UCI=quantile(res1,prob=1-q)) ret },vectorize.args=c("t"))

The function makes use of another function, *Survdataset*, that expands internally the vector of time points into a survival dataset. This dataset is used to obtain predictions of the log-hazard function by calling the *predict* function from the *mgcv* package.

## Function that expands a prediction dataset ## so that a GL rule may be applied ## Used in num integration when generating measures of effect Survdataset<-function(GL,data,fu) ## GL : Gauss Lobatto rule ## data: survival data ## fu: column number containing fu info { ## append artificial ID in the set data$id<-1:nrow(data) Gllx<-data.frame(stop=rep(GL$x,length(data$id)), t=rep(data[,fu],each=length(GL$x)), id=rep(data$id,each=length(GL$x)), start=0) ## Change the final indicator to what ## was observed, map node positions, ## weights from [-1,1] back to the ## study time Gllx<-transform(Gllx, stop=0.5*(stop*(t-start)+(t+start))) ## now merge the remaining covariate info Gllx<-merge(Gllx,data[,-c(fu)]) nm<-match(c("t","start","id"),colnames(Gllx)) Gllx[,-nm] }

The ability to draw samples from the multivariate normal distribution corresponding to the model estimates and its covariance matrix is provided by another function, *rmvn*:

## function that draws multivariate normal random variates with ## a given mean vector and covariance matrix ## n : number of samples to draw ## mu : mean vector ## sig : covariance matrix rmvn <- function(n,mu,sig) { ## MVN random deviates L <- mroot(sig);m <- ncol(L); t(mu + L%*%matrix(rnorm(m*n),m,n)) }

To illustrate the use of these functions we revisit the PBC example from the 2nd part of this blog series. Firstly, let’s obtain a few Gauss-Lobatto lists of weights/nodes for the integration of the survival function:

## Obtain a few Gauss Lobatto rules to integrate the survival ## distribution GL5<-GaussLobatto(5); GL10<-GaussLobatto(10); GL20<-GaussLobatto(20);

Subsequently, we fit the log-hazard rate to the coarsely (5 nodes) and more finely discretized (using a 10 point Gauss Lobatto rule) versions of the PBC dataset, created in Part 2. The third command obtains the Kaplan Meier estimate in the PBC dataset.

fSurv1<-gam(gam.ev~s(stop,bs="cr")+offset(log(gam.dur)), data=pbcGAM,family="poisson",scale=1,method="REML") fSurv2<-gam(gam.ev~s(stop,bs="cr")+offset(log(gam.dur)), data=pbcGAM2,family="poisson",scale=1,method="REML") KMSurv<-survfit(Surv(time,status)~1,data=pbc)

We obtained survival probability estimates for the 6 combinations of time discretization for fitting (either a 5 or 10th order Lobatto rule) and prediction (a 5th, 10th or 20th order rule):

t<-seq(0,4500,100) s1<-SurvGAM(t,fSurv1,GL5) s2<-SurvGAM(t,fSurv1,GL10) s3<-SurvGAM(t,fSurv1,GL20) s4<-SurvGAM(t,fSurv2,GL5) s5<-SurvGAM(t,fSurv2,GL10) s6<-SurvGAM(t,fSurv2,GL20)

In all cases 1000 Monte Carlo samples were obtained for the calculation of survival probability estimates and their pointwise 95% confidence intervals. We can plot these against the Kaplan Meier curve estimates:

par(mfrow=c(2,3)) plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL5)/Predict(GL5)") lines(s1[1,],s1[2,],col="blue",lwd=2) lines(s1[1,],s1[3,],col="blue",lwd=2,lty=2) lines(s1[1,],s1[4,],col="blue",lwd=2,lty=2) plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL5)/Predict(GL10)") lines(s2[1,],s2[2,],col="blue",lwd=2) lines(s2[1,],s2[3,],col="blue",lwd=2,lty=2) lines(s2[1,],s2[4,],col="blue",lwd=2,lty=2) plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL5)/Predict(GL20)") lines(s3[1,],s3[2,],col="blue",lwd=2) lines(s3[1,],s3[3,],col="blue",lwd=2,lty=2) lines(s3[1,],s3[4,],col="blue",lwd=2,lty=2) plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL10)/Predict(GL5)") lines(s4[1,],s4[2,],col="blue",lwd=2) lines(s4[1,],s4[3,],col="blue",lwd=2,lty=2) lines(s4[1,],s4[4,],col="blue",lwd=2,lty=2) plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL10)/Predict(GL10)") lines(s5[1,],s5[2,],col="blue",lwd=2) lines(s5[1,],s5[3,],col="blue",lwd=2,lty=2) lines(s5[1,],s5[4,],col="blue",lwd=2,lty=2) plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL10)/Predict(GL20)") lines(s6[1,],s6[2,],col="blue",lwd=2) lines(s6[1,],s6[3,],col="blue",lwd=2,lty=2) lines(s6[1,],s6[4,],col="blue",lwd=2,lty=2)

Overall, there is a close agreement between the Kaplan Meier estimate and the PGAM estimates despite the different function spaces that the corresponding estimators “live”: the space of all piecewise constant functions (KM) v.s. that of the smooth functions with bounded, continuous second derivatives (PGAM). Furthermore, the 95% confidence interval of each estimator (dashed lines) contain the expected value of the other estimator. This suggests that there is no systematic difference between the KM and the PGAM survival estimators. This was confirmed in simulated datasets (see Fig 2 in our PLoS One paper).

To **leave a comment** for the author, please follow the link and comment on his blog: ** Statistical Reflections of a Medical Doctor » R**.

R-bloggers.com offers

(This article was first published on ** Statistical Reflections of a Medical Doctor » R**, and kindly contributed to R-bloggers)

In the third part of the series on survival analysis with GAMs we will review the use of the baseline hazard estimates provided by this regression model. In contrast to the Cox mode, the *log-baseline hazard* is estimated along with other quantities (e.g. the log hazard ratios) by the Poisson GAM (PGAM) as:

In the aforementioned expression, the baseline hazard is equivalently modeled as a time-varying deviation () from a constant (the intercept ) , or as a time-varying function (). In the latter case, the constant is absorbed into the smooth term. The choice between these equivalent forms is dictated by the application at hand; in particular, the intercept may be switched on or off by centering the smooth terms appearing in the call to the gam function. Hence, in the PGAM formulation the log-baseline hazard is yet another covariate that one estimates by a smooth function; other covariates may modify this hazard in a *proportional *fashion by additively shifting the log-baseline hazard ().

In the “standard” way of fitting a PGAM by *mgcv*, the log-baseline hazard is estimated in the constant+deviation form. Exponentiation may be used to derive the baseline hazard and its standard errors. Continuing the analysis of the Primary Biliary Cirrhosis example from the second part of the series, we may write:

par(mfrow=c(2,2)) plot(fGAM,main="Gauss Lobatto (5)",ylab="log-basehaz") plot(fGAM2,main="Gauss Lobatto (10)",ylab="log-basehaz") plot(fGAM,main="Gauss Lobatto (5)",ylab="basehaz",trans=exp) plot(fGAM2,main="Gauss Lobatto (10)",ylab="basehaz",trans=exp)

There is no substantial difference in the estimated obtained by the coarse (Gauss Lobatto (5)) and finer (Gauss Lobatto (10)) discretization. Note that as a result of fitting the log-hazard as constant+ time-varying deviation, the standard error of the curve vanishes at ~1050: the value of the log-hazard at that instant in events per unit time is provided by the intercept term.

Estimation of the log-baseline hazard allows the PGAM to function as a parametric, smooth alternative to the Kaplan Meier estimator. This will be examined in the fourth part of this series.

To **leave a comment** for the author, please follow the link and comment on his blog: ** Statistical Reflections of a Medical Doctor » R**.

R-bloggers.com offers

(This article was first published on ** Statistical Reflections of a Medical Doctor » R**, and kindly contributed to R-bloggers)

In the second part of the series we will consider the time discretization that makes the Poisson GAM approach to survival analysis possible.

Consider a set of *s* individual observations at times , with censoring indicators assuming the value of 0 if the corresponding observation was censored and 1 otherwise. Under the assumption of non-informative censoring, the likelihood of the sample is given by:

where is the hazard function. By using an **interpolatory** quadrature rule, one may substitute the integral with a **weighted sum** evaluated at a distinct number of **nodes.**

where , are the nodes, weights of the integration rule and is an indicator variable equal to 1 if the corresponding node corresponds to an event time and zero otherwise. By including additional “pseudo-observations” at the nodes of the quadrature rule, we converted the survival likelihood to the kernel of a Poisson regression with variable exposures (weights). Conditional on the adoption of an efficient quadrature rule, this is a highly accurate approximation:

In order for the construct to work one has to ensure that the corresponding lifetimes are mapped to a node of the integration scheme. In our paper, this was accomplished by the adoption of the **Gauss-Lobatto** rule. The nodes and weights of the Gauss-Lobatto rule (which is defined in the interval depend on the Legendre polynomials in a complex way. The following R function will calculate the weights and nodes for the N-th order Gauss Lobatto rule:

GaussLobatto<-function(N) { N1<-N N<-N-1 x=matrix(cos(pi*(0:N)/N),ncol=1) x=cos(pi*(0:N)/N) P<-matrix(0,N1,N1) xold<-2 while (max(abs(x-xold))>2.22044604925031e-16) { xold<-x P[,1]<-1 P[,2]<-x for (k in 2:N) { P[,k+1]=( (2*k-1)*x*P[,k]-(k-1)*P[,k-1] )/k; } x<-xold-( x*P[,N1]-P[,N] )/( N1*P[,N1] ) } w<-2./(N*N1*P[,N1]^2); ret<-list(x=rev(x),w=w) attr(ret,"order")<-N ret }

which can be called to return a list of the nodes and their weights:

> GaussLobatto(5) $x [1] -1.0000000 -0.6546537 0.0000000 0.6546537 1.0000000 $w [1] 0.1000000 0.5444444 0.7111111 0.5444444 0.1000000 attr(,"order") [1] 4

To prepare a survival dataset for GAM fitting, one needs to call this function to obtain a Gauss Lobatto rule of the required order. Once this has been obtained, the following R function will expand the (right-censored) dataset to include the pseudo-observations at the nodes of the quadrature rule:

GAMSurvdataset<-function(GL,data,fu,d) ## GL : Gauss Lobatto rule ## data: survival data ## fu: column number containing fu info ## d: column number with event indicator { ## append artificial ID in the set data$id<-1:nrow(data) Gllx<-data.frame(stop=rep(GL$x,length(data$id)), gam.dur=rep(GL$w,length(data$id)), t=rep(data[,fu],each=length(GL$x)), ev=rep(data[,d],each=length(GL$x)), id=rep(data$id,each=length(GL$x)), gam.ev=0,start=0) ## Change the final indicator to what ## was observed, map node positions, ## weights from [-1,1] back to the ## study time Gllx<-transform(Gllx, gam.ev=as.numeric((gam.ev | ev)*I(stop==1)), gam.dur=0.5*gam.dur*(t-start), stop=0.5*(stop*(t-start)+(t+start))) ## now merge the remaining covariate info Gllx<-merge(Gllx,data[,-c(fu,d)]) Gllx }

We illustrate the use of these functions on the Primary Biliary Cirrhosis dataset that comes with R:

data(pbc) > ## Change transplant to alive > pbc$status[pbc$status==1]<-0 > ## Change event code of death(=2) to 1 > pbc$status[pbc$status==2]<-1 > > head(pbc) id time status trt age sex ascites hepato spiders edema bili chol albumin copper 1 1 400 1 1 58.76523 f 1 1 1 1.0 14.5 261 2.60 156 2 2 4500 0 1 56.44627 f 0 1 1 0.0 1.1 302 4.14 54 3 3 1012 1 1 70.07255 m 0 0 0 0.5 1.4 176 3.48 210 4 4 1925 1 1 54.74059 f 0 1 1 0.5 1.8 244 2.54 64 5 5 1504 0 2 38.10541 f 0 1 1 0.0 3.4 279 3.53 143 6 6 2503 1 2 66.25873 f 0 1 0 0.0 0.8 248 3.98 50 alk.phos ast trig platelet protime stage 1 1718.0 137.95 172 190 12.2 4 2 7394.8 113.52 88 221 10.6 3 3 516.0 96.10 55 151 12.0 4 4 6121.8 60.63 92 183 10.3 4 5 671.0 113.15 72 136 10.9 3 6 944.0 93.00 63 NA 11.0 3 > > GL<-GaussLobatto(5) > pbcGAM<-GAMSurvdataset(GL,pbc,2,3) > head(pbcGAM) id stop gam.dur t ev gam.ev start trt age sex ascites hepato spiders 1 1 0.00000 20.0000 400 1 0 0 1 58.76523 f 1 1 1 2 1 69.06927 108.8889 400 1 0 0 1 58.76523 f 1 1 1 3 1 200.00000 142.2222 400 1 0 0 1 58.76523 f 1 1 1 4 1 330.93073 108.8889 400 1 0 0 1 58.76523 f 1 1 1 5 1 400.00000 20.0000 400 1 1 0 1 58.76523 f 1 1 1 6 2 0.00000 225.0000 4500 0 0 0 1 56.44627 f 0 1 1 edema bili chol albumin copper alk.phos ast trig platelet protime stage 1 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4 2 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4 3 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4 4 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4 5 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4 6 0 1.1 302 4.14 54 7394.8 113.52 88 221 10.6 3 > > dim(pbc) [1] 418 20 > dim(pbcGAM) [1] 2090 24

The original (pbc) dataset has been expanded to include the pseudo-observations at the nodes of the Lobatto rule. There are multiple records (5 per individual in this particular case) as can be seen by examining the data for the first patient (id=1). The corresponding times are found in the variable **stop, **their associated weights in the variable **gam.dur **and the event indicators are in the column **gam.ev**. Note that nodes and weights are expressed on the scale of the survival dataset, not in the scale of the Lobatto rule (). To fit the survival dataset one needs to load the mgcv package and fit a Poisson GAM, using a flexible (penalized spline) for the **log-hazard rate **function.

The following code will obtain an adjusted (for age and sex) hazard ratio using the PGAM or the Cox model:

library(survival) ## for coxph > library(mgcv) ## for mgcv > > ## Prop Hazards Modeling with PGAM > fGAM<-gam(gam.ev~s(stop,bs="cr")+trt+age+sex+offset(log(gam.dur)), + data=pbcGAM,family="poisson",scale=1,method="REML") > > ## Your Cox Model here > f<-coxph(Surv(time,status)~trt+age+sex,data=pbc) > > > summary(fGAM) Family: poisson Link function: log Formula: gam.ev ~ s(stop, bs = "cr") + trt + age + sex + offset(log(gam.dur)) Parametric coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -10.345236 0.655176 -15.790 < 2e-16 *** trt 0.069546 0.181779 0.383 0.702 age 0.038488 0.008968 4.292 1.77e-05 *** sexf -0.370260 0.237726 -1.558 0.119 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df Chi.sq p-value s(stop) 1.008 1.015 4.186 0.0417 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = -0.249 Deviance explained = 2.25% -REML = 693.66 Scale est. = 1 n = 1560 > f Call: coxph(formula = Surv(time, status) ~ trt + age + sex, data = pbc) coef exp(coef) se(coef) z p trt 0.0626 1.065 0.182 0.344 7.3e-01 age 0.0388 1.040 0.009 4.316 1.6e-05 sexf -0.3377 0.713 0.239 -1.414 1.6e-01 Likelihood ratio test=22.5 on 3 df, p=5.05e-05 n= 312, number of events= 125 (106 observations deleted due to missingness)

The estimates for log-hazard ratio of the three covariates (trt, age, and female gender) are numerically very close. Any numerical differences reflect the different assumptions made about the baseline hazard: flexible spline (PGAM) v.s. piecewise exponential (Cox).

Increasing the number of nodes of the Lobatto rule does not materially affect the estimates of the PGAM:

GL<-GaussLobatto(10) > pbcGAM2<-GAMSurvdataset(GL,pbc,2,3) > fGAM2<-gam(gam.ev~s(stop,bs="cr")+trt+age+sex+offset(log(gam.dur)), + data=pbcGAM2,family="poisson",scale=1,method="REML") > > summary(fGAM2) Family: poisson Link function: log Formula: gam.ev ~ s(stop, bs = "cr") + trt + age + sex + offset(log(gam.dur)) Parametric coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -10.345288 0.655177 -15.790 < 2e-16 *** trt 0.069553 0.181780 0.383 0.702 age 0.038487 0.008968 4.292 1.77e-05 *** sexf -0.370340 0.237723 -1.558 0.119 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df Chi.sq p-value s(stop) 1.003 1.005 4.163 0.0416 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = -0.124 Deviance explained = 1.7% -REML = 881.67 Scale est. = 1 n = 3120

Nevertheless, the estimates of the “baseline log-hazard” become more accurate (decreased standard errors and significance of the smooth term) as the number of nodes increases.

In simulations (see Fig 3) we show that the estimates of the hazard ratio generated by the GAM are comparable in bias, variance and coverage to those obtained by the Cox model. Even though this is an importance benchmark for the proposed method, it does not provide a compelling reason for replacing the Cox model with the PGAM. In fact, the advantages of the PGAM will only become apparent once we consider contexts which depend on the baseline hazard function, or problems in which the proportionality of hazards assumption is violated. So stay tuned.

To **leave a comment** for the author, please follow the link and comment on his blog: ** Statistical Reflections of a Medical Doctor » R**.

R-bloggers.com offers

(This article was first published on ** Thinking inside the box **, and kindly contributed to R-bloggers)

The new release 0.11.5 of Rcpp arrived on the CRAN network for GNU R yesterday; the corresponding Debian package has also been uploaded.

Rcpp has become *the* most popular way of enhancing GNU R with C++ code. As of today, 373 packages on CRAN depend on Rcpp for making analyses go faster and further; BioConductor adds another 57 packages, and casual searches on GitHub suggests many more.

This version adds a little more polish and refinement around things we worked on previous releases to solidify builds, installation and the run-time experience. It does not bring anything new or majorrelease continues the 0.11.* release cycle, adding another large number of small bug fixes, polishes and enhancements. As always, you can follow the development via the GitHub repo and particularly the Issue tickets and Pull Requests. And any discussions, questions, ... regarding Rcpp are always welcome at the rcpp-devel mailing list.

See below for a detailed list of changes extracted from the `NEWS`

file.

## Changes in Rcpp version 0.11.6 (2015-05-01)

Changes in Rcpp API:

The unwinding of exceptions was refined to protect against inadvertent memory leaks.

Header files now try even harder not to let macro definitions leak.

Matrices have a new default constructor for zero-by-zero dimension matrices (via a pull request by Dmitrii Meleshko).

A new

`empty()`

string constructor was added (via another pull request).Better support for Vectors with a storage policy different from the default, i.e.

`NoProtectStorage`

, was added.Changes in Rcpp Attributes:

Rtools 3.3 is now supported.

Thanks to CRANberries, you can also look at a diff to the previous release As always, even fuller details are on the Rcpp Changelog page and the Rcpp page which also leads to the downloads page, the browseable doxygen docs and zip files of doxygen output for the standard formats. A local directory has source and documentation too. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To **leave a comment** for the author, please follow the link and comment on his blog: ** Thinking inside the box **.

R-bloggers.com offers

Our corresponding RcppArmadillo release 0.5.100.1.0 also reached CRAN and Debian yesterday. See below for the brief list of changes.

## Changes in RcppArmadillo version 0.5.100.1.0 (2015-05-01)

Upgraded to Armadillo release 5.100.1 ("Ankle Biter Deluxe")

added

`interp1()`

for 1D interpolationadded

`.is_sorted()`

for checking whether a vector or matrix has sorted elementsupdated physical constants to NIST 2010 CODATA values

Courtesy of CRANberries, there is also a diffstat report for the most recent CRAN release. As always, more detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

]]>
(This article was first published on ** Thinking inside the box **, and kindly contributed to R-bloggers)

A new minor release 5.100.1 of Armadillo was released by Conrad yesterday. Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab.

Our corresponding RcppArmadillo release 0.5.100.1.0 also reached CRAN and Debian yesterday. See below for the brief list of changes.

## Changes in RcppArmadillo version 0.5.100.1.0 (2015-05-01)

Upgraded to Armadillo release 5.100.1 ("Ankle Biter Deluxe")

added

`interp1()`

for 1D interpolationadded

`.is_sorted()`

for checking whether a vector or matrix has sorted elementsupdated physical constants to NIST 2010 CODATA values

Courtesy of CRANberries, there is also a diffstat report for the most recent CRAN release. As always, more detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To **leave a comment** for the author, please follow the link and comment on his blog: ** Thinking inside the box **.

R-bloggers.com offers

(This article was first published on ** Wiekvoet**, and kindly contributed to R-bloggers)

A while ago I had a post: 'Should I use premium Diesel? Setup. Since that time the data has been acquired. This post describes the results.Results from distribution fitting and t.test confirm the visual observation.

mean sd

3.59517971 0.19314598

(0.04828649) (0.03414371)

Premium

mean sd

3.54015573 0.23905047

(0.05976262) (0.04225855)

data: usage by kind

t = -0.6934, df = 28.732, p-value = 0.4936

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-0.2173822 0.1073343

sample estimates:

mean in group Premium mean in group Standard

3.540156 3.595180

The result is a density for mean usage quite close to 3.6.

Iterations = 501:20500

Thinning interval = 1

Number of chains = 1

Sample size per chain = 20000

1. Empirical mean and standard deviation for each variable,

plus standard error of the mean:

Mean SD Naive SE Time-series SE

[1,] 3.5409 0.06413 0.0004535 0.001722

[2,] 0.2699 0.05353 0.0003785 0.001448

2. Quantiles for each variable:

2.5% 25% 50% 75% 97.5%

var1 3.4147 3.494 3.5465 3.5884 3.6523

var2 0.1873 0.232 0.2629 0.2998 0.3953

While premium Diesel had a few occasions with remarkable fuel usage, it did not result in a relevant decrease in fuel usage overall. In addition, while the advised prices for standard and premium are 5% apart, the achievable price difference is 10%. The real price fighters do not do premium, hence the pressure to discount is less. This 10% price increase is more than I initially expected. Hence the data do not show benefit for using premium Diesel.

library(MASS)

library(ggplot2)

library(MCMCpack)

r1 <- read.table(text='l km

28.25 710.6

22.93 690.4

28.51 760.5

23.22 697.9

31.52 871.2

24.68 689.6

30.85 826.9

23.04 699

29.96 845.3

30.16 894.7

25.71 696

23.6 669.8

28.57 739

27.23 727.4

18.31 499.9

24.28 689.5',header=TRUE)

r1$usage=100*r1$l/r1$km

r1$Fuel='Standard'

r2 <- read.table(text='l km

23.97 696

25.33 693.5

24.19 698.5

23.27 689.6

25.78 707.8

15.84 510.1

32.23 860.2

23.29 674.6

25.33 678.5

26.44 686

36.43 913.2

26.46 850.1

24.38 678.8

34.85 1001.3

28.34 858.5

25.85 698.8'

,header=TRUE)

r2$usage=100*r2$l/r2$km

r2$Fuel='Premium'

rr <- rbind(r1,r2)

rr$Fuel <- factor(rr$Fuel)

ggplot(rr,aes(Fuel,usage))+geom_violin()+scale_y_continuous('usage (l/100 km)')

fitdistr(r1$usage,'normal')

fitdistr(r2$usage,'normal')

t.test(usage ~ kind, data=rr)

priormeandens <- function(usage) {

dens <- (dnorm(usage,3.6,.05)+

dnorm(usage,3.6/1.05,.05)+

dnorm(usage,(3.6+3.6/1.05)/2,.15))/3

log(dens)

}

mydens <- function(theta,x) {

musage=theta[1]

sdusage=theta[2]

datdens <- sum(dnorm(x,musage,sdusage,log=TRUE))

priormean <- priormeandens(musage)

priorsd <- dunif(sdusage,0,1,log=TRUE)

datdens+priormean+priorsd

}

mcmc1 <- MCMCmetrop1R(mydens,

theta.init=c(3.8,.2),

seed=as.integer((Sys.time())),

x=r2$usage)

plot(mcmc1)

summary(mcmc1)

To **leave a comment** for the author, please follow the link and comment on his blog: ** Wiekvoet**.

R-bloggers.com offers