Employers: visit this link to post a new R job to the R community (it’s free and quick).
Job seekers: please follow the links below to learn more and apply for your job of interest (or visit previous R jobs posts).
]]>
A new version, now at 0.0.4, of the drat package arrived on CRAN yesterday. Its name stands for drat R Archive Template, and it helps with easy-to-create and easy-to-use repositories for R packages, and is finding increasing by other projects.
Version 0.0.4 brings both new code and more documentation:
initRepo()
to create a git-based repository, and pruneRepo()
to remove older versions of packages;insertRepo()
functions now uses tryCatch()
around git commands (with thanks to Carl Boettiger);file:/some/path
with one colon but not two slashes (also thanks to Stefan Bache);Courtesy of CRANberries, there is a comparison to the previous release. More detailed information is on the drat page.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.
The post R Recipe: Aligning Axes in ggplot2 appeared first on Exegetic Analytics.
]]>Faceted plots in ggplot2 are phenomenal. They give you a simple way to break down an array of plots according to the values of one or more categorical variables. But what if you want to stack plots of different variables? Not quite so simple. But certainly possible. I gathered together this solution from a variety of sources on stackoverflow.
To illustrate, first let's generate some sample data.
> data <- data.frame(x = seq(0, 100, 1)) > # > data = transform(data, + y1 = sin(x * pi / 10), + y2 = x**2 + )
Then we'll import a couple of libraries: ggplot2, of course, and gridExtra which will give us the stacking functionality.
> library(ggplot2) > library(gridExtra)
We'll generate the two plots.
> p1 <- ggplot(data, aes(x = x)) + geom_line(aes(y = y1)) + theme_classic() > p2 <- ggplot(data, aes(x = x)) + geom_bar(aes(y = y2), stat = "identity") + theme_classic()
We could just slap these plots onto a grid, which would produce an acceptable plot. But my inner perfectionist is not happy with the fact that the two vertical axes do not line up and, consequently, the horizontal scales are slightly different.
The first step towards fixing this small issue is to take the plots and convert them into gtables.
> p1 <- ggplot_gtable(ggplot_build(p1)) > p2 <- ggplot_gtable(ggplot_build(p2))
Next the pivotal bit: find the widths of each of the plots, calculate the maximum and then apply it to each of them individually. This effectively applies a uniform layout to each of the plots.
> maxWidth = unit.pmax(p1$widths[2:3], p2$widths[2:3]) > > p1$widths[2:3] <- maxWidth > p2$widths[2:3] <- maxWidth
And the final result is a stacked plot with perfectly aligned vertical axes. Success!
> grid.arrange(p1, p2, heights = c(3, 2))
The post R Recipe: Aligning Axes in ggplot2 appeared first on Exegetic Analytics.
Another trip in the métro today (to work with Pierre Jacob and Lawrence Murray in a Paris Anticafé!, as the University was closed) led me to infer—warning!, this is not the exact distribution!—the distribution of x, namely
since a path x of length l(x) will corresponds to N draws if N-l(x) is an even integer 2p and p undistinguishable annihilations in 4 possible directions have to be distributed over l(x)+1 possible locations, with Feller’s number of distinguishable distributions as a result. With a prior π(N)=1/N on N, hence on p, the posterior on p is given by
Now, given N and x, the probability of no annihilation on the last round is 1 when l(x)=N and in general
which can be integrated against the posterior. The numerical expectation is represented for a range of values of l(x) in the above graph. Interestingly, the posterior probability is constant for l(x) large and equal to 0.8125 under a flat prior over N.
Getting back to Pierre Druilhet’s approach, he sets a flat prior on the length of the path θ and from there derives that the probability of annihilation is about 3/4. However, “the uniform prior on the paths of lengths lower or equal to M” used for this derivation which gives a probability of length l proportional to 3^{l} is quite different from the distribution of l(θ) given a number of draws N. Which as shown above looks much more like a Binomial B(N,1/2).
However, being not quite certain about the reasoning involving Fieller’s trick, I ran an ABC experiment under a flat prior restricted to (l(x),4l(x)) and got the above, where the histogram is for a posterior sample associated with l(x)=195 and the gold curve is the potential posterior. Since ABC is exact in this case (i.e., I only picked N’s for which l(x)=195), ABC is not to blame for the discrepancy! Here is the R code that goes with the ABC implementation:
#observation: elo=195 #ABC version T=1e6 el=rep(NA,T) N=sample(elo:(4*elo),T,rep=TRUE) for (t in 1:T){ #generate a path paz=sample(c(-(1:2),1:2),N[t],rep=TRUE) #eliminate U-turns uturn=paz[-N[t]]==-paz[-1] while (sum(uturn>0)){ uturn[-1]=uturn[-1]*(1- uturn[-(length(paz)-1)]) uturn=c((1:(length(paz)-1))[uturn==1], (2:length(paz))[uturn==1]) paz=paz[-uturn] uturn=paz[-length(paz)]==-paz[-1] } el[t]=length(paz)} #subsample to get exact posterior poster=N[abs(el-elo)==0]
I set aside a small bit of time to give rbokeh a try and figured I’d share a small bit of code that shows how to make the “same” chart in both ggplot2 and rbokeh.
rbokeh is an htmlwidget wrapper for the Bokeh visualization library that has become quite popular in Python circles. Bokeh makes creating interactive charts pretty simple and rbokeh lets you do it all with R syntax.
This is not a comprehensive introduction into rbokeh. You can get that here (officially). I merely wanted to show how a ggplot idiom would map to an rbokeh one for those that may be looking to try out the rbokeh library and are familiar with ggplot. They share a very common “grammar of graphics” base where you have a plot structure and add layers and aesthetics. They each do this a tad bit differently, though, as you’ll see.
First, let’s plot a line graph with some markers in ggplot. The data I’m using is a small time series that we’ll use to plot a cumulative sum of via a line graph. It’s small enough to fit inline:
library(ggplot2) library(rbokeh) library(htmlwidgets) structure(list(wk = structure(c(16069, 16237, 16244, 16251, 16279, 16286, 16300, 16307, 16314, 16321, 16328, 16335, 16342, 16349, 16356, 16363, 16377, 16384, 16391, 16398, 16412, 16419, 16426, 16440, 16447, 16454, 16468, 16475, 16496, 16503, 16510, 16517, 16524, 16538, 16552, 16559, 16566, 16573), class = "Date"), n = c(1L, 1L, 1L, 1L, 3L, 1L, 3L, 2L, 4L, 2L, 3L, 2L, 5L, 5L, 1L, 1L, 3L, 3L, 3L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 7L, 1L, 2L, 6L, 7L, 1L, 1L, 1L, 2L, 2L, 7L, 1L)), .Names = c("wk", "n"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -38L)) -> by_week events <- data.frame(when=as.Date(c("2014-10-09", "2015-03-20", "2015-05-15")), what=c("Thing1", "Thing2", "Thing2")) |
The ggplot version is pretty straightforward:
gg <- ggplot() gg <- gg + geom_vline(data=events, aes(xintercept=as.numeric(when), color=what), linetype="dashed", alpha=1/2) gg <- gg + geom_text(data=events, aes(x=when, y=1, label=what, color=what), hjust=1.1, size=3) gg <- gg + geom_line(data=by_week, aes(x=wk, y=cumsum(n))) gg <- gg + scale_x_date(expand=c(0, 0)) gg <- gg + scale_y_continuous(limits=c(0, 100)) gg <- gg + labs(x=NULL, y="Cumulative Stuff") gg <- gg + theme_bw() gg <- gg + theme(panel.grid=element_blank()) gg <- gg + theme(panel.border=element_blank()) gg <- gg + theme(legend.position="none") gg |
We:
events
dates)cumsum(n)
vs pre-compute itThat gives us this:
Here’s a similar structure in rbokeh:
figure(width=550, height=375, logo="grey", outline_line_alpha=0) %>% ly_abline(v=events$when, color=c("red", "blue", "blue"), type=2, alpha=1/4) %>% ly_text(x=events$when, y=5, color=c("red", "blue", "blue"), text=events$what, align="right", font_size="7pt") %>% ly_lines(x=wk, y=cumsum(n), data=by_week) %>% y_range(c(0, 100)) %>% x_axis(grid=FALSE, label=NULL, major_label_text_font_size="8pt", axis_line_alpha=0) %>% y_axis(grid=FALSE, label="Cumulative Stuff", minor_tick_line_alpha=0, axis_label_text_font_size="10pt", axis_line_alpha=0) -> rb rb |
Here, we set the width
and height
and configure some of the initial aesthetic options. Note that outline_line_alpha=0
is the equivalent of theme(panel.border=element_blank())
.
The markers and text do not work exactly as one might expect since there’s no way to specify a data
parameter, so we have to set the colors manually. Also, since the target is a browser, points are specified in the same way you would with CSS. However, it’s a pretty easy translation from geom_[hv]line
to ly_abline
and geom_text
to ly_text
.
The ly_lines
works pretty much like geom_line
.
Notice that both ggplot and rbokeh can grok dates for plotting (though we do not need the as.numeric
hack for rbokeh).
rbokeh will auto-compute bounds like ggplot would but I wanted the scale to go from 0 to 100 in each plot. You can think of y_range
as ylim
in ggplot.
To configure the axes, you work directly with x_axis
and y_axis
parameters vs theme
elements in ggplot. To turn off only lines, I set the alpha to 0 in each and did the same with the y axis minor tick marks.
Here’s the rbokeh result:
NOTE: you can save out the widget with:
saveWidget(rb, file="rbokeh001.html") |
and I like to use the following iframe
settings to include the widgets:
<iframe style="max-width=100%" src="rbokeh001.html" sandbox="allow-same-origin allow-scripts" width="100%" height="400" scrolling="no" seamless="seamless" frameBorder="0"></iframe> |
Hopefully this helped a bit with translating some ggplot idioms over to rbokeh and developing a working mental model of rbokeh plots. As I play with it a bit more I’ll add some more examples here in the event there are “tricks” that need to be exposed. You can find the code up on github and please feel free to drop a note in the comments if there are better ways of doing what I did or if you have other hints for folks.
Recently, in various research projects, the Lambert-W function arose a number of times. Somewhat frustratingly, there is no built-in function in R to calculate it. The only options were those in the gsl and LambertW packages, the latter merely importing the former. Importing the entire GNU Scientific Library (GSL) can be a bit of a hassle, especially for those of us restricted to a Windows environment.
Therefore, I spent a little time and built a package for R whose sole purpose is to calculate the real-valued versions of the Lambert W function without the need for importing the GSL: the lamW package. It does depends on Rcpp, though. It could have been written in pure R, but as there are a number of loops involved which cannot be vectorized, and as Rcpp is fast becoming almost a base package, I figured that the speed and convenience was worth it.
A welcome outcome of this was that I think I finally wrapped my head around basic Padé approximation, which I use when calculating some parts of the primary branch of the Lambert-W. Eventually, I’d like to write a longer post about Padé approximation; when that will happen, who knows 8-).
By Mark Malter
A few weeks ago, I wrote about my Baseball Stats R shiny application, where I demonstrated how to calculate runs expectancies based on the 24 possible bases/outs states for any plate appearance. In this article, I’ll explain how I expanded on that to calculate the probability of winning the game, based on the current score/inning/bases/outs state. While this is done on other websites, I have added some unique play attempt features -- steal attempt, sacrifice bunt attempt, and tag from third attempt -- to show the probability of winning with and without the attempt, as well as the expected win probability given a user determined success rate for the play. That way, a manager can not only know the expected runs based on a particular decision, but the actual probability of winning the game if the play is attempted.
After the user enters the score, inning, bases state, and outs, the code runs through a large number of simulated games using the expected runs to be scored over the remainder of the current half inning, as well as each succeeding half inning for the remainder of the game.
When there are runners on base and the user clicks on any of the ‘play attempt’ tabs, a new table is generated showing the new probabilities. I allow for sacrifice bunts with less than two outs and a runner on first, second, or first and second. The stolen base tab can be used with any number of outs and the possibility of stealing second, third, or both. The tag from third tab will work as long as there is a runner on third and less than two outs prior to the catch.
I first got this idea after watching game seven of the 2014 World Series. Trailing 3-2 with two outs and nobody on base, Alex Gordon singled to center off of Madison Bumgarner and made it all the way to third base after a two base error. As Gordon was approaching third base, the coach gave him the stop sign, as shortstop Brandon Crawford was taking the relay throw in short left field. Had Gordon attempted to score on the play, he probably would have been out at the plate- game and series over. However, by holding up at third base, the Royals are also still ‘probably’ going to lose since runners only score from third base with two outs about 26% of the time.
The calculator shows that the probability of winning a game down by one run with a runner on third and two outs in the bottom of the ninth (or an extra inning) is roughly 17%. If we click on the ‘Tag from Third Attempt’ tab (Gordon attempting to score would have been equivalent to tagging from third after a catch for the second out), and play with the ‘Base running success rate’ slider, we see that the break even success rate is roughly 0.3. I don’t know the probability of Gordon beating the throw home, but if it was greater than 0.3 then making the attempt would have improved the Royals chances of winning the World Series. In fact, if the true success rate was as much as 0.5 then the Royals win probability would have jumped by 11% to 28%.
Here is the UI code. Here is the server code. And here is the Shiny App.
>