There should be a word for “the regret felt when an R 📦, which would have saved untold hours of your life, is released”… #rstats 🤔 https://t.co/2THN4MwedO
— Mara Averick (@dataandme) May 31, 2017
But seriously, I have worked with US Census data a lot in the past and this package
I was working this weekend on a side project with an old friend about opioid usage in Texas and needed to download some Census data again. A perfect opportunity to give this new package a little run-through!
Before running code like the following from tidycensus, you need to obtain an API key from the Census and then use the function census_api_key()
to set it in R.
library(tidyverse)
library(tidycensus)
texas_pop <- get_acs(geography = "county",
variables = "B01003_001",
state = "TX",
geometry = TRUE)
texas_pop
## Simple feature collection with 254 features and 5 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -106.6456 ymin: 25.83738 xmax: -93.50829 ymax: 36.5007
## epsg (SRID): 4269
## proj4string: +proj=longlat +datum=NAD83 +no_defs
## # A tibble: 254 x 6
## GEOID NAME variable estimate moe geometry
## <chr> <chr> <chr> <dbl> <dbl> <S3: sfc_MULTIPOLYGON>
## 1 48007 Aransas County, Texas B01003_001 24292 0 <S3: sfc_MULTIPOLYGON>
## 2 48025 Bee County, Texas B01003_001 32659 0 <S3: sfc_MULTIPOLYGON>
## 3 48035 Bosque County, Texas B01003_001 17971 0 <S3: sfc_MULTIPOLYGON>
## 4 48067 Cass County, Texas B01003_001 30328 0 <S3: sfc_MULTIPOLYGON>
## 5 48083 Coleman County, Texas B01003_001 8536 0 <S3: sfc_MULTIPOLYGON>
## 6 48091 Comal County, Texas B01003_001 119632 0 <S3: sfc_MULTIPOLYGON>
## 7 48103 Crane County, Texas B01003_001 4730 0 <S3: sfc_MULTIPOLYGON>
## 8 48139 Ellis County, Texas B01003_001 157058 0 <S3: sfc_MULTIPOLYGON>
## 9 48151 Fisher County, Texas B01003_001 3858 0 <S3: sfc_MULTIPOLYGON>
## 10 48167 Galveston County, Texas B01003_001 308163 0 <S3: sfc_MULTIPOLYGON>
## # ... with 244 more rows
There we go! The total population in each county in Texas, in a tidyverse-ready data frame. If you want to get information for multiple states, just use purrr. The US Census tabulates lots of important kinds of information here in the United States, although there has been troubling uncertainty about leadership and funding there in recent months.
So we have this data in a form that will be easy to manipulate; what if we want to map it? Kyle Walker again has this taken care of, with his tigris package (a dependency of tidycensus); if you set geometry = TRUE
the way that I did when I downloaded the Census data above, tigris handles downloading the shapefiles from the Census, with support for sf
simple features. Kyle has a vignette for mapping using ggplot2, but you can also pipe straight into leaflet.
library(leaflet)
library(stringr)
library(sf)
pal <- colorQuantile(palette = "viridis", domain = texas_pop$estimate, n = 10)
texas_pop %>%
st_transform(crs = "+init=epsg:4326") %>%
leaflet(width = "100%") %>%
addProviderTiles(provider = "CartoDB.Positron") %>%
addPolygons(popup = ~ str_extract(NAME, "^([^,]*)"),
stroke = FALSE,
smoothFactor = 0,
fillOpacity = 0.7,
color = ~ pal(estimate)) %>%
addLegend("bottomright",
pal = pal,
values = ~ estimate,
title = "Population percentiles",
opacity = 1)
What is that st_transform
doing? Well, I am no cartographer and I am still fuzzy on these issues, but it is doing a projection onto a certain reference system of the spatial information contained in the sf
column. The specific choice of an EPSG code of 4326 is for a given projection.
Let’s look at the counties in Utah (where I live) while we’re at it. Let’s map color to population here, instead of quantiles, just for something different.
utah_pop <- get_acs(geography = "county",
variables = "B01003_001",
state = "UT",
geometry = TRUE)
pal <- colorNumeric(palette = "plasma",
domain = utah_pop$estimate)
utah_pop %>%
st_transform(crs = "+init=epsg:4326") %>%
leaflet(width = "100%") %>%
addProviderTiles(provider = "CartoDB.Positron") %>%
addPolygons(popup = ~ str_extract(NAME, "^([^,]*)"),
stroke = FALSE,
smoothFactor = 0,
fillOpacity = 0.7,
color = ~ pal(estimate)) %>%
addLegend("bottomright",
pal = pal,
values = ~ estimate,
title = "County Populations",
opacity = 1)
Yep, that is right, although still remarkable to me. Utah is largely an extremely rural state, with lots of people here in Salt Lake City where I live and then in the corridor to the north and south.
There is so much other information available from the Census. For example, what if I want to look at the median home value in Salt Lake County, at the census tract level?
slc_value <- get_acs(geography = "tract",
variables = "B25077_001",
state = "UT",
county = "Salt Lake County",
geometry = TRUE)
pal <- colorNumeric(palette = "viridis",
domain = slc_value$estimate)
slc_value %>%
st_transform(crs = "+init=epsg:4326") %>%
leaflet(width = "100%") %>%
addProviderTiles(provider = "CartoDB.Positron") %>%
addPolygons(popup = ~ str_extract(NAME, "^([^,]*)"),
stroke = FALSE,
smoothFactor = 0,
fillOpacity = 0.7,
color = ~ pal(estimate)) %>%
addLegend("bottomright",
pal = pal,
values = ~ estimate,
title = "Median Home Value",
labFormat = labelFormat(prefix = "$"),
opacity = 1)
The two census tracts with NA
values are the airport on the west side and the University of Utah on the east side. You can very obviously see the east-to-west gradient that comes as no surprise to us locals, and that priciest tract is up against one of the canyons with beautiful views. But mainly please notice with what ease I made this interactive map!
Maybe the main reason I wrote up this blog post is to say how streamlined and easy it now is to get Census data into R and plot it, but another reason is to demonstrate, at least to myself, how little effort it takes to make a blog post with, say, interactive leaflet components with my new blogging workflow. I recently changed my blog from a Jekyll blog hosted on GitHub Pages to a blog built with blogdown and Hugo, deployed using Netlify. I am finding this workflow so great, and this post with its leaflet maps went off without a hitch! I would like to say a big THANK YOU to Yihui Xie for his work on R Markdown, knitr, and blogdown. Let me know if you have any questions!
So you are using this pipeline to have data treated by different functions in R. For example, you may be imputing some missing values using the simputation package. Let us first load the only realistic dataset in R
> data(retailers, package="validate")
> head(retailers, 3)
size incl.prob staff turnover other.rev total.rev staff.costs total.costs profit vat
1 sc0 0.02 75 NA NA 1130 NA 18915 20045 NA
2 sc3 0.14 9 1607 NA 1607 131 1544 63 NA
3 sc3 0.14 NA 6886 -33 6919 324 6493 426 NA
This data is dirty with missings and full of errors. Let us do some imputations with simputation.
> out <- retailers %>%
+ impute_lm(other.rev ~ turnover) %>%
+ impute_median(other.rev ~ size)
>
> head(out,3)
size incl.prob staff turnover other.rev total.rev staff.costs total.costs profit vat
1 sc0 0.02 75 NA 6114.775 1130 NA 18915 20045 NA
2 sc3 0.14 9 1607 5427.113 1607 131 1544 63 NA
3 sc3 0.14 NA 6886 -33.000 6919 324 6493 426 NA
>
Ok, cool, we know all that. But what if you’d like to know what value was imputed with which method? That’s where the lumberjack comes in.
The lumberjack operator is a `pipe'[1] operator that allows you to track changes in data.
> library(lumberjack)
> retailers$id <- seq_len(nrow(retailers))
> out <- retailers %>>%
+ start_log(log=cellwise$new(key="id")) %>>%
+ impute_lm(other.rev ~ turnover) %>>%
+ impute_median(other.rev ~ size) %>>%
+ dump_log(stop=TRUE)
Dumped a log at cellwise.csv
>
> read.csv("cellwise.csv") %>>% dplyr::arrange(key) %>>% head(3)
step time expression key variable old new
1 2 2017-06-23 21:11:05 CEST impute_median(other.rev ~ size) 1 other.rev NA 6114.775
2 1 2017-06-23 21:11:05 CEST impute_lm(other.rev ~ turnover) 2 other.rev NA 5427.113
3 1 2017-06-23 21:11:05 CEST impute_lm(other.rev ~ turnover) 6 other.rev NA 6341.683
>
So, to track changes we only need to switch from %>%
to %>>%
and add the start_log()
and dump_log()
function calls in the data pipeline. (to be sure: it works with any function, not only with simputation). The package is on CRAN now, and please see the introductory vignette for more examples and ways to customize it.
There are many ways to track changes in data. That is why the lumberjack is completely extensible. The package comes with a few loggers, but users or package authors are invited to write their own. Please see the extending lumberjack vignette for instructions.
If this post got you interested, please install the package using
install.packages('lumberjack')
You can get started with the introductory vignette or even just use the lumberjack operator %>>%
as a (close) replacement of the %>%
operator.
As always, I am open to suggestions and comments. Either through the packages github page.
Also, I will be talking at useR2017 about the simputation package, but I will sneak in a bit of lumberjack as well :p.
And finally, here’s a picture of a lumberjack smoking a pipe.
[1] It really should be called a function composition operator, but potetoes/potatoes.
R is incredible software for statistics and data science. But while the bits and bytes of software are an essential component of its usefulness, software needs a community to be successful. And that's an area where R really shines, as Shannon Ellis explains in this lovely ROpenSci blog post. For software, a thriving community offers developers, expertise, collaborators, writers and documentation, testers, agitators (to keep the community and software on track!), and so much more. Shannon provides links where you can find all of this in the R community:
I'll add a couple of others as well:
As I've said before, the R community is one of the greatest assets of R, and is an essential component of what makes R useful, easy, and fun to use. And you couldn't find a nicer and more welcoming group of people to be a part of.
To learn more about the R community, be sure to check out Shannon's blog post linked below.
ROpenSci Blog: Hey! You there! You are welcome here
applying log results in additive relationship:
so that after transformation it becomes
Observe this on 1D plane:
Logarithmic scale is simply a log transformation applied to all feature’s values before plotting them. In our example we used it on both trading partners’ features – imports and exports which gives bubble chart new look:
In an important 2005 article in the Australian Journal of Political Science, Simon Jackman set out a statistically-based approach to pooling polls in an election campaign. He describes the sensible intuitive approach of modelling a latent, unobserved voting intention (unobserved except on the day of the actual election) and treats each poll as a random observation based on that latent state space. Uncertainty associated with each measurement comes from sample size and bias coming from the average effect of the firm conducting the poll, as well as of course uncertainty about the state of the unobserved voting intention. This approach allows house effects and the latent state space to be estimated simultaneously, quantifies the uncertainty associated with both, and in general gives a much more satisfying method of pooling polls than any kind of weighted average.
Jackman gives a worked example of the approach in his excellent book Bayesian Analysis for the Social Sciences, using voting intention for the Australian Labor Party (ALP) in the 2007 Australian federal election for data. He provides JAGS
code for fitting the model, but notes that with over 1,000 parameters to estimate (most of those parameters are the estimated voting intention for each day between the 2004 and 2007 elections) it is painfully slow to fit in general purpose MCMC-based Bayesian tools such as WinBUGS
or JAGS
– several days of CPU time on a fast computer in 2009. Jackman estimated his model with Gibbs sampling implemented directly in R.
Down the track, I want to implement Jackman’s method of polling aggregation myself, to estimate latent voting intention for New Zealand to provide an alternative method for my election forecasts. I set myself the familiarisation task of reproducing his results for the Australian 2007 election. New Zealand’s elections are a little complex to model because of the multiple parties in the proportional representation system, so I wanted to use a general Bayesian tool for the purpose to simplify my model specification when I came to it. I use Stan because its Hamiltonian Monte Carlo method of exploring the parameter space works well when there are many parameters – as in this case, with well over 1,000 parameters to estimate.
Stan describes itself as “a state-of-the-art platform for statistical modeling and high-performance statistical computation. Thousands of users rely on Stan for statistical modeling, data analysis, and prediction in the social, biological, and physical sciences, engineering, and business.” It lets the programmer specify a complex statistical model, and given a set of data will return a range of parameter estimates that were most likely to produce the observed data. Stan isn’t something you use as an end-to-end workbench – it’s assumed that data manipulation and presentation is done with another tool such as R, Matlab or Python. Stan focuses on doing one thing well – using Hamiltonian Monte Carlo to estimate complex statistical models, potentially with many thousands of hierarchical parameters, with arbitrarily set prior distributions.
Caveat! – I’m fairly new to Stan and I’m pretty sure my Stan programs that follow aren’t best practice, even though I am confident they work. Use at your own risk!
I approached the problem in stages, gradually making my model more realistic. First, I set myself the task of modelling latent first-preference support for the ALP in the absence of polling data. If all we had were the 2004 and 2007 election results, where might we have thought ALP support went between those two points? Here’s my results:
For this first analysis, I specified that support for the ALP had to be a random walk that changed by a normally distributed variable with standard deviation of 0.25 percentage points for each daily change. Why 0.25? Just because Jim Savage used it in his rough application of this approach to the US Presidential election in 2016. I’ll be relaxing this assumption later.
Here’s the R code that sets up the session, brings in the data from Jackman’s pscl
R package, and defines a graphics function that I’ll be using for each model I create.
Here’s the Stan program that specifies this super simple model of changing ALP support from 2004 to 2007:
And here’s the R code that calls that Stan program and draws the resulting summary graphic. Stan works by compiling a program in C++ that is based on the statistical model specified in the *.stan
file. Then the C++ program zooms around the high-dimensional parameter space, moving slower around the combinations of parameters that seem more likely given the data and the specified prior distributions. It can use multiple processors on your machine and works super fast given the complexity of what it’s doing.
Next I wanted to add a single polling firm. I chose Nielsen’s 42 polls because Jackman found they had a fairly low bias, which removed one complication for me as I built up my familiarity with the approach. Here’s the result:
That model was specified in Stan as set out below. The Stan program is more complex now; I’ve had to specify how many polls I have (y_n
), the values for each poll (y_values
), and the days since the last election each poll was taken (y_days
). This way I only have to specify 42 measurement errors as part of the probability model – other implementations I’ve seen of this approach ask for an estimate of measurement error for each poll on each day, treating the days with no polls as missing values to be estimated. That obviously adds a huge computational load I wanted to avoid.
In this program, I haven’t yet added in the notion of a house effect for Nielsen. Each measurement Nielsen made is assumed to have been an unbiased one. Again, I’ll be relaxing this later. The state model is also the same as before ie standard deviation of the day to day innovations is still hard coded as 0.25 percentage points.
Here’s the R code to prepare the data and pass it to Stan. Interestingly, fitting this model is noticeably faster than the one with no polling data at all. My intuition for this is that now the state space is constrained to being reasonably close to some actually observed measurements, it’s an easier job for Stan to know where is good to explore.
Finally, the complete model replicating Jackman’s work:
As well as adding the other four sets of polls, I’ve introduced five house effects that need to be estimated (ie the bias for each polling firm/mode); and I’ve told Stan to estimate the standard deviation of the day-to-day innovations in the latent support for ALP rather than hard-coding it as 0.25. Jackman specified a uniform prior on [0, 1]
for that parameter, but I found this led to lots of estimation problems for Stan. The Stan developers give some great practical advice on this sort of issue and I adapted some of that to specify the prior distribution for the standard deviation of day to day innovation as N(0.5, 0.5)
, constrained to be positive.
Here’s the Stan program:
Building the fact there are 5 polling firms (or firm-mode combinations, as Morgan is in there twice) directly into the program must be bad practice, but seeing as there are different numbers of polls taken by each firm and on different days I couldn’t work out a better way to do it. Stan doesn’t support ragged arrays, or objects like R’s lists, or (I think) convenient subsetting of tables, which would be the three ways I’d normally try to do that in another language. So I settled for the approach above, even though it has some ugly bits of repetition.
Here’s the R code that sorts the data and passes it to Stan
Here’s the house effects estimated by me with Stan, compared to those in Jackman’s 2009 book:
Basically we got the same results – certainly close enough anyway. Jackman writes:
“The largest effect is for the face-to-face polls conducted by Morgan; the point estimate of the house effect is 2.7 percentage points, which is very large relative to the classical sampling error accompanhying these polls.”
Interestingly, Morgan’s phone polls did much better.
Here’s the code that did that comparison:
So there we go – state space modelling of voting intention, with variable house effects, in the Australian 2007 federal election.
We recently had a group discussion at rOpenSci's #runconf17 in Los Angeles, CA about the R community. I initially opened the issue on GitHub. After this issue was well-received (check out the emoji-love below!), we realized people were keen to talk about this and decided to have an optional and informal discussion in person.
To get the discussion started I posed two general questions and then just let discussion fly. I prompted the group with the following:
The discussion focused primarily on the first point, and I have to say the group's answers...were awesome. Take a look!
Everyone seemed to be in agreement that (1) the community is one of R's biggest strengths and (2) a lot within the R community happens on twitter. During discussion, Julia Lowndes mentioned she joined twitter because she heard that people asked and answered questions about R there, and others echoed this sentiment. Simply, the R community is not just for 'power users' or developers. It's a place for users and people interested in learning more about R. So, if you want to get involved in the community and you are not already, consider getting a twitter account and check out the #rstats hashtag. We expect you'll be surprised by how responsive, welcoming, and inclusive the community is.
In addition to twitter, there are many resources available within the R community where you can learn more about all things R. Below is a brief list of resources mentioned during our discussion that had helped us feel more included in the community. Feel free to suggest more!
No community is perfect, and being willing to consider our shortcomings and think about ways in which we can improve is so important. The group came up with a lot of great suggestions, including many I had not previously thought of personally.
Alice Daish did a great job capturing the conversation and allowing for more discussion online:
Join the lunchtime #runconf17 discussion about the #rstats communities - what do we need to do to improve? pic.twitter.com/ztbXxNfqU7
— Alice Data (@alice_data) May 26, 2017
To summarize here:
And, when times get tough, look to your community. Get out there. Be active. Communicate with one another. Tim Phan brilliantly summarized the importance of action and community in this thread:
Dear #runconf17, bye for now!Thanks to the organizers for all you do. Here's an incoming tweet storm on R, community, and open science 1/6 pic.twitter.com/7DpkceOUC8
— Timothy Phan (@timothy_phan) May 26, 2017
Thank you to all who participated in this conversation and all who contribute to the community to make R such a fun language in which to work and develop! Thank you to rOpenSci for hosting and giving us all the opportunity to get to know one another and work together. I'm excited to see where this community goes moving forward!
]]>What's that? You've heard of R? You use R? You develop in R? You know someone else who's mentioned R? Oh, you're breathing? Well, in that case, welcome! Come join the R community!
We recently had a group discussion at rOpenSci's #runconf17 in Los Angeles, CA about the R community. I initially opened the issue on GitHub. After this issue was well-received (check out the emoji-love below!), we realized people were keen to talk about this and decided to have an optional and informal discussion in person.
To get the discussion started I posed two general questions and then just let discussion fly. I prompted the group with the following:
The discussion focused primarily on the first point, and I have to say the group's answers…were awesome. Take a look!
Everyone seemed to be in agreement that (1) the community is one of R's biggest strengths and (2) a lot within the R community happens on twitter. During discussion, Julia Lowndes mentioned she joined twitter because she heard that people asked and answered questions about R there, and others echoed this sentiment. Simply, the R community is not just for 'power users' or developers. It's a place for users and people interested in learning more about R. So, if you want to get involved in the community and you are not already, consider getting a twitter account and check out the #rstats hashtag. We expect you'll be surprised by how responsive, welcoming, and inclusive the community is.
In addition to twitter, there are many resources available within the R community where you can learn more about all things R. Below is a brief list of resources mentioned during our discussion that had helped us feel more included in the community. Feel free to suggest more!
No community is perfect, and being willing to consider our shortcomings and think about ways in which we can improve is so important. The group came up with a lot of great suggestions, including many I had not previously thought of personally.
Alice Daish did a great job capturing the conversation and allowing for more discussion online:
Join the lunchtime #runconf17 discussion about the #rstats communities – what do we need to do to improve? pic.twitter.com/ztbXxNfqU7
— Alice Data (@alice_data) May 26, 2017
To summarize here:
And, when times get tough, look to your community. Get out there. Be active. Communicate with one another. Tim Phan brilliantly summarized the importance of action and community in this thread:
Dear #runconf17, bye for now!Thanks to the organizers for all you do. Here's an incoming tweet storm on R, community, and open science 1/6 pic.twitter.com/7DpkceOUC8
— Timothy Phan (@timothy_phan) May 26, 2017
Thank you to all who participated in this conversation and all who contribute to the community to make R such a fun language in which to work and develop! Thank you to rOpenSci for hosting and giving us all the opportunity to get to know one another and work together. I'm excited to see where this community goes moving forward!
OpenCV is an incredibly powerful tool to have in your toolbox. I have had a lot of success using it in Python but very little success in R. I haven’t done too much other than searching Google but it seems as if “imager” and “videoplayR” provide a lot of the functionality but not all of it.
I have never actually called Python functions from R before. Initially, I tried the “rPython” library – that has a lot of advantages, but was completely unnecessary for me so system() worked absolutely fine. While this example is extremely simple, it should help to illustrate how easy it is to utilize the power of Python from within R. I need to give credit to Harrison Kinsley for all of his efforts and work at PythonProgramming.net – I used a lot of his code and ideas for this post (especially the Python portion).
Using videoplayR I created a function which would take a picture with my webcam and save it as “originalWebcamShot.png”
Note: saving images and then loading them isn’t very efficient but works in this case and is extremely easy to implement. It saves us from passing variables, functions, objects, and/or methods between R and Python in this case.
I’ll trace my steps backward through this post (I think it’s easier to understand what’s going on in this case).
source('imageFunctions.R') library("videoplayR") # Take a picture and save it img = webcamImage(rollFrames = 10, showImage = FALSE, saveImageToWD = 'originalWebcamShot.png') # Run Python script to detect faces, draw rectangles, return new image system('python3 facialRecognition.py') # Read in new image img.face = readImg("modifiedWebcamShot.png") # Display images imshow(img) imshow(img.face)
The user-defined function:
library("videoplayR") webcamImage = function(rollFrames = 4, showImage = FALSE, saveImageToWD = NA){ # rollFrames runs through multiple pictures - allows camera to adjust # showImage allows opportunity to display image within function # Turn on webcam stream = readStream(0) # Take pictures print("Video stream initiated.") for(i in seq(rollFrames)){ img = nextFrame(stream) } # Turn off camera release(stream) # Display image if requested if(showImage == TRUE){ imshow(img) } if(!is.na(saveImageToWD)){ fileName = paste(getwd(),"/",saveImageToWD,sep='') print(paste("Saving Image To: ",fileName, sep='')) writeImg(fileName,img) } return(img) }
The Python script:
import numpy as np import cv2 def main(): # I followed Harrison Kingsley's work for this # Much of the source code is found https://pythonprogramming.net/haar-cascade-face-eye-detection-python-opencv-tutorial/ face_cascade = cv2.CascadeClassifier('haarcascade_frontalface_default.xml') eye_cascade = cv2.CascadeClassifier('haarcascade_eye.xml') img = cv2.imread('originalWebcamShot.png') gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) faces = face_cascade.detectMultiScale(gray, 1.3, 5) for (x,y,w,h) in faces: cv2.rectangle(img,(x,y),(x+w,y+h),(0,0,255),2) roi_gray = gray[y:y+h, x:x+w] roi_color = img[y:y+h, x:x+w] eyes = eye_cascade.detectMultiScale(roi_gray) for (ex,ey,ew,eh) in eyes: cv2.rectangle(roi_color,(ex,ey),(ex+ew,ey+eh),(0,255,0),2) cv2.imwrite('modifiedWebcamShot.png',img) if __name__ == '__main__': main()
The Python code was entirely based off of Harrison Kinsley’s work:
Thanks for reading. As always, the code is on my GitHub
angstroms v0.0.1: Provides helper functions for working with Regional Ocean Modeling System (ROMS) output.
bikedata v0.0.1: Download and aggregate data from public bicycle systems from around the world. There is a vignette.
datasauRushttps://CRAN.R-project.org/package=datasauRus v0.1.2: The Datasaurus Dozen is a set of datasets that have the same summary statistics, despite having radically different distributions. As well as being an engaging variant on the Anscombe’s Quartet, the data is generated in a novel way through a simulated annealing process. Look here for details, and in the vignette for examples.
dwapi v0.1.1: Provides a set of wrapper functions for data.world’s REST API. There is a quickstart guide.
HURDAT v0.1.0: Provides datasets from the Hurricane Research Division’s Hurricane Re-Analysis Project, giving details for most known hurricanes and tropical storms for the Atlantic and northeastern Pacific ocean (northwestern hemisphere). The vignette describes the datasets.
neurohcp v0.6: Implements an interface to the Human Connectome Project. The vignette shows how it works.
osmdata v0.0.3: Provides functions to download and import of OpenStreetMap data as ‘sf’ or ‘sp’ objects. There is an Introduction and a vignette describing Translation to Simple Features.
parlitools v0.0.4: Provides various tools for analyzing UK political data, including creating political cartograms and retrieving data. There is an Introduction, and vignettes on the British Election Study, Mapping Local Suthorities, and Using Cartograms.
rerddap v0.4.2: Implements an R client to NOAA’s ERDDDAP data servers. There is an Introduction.
soilcarbon v1.0.0: Provides tools for analyzing the Soil Carbon Database created by Powell Center Working Group. The vignette launches a local Shiny App.
suncalc v0.1: Implements an R interface to the ‘suncalc.js’ library, part of the SunCalc.net’s project for calculating sun position, sunlight phases, moon position and lunar phase for the given location and time.
EventStudy v0.3.1: Provides an interface to the EventStudy API. There is an Introduction, and vignettes on Preparing EventStudy, parameters, and the RStudio Addin.
kmcudaR v1.0.0: Provides a fast, drop-in replacement for the classic K-means algorithm based on Yingyang K-Means. Look here for details.
openEBGM v0.1.0: Provides an implementation of DuMouchel’s Bayesian data mining method for the market basket problem. There is an Introduction, and vignettes for Processing Raw Data, Hyperparameter Estimation, Empirical Bayes Metrics, and Objects and Class Functions.
spacyr v0.9.0: Provides a wrapper for the Python spaCy Natural Language Processing library. Look here for help with installation and use.
learnr v0.9: Provides functions to create interactive tutorials for learning about R and R packages using R Markdown, using a combination of narrative, figures, videos, exercises, and quizzes. Look here to get started.
olsrr v0.2.0: Provides tools for teaching and learning ordinary least squares regression. There is an Introduction and vignettes on Heteroscedascitity, Measures of Influence, Collinearity Diagnostics, Residual Diagnostics and Variable Selection Methods.
rODE v0.99.4: Contains functions to show students how an ODE solver is made and how classes can be effective for constructing equations that describe natural phenomena. Have a look at the free book Computer Simulations in Physics. There are several vignettes providing brief examples, including one on the Pendulum and another on Planets.
atlantistools v0.4.2: Provides access to the Atlantis framework for end-to-end marine ecosystem modelling. There is a package demo and vignettes for model preprocessing, model calibration, species calibration, and model comparison.
phylodyn v0.9.0: Provides statistical tools for reconstructing population size from genetic sequence data. There are several vignettes including a Coalescent simulation of genealogies and a case study using New York Influenza data.
adaptiveGPCA v0.1: Implements the adaptive gPCA algorithm described in Fukuyama. The vignette shows an example using data stored in a phyloseq object.
BayesNetBP v1.2.1: Implements belief propagation methods for Bayesian Networks based on the paper by Cowell. There is a function to invoke a Shiny App.
RPEXE.RPEXT v0.0.1: Implements the likelihood ration test and backward elimination procedure for the reduced piecewise exponential survival analysis technique described in described in Han et al. 2012 and 2016. The vignette provides examples.
sfdct v0.0.3: Provides functions to construct a constrained ‘Delaunay’ triangulation from simple features objects. There is a vignette.
simglm v0.5.0: Provides functions to simulate linear and generalized linear models with up to three levels of nesting. There is an Introduction and vignettes for simulating GLMs and Missing Data performing Power Analysis and dealing with Unbalanced Data.
checkarg v0.1.0: Provides utility functions that allow checking the basic validity of a function argument or any other value, including generating an error and assigning a default in a single line of code.
CodeDepends v0.5-3: Provides tools for analyzing R expressions or blocks of code and determining the dependencies between them. The vignette shows how to use them.
desctable v0.1.0: Provides functions to create descriptive and comparative tables that are ready to be saved as csv, or piped to DT::datatable()
or pander::pander()
to integrate into reports. There is a vignette to get you started.
lifelogr v0.1.0: Provides a framework for combining self-data from multiple sources, including fitbit and Apple Health. There is a general introduction as well as an introduction for visualization functions.
processx v2.0.0: Portable tools to run system processes in the background.
printr v0.1: Extends knitr generic function knit_print()
to automatically print objects using an appropriate format such as Markdown or LaTeX. The vignette provides an introduction.
RHPCBenchmark v0.1.0: Provides microbenchmarks for determining the run-time performance of aspects of the R programming environment, and packages that are relevant to high-performance computation. There is an Introduction.
rlang v0.1.1: Provides a toolbox of functions for working with base types, core R features like the condition system, and core ‘Tidyverse’ features like tidy evaluation. The vignette explains R’s capabilities for creating Domain Specific Languages.
readtext v0.50: Provides functions for importing and handling text files and formatted text files with additional meta-data, including ‘.csv’, ‘.tab’, ‘.json’, ‘.xml’, ‘.pdf’, ‘.doc’, ‘.docx’, ‘.xls’, ‘.xlsx’ and other file types. There is a vignette
tangram v0.2.6: Provides an extensible formula system to implements a grammar of tables for creating production-quality tables using a three-step process that involves a formula parser, statistical content generation from data, and rendering. There is a vignette introducing the Grammar, a Global Style for Rmd, and duplicating SAS PROC Tabulate.
tatoo v1.0.6: Provides functions to combine data.frames and to add metadata that can be used for printing and xlsx export. The vignette shows some examples.
ContourFunctions v0.1.0: Provides functions for making contour plots. A vignette introduces the package.
mbgraphic v1.0.0: Implements a two-step process for describing univariate and bivariate behavior similar to the cognostics measures proposed by Paul and John Tuke. First, measures describing variables are computed and then plots are selected. The vignette describes the details.
polypoly v0.0.2: Provides tools for reshaping, plotting, and manipulating matrices of orthogonal polynomials. The vignette provides an overview.
RJSplot v2.1: Provides functions to create interactive graphs with ‘R’. It joins the data analysis power of R and the visualization libraries of JavaScript in one package There is a tutorial.
Two hundred and twenty-nine new packages were submitted to CRAN in May. Here are my picks for the “Top 40”, organized into five categories: Data, Data Science and Machine Learning, Education, Miscellaneous, Statistics and Utilities.
angstroms v0.0.1: Provides helper functions for working with Regional Ocean Modeling System (ROMS) output.
bikedata v0.0.1: Download and aggregate data from public bicycle systems from around the world. There is a vignette.
datasauRushttps://CRAN.R-project.org/package=datasauRus v0.1.2: The Datasaurus Dozen is a set of datasets that have the same summary statistics, despite having radically different distributions. As well as being an engaging variant on the Anscombe’s Quartet, the data is generated in a novel way through a simulated annealing process. Look here for details, and in the vignette for examples.
dwapi v0.1.1: Provides a set of wrapper functions for data.world’s REST API. There is a quickstart guide.
HURDAT v0.1.0: Provides datasets from the Hurricane Research Division’s Hurricane Re-Analysis Project, giving details for most known hurricanes and tropical storms for the Atlantic and northeastern Pacific ocean (northwestern hemisphere). The vignette describes the datasets.
neurohcp v0.6: Implements an interface to the Human Connectome Project. The vignette shows how it works.
osmdata v0.0.3: Provides functions to download and import of OpenStreetMap data as ‘sf’ or ‘sp’ objects. There is an Introduction and a vignette describing Translation to Simple Features.
parlitools v0.0.4: Provides various tools for analyzing UK political data, including creating political cartograms and retrieving data. There is an Introduction, and vignettes on the British Election Study, Mapping Local Suthorities, and Using Cartograms.
rerddap v0.4.2: Implements an R client to NOAA’s ERDDDAP data servers. There is an Introduction.
soilcarbon v1.0.0: Provides tools for analyzing the Soil Carbon Database created by Powell Center Working Group. The vignette launches a local Shiny App.
suncalc v0.1: Implements an R interface to the ‘suncalc.js’ library, part of the SunCalc.net’s project for calculating sun position, sunlight phases, moon position and lunar phase for the given location and time.
EventStudy v0.3.1: Provides an interface to the EventStudy API. There is an Introduction, and vignettes on Preparing EventStudy, parameters, and the RStudio Addin.
kmcudaR v1.0.0: Provides a fast, drop-in replacement for the classic K-means algorithm based on Yingyang K-Means. Look here for details.
openEBGM v0.1.0: Provides an implementation of DuMouchel’s Bayesian data mining method for the market basket problem. There is an Introduction, and vignettes for Processing Raw Data, Hyperparameter Estimation, Empirical Bayes Metrics, and Objects and Class Functions.
spacyr v0.9.0: Provides a wrapper for the Python spaCy Natural Language Processing library. Look here for help with installation and use.
learnr v0.9: Provides functions to create interactive tutorials for learning about R and R packages using R Markdown, using a combination of narrative, figures, videos, exercises, and quizzes. Look here to get started.
olsrr v0.2.0: Provides tools for teaching and learning ordinary least squares regression. There is an Introduction and vignettes on Heteroscedascitity, Measures of Influence, Collinearity Diagnostics, Residual Diagnostics and Variable Selection Methods.
rODE v0.99.4: Contains functions to show students how an ODE solver is made and how classes can be effective for constructing equations that describe natural phenomena. Have a look at the free book Computer Simulations in Physics. There are several vignettes providing brief examples, including one on the Pendulum and another on Planets.
atlantistools v0.4.2: Provides access to the Atlantis framework for end-to-end marine ecosystem modelling. There is a package demo and vignettes for model preprocessing, model calibration, species calibration, and model comparison.
phylodyn v0.9.0: Provides statistical tools for reconstructing population size from genetic sequence data. There are several vignettes including a Coalescent simulation of genealogies and a case study using New York Influenza data.
adaptiveGPCA v0.1: Implements the adaptive gPCA algorithm described in Fukuyama. The vignette shows an example using data stored in a phyloseq object.
BayesNetBP v1.2.1: Implements belief propagation methods for Bayesian Networks based on the paper by Cowell. There is a function to invoke a Shiny App.
RPEXE.RPEXT v0.0.1: Implements the likelihood ration test and backward elimination procedure for the reduced piecewise exponential survival analysis technique described in described in Han et al. 2012 and 2016. The vignette provides examples.
sfdct v0.0.3: Provides functions to construct a constrained ‘Delaunay’ triangulation from simple features objects. There is a vignette.
simglm v0.5.0: Provides functions to simulate linear and generalized linear models with up to three levels of nesting. There is an Introduction and vignettes for simulating GLMs and Missing Data performing Power Analysis and dealing with Unbalanced Data.
checkarg v0.1.0: Provides utility functions that allow checking the basic validity of a function argument or any other value, including generating an error and assigning a default in a single line of code.
CodeDepends v0.5-3: Provides tools for analyzing R expressions or blocks of code and determining the dependencies between them. The vignette shows how to use them.
desctable v0.1.0: Provides functions to create descriptive and comparative tables that are ready to be saved as csv, or piped to DT::datatable()
or pander::pander()
to integrate into reports. There is a vignette to get you started.
lifelogr v0.1.0: Provides a framework for combining self-data from multiple sources, including fitbit and Apple Health. There is a general introduction as well as an introduction for visualization functions.
processx v2.0.0: Portable tools to run system processes in the background.
printr v0.1: Extends knitr generic function knit_print()
to automatically print objects using an appropriate format such as Markdown or LaTeX. The vignette provides an introduction.
RHPCBenchmark v0.1.0: Provides microbenchmarks for determining the run-time performance of aspects of the R programming environment, and packages that are relevant to high-performance computation. There is an Introduction.
rlang v0.1.1: Provides a toolbox of functions for working with base types, core R features like the condition system, and core ‘Tidyverse’ features like tidy evaluation. The vignette explains R’s capabilities for creating Domain Specific Languages.
readtext v0.50: Provides functions for importing and handling text files and formatted text files with additional meta-data, including ‘.csv’, ‘.tab’, ‘.json’, ‘.xml’, ‘.pdf’, ‘.doc’, ‘.docx’, ‘.xls’, ‘.xlsx’ and other file types. There is a vignette
tangram v0.2.6: Provides an extensible formula system to implements a grammar of tables for creating production-quality tables using a three-step process that involves a formula parser, statistical content generation from data, and rendering. There is a vignette introducing the Grammar, a Global Style for Rmd, and duplicating SAS PROC Tabulate.
tatoo v1.0.6: Provides functions to combine data.frames and to add metadata that can be used for printing and xlsx export. The vignette shows some examples.
ContourFunctions v0.1.0: Provides functions for making contour plots. A vignette introduces the package.
mbgraphic v1.0.0: Implements a two-step process for describing univariate and bivariate behavior similar to the cognostics measures proposed by Paul and John Tuke. First, measures describing variables are computed and then plots are selected. The vignette describes the details.
polypoly v0.0.2: Provides tools for reshaping, plotting, and manipulating matrices of orthogonal polynomials. The vignette provides an overview.
RJSplot v2.1: Provides functions to create interactive graphs with ‘R’. It joins the data analysis power of R and the visualization libraries of JavaScript in one package There is a tutorial.