This is the bi-monthly R-bloggers post (for 2015-11-30) for new R Jobs.
Just visit this link and post a new R job to the R community (it’s free and quick).
Job seekers: please follow the links below to learn more and apply for your job of interest:
(In R-users.com you may see all the R jobs that are currently available)
(you may also look at previous R jobs posts).
]]>DataCamp is offering R-bloggers readers a holiday promotion. For just $9 (instead of $25) You can gain access to their full curriculum of online R courses, videos and interactive coding challenges. With hands-on learning and instruction from leading experts such as Garrett Grolemund (RStudio), Matt Dowle (data.table) and Bob Muenchen (r4stats), DataCamp’s premium courses can help you acquire new R skills.
Here is a list of all the courses that you will get access to: (in order to get the discount you’ll need to register through this link):
In summary, you’re invited to learning new R skills from the comfort of your own browser, all for only $9. You may enjoy these discounts today through Cyber Monday, December 1st, at DataCamp.com.
On DataCamp:
DataCamp is an online learning platform that uses video lessons and interactive in-browser coding challenges to teach you how to perform your own data analysis using R. All our courses can be taken at your own pace. To date, over 230,000 R enthusiasts have already taken one or more courses at DataCamp.
]]>the integrated public use microdata series international (ipumsi) has been my white whale since i started in survey research. non-demographers, perhaps think of this repository as a martryoshka varanasi-kaaba-ark of the covenant: nothing compares. the minnesota population center amassed half a billion person-level records from national statistics offices across the globe. it’s all free and ready for download, so long as you have a project idea and an institutional affiliation. so my turn to talk? because now the software needed for analysis is free as well, and markedly superior to anything that’s available for purchase. 277 censuses later, roll credits. these tutorials maniacally document every step necessary to
click here to get started working with ipums international
notes: unless you plan to make severe edits to my example code, individual extracts must contain a single year and a single country and be formatted as a csv. the actual extract link can simply be copied and pasted into your r script from the url highlighted in the screenshot below. each extract should include the variables “serial”, “strata”, and “perwt” if you plan on calculating statistics to be shared anywhere beyond fingerpainting class. these census files cannot be treated as simple random samples, those three columns contain the information necessary for my scripts to handle everything correctly.
confidential to sas, spss, stata, and sudaan users: neil armstrong would give pogo sticks the same look i’m giving your softwares right now. time to reserve your spot on apollo eleven. time to transition to r.
As an example, take a typical dataset in the social, behavioral and medical sciences, where one is interested in interactions, for example between gender or country (categorical), frequencies of behaviors or experiences (count) and the dose of a drug (continuous). Other examples are Internet-scale marketing data or high-throughput sequencing data.
There are methods available to estimate mixed graphical models from mixed continuous data, however, these usually have two drawbacks: first, there is a possible information loss due to necessary transformations and second, they cannot incorporate (nominal) categorical variables (for an overview see here). A new method implemented in the R-package mgm addresses these limitations.
In the following, we use the mgm-package to estimate the conditional independence network in a dataset of questionnaire responses of individuals diagnosed with Autism Spectrum Disoder. This dataset includes variables of different domains, such as age (continuous), type of housing (categorical) and number of treatments (count).
The dataset consists of responses of 3521 individuals to a questionnaire including 28 variables of domains continuous, count and categorical.
dim(data)
## [1] 3521 28
round(data[1:4, 1:5],2)
## sex IQ agediagnosis opennessdiagwp successself
## [1,] 1 6 0 1 1.92
## [2,] 2 6 7 1 5.40
## [3,] 1 5 4 2 5.66
## [4,] 1 6 8 1 8.00
We now use our knowledge about the variables to specify the domain (or type) of each variable and the number of categories for categorical variables (for non-categorical variables we choose 1). "c", "g", "p" stands for categorical, Gaussian and Poisson (count), respectively.
type <- c("c", "g", "g", "c", "c", "g", "c", "c", "p", "p",
"p", "p", "p", "p", "c", "p", "c", "g", "p", "p",
"p", "p", "g", "g", "g", "g", "g", "g", "c", "c",
"g")
cat <- c(2, 1, 1, 3, 2, 1, 5, 3, 1, 1, 1, 1, 1, 1, 2, 1, 4,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 2, 1)
The estimation algorithm requires us to make an assumption about the highest order interaction in the true graph. Here we assume that there are at most pairwise interactions in the true graph and set d = 2. The algorithm includes an L1-penalty to obtain a sparse estimate. We can select the regularization parameter lambda using cross validation (CV) or the Extended Bayesian Information Criterion (EBIC). Here, we choose the EBIC, which is known to be a bit more conservative than CV but is computationally faster.
library(mgm)
fit <- mgmfit(data, type, cat, lamda.sel="EBIC", d=2)
The fit function returns all estimated parameters and a weighted and unweighted (binarized) adjacency matrix. Here we use the qgraph package to visualize the graph:
# group variables
group_list <- list("Demographics"=c(1,14,15,28),
"Psychological"=c(2,4,5,6,18,20,21),
"Social environment" = c(7,16,17,19,26,27),
"Medical"=c(3,8,9,10,11,12,13,22,23,24,25))
# define nice colors
group_cols <- c("#E35959","#8FC45A","#4B71B3","#E8ED61")
# plot
library(qgraph)
qgraph(fit$adj,
vsize=3, layout="spring",
edge.color = rgb(33,33,33,100,
maxColorValue = 255),
color=group_cols,
border.width=1.5,
border.color="black",
groups=group_list,
nodeNames=datalist$colnames,
legend=TRUE,
legend.mode="groups",
legend.cex=.75)
A reproducible example can be found in the examples of the package or more elaboratly explained in the corresponding paper. Here is a paper explaining the theory behind the implemented algorithm.
Computationally efficient methods for Gaussian data are implemented in the huge package and the glasso package. For binary data, there is the IsingFit package.
Great free resources about graphical models are Chapter 17 in the freely available book The Elements of Statistical Learning and the Coursera course Probabilistic Graphical Models.
]]>Determining conditional independence relationships through undirected graphical models is a key component in the statistical analysis of complex obervational data in a wide variety of disciplines. In many situations one seeks to estimate the underlying graphical model of a dataset that includes variables of different domains.
As an example, take a typical dataset in the social, behavioral and medical sciences, where one is interested in interactions, for example between gender or country (categorical), frequencies of behaviors or experiences (count) and the dose of a drug (continuous). Other examples are Internet-scale marketing data or high-throughput sequencing data.
There are methods available to estimate mixed graphical models from mixed continuous data, however, these usually have two drawbacks: first, there is a possible information loss due to necessary transformations and second, they cannot incorporate (nominal) categorical variables (for an overview see here). A new method implemented in the R-package mgm addresses these limitations.
In the following, we use the mgm-package to estimate the conditional independence network in a dataset of questionnaire responses of individuals diagnosed with Autism Spectrum Disoder. This dataset includes variables of different domains, such as age (continuous), type of housing (categorical) and number of treatments (count).
The dataset consists of responses of 3521 individuals to a questionnaire including 28 variables of domains continuous, count and categorical.
dim(data)
## [1] 3521 28
round(data[1:4, 1:5],2)
## sex IQ agediagnosis opennessdiagwp successself
## [1,] 1 6 0 1 1.92
## [2,] 2 6 7 1 5.40
## [3,] 1 5 4 2 5.66
## [4,] 1 6 8 1 8.00
We now use our knowledge about the variables to specify the domain (or type) of each variable and the number of categories for categorical variables (for non-categorical variables we choose 1). “c”, “g”, “p” stands for categorical, Gaussian and Poisson (count), respectively.
type <- c("c", "g", "g", "c", "c", "g", "c", "c", "p", "p",
"p", "p", "p", "p", "c", "p", "c", "g", "p", "p",
"p", "p", "g", "g", "g", "g", "g", "g", "c", "c",
"g")
cat <- c(2, 1, 1, 3, 2, 1, 5, 3, 1, 1, 1, 1, 1, 1, 2, 1, 4,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 2, 1)
The estimation algorithm requires us to make an assumption about the highest order interaction in the true graph. Here we assume that there are at most pairwise interactions in the true graph and set d = 2. The algorithm includes an L1-penalty to obtain a sparse estimate. We can select the regularization parameter lambda using cross validation (CV) or the Extended Bayesian Information Criterion (EBIC). Here, we choose the EBIC, which is known to be a bit more conservative than CV but is computationally faster.
library(mgm)
fit <- mgmfit(data, type, cat, lamda.sel="EBIC", d=2)
The fit function returns all estimated parameters and a weighted and unweighted (binarized) adjacency matrix. Here we use the qgraph package to visualize the graph:
# group variables
group_list <- list("Demographics"=c(1,14,15,28),
"Psychological"=c(2,4,5,6,18,20,21),
"Social environment" = c(7,16,17,19,26,27),
"Medical"=c(3,8,9,10,11,12,13,22,23,24,25))
# define nice colors
group_cols <- c("#E35959","#8FC45A","#4B71B3","#E8ED61")
# plot
library(qgraph)
qgraph(fit$adj,
vsize=3, layout="spring",
edge.color = rgb(33,33,33,100,
maxColorValue = 255),
color=group_cols,
border.width=1.5,
border.color="black",
groups=group_list,
nodeNames=datalist$colnames,
legend=TRUE,
legend.mode="groups",
legend.cex=.75)
A reproducible example can be found in the examples of the package or more elaboratly explained in the corresponding paper. Here is a paper explaining the theory behind the implemented algorithm.
Computationally efficient methods for Gaussian data are implemented in the huge package and the glasso package. For binary data, there is the IsingFit package.
Great free resources about graphical models are Chapter 17 in the freely available book The Elements of Statistical Learning and the Coursera course Probabilistic Graphical Models.
In a previous post, I wrote about what I use association rules for and mentioned a Shiny application I developed to explore and visualize rules. This post is about that app. The app is mainly a wrapper around the arules and arulesViz packages developed by Michael Hahsler.
Option 1: Copy the code below from the arules_app.R gist
Option2: Source gist directly.
library('devtools')
library('shiny')
library('arules')
library('arulesViz')
source_gist(id='706a28f832a33e90283b')
Option 3: Download the Rsenal package (my personal R package with a hodgepodge of data science tools) and use the arulesApp
function:
library('devtools')
install_github('brooksandrew/Rsenal')
library('Rsenal')
?Rsenal::arulesApp
arulesApp
is intended to be called from the R console for interactive and exploratory use. It calls shinyApp
which spins up a Shiny app without the overhead of having to worry about placing server.R and ui.R. Calling a Shiny app with a function also has the benefit of smooth passing of parameters and data objects as arguments. More on shinyApp
here.
arulesApp
is currently highly exploratory (and highly unoptimized). Therefore it works best for quickly iterating on rule training and visualization with low-medium sized datasets. Check out Michael Hahsler’s arulesViz paper for a thorough description of how to interpret the visualizations. There is a particularly useful table on page 24 which compares and summarizes the visualization techniques.
Simply call arulesApp
from the console with a data.frame or transaction set for which rules will be mined from:
library('arules') contains Adult and AdultUCI datasets
data('Adult') # transaction set
arulesApp(Adult, vars=40)
data('AdultUCI') # data.frame
arulesApp(AdultUCI)
Here are the arguments:
dataset
data.frame, this is the dataset that association rules will be mined from. Each row is treated as a transaction. Seems to work OK when a the S4 transactions class from arules is used, however this is not thoroughly tested.bin
logical, TRUE will automatically discretize/bin numerical data into categorical features that can be used for association analysis.vars
integer, how many variables to include in initial rule miningsupp
numeric, the support parameter for initializing visualization. Useful when it is known that a high support is needed to not crash computationally.conf
numeric, the confidence parameter for initializing visualization. Similarly useful when it is known that a high confidence is needed to not crash computationally.In a previous post, I wrote about what I use association rules for and mentioned a Shiny application I developed to explore and visualize rules. This post is about that app. The app is mainly a wrapper around the arules and arulesViz packages developed by Michael Hahsler.
Option 1: Copy the code below from the arules_app.R gist
Option2: Source gist directly.
library('devtools')
library('shiny')
library('arules')
library('arulesViz')
source_gist(id='706a28f832a33e90283b')
Option 3: Download the Rsenal package (my personal R package with a hodgepodge of data science tools) and use the arulesApp
function:
library('devtools')
install_github('brooksandrew/Rsenal')
library('Rsenal')
?Rsenal::arulesApp
arulesApp
is intended to be called from the R console for interactive and exploratory use. It calls shinyApp
which spins up a Shiny app without the overhead of having to worry about placing server.R and ui.R. Calling a Shiny app with a function also has the benefit of smooth passing of parameters and data objects as arguments. More on shinyApp
here.
arulesApp
is currently highly exploratory (and highly unoptimized). Therefore it works best for quickly iterating on rule training and visualization with low-medium sized datasets. Check out Michael Hahsler’s arulesViz paper for a thorough description of how to interpret the visualizations. There is a particularly useful table on page 24 which compares and summarizes the visualization techniques.
Simply call arulesApp
from the console with a data.frame or transaction set for which rules will be mined from:
library('arules') contains Adult and AdultUCI datasets
data('Adult') # transaction set
arulesApp(Adult, vars=40)
data('AdultUCI') # data.frame
arulesApp(AdultUCI)
Here are the arguments:
dataset
data.frame, this is the dataset that association rules will be mined from. Each row is treated as a transaction. Seems to work OK when a the S4 transactions class from arules is used, however this is not thoroughly tested.bin
logical, TRUE will automatically discretize/bin numerical data into categorical features that can be used for association analysis.vars
integer, how many variables to include in initial rule miningsupp
numeric, the support parameter for initializing visualization. Useful when it is known that a high support is needed to not crash computationally.conf
numeric, the confidence parameter for initializing visualization. Similarly useful when it is known that a high confidence is needed to not crash computationally.I’ve been having discussions with colleagues and university administration about the best way for universities to manage home-grown software.
The traditional business model for software is that we build software and sell it to everyone willing to pay. Very often, that leads to a software company spin-off that has little or nothing to do with the university that nurtured the development. Think MATLAB, S-Plus, Minitab, SAS and SPSS, all of which grew out of universities or research institutions. This model has repeatedly been shown to stifle research development, channel funds away from the institutions where the software was born, and add to research costs for everyone.
I argue that the open-source model is a much better approach both for research development and for university funding. Under the open-source model, we build software, and make it available for anyone to use and adapt under an appropriate licence. This approach has many benefits that are not always appreciated by university administrators.
Lo and behold, following a recent transfer to this GitHub repository, we finalised a number of outstanding issues. And Philippe was even kind enough to label me a co-author. And now the package is on CRAN as of yesterday. So install.packages("gtrendsR")
away and enjoy!
Here is a quiick demo:
## load the package, and if options() are set appropriately, connect
## alternatively, also run gconnect("someuser", "somepassword")
library(gtrendsR)
## using the default connection, run a query for three terms
res <- gtrends(c("nhl", "nba", "nfl"))
## plot (in default mode) as time series
plot(res)
## plot via googeVis to browser
## highlighting regions (probably countries) and cities
plot(res, type = "region")
plot(res, type = "cities")
The time series (default) plot for this query came out as follows a couple of days ago:
One really nice feature of the package is the rather rich data structure. The result set for the query above is actually stored in the package and can be accessed. It contains a number of components:
R> data(sport_trend)
R> names(sport_trend)
[1] "query" "meta" "trend" "regions" "topmetros"
[6] "cities" "searches" "rising" "headers"
R>
So not only can one look at trends, but also at regions, metropolitan areas, and cities --- even plot this easily via package googleVis which is accessed via options in the default plot method. Furthermore, related searches and rising queries may give leads to dynamics within the search.
Please use the standard GitHub issue system for bug reports, suggestions and alike.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.
]]>Sometime earlier last year, I started to help Philippe Massicotte with his gtrendsR package—which was then still "hiding" in relatively obscurity on BitBucket. I was able to assist with a few things related to internal data handling as well as package setup and package builds–but the package is really largely Philippe’s. But then we both got busy, and it wasn’t until this summer at the excellent useR! 2015 conference that we met and concluded that we really should finish the package. And we both remained busy…
Lo and behold, following a recent transfer to this GitHub repository, we finalised a number of outstanding issues. And Philippe was even kind enough to label me a co-author. And now the package is on CRAN as of yesterday. So install.packages("gtrendsR")
away and enjoy!
Here is a quiick demo:
## load the package, and if options() are set appropriately, connect
## alternatively, also run gconnect("someuser", "somepassword")
library(gtrendsR)
## using the default connection, run a query for three terms
res <- gtrends(c("nhl", "nba", "nfl"))
## plot (in default mode) as time series
plot(res)
## plot via googeVis to browser
## highlighting regions (probably countries) and cities
plot(res, type = "region")
plot(res, type = "cities")
The time series (default) plot for this query came out as follows a couple of days ago:
One really nice feature of the package is the rather rich data structure. The result set for the query above is actually stored in the package and can be accessed. It contains a number of components:
R> data(sport_trend)
R> names(sport_trend)
[1] "query" "meta" "trend" "regions" "topmetros"
[6] "cities" "searches" "rising" "headers"
R>
So not only can one look at trends, but also at regions, metropolitan areas, and cities — even plot this easily via package googleVis which is accessed via options in the default plot method. Furthermore, related searches and rising queries may give leads to dynamics within the search.
Please use the standard GitHub issue system for bug reports, suggestions and alike.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.
I followed the post, Installing an R kernel for IPython/jupyter notebook 3 on OSX, to install jupyter with python3 and R kernels in my iMac.
I have elementaryOS on my Macbook Pro and also want to have jupyter on it. The installation process is quite similar.
1 2 |
sudo apt-get install python3-pip sudo pip3 install jupyter |
Then we can use the following command to start jupyter:
1 |
ipython notebook |
To compile IRkernel, we should firstly have zmq lib installed.
1 |
sudo apt-get install libzmq3-dev python-zmq |
In R, run the following command to install IRkernel:
1 2 3 4 5 |
install.packages(c('rzmq','repr','IRkernel','IRdisplay'), repos = c('http://irkernel.github.io/', getOption('repos')), type = 'source') IRkernel::installspec() |
Now we can use R in jupyter. Inline image is a great feature especially for demonstration.
With many phylogenetic packages available in R and my package ggtree, R in jupyter can be a great environment for phylogenetics.