(This article was first published on ** R – Curtis Miller's Personal Website**, and kindly contributed to R-bloggers)

At the University of Utah, I teach the R lab that accompanies MATH 3070, “Applied Statistics I.”” None of my students are presumed to have any programming experience, and they never hesitate to remind me of that fact, especially when they are starting out. When I create assignments and pick problems, I often can write a one- or three-line solution in thirty seconds that students will sometimes spend four hours trying to solve. They then see my solution and slap their foreheads at its simplicity. I can be tricky with my solutions. For example, suppose you wish to find the sample proportion for a certain property. A common approach (or at least the one used in the textbook our course uses, *Using R for Introductory Statistics* by John Verzani) looks like this:

library(MASS) x <- Cars93$Origin # What proportion of cars in Cars93 are American? length(which(x == "USA")) / length(x)

## [1] 0.516129

But if you realize that `x == "USA"`

produces a vector of boolean values that, when coerced to numeric, become 0’s and 1’s for `FALSE`

and `TRUE`

respectively, and that the sample proportion is the sample mean of binomial random variables, you can get a much simpler and easy-to-read solution:

mean(x == "USA")

## [1] 0.516129

This is one of those tricks that students do understand at some level, yet it still blows their mind when they see it, and there are many other things they see that make them somewhat envious. So I do have a lot of students ask me, “How did you get so good at R?”

As some background, I fiddled around with programming as early as when I was ten years old; my dad was a computer programmer and he introduced me to QBASIC to get me started. I also fiddled with Visual Basic, and while I never really could program well, I did understand some basic concepts and I could make simple text programs with QBASIC. In high school, I found my dad’s old C textbook and I went through it, chapter by chapter, working half the problems in each on my Linux laptop, thus giving me an even more solid understanding of programming. That said, I was not one of those teenagers hacking into the high school’s network and released a homebrew virus that played a fart sound every time someone clicked the mouse; I never did any projects and mostly just let any skills I developed languish. I did not have any structure or purpose, and truth be told, I did not want to be a programmer.

The class I teach is also the first class where I learned R. (It’s even in the exact same room, and I like to harass my students by saying “I’ve been in the exact same seats as you; I sat in that general area over there.”) I did well; I already understood many of the ideas basic to all programming. After the class ended, I tried fidling around with some real data sets, looking at the correlation between homocides and guns per capita (then stopped when I realized I did not know what I was doing). Following that I worked as an intern in Washington, DC, for a semester, where I did absolutely nothing with R and I would not use it again until I took MATH 3080 (the second part of MATH 3070) and the R lab accompanying that class. Those two classes were sufficient to use R for future class projects and a real-world project (that got me media attention). (You can find my lousy first R scripts for that project here.)

December 8th is the last day of the R Lab for MATH 3070 in the Fall 2016 semester. Some students will continue on to take MATH 3080; others will go elsewhere, and many will need to use R sooner rather than later. I’m aware that they are not yet R pros, so I’ve written this blog post to give advice for not only learning R and programming from here on out, but also contributing to the R community. This is largely based on my own experience, and I invite other R users to share their own advice and stories in the comments.

One way to learn R well is to take more classes. For University of Utah students, you may have already taken MATH 3070 (perhaps from me), and you may take the MATH 3080 R lab to learn more about using R for statistics (naturally, you will be learning to use R to solve problems discussed in that course, such as ANOVA, linear regression, goodness-of-fit analysis, nonparametric tests, and so on).

To my knowledge, that’s the extent of undergraduate courses for learning R. Furthermore, the only course I am aware of that focuses on *programming* in particular, and covers R programming, is STAT 6003, entitled “Survey of Statistical Computer Packages,” which covers R in addition to SAS, SPSS, and STATA. Otherwise, you would be learning R by taking MSTAT courses such as MATH 6010 (linear models), MATH 6020 (multivariate models), or MATH 5075 (time series). These courses don’t teach R so much as use it for applications of the mathematical topics of interest (though in the MATH 5075 and 6020 textbooks there are R examples). Additionally, I am not aware of classes in the computer science department that teach R beyond what you would learn in the R labs from the Mathematics department; CS 6190, “Probabilistic Models”, uses R for Bayesian analysis, but that class is **not at all** for undergraduates.

Thus, for University of Utah students, the MATH 3070 labs may be the last courses where you learn R programming; you are expected to learn R from here on out on a need-to-know basis. That said, you could look for free online courses to sharpen your R skills. Perhaps look at Coursera and the R programming course provided by Johns Hopkins University if you want a structured class. You could also look at DataCamp for a less structured approach; you can find video lectures with problems where you can learn more about some R topics.

If you’re willing to shell out money, perhaps look into R workshops. Some big names in R, such as Hadley Wickham, offer R workshops on various topics throughout the year. That said, be prepared to spend a considerable amount; there’s a good chance that you would have to travel a distance to even get to the workshop.

I like the textbook the Mathematics department uses for the R lab, John Verzani’s *Using R for Introductory Statistics*. I wrote my lecture notes using his book, and even going through it for a second time I discovered functions and techniques that I was not familiar with and improved my own abilities. For students continuing to MATH 3080, keep the book; we will be using the same one. Otherwise, if you really want the money for the book back, you could use Verzani’s original R notes, referred to as *simpleR*, upon which his book was based, but the second edition of the book represents a significant improvement over even the first. If you don’t need the money, maybe consider keeping the book.

While Prof. Verzani’s book is good for R in the context of introductory statistics, it does not say much about R’s inner workings. It still treats R from the perspective of a humble user, not a programmer or power user. If you want to learn R as a programming language, I would highly recommend Hadley Wickham’s book *Advanced R*, available for free online or in print. Hadley Wickham is seen as a major authority in the R community; he’s a prolific package author, wrote many of the best R packages in existence (in my opinion), and clearly has a deep understanding of the R *language*. Even reading two chapters of Prof. Wickham’s book improved my R skills tremendously.

There are many, *many* publishers and authors writing about R, and you can find books from O’Reilly or even R for Dummies (though I would hope my students would be beyond the *for Dummies* level). That said, one publisher I would like to highlight is Springer. Their Use R! series includes many great books on topics specific to R, such as certain R packages or common R tasks (such as Hadley Wickham’s book, ** ggplot2**, which can be obtained for free from the University of Utah library and I highly recommend anyone read). In addition, many Springer books on more general topics include R examples, allowing readers to learn both the theory and the R application (some examples include the MATH 5075 textbook,

Another publisher with an impressive collection of R books is CRC Press. CRC Press published the textbook used in the R lab, in addition to Hadley Wickham’s *Advanced R*. CRC Press also appear to be the publisher of choice for Yihui Xie, the second-most-prolific package author, the creator of **knitr** and responsible for the growth of literate programming in R (specifically, R Markdown).

While taking courses and reading books do help people learn R, I see these as developing a programming foundation. The only way to learn to program is to write code. My understanding of R did not take off until I actually had to use R on my own for both academic and “real-world” projects. The more you code, the more you encounter problems… and solve them, one way or another. Coding is a skill like any other. No one is born being good at math, despite popular myth. The same is true for coding, or any skill. It’s always painful starting out, but that’s true for any worthwhile skill; the only way to learn is to practice. And in this era of increasing mechanization, coding skills are becoming even more valuable. (And having R skills pays well.)

When you code, you learn what you need to learn. These range from basic skills (writing functions, looping, creating graphics) to the use of packages particularly useful for your application. You also develop programming style, which is how to write source code and documentation in a way that allows people to understand what is being done (yes, programming involves style, just like in English class). You learn how much commenting is too much (or how much is not enough).

If you are a student, use R for your projects. Otherwise, get an idea for a project and use R to complete it. Perhaps there is some data set you are curious about, or you want to develop a predictive model, or maybe even scrape a web page. Maybe think of the most arduous task in your job and think about how you could use R to pass off the work to a computer. Just find something to do, and do it. This will reinforce your skills.

You should expect to encounter something new when coding that you have never done before. For example, you will likely use some package that I have not covered. In April, there were over 8,000 packages published on CRAN, and that number will only grow as R grows in popularity. The only way to make sense of any of them is to read their documentation.

Fortunately, I usually find the documentation for a package to be very helpful, and in RStudio, it’s extremely easy to pull up the documentation for any function you are unfamiliar with; just type `?newFunc`

or `help("newFunc")`

in the console (where `newFunc`

is the object you want to look up documentation for), and the documentation will be pulled up in a side window. If the documentation is especially good, it may include examples towards the end.

Package authors often don’t stop at just documenting functions. They may write vignettes that give more detail about the package’s purpose and common use, with examples or theory. A lot of packages are published with a journal article in the *Journal of Statistical Software* (*J.Stat.Soft*) or the *R Journal*, both of which are peer-reviewed journals dedicated to publishing about statistical software packages (the latter for R in particular, though most articles in *J.Stat.Soft* are about R packages). These articles often include a lot of information about the package. Some packages even have books devoted to them!

Some packages include demonstrations, or demos, that you can access to see common usage. To see all available demos, type `demo()`

in the console; this will list all the demos in all packages loaded into the working environment. Then type `demo("demo_name")`

to see the demo `"demo_name"`

. Try this out by typing `demo("error.catching")`

in the console to see what a demo is like.

To find further documentation for a package, including vignettes and other information, try looking at the package’s page on CRAN; for example, here is the CRAN page for the package **magrittr**. On the CRAN page, you can find a basic description of the package, any vignettes, and the reference manual (which is a PDF file that holds essentially the same information as that found by looking up documentation from the command line, such as the usage of all functions included in the package). Third-party sites may also include documentation for particularly popular packages.

Many people today, after getting a basic understanding of a programming language, forego classes and books and rely on only two tools to learn how to program: Google and StackOverflow. The use of Google is obvious; if you don’t know how to use Google, you have bigger problems you must address before learning how to program with R. Jokes aside, though, usually you are not the first person to encounter a problem or need to accomplish a task, and a good Google search will help you solve a problem or direct you to the packages you need for a project to solve a certain problem.

Frequently, your Google search will direct you to StackOverflow, a website where programmers ask questions and other programmers answer them. Usually your question has already been asked, especially if you are new to programming, and you should thoroughly check to see if you can find the question already. In the rare chance where your question is genuinely new, feel free to post your question on StackExchange. Most of the questions I have posted there have been answered, and there is a good chance your question will be answered as well.

If you want to stay on top of R news, learn tips and tricks, discover new techniques and new packages, and just stay in touch with the extensive R community in general, look no further than R-Bloggers. R-Bloggers is a blog aggregator; bloggers request to have their blog added to the site, and whenever they publish a new R-related post, it gets copied and posted to the site, where it is then stored and distributed. You can follow R-Bloggers on Facebook and Twitter, via RSS, or via e-mail.

I get an e-mail from R-Bloggers daily, and sometimes it includes fascinating articles that expose me to something I had not known before. I am aware of **dplyr**, **magrittr**, parallel programming, Hadley Wickham, **bookdown**, the tidyverse, and many other things thanks to R-Bloggers. You can also stay on top of industry trends, which is always an important survival skill for anything remotely related to computer science. Bloggers are a great source of tutorials, as well; they will include source code with all of their nifty analyses, code that you can learn from to see best practices and techniques.

I also invite you to follow my blog. While I do post about non-R topics (Python programming, game programming, economics, and politics, and maybe something more personal now and then), I do try to post about data analysis and R programming, and I include source code whenever I can. I am also a contributor to R-Bloggers, so if you subscribe to R-Bloggers, you will be subscribing to my R content as well.

I am not a member of an R user group or attended any meeting. (I don’t own a car and I don’t have a lot of money, so I’m not very mobile; besides, I probably would rather spend my time doing something else.) That said, many R users like to attend R user group meetings and conferences to connect with the community and learn more about R. There is an R user group at the University of Utah students here can join, and there are likely others nearby you can find. You can use this list to find a user group near you.

Some R users prefer the good-ol’-fashioned mailing list to stay on top of news and to communicate with other R users, perhaps to get help. CRAN has some official mailing lists you can consider subscribing to. The University of Utah Mathematics Department has its own mailing list as well, to which I am a subscriber. Sadly, few use this mailing list, but I hope that more students join the list so that they can keep in contact with one another not only to share news and ask for help but also build a professional community through which connections can be built (perhaps job tips).

I also would like to mention that the website R-users is a site where employers can seek out R programmers (there is also a feed for R-users on this blog). In future job searches, perhaps check there for R-related jobs.

R is open source software, and in that spirit, many contribute to R and its development. The language itself is free (in terms of both speech and beer), and the overwhelming majority of packages useful for application are free. Documentation is free, and thanks to websites such as StackOverflow, users can even get high quality assistance for free. Many learning resources are provided for free. While you can take advantage of how cheaply R can be used ($0 is really cheap), I hope you consider how you could possibly “pay back” the community that provides all of these excellent resources for you. (And while this could take the form of much-needed monetary support, that is not what I have in mind.)

Granted, this article is written for those with minimal R experience, and thus unlikely to believe they have anything valuable to contribute. However, not only can even R beginners “pay back” the community, if you use R enough in your career, you will eventually develop some level of expertise that can help others who are starting out just like you. In fact, another “beginner” may be more helpful to others than another “expert”, who may take some knowledge for granted.

Also keep in mind that when I say you should “give back” to the community, this is not only for the sake of the community; it is for *your* sake as well, and *you yourself* may benefit the most from your contributions.

First, helping others and sharing your code improves your own skills. One of the reasons why I am a good R programmer is because I teach the R lab. The process of preparing lectures and helping students on assignments hones my own skills and forces me to address issues I otherwise would take for granted. On top of this, you learn to write better code and functions when that code is being written with a user other than yourself in mind. Your style and documentation improve, and the function you write will likely be more useful when you need to think about how to make them “general”, or usable in a wide variety of situations.

Second, by contributing to the community, you begin to develop a reputation that can translate to a more successful career. You can become an authority on a topic that others look to for help and thus develop a personal brand. If the resources you provide (be it a package, a blog post, or even answers to StackOverflow questions) become popular, you can develop an audience that you can then exploit for profit or leverage in job interviews. Employers can see your writing and code samples, and if those samples are good, you may be more likely to get the job you want.

So with that in mind, here are some ways to “give back”:

Many programmers have accounts on GitHub, a software repository site that also serves as a social network for programmers. On GitHub, developers store and share their code, and others can download it, report issues, or even issue a pull request, which is a developer’s own modification to the code. By hosting code on GitHub, you may get more eyes looking at and using your code.

In your line of work, you may find yourself writing functions useful to the applications you are working on. If these functions are generally useful and not already in existence, you may want to consider bundling your code together into a package, which you can then host on GitHub or perhaps consider submitting to CRAN. People do benefit from being the authors of popular, useful packages. Ari Lamstein, the author of the mapping package **choroplethr**, reported that his package has made him money in books and consulting.

Even if you don’t have a package for some particular application in mind, perhaps consider organizing functions you use commonly into a personal package for your own use, and sharing this personal library on GitHub (with the disclaimer that you may change the contents of the package at whim and others should use the package at their own risk). This may encourage you to write better functions and classes, document your work well, all while keeping it together in a single place.

Additionally, if you find bugs or weak features/functionality in others’ packages, don’t hesitate to report them. Perhaps you can even read the source code yourself and add the fix; the authors and the community would certainly appreciate the help. (But be sure you know what you are doing.) This holds for packages hosted on CRAN as well, though the process for GitHub packages is simpler.

It doesn’t take much to ask questions on StackOverflow, while answering them takes more effort. Answering questions, though, may benefit your own career. The website does a good job of tracking users contributions and scoring their usefulness, and this may make for a nice line on a resume. Selfish motivations aside, though, I believe that if you are going to ask questions, you should try to answer others’ questions when you can, since others have donated their time to help you (and it’s not easy or quick to write good answers).

In my career, I have found that nothing teaches me more than being forced to write about a topic. Yihui Xie has written a few books on R topics (usually related to his packages), and in the announcement blog post on RStudio’s website, he had the following to say about writing books:

Writing books can be highly addictive: it helps you organize your (random) thoughts and content into chapters and sections, and it is very rewarding to see the number of pages grow each day like a little baby. You can do things that you normally cannot/won’t do in journal papers. … Choose a fresh and crispy font, and you simply cannot stop writing!

I too have found that writing forces me to process topics more deeply than when I don’t write, and I learn a lot just from the act of writing. (Again, this is one reason why I like teaching.)

You are never too inexperienced to write a blog. In fact, beginners are often great authors since they take less for granted and can explain things very clearly once they gain an understanding of a process. They also are better at demonstrating the process that leads to a result. Other beginners would greatly value your contributions.

It does not take too much to develop a meager audience, and a few great blog posts can earn you a reasonable baseline readership. I don’t promote my blog beyond announcing new posts on Reddit and other social networking sites, and having my posts distributed by R-Bloggers. While I have not had any of my R posts “take off”, my series on finance data using Python does earn my blog at least a few hundred views daily and is a popular introductory guide to the topic; considering my goals, that’s not bad. I get e-mails from readers on a regular basis asking for insight or proposing opportunities. That said, don’t be disappointed if your blog does not have a huge following (unless you’re trying to make a living blogging); those who found your insights useful will appreciate your efforts, and again, the one who learned the most from your article is *you*.

As you develop more expertise and an audience, you may eventually want to consider writing a book, as Yihui Xie suggests. Writing book-length material is not as difficult as you may think, especially when you have a book-length’s worth of material to write about. Even if you don’t sell your book, it can still be a useful part of your portfolio.

As I said earlier, this blog post represents my own experience with learning R, and it also reflects my own learning style. I also am not *that* experienced with R; I’ve been using it since 2012, while others have likely used it much more than I have, and I would invite their thoughts in the comments. That said, I believe that the suggestions made here may help not those who want to learn R but those who want to learn programming in general.

If I would offer one final tip, though, it would be to never grow complacent. The world is a rapidly changing place, and that’s especially true in the industries where R is commonly used. You should be prepared to always be learning in order to not just stay on top, but stay *employed!* Being able to learn independently may be the most valuable skill to have in the machine age, so I suggest you start practicing now.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Curtis Miller's Personal Website**.

R-bloggers.com offers

(This article was first published on ** Revolutions**, and kindly contributed to R-bloggers)

The R Consortium has already funded 8 projects (and 3 more just in July) proposed by the R community, and the call for proposals for yet more projects is now open. If you have an idea for a projects that would advance R or the R Community, get your submission in by February 10, 2017.

Meanwhile, the already-funded projects are making good progress. R-hub, the build service for R packages, has been running a successful public beta for a couple of months now. The SatRDays mini-conferences project has already had one very successful sold-out meeting in Budapest (follow that link for recordings of the talks), with another scheduled in Cape Town on February 18, 2017. R-ladies has rapidly expanded to 25 chapters around the world. And two other projects have recently reached interim milestones.

RL10N, the project to translate R into other spoken languages, has achieved its first milestone with the release of the poio package on CRAN. This package allows translators to create simple files with translations of messages, warnings, and errors. Next, the project plans to add tools for managing and updating translations, and finding translators to create the files in various languages.

The Improving Database Intefaces project has also made good progress, releasing the RSQLite v1.1 package. This provides a standardized interface to the SQLite database according to the DBI specification (which continues to evolve). This same interface will be extended to other databases, and make withing with different databases in R more consistent.

The R Consortium is also now sponsoring R user groups around the world, so if you are a member of an established R user group or would like to set one up, follow that link to apply for sponsorship. You can also find a list of local R user groups here on the blog.

Thanks as always to the members of the R Consortium (Microsoft, RStudio and all the others) for providing the funding to support these worthwhile projects!

To **leave a comment** for the author, please follow the link and comment on their blog: ** Revolutions**.

R-bloggers.com offers

(This article was first published on ** R - Data Science Heroes Blog**, and kindly contributed to R-bloggers)

Hi there! I decided to *almost* re-write the model validation section since it didn’t reflect real case scenarios.

Hopefully in the two new chapters you will gain a deeper knowledge on methodological aspects on model validation through classical cross-validation, bootstrapping, and going further in the **nature of the error**. And also take advantage of validation when data is **time dependent**.

There is a lot more to tell about model validation, but it’s a kick start.

Coming soon, there will be an update on methodological aspects in **data preparation**.

*First published at: http://blog.datascienceheroes.com/model-performance-in-data-science-live-book*

To **leave a comment** for the author, please follow the link and comment on their blog: ** R - Data Science Heroes Blog**.

R-bloggers.com offers

(This article was first published on ** R-exercises**, and kindly contributed to R-bloggers)

[For this exercise, first write down your answer, without using R. Then, check your answer using R.]

Answers to the exercises are available here.

**Exercise 1**

If

`M=matrix(c(1:10),nrow=5,ncol=2,`

dimnames=list(c('a','b','c','d','e'),c('A','B')))

What is the value of: `M`

**Exercise 2**

Consider the matrix `M`

,

What is the value of:

`M[1,]`

`M[,1]`

`M[3,2]`

`M['e','A']`

**Exercise 3**

Consider the matrix `N`

`N=matrix(c(1:9),nrow=3,ncol=3,`

dimnames=list(c('a','b','c'),c('A','B','C')))

What is the value of: `diag(N)`

**Exercise 4**

What is the value of: `diag(4,3,3)`

Is matrix ?

**Exercise 5**

If `M=matrix(c(1:9),3,3,byrow=T,`

dimnames=list(c('a','b','c'),c('d','e','f')))

What is the value of:

`rownames(M)`

`colnames(M)`

**Exercise 6**

What is the value of:

`upper.tri(M)`

`lower.tri(M)`

`lower.tri(M,diag=T)`

**Exercise 7**

Consider two matrix,

`M,N`

`M=matrix(c(1:9),3,3,byrow=T)`

`N=matrix(c(1:9),3,3)`

What is the value of:

`M*N`

**Exercise 8**

Consider two matrix,

`M,N`

`M=matrix(c(1:9),3,3,byrow=T)`

`N=matrix(c(1:9),3,3)`

What is the value of:

`M%*%N`

**Exercise 9**

Consider two matrix,

`M,N`

`M=matrix(c(1:9),3,3,byrow=T)`

`N=matrix(c(1:9),3,3)`

What is the value of:

`(M+N)^2`

**Exercise 10**

Consider two matrix,

`M,N`

`M=matrix(c(1:9),3,3,byrow=T)`

`N=matrix(c(1:9),3,3)`

What is the value of:

`M/N`

**Want to practice matrices a bit more? We have more exercise sets on this topic here.**

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-exercises**.

R-bloggers.com offers

(This article was first published on ** DataCamp Blog**, and kindly contributed to R-bloggers)

The team here at DataCamp is thrilled to announce that we now offer free Italian (thanks to Quantide) and German (thanks to eoda) translations of our most popular course, Introduction to R. Best of all, the courses are free as a part of our open course offering!

By using in-browser coding challenges you will experiment with the different aspects of the R language in real time, and you will receive instant and personalized feedback that guides you to the solution. All of this, now available in Introduzione a R and Einführung in R.

Want to create your own translation of Introduction to R? With DataCamp Teach, you can easily create and host your own interactive courses for free. Use the same system DataCamp course creators use to develop their courses, and share your R knowledge with the rest of the world. With DataCamp teach you just write your interactive exercises in simple markdown files, and DataCamp teach uploads the content to DataCamp for you. This makes creating a DataCamp course hassle-free.

To **leave a comment** for the author, please follow the link and comment on their blog: ** DataCamp Blog**.

R-bloggers.com offers

(This article was first published on ** R – biologyforfun**, and kindly contributed to R-bloggers)

Below I will expand on previous posts on bayesian regression modelling using STAN (see previous instalments here, here, and here). Topic of the day is modelling crossed and nested design in hierarchical models using STAN in R.

Crossed design appear when we have more than one grouping variable and when data are recorded for each combination of the grouping variables. For example say that we measured the growth of a fungi on different Petri dishes and that you took several samples from each dishes. In this example we have two grouping variables: the Petri dish and the sample. Since we have observations for each combination of the two grouping variables we are in a crossed design. We can model this using a hierarchical model with an intercept representing the average growth, a parameter representing the deviation from this average for each Petri dish and an additional parameter representing the deviation from the average for each sample. Below is the corresponding model in STAN code:

/*A simple example of an crossed hierarchical model *based on the Penicillin data from the lme4 package */ data { int<lower=0> N;//number of observations int<lower=0> n_sample;//number of samples int<lower=0> n_plate;//number of plates int<lower=1,upper=n_sample> sample_id[N];//vector of sample indeces int<lower=1,upper=n_plate> plate_id[N];//vector of plate indeces vector[N] y; } parameters { vector[n_sample] gamma;//vector of sample deviation from the average vector[n_plate] delta;//vector of plate deviation from the average real<lower=0> mu;//average diameter value real<lower=0> sigma_gamma;//standard deviation of the gamma coeffs real<lower=0> sigma_delta;//standard deviation of the delta coeffs real<lower=0> sigma_y;//standard deviation of the observations } transformed parameters { vector[N] y_hat; for (i in 1:N) y_hat[i] = mu + gamma[sample_id[i]] + delta[plate_id[i]]; } model { //prior on the scale coefficient //weakly informative priors, see section 6.9 in STAN user guide sigma_gamma ~ cauchy(0,2.5); sigma_delta ~ cauchy(0,2.5); sigma_y ~ gamma(2,0.1); //get sample and plate level deviation gamma ~ normal(0, sigma_gamma); delta ~ normal(0, sigma_delta); //likelihood y ~ normal(y_hat, sigma_y); } generated quantities { //sample predicted values from the model for posterior predictive checks real y_rep[N]; for(n in 1:N) y_rep[n] = normal_rng(y_hat[n],sigma_y); }

Pasting and saving this code into a .stan file we now turn to R using the Penicillin dataset from the lme4 package as (real-life) example:

library(lme4) library(rstan) library(shinystan)#for great model viz library(ggplot2)#for great viz in general data(Penicillin) #look if we have sample for each combination xtabs(~plate+sample,Penicillin) #create the plate and sample index plate_id<-as.numeric(Penicillin$plate) sample_id<-as.numeric(Penicillin$sample) #the model matrix (just an intercept in this case) X<-matrix(rep(1,dim(Penicillin)[1]),ncol=1) #fit the model m_peni<-stan(file = "crossed_penicillin.stan", data=list(N=dim(Penicillin)[1],n_sample=length(unique(sample_id)), n_plate=length(unique(plate_id)),sample_id=sample_id, plate_id=plate_id,y=Penicillin$diameter)) #launch_shinystan(m_peni)

The model seem to fit pretty nicely, all chains converged for all parameters (Rhat around 1), we have decent posterior distribution (top panel in the figure below) and also good correlation between observed and fitted data (bottom panel figure below).

In a next step we can look at the deviation form the average diameter for each sample and each plate (Petri dish):

#make caterpillar plot mcmc_peni<-extract(m_peni) sample_eff<-apply(mcmc_peni$gamma,2,quantile,probs=c(0.025,0.5,0.975)) df_sample<-data.frame(ID=unique(Penicillin$sample),Group="Sample", LI=sample_eff[1,],Median=sample_eff[2,],HI=sample_eff[3,]) plate_eff<-apply(mcmc_peni$delta,2,quantile,probs=c(0.025,0.5,0.975)) df_plate<-data.frame(ID=unique(Penicillin$plate),Group="Plate", LI=plate_eff[1,],Median=plate_eff[2,],HI=plate_eff[3,]) df_all<-rbind(df_sample,df_plate) ggplot(df_all,aes(x=ID,y=Median))+geom_point()+ geom_linerange(aes(ymin=LI,ymax=HI))+facet_wrap(~Group,scales="free")+ geom_hline(aes(yintercept=0),color="blue",linetype="dashed")+ labs(y="Regression parameters")

We can compare this figure to Figure 2.2 in here where the same model was fitted to the data using lmer.

I now turn to nested design. Nested design occur when there is more than one grouping variable and when there is a hierarchy in these variables with categories from lower variables only being present at one level from higher variables. For examples if we measured student scores within classes within schools we would have a nested hierarchical design. In the following I will use the Arabidopsis dataset from the lme4 package. Arabidopsis plants from different regions (Netherlands, Spain and Sweden) and from different populations within these regions (nested design) were collected and the researchers looked at the effect of herbivory and nutrient addition on the number of fruits produced per plants. Below is the corresponding STAN code:

/*Nested regression example *Three-levels with varying-intercept *based on: https://rpubs.com/kaz_yos/stan-multi-2 *and applied to the Arabidopsis data from lme4 */ data { int<lower=1> N; //number of observations int<lower=1> P; //number of populations int<lower=1> R; //number of regions //population ID int<lower=1,upper=P> PopID[N]; //index of population appertenance to a specific region int<lower=1,upper=R> PopWithinReg[P]; int<lower=0> Fruit[N]; //the response variable real AMD[N]; //predictor variable, whether the apical meristem was unclipped (0) or clipped (1) real nutrient[N]; //predictor variable, whether nutrient level were control (0) or higher (1) } parameters { //regression slopes real beta_0; //intercept real beta_1; //effect of clipping apical meristem on number of fruits real beta_2; //effect of increaing nutrient level on number of fruits //the deviation from the intercept at the different levels real dev_pop[P]; //deviation between the populations within a region real dev_reg[R]; //deviation between the regions //the standard deviation for the deviations real<lower=0> sigma_pop; real<lower=0> sigma_reg; } transformed parameters { //varying intercepts real beta_0pop[P]; real beta_0reg[R]; //the linear predictor for the observations real<lower=0> lambda[N]; //compute the varying intercept at the region level for(r in 1:R){ beta_0reg[r] = beta_0 + dev_reg[r];} //compute varying intercept at the population within region level for(p in 1:P){ beta_0pop[p] = beta_0reg[PopWithinReg[p]] + dev_pop[p];} //the linear predictor for(n in 1:N){ lambda[n] = beta_0pop[PopID[n]] + beta_1 * AMD[n] + beta_2 * nutrient[n];} } model { //weakly informative priors on the slopes beta_0 ~ cauchy(0,5); beta_1 ~ cauchy(0,5); beta_2 ~ cauchy(0,5); //weakly informative prior on the standard deviation sigma_pop ~ cauchy(0,2.5); sigma_reg ~ cauchy(0,2.5); //distribution of the varying intercept dev_pop ~ normal(0,sigma_pop); dev_reg ~ normal(0,sigma_reg); //likelihood Fruit ~ poisson_log(lambda); } generated quantities { //sample predicted values from the model for posterior predictive checks int<lower=0> fruit_rep[N]; for(n in 1:N) fruit_rep[n] = poisson_log_rng(lambda[n]); }

I decided to use a Poisson distribution as the response is a count variable. The only “tricky” part is the index linking a particular population to its specific region (PopWithinReg). In this model we assume that variations between populations within regions is only affecting the average number of fruits but is not affecting the plant responses to the simulated herbivory (AMD) and to increased in nutrient levels. In other words populations within region is an intercept-only “random effect”. We turn back to R:

data("Arabidopsis") #generate the IDs pop.id <- as.numeric(Arabidopsis$popu) pop_to_reg <- as.numeric(factor(substr(levels(Arabidopsis$popu),3,4))) #create the predictor variables amd <- ifelse(Arabidopsis$amd=="unclipped",0,1) nutrient <- ifelse(Arabidopsis$nutrient==1,0,1) m_arab <- stan("nested_3lvl.stan",data=list(N=625,P=9,R=3,PopID=pop.id, PopWithinReg=pop_to_reg,Fruit=Arabidopsis$total.fruits, AMD=amd,nutrient=nutrient)) #check model #launch_shinystan(m_arab)

Rstan is warning us that we had some divergent iterations, we could correct this using non-centered re-parametrization (See this post and the STAN user guide). More worrisome is the discrepancy between the posterior predictive data and the observed ones:

We can explore these errors for each populations within regions:

mcmc_arab <- extract(m_arab) #plot obs vs fitted data across groups fit_arab <- mcmc_arab$fruit_rep #average across MCMC samples Arabidopsis$Fit <- apply(fit_arab,2,mean) #plot obs vs fit ggplot(Arabidopsis,aes(x=total.fruits,y=Fit,color=amd,shape=factor(nutrient)+ geom_point()+facet_wrap(~popu,scales="free")+ geom_abline(aes(intercept=0,slope=1))

The model predict basically four values, one for each combination of the two treatment variables. The original data are way more dispersed than the fitted ones, one could try to use negative binomial distribution while making the treatment effect also vary between the populations between the regions …

That’s it for this post, a great source of regression models for further examples in the STAN-wiki.

Filed under: R and Stat Tagged: Bayesian, R, STAN, Statistics

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – biologyforfun**.

R-bloggers.com offers

As explained in the first post, the tesseract system is powered by language specific training data. By default only English training data is installed. Version 1.3 adds utilities to make it easier to install additional training data.

```
# Download French training data
tesseract_download("fra")
```

Note that this function is not needed on Linux. Here you should install training data via your system package manager instead. For example on Debian/Ubuntu:

```
sudo apt-get install tesseract-ocr-fra
```

And on Fedora/CentOS you use:

```
sudo yum install tesseract-langpack-fra
```

Use `tesseract_info()`

to see which training data are currently installed.

Tesseract supports many parameters to fine tune the OCR engine. For example you can limit the possible characters that can be recognized.

```
engine <- tesseract(options = list(tessedit_char_whitelist = "0123456789"))
ocr("image.png", engine = engine)
```

In the example above, Tesseract will only consider numeric characters. If you know in advance the data is numeric (for example an accounting spreadsheet) such options can tremendously improve the accuracy.

Tesseract now automatically recognizes images from the awesome magick package (our R wrapper to ImageMagick). This can be useful to preprocess images before feeding to tesseract.

```
library(magick)
library(tesseract)
image <- image_read("http://jeroenooms.github.io/files/dog_hq.png")
image <- image_crop(image, "1700x100+50+150")
cat(ocr(image))
```

We plan to more integration between Magick and Tesseract in future versions.

]]>
(This article was first published on ** rOpenSci Blog - R**, and kindly contributed to R-bloggers)

A few weeks ago we announced the first release of the tesseract package: a high quality OCR engine in R. We have now released an update with extra features.

As explained in the first post, the tesseract system is powered by language specific training data. By default only English training data is installed. Version 1.3 adds utilities to make it easier to install additional training data.

```
# Download French training data
tesseract_download("fra")
```

Note that this function is not needed on Linux. Here you should install training data via your system package manager instead. For example on Debian/Ubuntu:

```
sudo apt-get install tesseract-ocr-fra
```

And on Fedora/CentOS you use:

```
sudo yum install tesseract-langpack-fra
```

Use `tesseract_info()`

to see which training data are currently installed.

Tesseract supports many parameters to fine tune the OCR engine. For example you can limit the possible characters that can be recognized.

```
engine <- tesseract(options = list(tessedit_char_whitelist = "0123456789"))
ocr("image.png", engine = engine)
```

In the example above, Tesseract will only consider numeric characters. If you know in advance the data is numeric (for example an accounting spreadsheet) such options can tremendously improve the accuracy.

Tesseract now automatically recognizes images from the awesome magick package (our R wrapper to ImageMagick). This can be useful to preprocess images before feeding to tesseract.

```
library(magick)
library(tesseract)
image <- image_read("http://jeroenooms.github.io/files/dog_hq.png")
image <- image_crop(image, "1700x100+50+150")
cat(ocr(image))
```

We plan to more integration between Magick and Tesseract in future versions.

To **leave a comment** for the author, please follow the link and comment on their blog: ** rOpenSci Blog - R**.

R-bloggers.com offers

(This article was first published on ** Thinking inside the box **, and kindly contributed to R-bloggers)

A new version of RcppAPT — our interface from R to the C++ library behind the awesome `apt`

, `apt-get`

, `apt-cache`

, … commands and their cache powering Debian, Ubuntu and the like — is now on CRAN.

We changed the package to require C++11 compilation as newer Debian systems with `g++-6`

and the current libapt-pkg-dev library cannot build under the C++98 standard which CRAN imposes (and let’s not get into why …). Once set to C++11 we have no issues. We also added more examples to the manual pages, and turned on code coverage.

A bit more information about the package is available here as well as as the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Thinking inside the box **.

R-bloggers.com offers