We wrote a short tutorial on contast coding, covering the common contrast coding scenarios, among them: treatment, helmert, anova, sum, and sliding (successive differences) contrasts. The target audience is psychologists and linguists, but really it is for anyone doing planned experiments.
The paper has not been submitted anywhere yet. We are keen to get user feedback before we do that. Comments and criticism very welcome. Please post comments on this blog, or email me.
Abstract:
Factorial experiments in research on memory, language, and in other areas are often analyzed using analysis of variance (ANOVA). However, for experimental factors with more than two levels, the ANOVA omnibus F-test is not informative about the source of a main effect or interaction. This is unfortunate as researchers typically have specific hypotheses about which condition means differ from each other. A priori contrasts (i.e., comparisons planned before the sample means are known) between specific conditions or combinations of conditions are the appropriate way to represent such hypotheses in the statistical model. Many researchers have pointed out that contrasts should be “tested instead of, rather than as a supplement to, the ordinary `omnibus’ F test” (Hayes, 1973, p. 601). In this tutorial, we explain the mathematics underlying different kinds of contrasts (i.e., treatment, sum, repeated, Helmert, and polynomial contrasts), discuss their properties, and demonstrate how they are applied in the R System for Statistical Computing (R Core Team, 2018). In this context, we explain the generalized inverse which is needed to compute the weight coefficients for contrasts that test hypotheses that are not covered by the default set of contrasts. A detailed understanding of contrast coding is crucial for successful and correct specification in linear models (including linear mixed models). Contrasts defined a priori yield far more precise confirmatory tests of experimental hypotheses than standard omnibus F-test.
Full paper: https://arxiv.org/abs/1807.10451
In this exercise, we will continue to solve problems from the last exercise about GLM here. Therefore, the exercise number will start at 9. Please make sure you read and follow the previous exercise before you continue practicing.
In the last exercise, we knew that there was over-dispersion over the model. So, we tried to use Quasi-Poisson regression, along with step-wise variable selection algorithms. Please note, here we assumed there is no influence from the background theory or knowledge behind the data. Obviously, there is no such thing in the real world, but we just use this step as an exercise in general.
Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page. Load the data-set and required package before running the exercise.
Exercise 9
Load the package called “MASS” to execute the negative binomial model. Run the package; consider all the explanatory variables.
Exercise 10
Check the summary of the model.
Exercise 11
Set options in base R, considering missing values.
Exercise 12
The previous exercise gave insight that variables 1,3,4,6 or 1,4,6 produce the best model performance. Therefore, refit the model using those variables.
Exercise 13
Check the diagnostic plot and generate a conclusion based on if the model gives the best performance.
I just came across a nice little post on acquiring and visualizing geodata in R using the Max Planck Institute of Ornithology as an example. It’s by the rOpenSci guys. Some useful code in there by the look of it… Worth a look…
We are pleased and excited to announce that we are working on a second edition of Practical Data Science with R!
Manning Publications has just announced the launching of the MEAP (Manning Early Access Program) for the second edition. The MEAP allows you to subscribe to drafts of chapters as they become available, and give us feedback before the book goes into print. Currently, drafts of the first three chapters are available.
If you’ve been contemplating buying the first edition, and haven’t yet, never fear! If you subscribe to the MEAP for the second edition, an eBook copy of the previous edition, Practical Data Science with R (First Edition), is included at no additional cost.
In addition to the topics that we covered in the first edition, we plan to add: additional material on using the vtreat
package for data preparation; a discussion of LIME for model explanation; and sections on modeling techniques that we didn’t cover in the first edition, such as gradient boosting, regularized regression, and auto-encoders.
Please subscribe to our book, your support now will help us improve it. Please also forward this offer to your friends and colleagues (and please ask them to also subscribe and forward).
Manning is sharing a 50% off promotion code active until August 23, 2018: mlzumel3.
I’ll be giving a talk at the R/Medicine conference on Sept 7th in New Haven CT.
My talk is on modeling in the tidyverse but there are some excellent speakers. Rob Tibshirani, Mike Lawrence, Jennifer Thompson, and a bunch of others will be there.
Take look at the conference website for more details.
Here is the course link.
You’ve taken a survey (or 1000) before, right? Have you ever wondered what goes into designing a survey and how survey responses are turned into actionable insights? Of course you have! In Analyzing Survey Data in R, you will work with surveys from A to Z, starting with common survey design structures, such as clustering and stratification, and will continue through to visualizing and analyzing survey results. You will model survey data from the National Health and Nutrition Examination Survey using R’s survey and tidyverse packages. Following the course, you will be able to successfully interpret survey results and finally find the answers to life’s burning questions!
Our exploration of survey data will begin with survey weights. In this chapter, we will learn what survey weights are and why they are so important in survey data analysis. Another unique feature of survey data are how they were collected via clustering and stratification. We’ll practice specifying and exploring these sampling features for several survey datasets.
Now that we have a handle of survey weights, we will practice incorporating those weights into our analysis of categorical data in this chapter. We’ll conduct descriptive inference by calculating summary statistics, building summary tables, and constructing bar graphs. For analytic inference, we will learn to run chi-squared tests.
Of course not all survey data are categorical and so in this chapter, we will explore analyzing quantitative survey data. We will learn to compute survey-weighted statistics, such as the mean and quantiles. For data visualization, we’ll construct bar-graphs, histograms and density plots. We will close out the chapter by conducting analytic inference with survey-weighted t-tests.
To model survey data also requires careful consideration of how the data were collected. We will start our modeling chapter by learning how to incorporate survey weights into scatter plots through aesthetics such as size, color, and transparency. We’ll model the survey data with linear regression and will explore how to incorporate categorical predictors and polynomial terms into our models.
The language used by data scientists can be confusing to anyone encountering it for the first time. Ever changing best practices and constantly evolving technologies and methodologies have given rise to a range of nuanced terms used throughout casual data conversation. Unfamiliarity with these terms often leads to disconnected expectations across different parts of a business when undertaking projects involving data and analytics. To make the most out of any data science project, it is important that participants have a shared vocabulary and an understanding of key terms at a level that is required of their role.
Mango Solutions is regularly involved in data science projects spanning different levels of a business. Below, we’ve outlined the most common data science terms that act as communication barriers in such projects:
Terms (common examples) | Definition for… | ||
… a data scientist | … a data science manager | … a business director | |
Data Science | An interdisciplinary field spanning mathematics, statistics and computer science aimed at delivering insights from data using a variety of technologies and methodologies. | An interdisciplinary business function making use of predictive and prescriptive analytics to make better business decisions. | The proactive use of data and advanced analytics to drive better decision making. |
Descriptive Analytics | Examination of historical data to understand the changes occurring to a business.
Used to answer the question “what happened?” |
||
Diagnostic Analytics | Examination of historical data to understand why changes have occurred within a business. Used to answer the question “why did something happen?” | ||
Predictive Analytics | The use of historical data to make predictions about future events. Used to answer the question “what will happen next?” | ||
Prescriptive Analytics | The use of data and above forms of analytics to determine the best course of action for a business. Used to answer the question “what’s the best decision we can make based on the data we have?” | ||
Model | The mathematical relationships describing how a sample of data is generated from other observations. | A data science product where mathematical and statistical relationships are estimated from historical data and later used to make predictions. | The mathematical and statistical relationships used to make predictions about key business metrics (e.g. future sales or probability a customer will make a purchase). |
Artificial Intelligence (AI) | In practice, this term is generally used to refer to “narrow AI” and encompasses the types of problems that can be solved with machine learning. AI usually encompasses topics like machine learning, natural language processing and computer vision among others. | ||
Machine Learning (e.g. random forest, xgboost, neural networks) | Variety of computational methods implementing supervised and unsupervised learning methods to predict class labels or continuous measures. | Typically, regression and classification algorithms for building models with many open-source implementations. | A broad range of leading predictive modelling methodologies. |
Deep Learning | A generalisation of artificial neural networks that makes use of many intermediate layers of representation to better capture relationships between the observed data and predictions. | A subcategory of machine learning well-suited for complex models and particularly successful in image classification and speech translation. | |
Supervised Learning | Machine learning algorithms where existing data exists for both the prediction target and the observations with which the prediction will be made. | Machine learning problems where models are estimated from known examples (e.g. identifying fraudulent credit card transactions from reported cases). | |
Unsupervised Learning | A category of machine learning problems where labels or prediction targets are unknown and must be discovered from patterns in the data. | The class of machine learning problems where object groupings need to be discovered (e.g. clusters/labels for pieces of text). | |
Over-fitting | Estimation error where the model fits the noise in the data. | ||
This is often the result of using models that are too complex for either the problem or available data. | e.g. A complex image classifier trained using 20 photographs will likely have 100% classification accuracy on those images but otherwise perform poorly on new images. | ||
Cross-Validation | An iterative approach for splitting data into train and test sets to ensure robust model estimation. | Critical strategy to ensure machine learning models don’t overfit the data and provide misleading predictions. This is needed to ensure models are general enough to be useful for making future predictions. | |
Training/Test Data | A division of data that allows unbiased model validation. Typically, models are estimated on training data and validated on “test” data that is withheld until the end of the analysis. | ||
Classification | A general term for a class of predictive problems where the target of the prediction is a label (e.g. if an observation belongs to one of two categories). | ||
Regression | Statistical and mathematical procedures for estimated the relationship between a set of variables and a target quantity while minimising the prediction errors | A broad term often used to refer to model estimation where the target variable is a continuous value (e.g. weekly sales) | |
Forecasting | The prediction of future events using mathematical or statistical models. | ||
Cloud (AWS, GCP, Azure, Cloudera) | A shared set of computational resources allowing on-demand scaling of infrastructure to meet business or project computational requirements. | A broad term for scalable on demand infrastructure and computing. | A shared set of computational resources that allow businesses to avoid upfront infrastructure costs. |
Version/Source Control (Git, SVN, Github, Gitlab) | A system for tracking, managing, and integrating code changes through a process involving branching and merging code repositories. | A system for tracking, managing, and integrating code changes while ensuring a full history of code changes is preserved along with comments from the individuals making those changes. | Framework for tracking code changes and allowing for the roll back to previous versions of software. |
Unit Testing | The automation of code validation through tests designed to ensure the correct functioning of small components of code. | An often time-consuming step during development that helps programmers test code functionality and protect against future bugs. The benefit in unit testing is often realised in the long term. | Development practice that helps ensure correct code functionality. |
Continuous Integration | A development practice where code changes are committed to a shared repository and validated by an automated build and testing process. | A practice used by a team of developers that helps protect against code integration failures and code changes that break existing or expected functionality. |
Mango Solutions can help you build a shared language around data science in your organisation. Based on our experience working with the world’s leading companies, we have developed 3 workshops to build a common language.
Find out which of the three workshops would be valuable to your organisation:
Last month, I was delighted to be invited to speak, along with Hadley Wickham, at the seventy-first meeting of the TokyoR user group in Tokyo, Japan. This day-long mini-conference attracted more than 200 attendees and featured 16 talks that covered a wide range of topics, including two near-real-time analyses of World Cup Soccer games (here and here) and an analysis of wind direction with circular data and autogressive processes (here). The tone of the talks ranged from light-hearted to business-serious. The slides for most of the presentations can be found here. If you scan through these slides, I think you will enjoy the contemporary Japanese aesthetics evident in the color palettes and playful composition of many of the presentations as well as the technical content.
In addition to the technical content, TokyoR was informative in at least three other areas. First of all, conference talks provided some insight into the country-wide R community. I have been doing my best to follow R user groups around the world for several years now, and have been tracking groups in Japan. Nevertheless, it is difficult to get a feel for what is really happening on the ground remotely. In his presentation, the Landscape with R – Japanese Rnd. Community, TokyoR organizer Koki Mimura did a great job of presenting the big picture of R in Japan. His talk indicated the breadth of established R user groups in Japan and described something of the evolution of the TokyoR group.
A second surprise was to discover that quite a few R books have been published in Japanese. “Recommendation of Reproducibility – Data Analysis and Making a Report using RStudio” by Ishida Motohiro（石田 基広） and Kohske Takahashi（高橋 康介） is one prominent recent example.
You can find this and several more Japanese R and data sceince books by entering Ishida-san’s name, 石田 基広, into Amazon.co.jp.
(As good as Google Translate is, the consequences of having to rely on it are frustrating and occasionally amusing. Entering the book information
とある弁当屋の統計技師(データサイエンティスト) ―データ分析のはじめかた― 単行本 – 2013/9/25 ino Google Translate indicates that a literal translation of the Japanese phrase used to render the concept “data scientist” is “ceremonial statistical technician”. )
I was also very pleased to see that the TokyoR attendees seemed to reflect the diverse background and occupations of R users that one sees world-wide. The business cards I collected included several entrepreneurs, data scientists, software developers, management consultants, a marketing executive from the daily newspaper Asahi Shimbum, an editor from O’Reilly, a scientist from the National Museum of Nature and Science, and at least one researcher from the Department of Musical Creativity and Environment at Tokyo University of the Arts. These diverse roles and backgrounds indicate the great strength and flexibility of the R Community, and, I believe, ensure the continued growth of the R language.
Finally, I would like to thank the TokyoR organizers and participants alike for their gracious hospitality. This was just a fine group of people to hang out with.
Note, for more insight into the technical content discussed at TokyoR, Koki Mimura has made the slides of presentations he has delivered over the years available here. Many of these are not only technically compelling, but splendid in their presentation.