The post Visualizing APA 6 Citations: qdapRegex 0.2.0 & qdapTools 1.1.0 appeared first on All About Statistics.
]]>qdapRegex 0.2.0 & qdapTools 1.1.0 have been released to CRAN. This post will provide some of the packages’ updates/features and provide an integrate demonstration of extracting and viewing intext APA 6 style citations from an MS Word (.docx) document.
The qdapRegex package is meant to provide access to canned, common regular expression patterns that can be used within qdapRegex, with R‘s own regular expression functions, or add on string manipulation packages such as stringr and stringi. The qdapRegex package serves a dual purpose of being both functional and educational.
Here are a select few new features. For a complete list of changes CLICK HERE:
is.regex
added as a logical check of a regular expression’s validy (conforms to R’s regular expression rules).TC
(title case), U
(upper case), and L
(lower case) added for convenient case manipulation.rm_citation_tex
added to remove/extract/replace bibkey citations from a .tex (LaTeX) file.regex_cheat
data set and cheat
function added to act as a quick reference for common regex task operations such a lookaheads.explain
added to view a visual representation of a regular expression using http://www.regexper.com andhttp://rick.measham.id.au/paste/explain. Also takes named regular expressions from the regex_usa
or other supplied dictionary.The last two functions regex_cheat
& explain
provide educational regex tools. regex_cheat
provides a cheatsheet of common regex elements. explain
interfaces with http://www.regexper.com & http://rick.measham.id.au/paste/explain.
qdapTools is an R package that contains tools associated with the qdap package that may be useful outside of the context of text analysis.
loc_split
added to split data forms (list
, vector
, data.frame
, matrix
) on a vector of integer locations.matrix2long
makes a long format data.frame. It takes a matrix object, stacks all columns and adds identifying columns by repeating row and column names accordingly.read_docx
added to read in .docx documents.split_vector
picks up a regex
argument to allow for regular expression search of break location.In this demonstration we will use dl_url
to grab a .docx file from the Internet. We’ll then read this document in with read_docx
. We’ll use split_vector
to split the text from the .docx into main body and a references section. rm_citations
will be utilize to extract intext APA 6 style citations. Last we will view frequencies and a visualization of the distribution of the citations using ggplot2. For a complete script of this R code used in this blog post CLICK HERE.
First we’ll make sure we have the correct versions of the packages, install them if necessary, and load the required packages for the demonstration:
Map(function(x, y) { if (!x %in% list.files(.libPaths())){ install.packages(x) } else { if (packageVersion(x) < y) { install.packages(x) } else { message(sprintf("Version of %s is suitable for demonstration", x)) } } }, c("qdapRegex", "qdapTools"), c("0.2.0", "1.1.0")) lapply(c("qdapRegex", "qdapTools", "ggplot2", "qdap"), require, character.only=TRUE)
Now let’s grab the .docx document, read it in, and split into body/references sections:
## Download .docx url_dl("http://umlreading.weebly.com/uploads/2/5/2/5/25253346/whole_language_timelineupdated.docx") ## Read in .docx txt < read_docx("whole_language_timelineupdated.docx") ## Remove non ascii characters txt < rm_non_ascii(txt) ## Split into body/references sections parts < split_vector(txt, split = "References", include = TRUE, regex=TRUE) ## View body parts[[1]] ## View references parts[[2]]
Now we can extract the intext APA 6 citations and view frequencies:
## Extract citations in order of appearance rm_citation(unbag(parts[[1]]), extract=TRUE)[[1]] ## Extract citations by section rm_citation(parts[[1]], extract=TRUE) ## Frequency left_just(cites < list2df(sort(table(rm_citation(unbag(parts[[1]]), extract=TRUE)), TRUE), "freq", "citation")[2:1])
## citation freq
## 1 Walker, 2008 14
## 2 Flesch (1955) 2
## 3 Adams (1990) 1
## 4 Anderson, Hiebert, Scott, and Wilkinson (1985) 1
## 5 Baumann & Hoffman, 1998 1
## 6 Baumann, 1998 1
## 7 Bond and Dykstra (1967) 1
## 8 Chall (1967) 1
## 9 Clay (1979) 1
## 10 Goodman and Goodman (1979) 1
## 11 McCormick & Braithwaite, 2008 1
## 12 Read Adams (1990) 1
## 13 Stahl and Miller (1989) 1
## 14 Stahl and Millers (1989) 1
## 15 Word Perception Intrinsic Phonics Instruction Gates (1951) 1
Now we can find the locations of the citations in the text and plot a distribution of the intext citations throughout the text:
## Distribution of citations (find locations) cite_locs < do.call(rbind, lapply(cites[[1]], function(x){ m < gregexpr(x, unbag(parts[[1]]), fixed=TRUE) data.frame( citation=x, start = m[[1]] 5, end = m[[1]] + 5 + attributes(m[[1]])[["match.length"]] ) })) ## Plot the distribution ggplot(cite_locs) + geom_segment(aes(x=start, xend=end, y=citation, yend=citation), size=3, color="yellow") + xlab("Duration") + scale_x_continuous(expand = c(0,0), limits = c(0, nchar(unbag(parts[[1]])) + 25)) + theme_grey() + theme( panel.grid.major=element_line(color="grey20"), panel.grid.minor=element_line(color="grey20"), plot.background = element_rect(fill="black"), panel.background = element_rect(fill="black"), panel.border = element_rect(colour = "grey50", fill=NA, size=1), axis.text=element_text(color="grey50"), axis.title=element_text(color="grey50") )
Please comment on the article here: TRinker's R Blog
The post Visualizing APA 6 Citations: qdapRegex 0.2.0 & qdapTools 1.1.0 appeared first on All About Statistics.
]]>Just in time for Christmas, here’s some good news for kids, from Pamela DavisKean and Justin Jager: The achievement gap has long been the focus of educational research, policy, and intervention. The authors took a new approach to examining the achievement gap by examining achievement trajectories within each racial group. To identify these trajectories they […]
The post Trajectories of Achievement Within Race/Ethnicity: “Catching Up” in Achievement Across Time appeared first on Statistical Modeling, Causal Inference, and Social Science.
The post Trajectories of Achievement Within Race/Ethnicity: “Catching Up” in Achievement Across Time appeared first on All About Statistics.
]]>Just in time for Christmas, here’s some good news for kids, from Pamela DavisKean and Justin Jager:
The achievement gap has long been the focus of educational research, policy, and intervention. The authors took a new approach to examining the achievement gap by examining achievement trajectories within each racial group. To identify these trajectories they used the Early Childhood Longitudinal Study–Kindergarten Cohort, which is a nationally representative sample of students in kindergarten through Grade 5. In the analyses, the authors found heterogeneity within each racial group in mathematics and reading achievement, suggesting that there are in fact achievement gaps within each race/ethnicity group. The authors also found that there are groups that catch up to the highest achieving groups by Grade 5, suggesting a positive impact of schooling on particular subgroups of children. The authors discuss the various trajectories that have been found in each racial group and the implications this has for future research on the achievement gap.
The post Trajectories of Achievement Within Race/Ethnicity: “Catching Up” in Achievement Across Time appeared first on Statistical Modeling, Causal Inference, and Social Science.
Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science
The post Trajectories of Achievement Within Race/Ethnicity: “Catching Up” in Achievement Across Time appeared first on All About Statistics.
]]>The post Di Cook is moving to Monash appeared first on All About Statistics.
]]>I’m delighted that Professor Dianne Cook will be joining Monash University in July 2015 as a Professor of Business Analytics. Di is an Australian who has worked in the US for the past 25 years, mostly at Iowa State University. She is moving back to Australia and joining the Department of Econometrics and Business Statistics in the Monash Business School, as part of our initiative in Business Analytics.
Di is a world leader in data visualization, and is wellknown for her work on interactive graphics. She is also the academic supervisor of several leading data scientists including Hadley Wickham and Yihui Xie, both of whom work for RStudio.
Di has a great deal of energy and enthusiasm for computational statistics and data visualization, and will play a key role in developing and teaching our new subjects in business analytics.
The Monash Business School is already exceptionally strong in econometrics (ranked 7th in the world on RePEc), and forecasting (ranked 11th on RePEc), and we have recently expanded into actuarial science. With Di joining the department, we will be extending our expertise in the area of data visualization as well.
Please comment on the article here: Hyndsight
The post Di Cook is moving to Monash appeared first on All About Statistics.
]]>The post All I want for Chrismukkah is that critics & “reformers” quit howlers of testing (after 3 yrs of blogging)! So here’s Aris Spanos “Tallking Back!” appeared first on All About Statistics.
]]>
This was initially posted as slides from our joint Spring 2014 seminar: “Talking Back to the Critics Using Error Statistics”. (You can enlarge them.) Related reading is Mayo and Spanos (2011)
Please comment on the article here: Error Statistics Philosophy » Statistics
The post All I want for Chrismukkah is that critics & “reformers” quit howlers of testing (after 3 yrs of blogging)! So here’s Aris Spanos “Tallking Back!” appeared first on All About Statistics.
]]>In a recent discussion involving our frustration with crap research, Daniel Lakeland wrote: I [Lakeland] really do worry about a world in which social and institutional and similar effects keep us plugging away at a certain kind of cargocult science that produces lots of publishable papers and makes it easier to get funding for projects […]
The post Using statistics to make the world a better place? appeared first on Statistical Modeling, Causal Inference, and Social Science.
The post Using statistics to make the world a better place? appeared first on All About Statistics.
]]>In a recent discussion involving our frustration with crap research, Daniel Lakeland wrote:
I [Lakeland] really do worry about a world in which social and institutional and similar effects keep us plugging away at a certain kind of cargocult science that produces lots of publishable papers and makes it easier to get funding for projects that don’t really promise to give us fundamental and predictive models that can drive real improvements in people’s lives.
It’s sort of a “it’s 2014, where’s my flying car?” attitude I know, but I’d be satisfied with a lot of things other than flying cars, such as:
1) real, effective solutions to antibiotic resistant organisms
2) cures for cystic fibrosis
3) reducing the effect of heart disease on people under age 75 by 30%
4) understanding major causes of “the obesity epidemic” in a real detailed way and finding effective ways to reverse it.
5) Being able to regenerate organic replacement joint components instead of titanium hip implants etc
6) Growing replacement kidneys
7) A more significantly more effective and long lasting pertussis vaccineIs the way we are doing science today going to provide any or all of these things in the next 30 years? What are some similar order of magnitude things that it has provided since 1980 using current “modern” methods and funding priorities, publication priorities, tenure systems, and so forth?
I want to consider this question, difficult as it is. It does seem that slow progress is being made in our conceptual understanding, at least in the sense that new paradigms are developing in different sciences. In sociology there’s an increased interest in network effects, in medicine it seems that a lot of things are being now understood in terms of the microbiome, in largescale biology there’s a focus on the fetal environment, in microbiology we’ve been hearing a lot about the rapid evolution of viruses, etc. It seems to me, as an outsider in these fields, that the concepts I’ve mentioned have been around for awhile, but it’s only recently that they’ve been moved to the center of discussion. And this, in turn, seems to be the result of millions of research studies on many thousands of topics, which collectively ruled out earlier favored explanations. By revealing areas where the old paradigms failed, all this research made room for new, more difficult, paradigms to be taken seriously.
It’s posterior predictive checking, on a larger scale. A chicken is an egg’s way of constructing another egg, and empirical research is a scientific theory’s way of uncovering the theory’s flaws.
It’s harder to make such clear statements about statistics or engineering or computer science, as these are essentially tools in the service of science rather than being the object of study themselves.
And there are fields where new paradigms don’t seem so apparent. Consider three areas with which I’m somewhat familiar: political science, psychology, and economics. In political science, I see persistent difficulties in integrating different perspectives coming from the studies of public opinion, institutions, and political maneuvering. It really feels like we’re not seeing the whole elephant at once. And I include my own research as an example of this incomplete perspective. Psychology seems to be undergoing a reforming process, in which various unsuccessful paradigms such as embodied cognition are being rejected, with no clear unification of the cognitive and behavioral approaches. Similarly in economics, although there it seems worse in that various incomplete perspectives are taken by their proponents as being allencompassing.
The point of that previous paragraph is not to pass judgment on these different social science fields; it’s just my impression that they are in various stages of reconstruction and reform. From my outsider’s perspective, biology and medicine seem to be in better shape, philosophically speaking, in that there seems to be more of a recognition that traditionally dominant paradigms miss a lot.
How does statistics fit into all this? Statistics can (potentially) do a lot:
– Guidance in data collection and the assessment of measurements. And recall that “data collection” is not just about how to collect a random sample or assign treatments in an experiment; it also includes considerations of what to measure and how to measure it. For example, if you are interested in measuring the most fecund days of a woman’s monthly cycle, it makes sense to check out the medical literature on the topic.
– Methods for calibrating variation by comparing to models of randomness. This is where I think that statistical significance and pvalues fit in: not as a way to make scientific discoveries (“p less than .05 so we get published in the tabloids!”) but as a measuring stick when interpreting observed comparisons and variation.
– Tools for combining information. That to me is the most general way to think of “inference,” and it encompasses all sorts of things, from classical “iid” models to more complicated approaches including all sorts of time series and spatial analyses, multilevel models for partial pooling, regularization methods that allow users to ramp up the amount of data they can include in their models, and Bayesian inference for balancing uncertainties.
– Methods for checking fit, for revealing the aspects of data that are not well explained by our models. To me this includes all of exploratory data analysis, which is about learning the unexpected. Or, to step back slightly, exploratory data analysis is about putting us in a situation in which we are able to learn the unexpected. And, remember, “unexpected” = “not expected” = “something not predicted by our model,” where this “model” may be implicit.
Again, statistics is in the service of science, and I see statistics as a way of organizing science rather than as a way of making scientific discovery.
Sometimes we can perform a statistical analysis that seems to take us all the way from data to model to scientific conclusion—for example, this cool ageperiodcohort model with Yair—but, even there, I think most of the real scientific heavy lifting is coming from existing substantive theories; the statistics is more of a way of rearranging the data or, as Dan Kahan would put it, of adjudicating between competing hypotheses or underlying models of reality.
The post Using statistics to make the world a better place? appeared first on Statistical Modeling, Causal Inference, and Social Science.
Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science
The post Using statistics to make the world a better place? appeared first on All About Statistics.
]]>The post Review: Wainer, Picturing the Uncertain World appeared first on All About Statistics.
]]>Picturing the Uncertain World by Howard Wainer is a book about statistics and statistical thinking, aided by visual depictions of data. Each article in the collection starts by stating a question or phenomenon, which is then investigated further using some clever statistics.
I bought the book after Scott Murray pointed me to it as the source of his assertion that in order to show uncertainty, the best way was to use blurry dots. I was surprised by that, since my own work had shown people to be pretty bad at judging blurriness, so that didn’t seem to be a particularly good choice (at least if you want people to be able to judge the amount of uncertainty).
I had never heard of Howard Wainer before reading this book. It turns out that he has been an outspoken critic of bad charts for a long time, much longer than blogs have been around to do that. In fact, Wainer wrote an article for American Statistician in 1984 that could have been the blueprint for blogs like junk charts.
And it turns out that there is even a connection between Wainer and Kaiser Fung, who runs junk charts.
@eagereyes Howard introduced me to Tufte principles in my first stats course almost 20 yr ago!
— Kaiser Fung (@junkcharts) December 9, 2014
This is also interesting because the book reminded me of Kaiser’s Numbers Rule Your World and Numbersense. It all makes sense.
After Scott pointing it out, the book immediately intrigued me: had somebody figured out how to show uncertainty well? How did I not know about this? Well, it turns out he hasn’t. But there is a lot of other good stuff in this book that makes it very worthwhile.
Wainer’s idea of uncertainty is much broader than the usual error metrics (though he addresses those as well). In fact, he describes statistics as the science of uncertainty. That makes a lot of sense, and he makes the case repeatedly about how statistics provides means of dealing with uncertainty about facts and observations.
As a consequence, the book is really about statistical thinking, aided by visual depictions of the data. In several chapters, Wainer takes data and either redraws an existing chart, or argues that by simply looking at the data the right way, it becomes much easier to understand what is going on.
The key chapter from my perspective was chapter 13, Depicting Error. Wainer shows a number of ways to depict error, from tables to a number of charts. Some of these are wellknown, others not. They are all interesting, though there isn’t much that is surprising (especially after having seen the Error Bars Considered Harmful paper by Michael Correll and Michael Gleicher at InfoVis earlier this year).
There is a lot of other good stuff in the book too, though. Chapter 16, Galton’s Normal, talks about the way the normal distribution drops to very, very small probabilities in the tails. It’s a short chapter, but it really drove home a point for me about how hard it is to intuitively understand distributions, even the ubiquitous normal distribution.
The final chapter, The Remembrance of Things Past, is probably the best. It’s the deepest, most human, and I think it has the best writing. It describes the statistical graphics produced by population of the jewish ghetto in Kovno, Lithuania, during the Holocaust. It’s chilling and fascinating, and the charts they created are incredible. Wainer does an admirable job of framing the entire chapter and navigating between becoming overly sentimental and being too sterile in his descriptions.
The book is really a collection of articles Wainer wrote for Chance Magazine and American Statistician in the mid–2000s (with one exception from 1996). As a result, it isn’t really more than the sum of its parts: it doesn’t have any cohesion between the chapters. But on the other hand, each chapter is a nicely selfcontained piece, easy to read, and it’s easy to pick the book up to read a chapter or two. Wainer also writes very well. The chapters are easy to read, and his explanations of statistical phenomena and procedures are very good and easy to follow even if you don’t know much about statistics.
Ultimately, my question about the blurry dots was not answered, because Wainer points to Alan MacEachren’s book How Maps Work as the source of the blurriness argument. I can’t find my copy of that book at the moment though, so following this lead further will have to wait for another day.
Please comment on the article here: eagereyes
The post Review: Wainer, Picturing the Uncertain World appeared first on All About Statistics.
]]>The post Cloudy and red appeared first on All About Statistics.
]]>Note: I'm traveling during the holidays so updates will be infrequent.
Reader Daniel L. pointed me to a blog post discussing the following weather map:
The author claimed that many readers misinterpreted the red color as meaning high temperatures when he intended to show higherthannormal temperatures. In other words, the readers did not recognize a relative scale is in play.
That is a minor issue that can be fixed by placing a label on the map.
There are several more irritants, starting with the abundance of what Ed Tufte calls chartjunk. The county boundaries do not serve a purpose, nor is it necessary to place so many place names. State boundaries too are too imposing. The legend fails to explain what the patch of green in Florida means.
The article itself links to a different view of this data on a newly launched site called Climate Prediction Center, by the National Oceanic and Atmospheric Administration (link). Here is a screenshot of the continental U.S.
This chart is the other extreme, bordering on too simple.
I'd suggest adding a little bit of interactivity to this chart, such as:
This is a Type V chart.
Please comment on the article here: Junk Charts
The post Cloudy and red appeared first on All About Statistics.
]]>The post What does Flatland have to do with Haskell? appeared first on All About Statistics.
]]>Edwin Abbott's 1884 novella, Flatland, recounts the misadventures of a square that lives in a twodimensional world called "Flatland". In this story, the square has a dream where he visits a onedimensional world (Lineland) and unsuccessfully tries to educate the populace about Flatland's existence. Shortly thereafter, a sphere visits Flatland to introduce our protagonist to his own home, Spaceland. The sphere looks like a circle to the square, because the square can only see the part of the sphere that intersects Flatland's plane. The square can't fathom Spaceland until the sphere actually brings him into the third dimension.
Having his view of reality sufficiently rocked, the square postulates the existence of still higher dimensional lands. The sphere denies this possibility and returns the square to Flatland on bad terms.
The irony here is that, in spite of being aware of the square's previous insistence that Flatland is the only reality there is, the square's corresponding limitation of thought, and the square's subsequent enlightenment; the sphere arrogantly asserted that his own land represented the limits of dimensionality and that there can be no more dimensions.
I begin with this summary not as an impromptu book report, but because this is a perfect analogy for why I've decided to learn Haskell.

It isn't hard to imagine the inveterate C programmer being hesitant to embrace higherlevel languages like Python and Perl. "Imagine the whole world that opens up to you when you don't have to worry about memory management and resolving pointers," requests the proselytizing Python programmer.
But the C programmer got on fine without knowing Python all this time. Besides the (standard) Python interpreter is written in C.
"Why? So I can change types on a whim?" retorts the C programmer. "...so I can pass functions as arguments to other functions? Big deal! I can do that with function pointers in C."
While it is technically true that, in C, functions can be passed as arguments to other functions, and can be returned by functions, this is only a small part of the functional programming techniques that Python allows. But it is the only part that most only trained in C can understand. This is like when the square thought that he saw the sphere because he saw a circle where the sphere intersected Flatland. There's a bigger picture (lambda functions, currying, closures) that only appears when the constraints imposed by a programming paradigm are pushed back.
Shifting tides in industry eventually compel the C programmer to give Python a shot. Once she is fluent in Python and its idioms (becomes a "pythonista", as they say) things begin to change. The OOP of Python allows the programmer to gain a new perspective on programming that had been, up to this point, unavailable to her simply because the concept doesn't exist in C.
Resolved to follow the road to higher abstraction, the (former) C programmer asks the Python programmer if there are other languages that will provoke a similar shift in thought relative to even Python. "Haskell, perhaps?" she offers.
The Python programmer scoffs, "Nah, Python is as good as it gets. Besides, purely functional programmers are a bunch of weirdos."
Haskell is a relatively obscure programming language (20th place or lower on most popularity indices), but accounts for a disproportionate number of the "thislanguagewillchangethewayyoulookatprogramming" claims. By the accounts of Haskell aficionados, finally understanding Haskell is a transformative experience.
Up until recently, I found myself mirroring the thoughts and behavior of the Python programer in this story; I keep hearing about how great Haskell is, and how it'll facilitate an enormous shift in how I'll view computer programming, but I largely dismissed these claims as the rantings of eccentrics that are more concerned with mathematical elegance than practical concerns.
Is it any wonder that I didn't believe the Haskell programmers? I can't see the benefits of Haskell within the constraints of how I view computer programming—constraints subtly imposed by my language of choice. (Lisp programmer Paul Graph describes this as the "Blub Paradox" in this essay Beating The Averages)
But just like the sphere and the arrogant Python programmer should, I have to remain open to the idea that there's a whole world out there that I'm not availing myself of just because I’m too obstinate to try it.
Any language that is Turingcomplete will let you do anything. It's trivially true that there is nothing that you can do in one language that you can't do in another. What sets languages apart (from most trivial to least) is the elegance of the syntax, it's community and third party packages, the ease in which you can perform certain tasks, and whether you’ll achieve enlightenment as a result of using it. I don't know if Haskell will do this for me but I think I ought to give it a shot.

If the ideas of relativity when applied to the domain of programming languages interests you, you might also be interested in the Sapir–Whorf hypothesis, which applies similar ideas mentioned in this post to the realm of natural languages.
Please comment on the article here: On the lambda
The post What does Flatland have to do with Haskell? appeared first on All About Statistics.
]]>The post Contextual Measurement Is a Game Changer appeared first on All About Statistics.
]]>Want to and Do Scold  Store Closing  Want to and Do Shout  Want to Curse  Do Curse  
S2DoScold  1.00  0.19  0.00  0.00  0.00  
S4WantScold  0.96  0.00  0.00  0.08  0.00  
S4DoScold  0.95  0.00  0.00  0.00  0.11  
S1DoScold  0.79  0.37  0.02  0.05  0.15  
S3WantScold  0.00  1.00  0.00  0.08  0.00  
S3DoScold  0.00  0.79  0.00  0.00  0.00  
S3DoShout  0.00  0.15  0.14  0.00  0.00  
S2WantShout  0.00  0.00  1.00  0.13  0.02  
S1WantShout  0.00  0.05  0.91  0.17  0.04  
S4WantShout  0.00  0.00  0.76  0.00  0.00  
S1DoShout  0.00  0.12  0.74  0.00  0.00  
S2DoShout  0.08  0.00  0.59  0.00  0.00  
S4DoShout  0.10  0.00  0.39  0.00  0.00  
S3WantShout  0.00  0.34  0.36  0.00  0.00  
S1wantCurse  0.13  0.18  0.03  1.00  0.09  
S2WantCurse  0.34  0.00  0.08  0.92  0.20  
S3WantCurse  0.00  0.41  0.00  0.85  0.02  
S2WantScold  0.59  0.00  0.00  0.73  0.00  
S1WantScold  0.40  0.22  0.01  0.69  0.00  
S4WantCurse  0.31  0.00  0.00  0.62  0.48  
S1DoCurse  0.24  0.16  0.01  0.17  1.00  
S2DoCurse  0.47  0.00  0.00  0.00  0.99  
S4DoCurse  0.46  0.00  0.02  0.00  0.95  
S3DoCurse  0.00  0.54  0.00  0.00  0.69 
# access the verbal data from difR
library(difR)
data(verbal)
# extract the 24 items
test<verbal[,1:24]
apply(test,2,table)
# remove rows with all 0s
none<apply(test,1,sum)
table(none)
test<test[none>0,]
library(NMF)
# set seed for nmf replication
set.seed(1219)
# 5 latent features chosen after
# examining several different solutions
fit<nmf(test, 5, method="lee", nrun=20)
summary(fit)
basismap(fit)
coefmap(fit)
# scales coefficients and sorts
library(psych)
h<coef(fit)
max_h<apply(h,1,function(x) max(x))
h_scaled<h/max_h
fa.sort(t(round(h_scaled,3)))
# hard clusters based on max value
W<basis(fit)
W2<max.col(W)
# profile clusters
table(W2)
t(aggregate(test, by=list(W2), mean))
Please comment on the article here: Engaging Market Research
The post Contextual Measurement Is a Game Changer appeared first on All About Statistics.
]]>The post Oneway ANOVA with fixed and random effects from a Bayesian perspective appeared first on All About Statistics.
]]>This blog post is derived from a computer practical session that I ran as part of my new course on Statistics for Big Data, previously discussed. This course covered a lot of material very quickly. In particular, I deferred introducing notions of hierarchical modelling until the Bayesian part of the course, where I feel it is more natural and powerful. However, some of the terminology associated with hierarchical statistical modelling probably seems a bit mysterious to those without a strong background in classical statistical modelling, and so this practical session was intended to clear up some potential confusion. I will analyse a simple oneway Analysis of Variance (ANOVA) model from a Bayesian perspective, making sure to highlight the difference between fixed and random effects in a Bayesian context where everything is random, as well as emphasising the associated identifiability issues. R code is used to illustrate the ideas.
We will consider the body mass index (BMI) of new male undergraduate students at a selection of UK Universities. Let us suppose that our data consist of measurements of (log) BMI for a random sample of 1,000 males at each of 8 Universities. We are interested to know if there are any differences between the Universities. Again, we want to model the process as we would simulate it, so thinking about how we would simulate such data is instructive. We start by assuming that the log BMI is a normal random quantity, and that the variance is common across the Universities in question (this is quite a big assumption, and it is easy to relax). We assume that the mean of this normal distribution is Universityspecific, but that we do not have strong prior opinions regarding the way in which the Universities differ. That said, we expect that the Universities would not be very different from one another.
A simple simulation of the data with some plausible parameters can be carried out as follows.
set.seed(1) Z=matrix(rnorm(1000*8,3.1,0.1),nrow=8) RE=rnorm(8,0,0.01) X=t(Z+RE) colnames(X)=paste("Uni",1:8,sep="") Data=stack(data.frame(X)) boxplot(exp(values)~ind,data=Data,notch=TRUE)
Make sure that you understand exactly what this code is doing before proceeding. The boxplot showing the simulated data is given below.
We will start with a frequentist analysis of the data. The model we would like to fit is
where i is an indicator for the University and j for the individual within a particular University. The “effect”, represents how the ith University differs from the overall mean. We know that this model is not actually identifiable when the model parameters are all treated as “fixed effects”, but R will handle this for us.
> mod=lm(values~ind,data=Data) > summary(mod) Call: lm(formula = values ~ ind, data = Data) Residuals: Min 1Q Median 3Q Max 0.36846 0.06778 0.00069 0.06910 0.38219 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 3.101068 0.003223 962.244 < 2e16 *** indUni2 0.006516 0.004558 1.430 0.152826 indUni3 0.017168 0.004558 3.767 0.000166 *** indUni4 0.017916 0.004558 3.931 8.53e05 *** indUni5 0.022838 0.004558 5.011 5.53e07 *** indUni6 0.001651 0.004558 0.362 0.717143 indUni7 0.007935 0.004558 1.741 0.081707 . indUni8 0.003373 0.004558 0.740 0.459300  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1019 on 7992 degrees of freedom Multiple Rsquared: 0.01439, Adjusted Rsquared: 0.01353 Fstatistic: 16.67 on 7 and 7992 DF, pvalue: < 2.2e16
We see that R has handled the identifiability problem using “treatment contrasts”, dropping the fixed effect for the first university, so that the intercept actually represents the mean value for the first University, and the effects for the other Univeristies represent the differences from the first University. If we would prefer to impose a sum constraint, then we can switch to sum contrasts with
options(contrasts=rep("contr.sum",2))
and then refit the model.
> mods=lm(values~ind,data=Data) > summary(mods) Call: lm(formula = values ~ ind, data = Data) Residuals: Min 1Q Median 3Q Max 0.36846 0.06778 0.00069 0.06910 0.38219 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 3.0986991 0.0011394 2719.558 < 2e16 *** ind1 0.0023687 0.0030146 0.786 0.432048 ind2 0.0041477 0.0030146 1.376 0.168905 ind3 0.0147997 0.0030146 4.909 9.32e07 *** ind4 0.0202851 0.0030146 6.729 1.83e11 *** ind5 0.0204693 0.0030146 6.790 1.20e11 *** ind6 0.0007175 0.0030146 0.238 0.811889 ind7 0.0103039 0.0030146 3.418 0.000634 ***  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1019 on 7992 degrees of freedom Multiple Rsquared: 0.01439, Adjusted Rsquared: 0.01353 Fstatistic: 16.67 on 7 and 7992 DF, pvalue: < 2.2e16
This has 7 degrees of freedom for the effects, as before, but ensures that the 8 effects sum to precisely zero. This is arguably more interpretable in this case.
We will now analyse the simulated data from a Bayesian perspective, using JAGS.
All parameters in Bayesian models are uncertain, and therefore random, so there is much confusion regarding the difference between “fixed” and “random” effects in a Bayesian context. For “fixed” effects, our prior captures the idea that we sample the effects independently from a “fixed” (typically vague) prior distribution. We could simply code this up and fit it in JAGS as follows.
require(rjags) n=dim(X)[1] p=dim(X)[2] data=list(X=X,n=n,p=p) init=list(mu=2,tau=1) modelstring=" model { for (j in 1:p) { theta[j]~dnorm(0,0.0001) for (i in 1:n) { X[i,j]~dnorm(mu+theta[j],tau) } } mu~dnorm(0,0.0001) tau~dgamma(1,0.0001) } " model=jags.model(textConnection(modelstring),data=data,inits=init) update(model,n.iter=1000) output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10) print(summary(output)) plot(output) autocorr.plot(output) pairs(as.matrix(output)) crosscorr.plot(output)
On running the code we can clearly see that this naive approach leads to high posterior correlation between the mean and the effects, due to the fundamental lack of identifiability of the model. This also leads to MCMC mixing problems, but it is important to understand that this computational issue is conceptually entirely separate from the fundamental statisticial identifiability issue. Even if we could avoid MCMC entirely, the identifiability issue would remain.
A quick fix for the identifiability issue is to use “treatment contrasts”, just as for the frequentist model. We can implement that as follows.
data=list(X=X,n=n,p=p) init=list(mu=2,tau=1) modelstring=" model { for (j in 1:p) { for (i in 1:n) { X[i,j]~dnorm(mu+theta[j],tau) } } theta[1]<0 for (j in 2:p) { theta[j]~dnorm(0,0.0001) } mu~dnorm(0,0.0001) tau~dgamma(1,0.0001) } " model=jags.model(textConnection(modelstring),data=data,inits=init) update(model,n.iter=1000) output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10) print(summary(output)) plot(output) autocorr.plot(output) pairs(as.matrix(output)) crosscorr.plot(output)
Running this we see that the model now works perfectly well, mixes nicely, and gives sensible inferences for the treatment effects.
Another source of confusion for models of this type is data formating and indexing in JAGS models. For our balanced data there was not problem passing in data to JAGS as a matrix and specifying the model using nested loops. However, for unbalanced designs this is not necessarily so convenient, and so then it can be helpful to specify the model based on twocolumn data, as we would use for fitting using lm(). This is illustrated with the following model specification, which is exactly equivalent to the previous model, and should give identical (up to Monte Carlo error) results.
N=n*p data=list(y=Data$values,g=Data$ind,N=N,p=p) init=list(mu=2,tau=1) modelstring=" model { for (i in 1:N) { y[i]~dnorm(mu+theta[g[i]],tau) } theta[1]<0 for (j in 2:p) { theta[j]~dnorm(0,0.0001) } mu~dnorm(0,0.0001) tau~dgamma(1,0.0001) } " model=jags.model(textConnection(modelstring),data=data,inits=init) update(model,n.iter=1000) output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10) print(summary(output)) plot(output)
As suggested above, this indexing scheme is much more convenient for unbalanced data, and hence widely used. However, since our data is balanced here, we will revert to the matrix approach for the remainder of the post.
One final thing to consider before moving on to random effects is the sumcontrast model. We can implement this in various ways, but I’ve tried to encode it for maximum clarity below, imposing the sumtozero constraint via the final effect.
data=list(X=X,n=n,p=p) init=list(mu=2,tau=1) modelstring=" model { for (j in 1:p) { for (i in 1:n) { X[i,j]~dnorm(mu+theta[j],tau) } } for (j in 1:(p1)) { theta[j]~dnorm(0,0.0001) } theta[p] < sum(theta[1:(p1)]) mu~dnorm(0,0.0001) tau~dgamma(1,0.0001) } " model=jags.model(textConnection(modelstring),data=data,inits=init) update(model,n.iter=1000) output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10) print(summary(output)) plot(output)
Again, this works perfectly well and gives similar results to the frequentist analysis.
The key difference between fixed and random effects in a Bayesian framework is that random effects are not independent, being drawn from a distribution with parameters which are not fixed. Essentially, there is another level of hierarchy involved in the specification of the random effects. This is best illustrated by example. A random effects model for this problem is given below.
data=list(X=X,n=n,p=p) init=list(mu=2,tau=1) modelstring=" model { for (j in 1:p) { theta[j]~dnorm(0,taut) for (i in 1:n) { X[i,j]~dnorm(mu+theta[j],tau) } } mu~dnorm(0,0.0001) tau~dgamma(1,0.0001) taut~dgamma(1,0.0001) } " model=jags.model(textConnection(modelstring),data=data,inits=init) update(model,n.iter=1000) output=coda.samples(model=model,variable.names=c("mu","tau","taut","theta"),n.iter=100000,thin=10) print(summary(output)) plot(output)
The only difference between this and our first naive attempt at a Bayesian fixed effects model is that we have put a gamma prior on the precision of the effect. Note that this model now runs and fits perfectly well, with reasonable mixing, and gives sensible parameter inferences. Although the effects here are not constrained to sumtozero, like in the case of sum contrasts for a fixed effects model, the prior encourages shrinkage towards zero, and so the random effect distribution can be thought of as a kind of soft version of a hard sumtozero constraint. From a predictive perspective, this model is much more powerful. In particular, using a random effects model, we can make strong predictions for unobserved groups (eg. a ninth University), with sensible prediction intervals based on our inferred understanding of how similar different universities are. Using a fixed effects model this isn’t really possible. Even for a Bayesian version of a fixed effects model using proper (but vague) priors, prediction intervals for unobserved groups are not really sensible.
Since we have used simulated data here, we can compare the estimated random effects with the true effects generated during the simulation.
> apply(as.matrix(output),2,mean) mu tau taut theta[1] theta[2] 3.098813e+00 9.627110e+01 7.015976e+03 2.086581e03 3.935511e03 theta[3] theta[4] theta[5] theta[6] theta[7] 1.389099e02 1.881528e02 1.921854e02 5.640306e04 9.529532e03 theta[8] 5.227518e03 > RE [1] 0.002637034 0.008294518 0.014616348 0.016839902 0.015443243 [6] 0.001908871 0.010162117 0.005471262
We see that the Bayesian random effects model has done an excellent job of estimation. If we wished, we could relax the assumption of common variance across the groups by making tau a vector indexed by j, though there is not much point in persuing this here, since we know that the groups do all have the same variance.
The above is the usual story regarding fixed and random effects in Bayesian inference. I hope this is reasonably clear, so really I should quit while I’m ahead… However, the issues are really a bit more subtle than I’ve suggested. The inferred precision of the random effects was around 7,000, so now lets rerun the original, naive, “fixed effects” model with a strong subjective Bayesian prior on the distribution of the effects.
data=list(X=X,n=n,p=p) init=list(mu=2,tau=1) modelstring=" model { for (j in 1:p) { theta[j]~dnorm(0,7000) for (i in 1:n) { X[i,j]~dnorm(mu+theta[j],tau) } } mu~dnorm(0,0.0001) tau~dgamma(1,0.0001) } " model=jags.model(textConnection(modelstring),data=data,inits=init) update(model,n.iter=1000) output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10) print(summary(output)) plot(output)
This model also runs perfectly well and gives sensible inferences, despite the fact that the effects are iid from a fixed distribution and there is no hard constraint on the effects. Similarly, we can make sensible predictions, together with appropriate prediction intervals, for an unobserved group. So it isn’t so much the fact that the effects are coupled via an extra level of hierarchy that makes things work. It’s really the fact that the effects are sensibly distributed and not just sampled directly from a vague prior. So for “real” subjective Bayesians the line between fixed and random effects is actually very blurred indeed…
Please comment on the article here: Darren Wilkinson's research blog
The post Oneway ANOVA with fixed and random effects from a Bayesian perspective appeared first on All About Statistics.
]]>