Last week I read an interesting paper by Bob Rodriguez: "Statistical Model Building for Large, Complex Data: Five New Directions in SAS/STAT Software." In it, Rodriguez summarizes five modern techniques for building predictive models and highlights recent SAS/STAT procedures that implement those techniques. The paper discusses the following high-performance (HP) […]
The post Statistical model building and the SELECT procedures in SAS appeared first on The DO Loop.
The post Statistical model building and the SELECT procedures in SAS appeared first on All About Statistics.
]]>Last week I read an interesting paper by Bob Rodriguez: "Statistical Model Building for Large, Complex Data: Five New Directions in SAS/STAT Software." In it, Rodriguez summarizes five modern techniques for building predictive models and highlights recent SAS/STAT procedures that implement those techniques. The paper discusses the following high-performance (HP) procedures in SAS/STAT software:
If you are unfamiliar with these newer procedures, I encourage you to read the Rodriguez paper. Although they are designed for distributed computations, you can also run these procedures in single-machine mode.
The GLMSELECT, HPGENSELECT, and (HP)QUANTSELECT procedures support choosing a small number of effects from among a large list of possible effects. Some statisticians call these "variable selection" procedures, but "effect selection" is a more appropriate term because an important use case is to use the procedures to select important interaction effects from the list of all second order interactions. Regardless of what you call them, the procedures automatically select a small number of effects that provide a good predictive model from among hundreds or thousands of possible effects. You can either accept the selected model or use traditional modeling techniques to refine the model from among the selected candidate effects.
While thinking about the use of the word "SELECT" in the names of these procedures, it occurred to me that there is another SAS/STAT procedure that contains the word SELECT, and that is the SURVEYSELECT procedure. However, the SURVEYSELECT procedure is different in that it randomly selects observations (rows in the design matrix) whereas the previous procedures select variables (columns in the design matrix).
This is not the only example of this observation-variable dichotomy in SAS/STAT. The cluster procedures all have "CLUS" as part of their names. The ACECLUS, CLUSTER, FASTCLUS, and MODECLUS procedures all attempt to group observations into clusters. However, the VARCLUS procedure is a dimension reduction technique that groups variables together.
The post Statistical model building and the SELECT procedures in SAS appeared first on The DO Loop.
Please comment on the article here: The DO Loop
The post Statistical model building and the SELECT procedures in SAS appeared first on All About Statistics.
]]>The post Mathematics teaching Rockstar – Jo Boaler appeared first on All About Statistics.
]]>My life in education has included being a High School maths teacher, then teaching at university for 20 years. I then made resources and gave professional development workshops for secondary school teachers. It was exciting to see the new statistics curriculum being implemented into the New Zealand schools. And now we are making resources and participating in the primary school sector. It is wonderful to learn from each level of teaching. We would all benefit from more discussion across the levels.
My father used to say (and the sexism has not escaped me) “Never run after a woman, a bus or an educational theory, as there will be another one along soon.” Education theories have lifespans, and some theories are more useful than others. I am not a fan of “learning styles” and fear they have served many students ill. However, there are some current ideas and idea-promoters in the teaching of mathematics that I find very attractive. I will begin with Jo Boaler, and intend to introduce you over the next few weeks to Dan Meyer, Carol Dweck and the person who wrote “Making it stick.”
My first contact with Jo Boaler was reading “The Elephant in the Classroom.” In this Jo points out how society is complicit in the idea of a “maths brain”. Somehow it is socially acceptable to admit or be almost defensively proud of being “no good at maths”. A major problem with this is that her research suggests that later success in life is connected to attainment in mathematics. In order to address this, Jo explores a less procedural approach to teaching mathematics, including greater communication and collaboration.
It is interesting to see the effect Jo Boaler’s recent book, “Mathematical Mindsets “, is having on colleagues in the teaching profession. The maths advisors based in Canterbury NZ are strong proponents of her idea of “rich tasks”. Here are some tweets about the book:
“I am loving Mathematical Mindsets by @joboaler – seriously – everyone needs to read this”
“Even if you don’t teach maths this book will change how you teach for ever.”
“Hands down the most important thing I have ever read in my life”
What I get from Jo Boaler’s work is that we need to rethink how we teach mathematics. The methods that worked for mathematics teachers are not the methods we need to be using for everyone. The defence “The old ways worked for me” is not defensible in terms of inclusion and equity. I will not even try to boil down her approach in this post, but rather suggest readers visit her website and read the book!
At Statistics Learning Centre we are committed to producing materials that fit with sound pedagogical methods. Our Dragonistics data cards are perfect for use in a number of rich tasks. We are constantly thinking of ways to embed mathematics and statistics tasks into the curriculum of other subjects.
I am aware that many of you readers are not primary or secondary teachers. There are so many barriers to getting mathematics taught in a more exciting, integrated and effective way. Primary teachers are not mathematics specialists, and may well feel less confident in their maths ability. Secondary mathematics teachers may feel constrained by the curriculum and the constant assessment in the last three years of schooling in New Zealand. And tertiary teachers have little incentive to improve their teaching, as it takes time from the more valued work of research.
Though it would be exciting if Jo Boaler’s ideas and methods were espoused in their entirety at all levels of mathematics teaching, I am aware that this is unlikely – as in a probability of zero. However, I believe that all teachers at all levels can all improve, even a little at a time. We at Statistics Learning Centre are committed to this vision. Through our blog, our resources, our games, our videos, our lessons and our professional development we aim to empower all teacher to teach statistics – better! We espouse the theories and teachings explained in Mathematical Mindsets, and hope that you also will learn about them, and endeavour to put them into place, whatever level you teach at.
Do tell us if Jo Boalers work has had an impact on what you do. How can the ideas apply at all levels of teaching? Do teachers need to have a growth mindset about their own ability to improve their teaching?
Here are some quotes to leave you with:
“Many parents have asked me: What is the point of my child explaining their work if they can get the answer right? My answer is always the same: Explaining your work is what, in mathematics, we call reasoning, and reasoning is central to the discipline of mathematics.”
“Numerous research studies (Silver, 1994) have shown that when students are given opportunities to pose mathematics problems, to consider a situation and think of a mathematics question to ask of it—which is the essence of real mathematics—they become more deeply engaged and perform at higher levels.”
“The researchers found that when students were given problems to solve, and they did not know methods to solve them, but they were given opportunity to explore the problems, they became curious, and their brains were primed to learn new methods, so that when teachers taught the methods, students paid greater attention to them and were more motivated to learn them. The researchers published their results with the title “A Time for Telling,” and they argued that the question is not “Should we tell or explain methods?” but “When is the best time do this?”
“five suggestions that can work to open mathematics tasks and increase their potential for learning: Open up the task so that there are multiple methods, pathways, and representations. Include inquiry opportunities. Ask the problem before teaching the method. Add a visual component and ask students how they see the mathematics. Extend the task to make it lower floor and higher ceiling. Ask students to convince and reason; be skeptical.”
All quotes from
Jo Boaler, Mathematical Mindsets: Unleashing Students’ Potential through Creative Math, Inspiring Messages and Innovative Teaching
Please comment on the article here: Learn and Teach Statistics and Operations Research
The post Mathematics teaching Rockstar – Jo Boaler appeared first on All About Statistics.
]]>Recently in the sister blog: An object’s mental representation includes not just visible attributes but also its nonvisible history. The present studies tested whether preschoolers seek subtle indicators of an object’s history, such as a mark acquired during its handling. Five studies with 169 children 3–5 years of age and 97 college students found that […]
The post “Children seek historical traces of owned objects” appeared first on Statistical Modeling, Causal Inference, and Social Science.
The post “Children seek historical traces of owned objects” appeared first on All About Statistics.
]]>Recently in the sister blog:
An object’s mental representation includes not just visible attributes but also its nonvisible history. The present studies tested whether preschoolers seek subtle indicators of an object’s history, such as a mark acquired during its handling. Five studies with 169 children 3–5 years of age and 97 college students found that children (like adults) searched for concealed traces of object history, invisible traces of object history, and the absence of traces of object history, to successfully identify an owned object. Controls demonstrated that children (like adults) appropriately limit their search for hidden indicators when an owned object is visibly distinct. Altogether, these results demonstrate that concealed and invisible indicators of history are an important component of preschool children’s object concepts.
The post “Children seek historical traces of owned objects” appeared first on Statistical Modeling, Causal Inference, and Social Science.
Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science
The post “Children seek historical traces of owned objects” appeared first on All About Statistics.
]]>Shravan points us to this post from Jay Van Bavel a couple years ago. It’s an interesting example because Bavel expresses skepticism about the “power pose” hype but he makes the same general mistake of Carney, Cuddy, Yap, and other researchers in this area in that he overreacts to every bit of noise that’s been […]
The post “The Dark Side of Power Posing” appeared first on Statistical Modeling, Causal Inference, and Social Science.
The post “The Dark Side of Power Posing” appeared first on All About Statistics.
]]>Shravan points us to this post from Jay Van Bavel a couple years ago. It’s an interesting example because Bavel expresses skepticism about the “power pose” hype but he makes the same general mistake of Carney, Cuddy, Yap, and other researchers in this area in that he overreacts to every bit of noise that’s been p-hacked and published.
Here’s Bavel:
Some of the new studies used different analysis strategies than the original paper . . . but they did find that the effects of power posing were replicable, if troubling. People who assume high-power poses were more likely to steal money, cheat on a test and commit traffic violations in a driving simulation. In one study, they even took to the streets of New York City and found that automobiles with more expansive driver’s seats were more likely to be illegally parked. . . .
Dr. Brinol [sic] and his colleagues found that power posing increased self-confidence, but only among participants who already had positive self-thoughts. In contrast, power posing had exactly the opposite effect on people who had negative self-thoughts. . . .
In two studies, Joe Cesario and Melissa McDonald found that power poses only increased power when they were made in a context that indicated dominance. Whereas people who held a power pose while they imagined standing at an executive desk overlooking a worksite engaged in powerful behavior, those who held a power pose while they imagined being frisked by the police actually engaged in less powerful behavior. . . .
In a way I like all this because it shows how the capitalize-on-noise strategy which worked so well for the original power pose authors can also be used to dismantle the whole idea. So that’s cool. But from a scientific point of view, I think there’s so much noise here that any of these interactions could well go in the opposite direction. Not to mention all the unstudied interactions and all the interactions that happened not to be statistically significant in these particular small samples.
I’m not trying to slam Bavel here. The above-linked post was published in 2013, before we were all fully aware of how easy it was for researchers to get statistical significance from noise, even without having to try. Now we know better: just cos some correlation or interaction appears in a sample, we don’t have to think it represents anything in the larger population.
The post “The Dark Side of Power Posing” appeared first on Statistical Modeling, Causal Inference, and Social Science.
Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science
The post “The Dark Side of Power Posing” appeared first on All About Statistics.
]]>The post On accuracy appeared first on All About Statistics.
]]>In our last article on the algebra of classifier measures we encouraged readers to work through Nina Zumel’s original “Statistics to English Translation” series. This series has become slightly harder to find as we have use the original category designation “statistics to English translation” for additional work.
To make things easier here are links to the original three articles which work through scores, significance, and includes a glossery.
A lot of what Nina is presenting can be summed up in the diagram below (also by her). If in the diagram the first row is truth (say red disks are infected) which classifier is the better initial screen for infection? Should you prefer the model 1 80% accurate row or the model 2 70% accurate row? This example helps break dependence on “accuracy as the only true measure” and promote discussion of additional measures.
Please comment on the article here: Statistics – Win-Vector Blog
The post On accuracy appeared first on All About Statistics.
]]>Someone writes in: I have MS and take a disease-modifying drug called Copaxone. Sandoz developed a generic version of Copaxone and filed for FDA approval. Teva, the manufacturer of Copaxone, filed a petition opposing that approval (surprise!). FDA rejected Teva’s petitions and approved the generic. My insurance company encouraged me to switch to the generic. […]
The post When do statistical rules affect drug approval? appeared first on Statistical Modeling, Causal Inference, and Social Science.
The post When do statistical rules affect drug approval? appeared first on All About Statistics.
]]>Someone writes in:
I have MS and take a disease-modifying drug called Copaxone. Sandoz developed a generic version of Copaxone and filed for FDA approval. Teva, the manufacturer of Copaxone, filed a petition opposing that approval (surprise!). FDA rejected Teva’s petitions and approved the generic.
My insurance company encouraged me to switch to the generic. Specifically, they increased the copay for the non-generic from $50 to $950 per month. That got my attention. My neurologist recommended against switching to the generic.
Consequently, I decided to try to review the FDA decision to see if I could get any insight into the basis for my neurologist’s recommendationdation.
What appeared on first glance to be a telling criticism of the Teva submission was a reference by the FDA to “non-standard statistical criteria” together with the FDA’s statement that reanalysis with standard practices found different results than those found by Teva. So, I looked up back at the Teva filing to identify the non-standard statistical criteria they used. If I found the right part of the Teva filing, they used R packages named ComBat and LIMMA—both empirical Bayes tools.
Now, it is possible that I have made a mistake and have not properly identified the statistical criteria that the FDA found wanting. I was unable to find any specific statement w.r.t. the “non-standard” statistics.
But, if empirical Bayes works better than older methods, then falling back to older methods would result in weaker inferences—and the rejection of the data from Teva.
It seems to me that this case raises interesting questions about the adoption and use of empirical Bayes. How should the FDA have treated the “non-standard statistical criteria”? More generally, is there a problem with getting regulatory agencies to accept Bayesian models? Maybe there is some issue here that would be appropriate for a masters student in public policy.
My correspondent included some relevant documentation:
The FDA docket files are available at http://www.regulations.gov/#!docketBrowser;rpp=25;po=0;dct=SR;D=FDA-2015-P-1050
The test below is from April 15, 2015 content/uploads/2016/07/Citizen_Petition_Denial_Letter_From_CDER_to_Teva_Pharmaceuticals.pdf”>FDA Denial Letter to Teva at pp. 41-42
Specifically, we concluded that the mouse splenocyte studies were poorly designed, contained a high level of residual batch bias, and used non-standard statistical criteria for assessing the presence of differentially expressed genes. When FDA reanalyzed the microarray data from one Teva study using industry standard practices and criteria, Copaxone and the comparator (Natco) product were found to have very similar effects on the efficacy-related pathways proposed for glatiramer acetate’s mechanism of action.
The image below is from the Teva Petition, July 2, 2014 at p. 60
And he adds:
My interest in this topic arose only because of my MS treatment—I have had no contact with Teva, Sandoz, or the FDA. And I approve of the insurance company’s action—that is, I think that creating incentives to encourage consumers to switch to generic medicines is usually a good idea.
I have no knowledge of any of this stuff, but the interaction of statistics and policy seems generally relevant so I thought I would share this with all of you.
The post When do statistical rules affect drug approval? appeared first on Statistical Modeling, Causal Inference, and Social Science.
Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science
The post When do statistical rules affect drug approval? appeared first on All About Statistics.
]]>The post A budget of classifier evaluation measures appeared first on All About Statistics.
]]>Beginning analysts and data scientists often ask: “how does one remember and master the seemingly endless number of classifier metrics?”
My concrete advice is:
That being said it always seems like there is a bit of gamesmanship in that somebody always brings up yet another score, often apparently in the hope you may not have heard of it. Some choice of measure is signaling your pedigree (precision/recall implies a data mining background, sensitivity/specificity a medical science background) and hoping to befuddle others.
The rest of this note is some help in dealing with this menagerie of common competing classifier evaluation scores.
Lets define our terms. We are going to work with “binary classification” problems. These are problems where we have example instances (also called rows) that are either “in the class” (we will call these instances “true”) or not (and we will call these instances “false”). A classifier is a function that given the description of an instance tries to determine if the instance is in the class or not. The classifier may either return a decision of “positive”/“negative” (indicating the classifier thinks the instance is in or out of the class) or a probability score denoting the estimated probability of being in the class.
For decision based (or “hard”) classifiers (those returning only a positive/negative determination) the “confusion matrix” is a sufficient statistic in the sense it contains all of the information summarizing classifier quality. All other classification measures can be derived from it.
For a decision classifier (one that returns “positive” and “negative”, and not probabilities) the classifier’s performance is completely determined by four counts:
Notice true and false are being used to indicate if the classifier is correct (and not the actual category of each item) in these terms. This is traditional nomenclature. The first two quantities are where the classifier is correct (positive corresponding to true and negative corresponding to false) and the second two quantities count instances where the classifier is incorrect.
It is traditional to arrange these quantities into a 2 by 2 table called the confusion matrix. If we define:
library('ggplot2')
library('caret')
## Loading required package: lattice
library('rSymPy')
## Loading required package: rJython
## Loading required package: rJava
## Loading required package: rjson
A = Var('TruePositives')
B = Var('FalsePositives')
C = Var('FalseNegatives')
D = Var('TrueNegatives')
(Note all code shared here.)
Then the caret R package defines the confusion matrix as follows (see help("confusionMatrix")
) as follows:
Reference
Predicted Event No Event
Event A B
No Event C D
Reference is “ground truth” or actual outcome. We will call examples that have true ground truth “true examples” (again, please don’t confuse this with “TrueNegatives” which are “false examples” that are correctly scored as being false. We would prefer to have the classifier indicate columns instead of rows, but we will use the caret notation for consistency.
We can encode what we have written about these confusion matrix summaries as algebraic statements. Caret’s help("confusionMatrix")
then gives us definitions of a number of common classifier scores:
# (A+C) and (B+D) are facts about the data, independent of classifier.
Sensitivity = A/(A+C)
Specificity = D/(B+D)
Prevalence = (A+C)/(A+B+C+D)
PPV = (Sensitivity * Prevalence)/((Sensitivity*Prevalence) + ((1-Specificity)*(1-Prevalence)))
NPV = (Specificity * (1-Prevalence))/(((1-Sensitivity)*Prevalence) + ((Specificity)*(1-Prevalence)))
DetectionRate = A/(A+B+C+D)
DetectionPrevalence = (A+B)/(A+B+C+D)
BalancedAccuracy = (Sensitivity+Specificity)/2
We can (from our notes) also define some more common metrics:
TPR = A/(A+C) # True Positive Rate
FPR = B/(B+D) # False Positive Rate
FNR = C/(A+C) # False Negative Rate
TNR = D/(B+D) # True Negative Rate
Recall = A/(A+C)
Precision = A/(A+B)
Accuracy = (A+D)/(A+B+C+D)
By writing everything down it becomes obvious thatSensitivity==TPR==Recall
. That won’t stop somebody from complaining if you say “recall” when they prefer “sensitivity”, but that is how things are.
By declaring all of these quantities as sympy variables and expressions we can now check much more. We confirm formal equality of various measures by checking that their difference algebraically simplifies to zero.
# Confirm TPR == 1 - FNR
sympy(paste("simplify(",TPR-(1-FNR),")"))
## [1] "0"
# Confirm Recall == Sensitivity
sympy(paste("simplify(",Recall-Sensitivity,")"))
## [1] "0"
# Confirm PPV == Precision
sympy(paste("simplify(",PPV-Precision,")"))
## [1] "0"
We can also confirm non-identity by simplifying and checking an instance:
# Confirm Precision != Specificity
expr <- sympy(paste("simplify(",Precision-Specificity,")"))
print(expr)
## [1] "(FalsePositives*TruePositives - FalsePositives*TrueNegatives)/(FalsePositives*TrueNegatives + FalsePositives*TruePositives + TrueNegatives*TruePositives + FalsePositives**2)"
sub <- function(expr,
TruePositives,FalsePositives,FalseNegatives,TrueNegatives) {
eval(expr)
}
sub(parse(text=expr),
TruePositives=0,FalsePositives=1,FalseNegatives=0,TrueNegatives=1)
## [1] -0.5
If we write the probability of a true (in-class) instances scoring higher than a false (not in class) instance (with 1/2 point for ties) as Prob[score(true)>score(false)] (with half point on ties)
. We can then confirm Prob[score(true)>score(false)] (with half point on ties) == BalancedAccuracy
for hard or decision classifiers by defining score(true)>score(false)
as:
A D : True Positive and True Negative: Correct sorting 1 point
A B : True Positive and False Positive (same prediction "Positive", different outcomes): 1/2 point
C D : False Negative and True Negative (same prediction "Negative", different outcomes): 1/2 point
C B : False Negative and True Negative: Wrong order 0 points
Then ScoreTrueGTFalse ==
Prob[score(true)>score(false)] (with 1/2 point for ties)` is:
ScoreTrueGTFalse = (1*A*D + 0.5*A*B + 0.5*C*D + 0*C*B)/((A+C)*(B+D))
Which we can confirm is equal to balanced accuracy.
sympy(paste("simplify(",ScoreTrueGTFalse-BalancedAccuracy,")"))
## [1] "0"
We can also confirm Prob[score(true)>score(false)]
(with half point on ties) == AUC
. We can compute the AUC
(the area under the drawn curve) of the above confusion matrix by referring to the following diagram.
Then we can check for general equality:
AUC = (1/2)*FPR*TPR + (1/2)*(1-FPR)*(1-TPR) + (1-FPR)*TPR
sympy(paste("simplify(",ScoreTrueGTFalse-AUC,")"))
## [1] "0"
This AUC score (with half point credit on ties) equivalence holds in general (see also More on ROC/AUC, though I got this wrong the first time).
We can show F1
is different than Balanced Accuracy by plotting results they differ on:
# Wikipedia https://en.wikipedia.org/wiki/F1_score
F1 = 2*Precision*Recall/(Precision+Recall)
F1 = sympy(paste("simplify(",F1,")"))
print(F1)
## [1] "2*TruePositives/(FalseNegatives + FalsePositives + 2*TruePositives)"
print(BalancedAccuracy)
## [1] "TrueNegatives/(2*(FalsePositives + TrueNegatives)) + TruePositives/(2*(FalseNegatives + TruePositives))"
# Show F1 and BalancedAccuracy do not always vary together (even for hard classifiers)
F1formula = parse(text=F1)
BAformula = parse(text=BalancedAccuracy)
frm = c()
for(TotTrue in 1:5) {
for(TotFalse in 1:5) {
for(TruePositives in 0:TotTrue) {
for(TrueNegatives in 0:TotFalse) {
FalsePositives = TotFalse-TrueNegatives
FalseNegatives = TotTrue-TruePositives
F1a <- sub(F1formula,
TruePositives=TruePositives,FalsePositives=FalsePositives,
FalseNegatives=FalseNegatives,TrueNegatives=TrueNegatives)
BAa <- sub(BAformula,
TruePositives=TruePositives,FalsePositives=FalsePositives,
FalseNegatives=FalseNegatives,TrueNegatives=TrueNegatives)
if((F1a<=0)&&(BAa>0.5)) {
stop()
}
fi = data.frame(
TotTrue=TotTrue,
TotFalse=TotFalse,
TruePositives=TruePositives,FalsePositives=FalsePositives,
FalseNegatives=FalseNegatives,TrueNegatives=TrueNegatives,
F1=F1a,BalancedAccuracy=BAa,
stringsAsFactors = FALSE)
frm = rbind(frm,fi) # bad n^2 accumulation
}
}
}
}
ggplot(data=frm,aes(x=F1,y=BalancedAccuracy)) +
geom_point() +
ggtitle("F1 versus balancedAccuarcy/AUC")
F1 versus BalancedAccuracy/AUC
In various sciences over the years over 20 measures of “scoring correspondence” have been introduced by playing games with publication priority, symmetry, and incorporating significance (“chance adjustments”) directly into the measure.
Each measure presumably exists because it avoids flaws of all of the others. However the sheer number of them (in my opinion) triggers what I call “De Morgan’s objection”:
If I had before me a fly and an elephant, having never seen more than one such magnitude of either kind; and if the fly were to endeavor to persuade me that he was larger than the elephant, I might by possibility be placed in a difficulty. The apparently little creature might use such arguments about the effect of distance, and might appeal to such laws of sight and hearing as I, if unlearned in those things, might be unable wholly to reject. But if there were a thousand flies, all buzzing, to appearance, about the great creature; and, to a fly, declaring, each one for himself, that he was bigger than the quadruped; and all giving different and frequently contradictory reasons; and each one despising and opposing the reasons of the others—I should feel quite at my ease. I should certainly say, My little friends, the case of each one of you is destroyed by the rest.
(Augustus De Morgan, “A Budget of Paradoxes” 1872)
There is actually an excellent literature stream investigating which of these measures are roughly equivalent (say arbitrary monotone functions of each other) and which are different (leave aside which are even useful).
Two excellent guides to this rat hole include:
Ackerman, M., & Ben-David, S. (2008). “Measures of clustering quality: A working set of axioms for clustering.”" Advances in Neural Information Processing Systems: Proceedings of the 2008 Conference.
Warrens, M. (2008). “On similarity coefficients for 2× 2 tables and correction for chance.” Psychometrika, 73(3), 487–502.
The point is: you not only can get a publication trying to sort this mess, you can actually do truly interesting work trying to relate these measures.
One can take finding relations and invariants much further as in “Lectures on Algebraic Statistics” Mathias Drton, Bernd Sturmfels, Seth Sullivant, 2008.
It is a bit much to hope to only need to know “one best measure” or to claim to be familiar (let alone expert) in all plausible measures. Instead, find a few common evaluation measures that work well and stick with them.
Please comment on the article here: Statistics – Win-Vector Blog
The post A budget of classifier evaluation measures appeared first on All About Statistics.
]]>The post A scalable particle filter in Scala appeared first on All About Statistics.
]]>Many modern algorithms in computational Bayesian statistics have at their heart a particle filter or some other sequential Monte Carlo (SMC) procedure. In this blog I’ve discussed particle MCMC algorithms which use a particle filter in the inner-loop in order to compute a (noisy, unbiased) estimate of the marginal likelihood of the data. These algorithms are often very computationally intensive, either because the forward model used to propagate the particles is expensive, or because the likelihood associated with each particle/observation is expensive (or both). In this case it is desirable to parallelise the particle filter to run on all available cores of a machine, or in some cases, it would even be desirable to distribute the the particle filter computation across a cluster of machines.
Parallelisation is difficult when using the conventional imperative programming languages typically used in scientific and statistical computing, but is much easier using modern functional languages such as Scala. In fact, in languages such as Scala it is possible to describe algorithms at a higher level of abstraction, so that exactly the same algorithm can run in serial, run in parallel across all available cores on a single machine, or run in parallel across a cluster of machines, all without changing any code. Doing so renders parallelisation a non-issue. In this post I’ll talk through how to do this for a simple bootstrap particle filter, but the same principle applies for a large range of statistical computing algorithms.
In the previous post I gave a quick introduction to the monad concept, and to monadic collections in particular. Many computational tasks in statistics can be accomplished using a sequence of operations on monadic collections. We would like to write code that is independent of any particular implementation of a monadic collection, so that we can switch to a different implementation without changing the code of our algorithm (for example, switching from a serial to a parallel collection). But in strongly typed languages we need to know at compile time that the collection we use has the methods that we require. Typeclasses provide a nice solution to this problem. I don’t want to get bogged down in a big discussion about Scala typeclasses here, but suffice to say that they describe a family of types conforming to a particular interface in an ad hoc loosely coupled way (they are said to provide ad hoc polymorphism). They are not the same as classes in traditional O-O languages, but they do solve a similar problem to the adaptor design pattern, in a much cleaner way. We can describe a simple typeclass for our monadic collection as follows:
trait GenericColl[C[_]] { def map[A, B](ca: C[A])(f: A => B): C[B] def reduce[A](ca: C[A])(f: (A, A) => A): A def flatMap[A, B, D[B] <: GenTraversable[B]](ca: C[A])(f: A => D[B]): C[B] def zip[A, B](ca: C[A])(cb: C[B]): C[(A, B)] def length[A](ca: C[A]): Int }
In the typeclass we just list the methods that we expect our generic collection to provide, but do not say anything about how they are implemented. For example, we know that operations such as map and reduce can be executed in parallel, but this is a separate concern. We can now write code that can be used for any collection conforming to the requirements of this typeclass. The full code for this example is provided in the associated github repo for this blog, and includes the obvious syntax for this typeclass, and typeclass instances for the Scala collections Vector and ParVector, that we will exploit later in the example.
We can now write some code for a single observation update of a bootstrap particle filter.
def update[S: State, O: Observation, C[_]: GenericColl]( dataLik: (S, O) => LogLik, stepFun: S => S )(x: C[S], o: O): (LogLik, C[S]) = { val xp = x map (stepFun(_)) val lw = xp map (dataLik(_, o)) val max = lw reduce (math.max(_, _)) val rw = lw map (lwi => math.exp(lwi - max)) val srw = rw reduce (_ + _) val l = rw.length val z = rw zip xp val rx = z flatMap (p => Vector.fill(Poisson(p._1 * l / srw).draw)(p._2)) (max + math.log(srw / l), rx) }
This is a very simple bootstrap filter, using Poisson resampling for simplicity and data locality, but does include use of the log-sum-exp trick to prevent over/underflow of raw weight calculations, and tracks the marginal (log-)likelihood of the observation. With this function we can now pass in a “prior” particle distribution in any collection conforming to our typeclass, together with a propagator function, an observation (log-)likelihood, and an observation, and it will return back a new collection of particles of exactly the same type that was provided for input. Note that all of the operations we require can be accomplished with the standard monadic collection operations declared in our typeclass.
Once we have a function for executing one step of a particle filter, we can produce a function for particle filtering as a functional fold over a sequence of observations:
def pFilter[S: State, O: Observation, C[_]: GenericColl, D[O] <: GenTraversable[O]]( x0: C[S], data: D[O], dataLik: (S, O) => LogLik, stepFun: S => S ): (LogLik, C[S]) = { val updater = update[S, O, C](dataLik, stepFun) _ data.foldLeft((0.0, x0))((prev, o) => { val next = updater(prev._2, o) (prev._1 + next._1, next._2) }) }
Folding data structures is a fundamental concept in functional programming, and is exactly what is required for any kind of filtering problem. Note that Brian Beckman has recently written a series of articles on Kalman filtering as a functional fold.
So far we haven’t said anything about parameters or parameter estimation, but this is appropriate, since parametrisation is a separate concern from filtering. However, once we have a function for particle filtering, we can produce a function concerned with evaluating marginal likelihoods trivially:
def pfMll[S: State, P: Parameter, O: Observation, C[_]: GenericColl, D[O] <: GenTraversable[O]]( simX0: P => C[S], stepFun: P => S => S, dataLik: P => (S, O) => LogLik, data: D[O] ): (P => LogLik) = (th: P) => pFilter(simX0(th), data, dataLik(th), stepFun(th))._1
Note that this higher-order function does not return a value, but instead a function which will accept a parameter as input and return a (log-)likelihood as output. This can then be used for parameter estimation purposes, perhaps being used in a PMMH pMCMC algorithm, or something else. Again, this is a separate concern.
Here I’ll just give a completely trivial toy example, purely to show how the functions work. For avoidance of doubt, I know that there are many better/simpler/easier ways to tackle this problem! Here we will just look at inferring the auto-regression parameter of a linear Gaussian AR(1)-plus-noise model using the functions we have developed.
First we can simulate some synthetic data from this model, using a value of 0.8 for the auto-regression parameter:
val inNoise = Gaussian(0.0, 1.0).sample(99) val state = DenseVector(inNoise.scanLeft(0.0)((s, i) => 0.8 * s + i).toArray) val noise = DenseVector(Gaussian(0.0, 2.0).sample(100).toArray) val data = (state + noise).toArray.toList
Now assuming that we don’t know the auto-regression parameter, we can construct a function to evaluate the likelihood of different parameter values as follows:
val mll = pfMll( (th: Double) => Gaussian(0.0, 10.0).sample(10000).toVector.par, (th: Double) => (s: Double) => Gaussian(th * s, 1.0).draw, (th: Double) => (s: Double, o: Double) => Gaussian(s, 2.0).logPdf(o), data )
Note that the 4 characters “.par” at the end of line 2 are the only difference between running this code serially or in parallel! Now we can run this code by calling the returned function with different values. So, hopefully mll(0.8) will return a larger log-likelihood than (say) mll(0.6) or mll(0.9). The example code in the github repo plots the results of calling mll() for a range of values (note that if that was the genuine use-case, then it would be much better to parallellise the parameter range than the particle filter, due to providing better parallelisation granularity, but many other examples require parallelisation of the particle filter itself). In this particular example, both the forward model and the likelihood are very cheap operations, so there is little to be gained from parallelisation. Nevertheless, I still get a speedup of more than a factor of two using the parallel version on my laptop.
In this post we have shown how typeclasses can be used in Scala to write code that is parallelisation-agnostic. Code written in this way can be run on one or many cores as desired. We’ve illustrated the concept with a scalable particle filter, but nothing about the approach is specific to that application. It would be easy to build up a library of statistical routines this way, all of which can effectively exploit available parallel hardware. Further, although we haven’t demonstrated it here, it is trivial to extend this idea to allow code to be distribution over a cluster of parallel machines if necessary. For example, if an Apache Spark cluster is available, it is easy to make a Spark RDD instance for our generic collection typeclass, that will then allow us to run our (unmodified) particle filter code over a Spark cluster. This emphasises the fact that Spark can be useful for distributing computation as well as just processing “big data”. I’ll say more about Spark in subsequent posts.
Please comment on the article here: Darren Wilkinson's research blog
The post A scalable particle filter in Scala appeared first on All About Statistics.
]]>The post Bayesian predicted slopes with interaction in multiple linear regression appeared first on All About Statistics.
]]>After Figure 18.9, p. 529, of DBDA2E |
Please comment on the article here: Doing Bayesian Data Analysis
The post Bayesian predicted slopes with interaction in multiple linear regression appeared first on All About Statistics.
]]>The post What if the RNC assigned seating randomly appeared first on All About Statistics.
]]>The punditry has spoken: the most important data question at the Republican Convention is where different states are located. Here is the FiveThirtyEight take on the matter:
They crunched some numbers and argue that Trump's margin of victory in the state primaries is the best indicator of how close to the front that state's delegation is situated.
Others have put this type of information on a map:
The scatter plot with the added "trendline" is often misleading. Your eyes are drawn to the line, and distracted from the points that are far away from the line. In fact, the R-squared of the regression line is only about 20%. This is quite obvious from the distribution of green shades in the map below.
***
So, I wanted to investigate the question of how robust this regression line is. The way statisticians address this question is as follows: imagine that the seating has been assigned completely at random - how likely would the actual seating plan have arisen from random assignment?
Take the seating assignments from the scatter plot. Then randomly shuffle the assignment to create simulated random seating plans. We keep the same slots, for example, four states were given #1 positions in the actual arrangement. In every simulation, four states got #1 positions - it's just that which four states were decided by flipping coins.
I did one hundred simulated seating plans at a time. For each plan, I created the scatter plot of seating position versus Trump margin (mirror image of the FiveThirtyEight chart), and fitted a regression line. The following shows the slopes of the first 200 simulations:
The more negative the slope, the more power Trump margin has in explaining the seating arrangement.
Notice that even though all these plans are created at random, the magnitude of the slopes range widely. In fact, there is one randomly created plan that sits right below the actual RNC plan shown in red. So, it is possible--but very unlikely--that the RNC plan is randomly drawn up.
Another view of this phenomenon is the histogram of the slopes:
This again shows that the actual seating plan is very unlikely to be produced by a random number generator. (I plotted 500 simulations here.)
In statistics, we measure rarity by "standard errors". The actual plan is almost but not quite three standard errors away from the average random plan. A rule of thumb is that 3 standard errors or more is rare. (This corresponds to over 99% confidence.)
PS. Does anyone have the data corresponding to the original scatter plot? There are other things I want to do with the data but I'd need to find (a) the seating position by state and (b) the primary results nicely set in a spreadsheet.
Please comment on the article here: Junk Charts
The post What if the RNC assigned seating randomly appeared first on All About Statistics.
]]>