There are PhD positions in our Probabilistic Machine Learning group at Aalto, Finland, and altogether 15 positions in Helsinki ICT network. Apply here The most interesting topic in the call is supervised by Prof. Samuel Kaski at AaltoPML (and you may collaborate with me too :) We are looking for PhD candidates interested in probabilistic […]

The post Phd positions in Probabilistic Machine Learning at #AaltoPML group Finland appeared first on Statistical Modeling, Causal Inference, and Social Science.

The post Phd positions in Probabilistic Machine Learning at #AaltoPML group Finland appeared first on All About Statistics.

]]>There are PhD positions in our Probabilistic Machine Learning group at Aalto, Finland, and altogether 15 positions in Helsinki ICT network. Apply here

The most interesting topic in the call is supervised by Prof. Samuel Kaski at AaltoPML (and you may collaborate with me too :)

We are looking for PhD candidates interested in probabilistic modeling and machine learning, both theory and applications. Main keywords include Bayesian inference and multiple data sources. Strong application areas with excellent collaboration opportunities are: personalized medicine, bioinformatics, user interaction, brain signal analysis, information visualization and intelligent information access. The group has several excellent postdocs who participate in supervision. We belong to the Finnish Center of Excellence in Computational Inference Research COIN.

Although this description doesn’t mention it, the research may also be related to Stan.

And before Andrew comments, I just say that right now in the winter, south Finland is warmer than New York or Iceland!

The post Phd positions in Probabilistic Machine Learning at #AaltoPML group Finland appeared first on Statistical Modeling, Causal Inference, and Social Science.

**Please comment on the article here:** **Statistical Modeling, Causal Inference, and Social Science**

The post Phd positions in Probabilistic Machine Learning at #AaltoPML group Finland appeared first on All About Statistics.

]]>David Hogg points me to a recent paper, “A Social Priming Data Set With Troubling Oddities” by Hal Pashler, Doug Rohrer, Ian Abramson, Tanya Wolfson, and Christine Harris, which begins: Chatterjee, Rose, and Sinha (2013) presented results from three experiments investigating social priming—specifically, priming effects induced by incidental exposure to concepts relating to cash or […]

The post Primed to lose appeared first on Statistical Modeling, Causal Inference, and Social Science.

The post Primed to lose appeared first on All About Statistics.

]]>David Hogg points me to a recent paper, “A Social Priming Data Set With Troubling Oddities” by Hal Pashler, Doug Rohrer, Ian Abramson, Tanya Wolfson, and Christine Harris, which begins:

Chatterjee, Rose, and Sinha (2013) presented results from three experiments investigating social priming—specifically, priming effects induced by incidental exposure to concepts relating to cash or credit cards. They reported that exposing people to cash concepts made them less generous with their time and money, whereas exposing them to credit card concepts made them more generous.

The effects reported in the Chatterjee et al. paper were large—suspiciously large.

Last year, I wrote about a study whose results were stunningly large. It was only after I learned the data had been faked—it was the notorious Lacour and Green voter canvassing paper—that I ruefully wrote that, sometimes a claim that is too good to be true, isn’t.

Pashler at all skipped my first step and went straight to the data. After some statistical detective work, they conclude:

We are not in a position to determine exactly what series of actions and events could have resulted in this pattern of seemingly corrupted data. In our view, given the results just described, possibilities that would need to be considered would include (a) human error, (b) computer error, and (c) deliberate data fabrication.

And:

In our opinion based solely on the analyses just described, the findings do seem potentially consistent with the disturbing third possibility: that the data records that contributed most to the priming effect were injected into the data set by means of copy-and- paste steps followed by some alteration of the pasted strings in order to mask the abnormal provenance of these data records that were driving the key effect.

Oof!

**No coincidence that we see fraud (or extreme sloppiness) in priming studies**

How did we get to this point?

Do you think Chatterjee et al. wanted to fabricate data (if that’s what they did) or do incredibly sloppy data processing (if that’s what happened)? Do you think that, when Chatterjee, Rose, and Sinha were in grad school studying psychology or organizational behavior or whatever, they thought, When I grow up I want to be running my data through the washing machine?

No, of course not.

They were *driven* to cheat, or to show disrespect for their data, because there was nothing there for them to find (or, to be precise, that any effects that *were* there, were too small and too variable for them to have any chance of detecting; click on above kangaroo image for a fuller explanation of this point).

Nobody wants to starve. If there’s no fruit on the trees, people will forage through the weeds looking for vegetables. If there’s nothing there, they’ll start to eat dirt. The low quality of research in these subfields of social psychology is a direct consequence of there being nothing there to study. Or, to be precise, it’s a direct consequence of effects being small and highly variable across people and situations.

I’m sure these researchers would’ve loved to secure business-school teaching positions by studying large and real effects. But, to continue my analogy, they got stuck in a barren patch of the forest, eating dirt and tree bark in a desperate attempt to stay viable. It’s not a pretty sight. But I can see how it can happen. I blame them, sure (just as I blame myself for the sloppiness that led to my two erroneous published papers). But I also blame the system, the advisors and peers and journal editors and Ted talk impresarios who misled them into thinking that they were working in a productive area of science, when they weren’t. They were blindfolded and taken into some area of the outback that had nothing to eat.

Outback, huh? I just realize what I wrote. It was unintentional, and I think I was primed by the kangaroo picture.

In all seriousness, I have no doubt that priming occurs—I see it all the time in my own life. My skepticism is with the claim of huge indirect priming effects. As Wagenmakers et al. put it, quoting Hal Pashler, “disbelief does in fact remain an option.” Especially because, as discussed in the present post, if these effects *were* really present, they’d be interfering with each other all over the place, and these sorts of crude experiments wouldn’t work anyway.

**It’s all about the incentives**

So . . . you take a research area with small and highly variable effects, but where this is not well understood so you can get publications in top journals with statistically significant results . . . this creates very little incentive to do careful research. I mean, what’s the point? If there’s essentially nothing going on and you’re gonna have to p-hack your data anyway, why not just jump straight to the finish line. Chatterjee et al. could’ve spent 3 years collecting data on 1000 people, they still probably would’ve had to twist the data to get what they needed for publication.

And that’s the other side of the coin. Very little incentive to do careful research, but a very big incentive to cheat or to be so sloppy with your data that maybe you can happen upon a statistically significant finding.

Bad bad incentives + Researchers in a tough position with their careers = Bad situation.

The post Primed to lose appeared first on Statistical Modeling, Causal Inference, and Social Science.

**Please comment on the article here:** **Statistical Modeling, Causal Inference, and Social Science**

The post Primed to lose appeared first on All About Statistics.

]]>The post New Judea Pearl Causal Inference "Primer" appeared first on All About Statistics.

]]>Should be a fun and informative read. Check out the contents and various chapters here.

"Causal Inference in Statistics - A Primer"

by J. Pearl, M. Glymour and N. Jewell

Available now on Kindle.

Available in print Feb. 26, 2016.

http://www.amazon.com/Causality-A-Primer-Judea-Pearl/dp/1119186846

http://www.amazon.com/Causal-Inference-Statistics-Judea-Pearl-ebook/dp/B01B3P6NJM/ref=mt_kindle?_encoding=UTF8&me=

**Please comment on the article here:** **No Hesitations**

The post New Judea Pearl Causal Inference "Primer" appeared first on All About Statistics.

]]>The post Using SVG graphics in blog posts appeared first on All About Statistics.

]]>My traditional work flow for embedding R graphics into a blog post has been via a PNG files that I upload online. However, when I created a 'simple' graphic with only basic curves and triangles for a recent post, I noticed that the PNG output didn't look as crisp as I expected it to be. So, eventually I used a SVG (scalable vector graphic) instead.

Creating a SVG file with R could't be easier; e.g. use the

`svg()`

function in the same way as `png()`

. Next, make the file available online and embed it into your page. There are many ways to do this, in the example here I placed the file into a public GitHub repository.To embed the figure into my page I could use either the traditional

`<img>`

tag, or perhaps better the `<object>`

tag. Paul Murrell provides further details on his blog.With

`<object>`

my code looks like this:`<object data="https://rawgithub.com/mages/diesunddas/master/Blog/transitionPlot.svg" type="image/svg+xml" width="400"> </object>`

There is a little trick required to display a graphic file hosted on GitHub.

By default, when I look for the raw URL, GitHub will provide an address starting with

`https://raw.githubusercontent.com/...`

, which needs to be replaced with `https://rawgithub.com/...`

.Ok, let's look at the output. As a nice example plot I use a

`transitionPlot`

by Max Gordon, something I wanted to do for a long time.Yet, I don't think that SVG is always a good answer. The file size of an SVG file can grow quite quickly, if there are many points to be plotted. As an example check the difference in file size for two identical plots with 10,000 points.

`x <- rnorm(10000)`

png()

plot(x)

dev.off()

file.size("Rplot001.png")/1000

# [1] 118.071

svg()

plot(x)

dev.off()

file.size("Rplot001.svg")/1000

# [1] 3099.181

That's 3.1 Mb vs 118 kb, a factor of 26! Even compressed to a .svgz file, the SVG file is still 317kb.`R version 3.2.3 (2015-12-10)`

Platform: x86_64-apple-darwin13.4.0 (64-bit)

Running under: OS X 10.11.3 (El Capitan)

locale:

[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:

[1] grid stats graphics grDevices utils datasets

[7] methods base

other attached packages:

[1] RColorBrewer_1.1-2 Gmisc_1.3 htmlTable_1.5

[4] Rcpp_0.12.3

loaded via a namespace (and not attached):

[1] Formula_1.2-1 knitr_1.12.3

[3] cluster_2.0.3 magrittr_1.5

[5] splines_3.2.3 munsell_0.4.2

[7] colorspace_1.2-6 lattice_0.20-33

[9] stringr_1.0.0 plyr_1.8.3

[11] tools_3.2.3 nnet_7.3-12

[13] gtable_0.1.2 latticeExtra_0.6-26

[15] htmltools_0.3 digest_0.6.9

[17] forestplot_1.4 survival_2.38-3

[19] abind_1.4-3 gridExtra_2.0.0

[21] ggplot2_2.0.0 acepack_1.3-3.3

[23] rsconnect_0.3.79 rpart_4.1-10

[25] rmarkdown_0.9.2 stringi_1.0-1

[27] scales_0.3.0 Hmisc_3.17-1

[29] XML_3.98-1.3 foreign_0.8-66

**Please comment on the article here:** **mages' blog**

The post Using SVG graphics in blog posts appeared first on All About Statistics.

]]>The post The State of Information Visualization, 2016 appeared first on All About Statistics.

]]>Oh hello, new year! I almost didn’t see you there! Lots of interesting things happened last year: *Dear Data*, deceptive visualization, storytelling research, new tools and ideas, etc. And this year is already shaping up to be quite strong, too.

Perhaps the most exciting project of 2015 was *Dear Data* by Giorgia Lupi and Stefanie Posavec. They are both designers, and they decided to collect data and send each other postcards with hand-drawn visualizations based on that every week. The topic is also a different one every week, and they’re often very personal. It’s a unique and very different project, with a lot of creativity in the ways data is displayed.

Somewhat related, there was a great paper at EuroVis last year on data sketching, drawing data by hand. Tools are clearly helpful when dealing with data, but they also tend to shape the things people do with them – they make some things easier than others, and obviously always have limitations. Sketching allows for more thinking outside the box and more creativity.

Academic visualization research can be trapped inside a bubble and not deal with issues people actually encounter out in the world. That is why I really liked the work on deceptive visualizations. It put some science behind issues that some people are aware of, but that so far have mostly been based on assumptions and hearsay. Do cropped bars mislead? Does inverting an axis make a difference? Is aspect ratio important?

The point was not so much that the results were surprising (for the most part, they weren’t), but that these things were actually tested rather than just stated as fact. It still amazes me how many things we simply take for granted in visualization without questioning them – and when we finally do, we find that they’re not based on actual research.

Along similar lines, Drew Skau and I looked at bar chart embellishments common in infographics and found that some of them aren’t that problematic – though some clearly are. Again, the point here being actual science rather than just assumptions.

One of the big issues in data visualization is cleaning data and wrangling it into a shape that can then be used in a visualization tool. Trifacta Wrangler is a great tool for that, and it’s free to use (with some size limitations, though they’re quite generous).

I recently heard somebody describe his work as “Living in the Hadleyverse” – a reference to Hadley Wickham and his untiring efforts to create better tools for both data analysis and visualization in R. Between ggplot, dplyr, and the up-and-coming ggvis, R is getting very powerful support to deal with large datasets, talk directly to databases, and create interactive visualizations for the web.

Sadly, last year also saw the death of *Many Eyes*. While not exactly a surprise after years of neglect, it did mean the end of the first really successful and widely used web-based visualization platforms. Many Eyes was not just a collection of tools, they were also ambitious about doing research and pushing the envelope on things like text visualization and figuring out user preferences. Alas, IBM did not seem to see the value and finally folded the project into Watson Analytics late last year.

In the process, they did release *Brunel*, a language for creating visualizations on the web based on the Grammar of Graphics. This had originally been developed as the new technology to power Many Eyes, under the name *RAVE*. I’m not sure if Brunel has any chance of catching on, given the popularity of D3. But it appears to be an interesting piece of technology.

I’m actually writing this while attending a seminar on *Data-Driven Storytelling* at Schloss Dagstuhl. There are 40 people here, with a good number of journalists and designers mixed into the usual group of academics. That such a seminar can happen is a sign that storytelling in visualization is here to stay.

This isn’t quite reflected in the papers at IEEE VIS or EuroVis yet, but I expect that to change this year. Oddly, the conference that had an entire session on storytelling last year was CHI – even though that is not a core visualization conference. The entire visualization track there was pretty strong.

I was one of the authors of the paper on ISOTYPE at CHI, and also the almost-published one on the connected scatterplot. I also wrote about presentation-oriented visualization techniques.

On the academic side, I expect to see a lot more work storytelling at the conferences, hopefully enough to finally get entire sessions. There is a lot of energy here at Dagstuhl right now, and many topics and issues to tackle. My hope is also that we can involve practitioners in this work more than we usually do.

A big driver of data visualization in the news will be the elections in the U.S. in November. There will be polls, predictions, lots of data-centric news stories, and just generally a fever pitch of data presentation. Exciting times!

**Please comment on the article here:** **eagereyes**

The post The State of Information Visualization, 2016 appeared first on All About Statistics.

]]>The post Neglected optimization topic: set diversity appeared first on All About Statistics.

]]>The mathematical concept of set diversity is a somewhat neglected topic in current applied decision sciences and optimization. We take this opportunity to discuss the issue.

Consider the following problem: for a number of items `U = {x_1`

, … `x_n}`

pick a small set of them `X = {x_i1, x_i2, ..., x_ik}`

such that there is a high probability one of the `x in X`

is a “success.” By success I mean some standard business outcome such as making a sale (in the sense of any of: propensity, appetency, up selling, and uplift modeling), clicking an advertisement, adding an account, finding a new medicine, or learning something useful.

This is common in:

- Search engines. The user is presented with a page consisting of “top results” with the hope that one of the results is what the user wanted.
- Online advertising. The user is presented with a number of advertisements in enticements in the hope that one of them matches user taste.
- Science. A number of molecules are simultaneously presented to biological assay hoping that at least one of them is a new drug candidate, or that the simultaneous set of measurements shows us where to experiment further.
- Sensor/guard placement. Overlapping areas of coverage don’t make up for uncovered areas.
- Machine learning method design. The random forest algorithm requires diversity among its sub-trees to work well. It tries to ensure by both per-tree variable selections and re-sampling (some of these issues discussed here).

In this note we will touch on key applications and some of the theory involved. While our group specializes in practical data science implementations, applications, and training, our researchers experience great joy when they can re-formulate a common problem using known theory/math and the reformulation is game changing (as it is in the case of set-scoring).

Minimal spanning trees, the basis of one set diversity metric.

Typically a first step towards solving this problem is to build a model `z()`

such that `z(x_i)`

is a good estimate of the probability of a single item `x_i`

“being a success.”

The usual ways to evaluate a score `z()`

(precision, recall, accuracy, sensitivity, specificity, deviance, ROC/AUC) deliberately (for the sake of simplicity) ignore any intended business use or application of the model score `z()`

. What we mean is using a score like `z()`

to rank or sort individuals items does not necessarily solve the original problem of building a good small *set* of results. Good sorting may not be enough to ensure a good set.

To explain: suppose we present a small number of candidates (say 5) and consider the entire process a success if at least one of them is good. We assume our application doesn’t particularly value having multiple competing successes, as we expect successes to be mutually exclusive for any number of reasons including:

- The user might pursue only one opportunity no matter how many seem to match.
- The probabilities may be modeling a mixture of disjoint mechanisms. The user may have only one (unknown) taste, and we increase our odds of hitting it through diverse selection.

What we have to keep in mind: different sets of 5 candidate matches where each individual match individually has a 20% chance of working can have an overall chance of having any matches at all (our goal) with any probability anywhere from 100% (when the matching mechanisms are completely disjoint and complementary) to 20% (when the items are near duplicates of each other). The ability to at least attempt to bias our selection to a much higher “join value” set can make a *huge* operational difference.

Of course, in discussing the implications of how we are going to apply or use a model we are moving from machine learning into the field of operations research, but let us continue.

We are trying to pick a small set `X`

so that the probability there is an `x in X`

such that `x`

is a “success” is maximized. This is different than picking `X`

such that the expected number of successes in `X`

is maximized and different than picking `X`

such that sum of `z(x)`

is maximized.

The key observation is we are working against a situation of diminishing returns. For example: two perfectly good opportunities don’t double our chances when combined if they are in fact two versions of the same opportunity. The common heuristic that the value of a set is approximately the sum of the value of the elements breaks down.

The exact objective function we are interested in is:

`f(X) = P[number_successes(X)>0]`

where `X`

is a set and `P[]`

is probability calculated over some (usually unknown) joint probability distribution. Even if we assume our score `z(x)`

is a perfect estimate of `P[x is a success]`

we can’t yet estimate `P[number_successes(X)>0]`

because we still would not know the probability of joint events (example: knowing `P[x1 succeeds]`

and `P[x2 succeeds]`

does not mean we necessarily know `P[x1 succeeds and x2 succeeds]`

).

One thing `P[]`

tells us is if set-success in our selected set is behaving like coin-flips (in an independent and sub-additive way), shuffled cards (in a disjoint and additive manner), or useless duplicates (in a sub-additive manner). Obviously the expected number of successes are additive, but if there is a “one or more successes” tends to be sub-additive (because two successes get the same credit as one).

We are saying the following usual additive measure:

`N(X) := sum({z(x) | x in X})`

may be convenient for the data scientist, but it is not necessarily a good stand-in for our actual application goal (such as maximizing `P[number_successes(X)>0]`

), and therefore may not always pick an actual high-utility set (even when there is one).

`P[number_successes(X)>0]`

We usually do not know `P[number_successes(X)>0]`

. We might get access to training data about `P[number_successes(X)>0]`

(or what a computer scientist would call “a `P[number_successes(X)>0]`

oracle”) by submitting sets `X`

to users and recording which sets have a success (instead of submitting items and recording which items are successful). However, set based learning can introduce its own difficulties (for example see “A Note on Learning from Multiple-Instance Examples”, Avrim Blum, Adam Kalai).

What we do is: introduce a stand-in set valued function that we hope behaves somewhat like `P[number_successes(X)>0]`

, allowing us to make good choices. Suppose we have models for both `z(x)`

(the probability of success, or expected value of individual items `x`

) and pairwise dissimilarity `d(x1,x2)`

(presume `d(,)`

is designed to be in the range zero to one, with zero indicating perfect similarity).

One might hope that with access to `z()`

and `d(,)`

one could appeal to something like inclusion/exclusion and define a useful stand-in function that is a sum of the `z()`

s plus all the pair `d(,)`

s. Because there are so many more `d(,)`

s than `z()`

s such function does not always point us to reasonable set selections. Later on we show a principled method to control the number of `d(,)`

s and get a very useful stand in function.

More commonly researchers use covering or variance heuristics as stand-ins the unknown set-valued function `f(X) = P[number_successes(X)>0]`

.

A good stand-in function is a coverage measure such as `Z()`

:

`Z(X) := sum({z(v) | exists x in X and v in V such that d(x,v) ≤ a})`

where`V`

is a “universe” of items we are trying to simultaneously be near to or cover. We can take`V=U`

(the set we are choosing from), or (better, but introduces a circular appeal to sovling the coverage/diversity/discrepancy problem)`V`

as a low discrepancy sample with respect to the unknown set-distribution of successes`P[]`

.

The idea is: if we cover all good opportunities we should have good chance having a hit. This has been used in literature for biological sampling, Adwords sorting (for example see: “Revisiting the greedy approach to submodular set function maximization”, PR Goundan, AS Schulz – Optimization online, 2007), and other applications.

Coverage style measures yield rich combinatorial optimization problems, especially when you value sets of items in terms of what composite systems can be assembled from them (such as drink recipes from ingredients, as cleverly illustrated by Jordan Meyer).

One desire is for the measure to be *intrinsic*, or only depend on the items selected (and be immune to changes in the set of items that can be selected from, the idea is the distribution of where we can look may be a misleading estimate of where we want to look).

A common intrinsic stand-in function is the variation measure `Z'()`

:

`Z'(X) := sum({d(x,y) | x,y in X; z(x) ≥ a, z(y) ≥ a})`

.

This one has been used in molecular diversity, and is what you would get from an inclusion/exclusion style argument.

*In theory* one could try a number of geometrically inspired measures based on the following (though most are impractical to actually implement).

Pick `d`

, a dimension to work in and for a set of items `X`

let `X'`

be `{ x in X | z(x) ≥ a }`

. Use distance geometry methods to define a function `g()`

from `X'`

to `R^d`

such that `||g(x)-g(y)|| ~ d(x,y)`

for all `x,y in X'`

. With such a coordinatization in hand we can use various geometric quantities as diversity measures:

- Determinant of the inertial ellipsoid of the
`g(x)`

. - Volume of the convex hull of the
`g(x)`

. - Total volume of the union of unit-radius spheres centered at each
`g(x)`

.

One should prefer a function like `Z()`

over many of the others because `Z()`

is efficiently implementable and captures the idea of diminishing returns: each point added tends to have less of an improvement than the previous. The unknown true objective `f(X) = P[number_successes(X)>0]`

has the diminishing return property, so we would expect a good stand-in function to also have such a property. Many of the other common measure (such as total variation) lack this feature. The theory of submodular functions was invented to study optimization over diminishing returns and is a key concept in the literature (see also: “Submodular Function Maximization”, Andreas Krause (ETH Zurich), Daniel Golovin (Google)).

Let’s look at set-valued stand in functions `M()`

that are:

**Monotone.**That is`M(Y) ≥ M(X)`

when`X`

is contained in`Y`

.**Submodular.**That is`M(X) + M(Y) ≥ M(X union Y) + M(X intersect Y).`

This may seem a bit abstract. But it is designed to model situations of diminishing returns. The following common functions are all provably monotone and submodular:

- Set cardinality.
- Weighted set cover (
`Z()`

, and other functions often used in the literature). - Volumes of unions of objects.
`P[number_successes(X)>0]`

the (unknown) probability of a set containing a successful item.

The ideas are:

- Maybe all of these functions behave similarly (so optimizing one informs us about the optimum of another).
- While fully maximizing a non-negative monotone submodular function is often intractable, it is a theorem that the so-called “greedy algorithm” achieves a maximum that is at least
`(1-1/e)`

of the optimum value. Only getting within 63% of the optimum utility may seem like giving up a lot, but it is a lot better than only being within 20% of the optimal utility (as in our earlier 5 element example). In fact if success methods are disjoint (which is plausible) and one class is over-represented (also plausible) we would with very high probability see 20% utility for an item oriented selection. In this situation the greedy set coverage selection*is likely over three times as effective as the naive selection*(as`(1-1/e)/.2 > 3`

)! For the math: Nemhauser, George L., Laurence A. Wolsey, and Marshall L. Fisher. “An analysis of approximations for maximizing submodular set functions—I.” Mathematical Programming 14.1 (1978): 265-294.

In 1999 I published an additional heuristic set value stand-in function that I had been using and publicly speaking on for some years. It was inspired by clustering algorithms and the desire to control which terms enter into an inclusion/exclusion style argument (via something like a dependency tree).

This function is unfortunately not monotone or submodular even with Euclidian distances. However, it has a number of desirable properties complementary to common coverage or variance measures (see: “IcePick: A Flexible Surface-Based System for Molecular Diversity”, John Mount, Jim Ruppert, Will Welch, and Ajay N. Jain, J. Med. Chem., 1999, 42 (1), pp 60–66.). This stand-in function works very well, and could be profitably applied in many applications.

Abstractly the function is defined as:

`S(X)`

is defined as the total length of edges in

the minimum weight spanning tree formed on the graph of nodes`{G | x in X such that z(x) ≥ a}`

with edge-weights given by`d(,)`

.

An obvious variation of `S()`

is to re-process the `d(,)`

into something like `d'(x,y) = (z(x) + z(y)) d(x,y)/2`

to lean a bit more towards individual values. We can also add `max_x z(z)`

to `S()`

so that `S()`

has a meaningful value on single item sets.

I designed this stand-in diversity function `S()`

to have the following properties:

- Inexpensive to compute.
- Valuing distance (or aperture), items/points further away tend to be worth more as they are more extreme measurements (
*not*a property of coverage measures such as`f(X) = P[number_successes(X)>0]`

). - Intrinsic.
`S(X)`

should be a function of only`X`

(the set chosen to score), and not of a larger ambient cover target set`U`

(where we are choosing from). This was critical in our application as`U`

was varying as chemists proposed more molecules. - No credit for presumed bad items (those that have
`z(x) < a`

). We note that`f(X) = P[number_successes(X)>0]`

itself is not an intrinsic score, but we are assuming that the distribution of available items`U`

is likely a*very*misleading estimate of the measure`P[]`

we are trying to cover. - No credit for presumed duplicates,
`S(X) = S(X union {y})`

when there is a`x in X`

such that`d(x,y)=0`

. - Consistent across selection sizes. Roughly:
`S(X)`

should be linear in`|X|`

for good diverse targets (that is`X`

where`z(x) ≥ a`

and`d(x,y)=1`

for all`x,y in X`

. This allows us to use the scores to help pick selection size. - Near additivity. Adding a new good item that is at least distance
`d`

from all other items should add about`d`

units of score. This is not a feature shared by the probability problem we motivated this note with (or by`Z()`

or`Z'()`

. This is because`S()`

was designed for scientific experiment design where extreme examples were thought to be more valuable in learning underlying mechanism. So we were not hoping for a biological “hit” on the first screen, but instead a screen that returned a lot of information due to having a lot of useful variation. - Diminishing returns (though not in the exact same sense as submodular functions):
`S(X union Y) ≤ S(X) + S(Y) + min_{x in X, y in Y} d(x,y)`

.

Total variation measures were in use at the time, and it was widely thought that in this field coverage style measures would be overwhelmed by the streetlight effect (the distribution of what we could most easily inspect being a very bad, even misleading, stand-in for the distribution induced by `P[]`

) especially optimizing over molecular libraries produced by combinatorial chemistry methods (once derided as “looking for a needle in a haystack by building more haystacks.”).

To find a good set I used a 2-opt optimization heuristic over the score `S()`

. This dominates the greedy algorithm as long as you use a greedy solution as one of your optimization attempt starts (though we did not prove an approximation bound on optimality in this case as the greedy/submodular argument does not immediately apply in this case).

I feel the spanning tree measure is complementary to commonly used coverage and variance set measures and serves well in many set-valuation projects.

It is our conjecture that there may not be one efficient intrinsic diversity or discrepancy measure “to rule them all”. Your choice of diversity measure should depend on your problem domain and how good an estimate you have of the unknown joint distribution you can acquire. It is likely something like Arrow’s impossibility theorem applies and there may be no measure simultaneously modeling all reasonable demands.

Instead we suggest trying a few examples (both notional and real) and seeing which heuristic tends to steer towards expected “right” answers. Each of the three main efficiently implementable measures (coverage, variation, and spanning measure) have a digram they clearly do poorly on. But keep in mind: even a reasonable attempt to score set value as a sub-additive function can greatly outperform simply adding individual item scores.

We illustrate a instruction situations below. In each case we have a number of items we can select (the circles), some of which are already selected (the filled circles). Distances are as viewed in the diagram. The problem is to choose the next item to select (perhaps part of a greedy allocation). In each case we show an plausible intuitive good pick, and show the *contrary* item the measure will likely pick (a bad choice avoided by one of the competing measures).

Example 1: coverage measure distracted from useful diversity by over-coverage of one region. This was the motivating problem in our original biotech application as combinatorial chemistry can produce huge numbers of related molecules- independent of the value of exploring a particular region of biological behavior space.

Example 2: variance measure distracted from usefully exploring new space by exploiting simultaneous distance to many already selected items. We saw this flaw in actual applications: wastefully picking essentially duplicate molecules for experiment because itappearedto represent more chances to be far away from another biological motif.

Example 3: spanning tree measure distracted from useful subdivision by collinearity. Spanning tree measure over-value the boundary. The problem is greatest if your search space is “small” (or elliptic) and less if your search space is “big” (or hyperbolic).

Overall the coverage measure is the most natural. The main issue with it is the distribution of available items may very much mis-represent the (unknown) measure or density induced by the true valuation density. We used the spanning tree measure because this mismatch of density (in the presence of molecular libraries produced by combinatorial chemistry methods) was considered the plausibly biggest issue in our original biotech application.

For set oriented applications (such as showing a user a page of search results) you must move from individual item scoring or ranking to set-valued scoring. Set valued optimization can be counterintuitive, complicated and expensive- but can make a *massive* difference in the quality of your final application.

The reason the spanning tree measure is not monotone or submodular (even for Euclidian dissimilarity measures) is given by Steiner style examples such as the following:

In the above diagram define the sets: `X = {}`

and `Y = {y1, y2, y3}`

. We see the minimal spanning tree on `(Y union {a})`

is shorter than the minimal spanning tree on `Y`

(violating monontonicity).

For our next example consider `X = {x1, x2, x3}`

and `Y = {x1, x2, x3, y4}`

. Then we have `X contained in Y`

and as “Y has already sprung the trap” we have `S(X union {a}) - S(X)`

is negative and `S(Y union {a}) - S(Y)`

violating submodularity (which would require `S(X union {a}) - S(X) ≥ S(Y union {a}) - S(Y)`

).

**Please comment on the article here:** **Statistics – Win-Vector Blog**

The post Neglected optimization topic: set diversity appeared first on All About Statistics.

]]>Bill Harris writes: I know you’re on a blog delay, but I’d like to vote to raise the odds that my question in a comment to http://andrewgelman.com/2015/09/15/even-though-its-published-in-a-top-psychology-journal-she-still-doesnt-believe-it/gets discussed, in case it’s not in your queue. It’s likely just my simple misunderstanding, but I’ve sensed two bits of contradictory advice in your writing: fit one complete model all at […]

The post Forking paths vs. six quick regression tips appeared first on Statistical Modeling, Causal Inference, and Social Science.

The post Forking paths vs. six quick regression tips appeared first on All About Statistics.

]]>Bill Harris writes:

I know you’re on a blog delay, but I’d like to vote to raise the odds that my question in a comment to http://andrewgelman.com/2015/09/15/even-though-its-published-in-a-top-psychology-journal-she-still-doesnt-believe-it/gets discussed, in case it’s not in your queue.

It’s likely just my simple misunderstanding, but I’ve sensed two bits of contradictory advice in your writing: fit one complete model all at one, and fit models incrementally, starting with the overly small.

For those of us who are working in industry and trying to stay abreast of good, current practice and thinking, this is important.

I realize it may also not be a simple question. Maybe both positions are correct, and we don’t yet have a unifying concept to bring them together.

I am open to a sound compromise. For example, I could imagine the need to start with EDA and small models but hold out a test set for one comprehensive model. I recall you once wrote to me that you don’t worry much about holding out data for testing, since your field produces new data with regularity. Others of us aren’t quite so lucky, either because data is produced parsimoniously or the data we need to use is produced parsimoniously.

Still, building the one big model, even after the discussions on sparsity and on horseshoe priors, can sound a bit like http://andrewgelman.com/2014/06/02/hate-stepwise-regression/, http://andrewgelman.com/2012/10/16/bayesian-analogue-to-stepwise-regression/, and

http://andrewgelman.com/2013/02/11/toward-a-framework-for-automatic-model-building/, although I recognize that regularization can make a big difference.Thoughts?

My reply:

I have so many things I really really *must* do, but am too lazy to do. Things to figure out, data to study, books to write. Every once in awhile I do some work and it feels soooo good. Like programming the first version of the GMO algorithm, or doing that simulation the other day that made it clear how the simple Markov model massively underestimates the magnitude of the hot hand (sorry, GVT!), or even buckling down and preparing R and Stan code for my classes. But most of the time I avoid working, and during those times, blogging keeps me sane. It’s now May in blog time, and I’m 1/4 of the way toward being Jones.

So, sure, Bill, I’ll take next Monday’s scheduled post (“Happy talk, meet the Edlin factor”) and bump it to 11 May, to make space for this one.

And now, to get to the topic at hand: Yes, it does seem that I give two sorts of advice but I hope they are complementary, not contradictory.

On one hand, let’s aim for hierarchical models where we study many patterns at once. My model here is Aki’s birthday model (the one with graphs on cover of BDA3) where, instead of analyzing just Valentine’s Day and Halloween, we looked at all 366 days at once, also adjusting for day of week in a way that allows that adjustment to change over time.

On the other hand, we can never quite get to where we want to be, so let’s start simple and build our models up. This happens both *within** a project*—start simple, build up, keep going until you don’t see any benefit from complexifying your model further—and *across projects*, where we (statistical researchers and practitioners) gradually get comfortable with methods and can go further.

This is related to the general idea we discussed ~~a few years ago~~ (wow—it was only a year ago, blogtime flies!), that statistical analysis recapitulates the development of statistical methods.

In the old days, many decades ago, one might start by computing correlation measures and then move to regression, adding predictors one at a time. Now we might start with a (multiple) regression, then allow intercepts to vary, then move to varying slopes. In a few years, we may internalize multilevel models (both in our understanding and in our computation) so that they can be our starting point, and once we’ve chunked that, we can walk in what briefly will feel like seven-league boots.

Does that help?

The post Forking paths vs. six quick regression tips appeared first on Statistical Modeling, Causal Inference, and Social Science.

**Please comment on the article here:** **Statistical Modeling, Causal Inference, and Social Science**

The post Forking paths vs. six quick regression tips appeared first on All About Statistics.

]]>The post "Using R for Introductory Econometrics" appeared first on All About Statistics.

]]>Recently, I received an email from Florian Heiss, Professor and Chair of Statistics and Econometrics at the Henrich Heine University of Dusseldorf.

He wrote:

Florian has used the CreateSpace publishing platform to produce an extremely professional product.

*Using R for Introductory Econometrics* is a fabulous modern resource. I know I'm going to be using it with my students, and I recommend it to anyone who wants to learn about econometrics and R at the same time.

© 2016, David E. Giles

He wrote:

"I'd like to introduce you to a new book I just published that might be of interest to you:Using R for Introductory Econometrics.The goal: An introduction to R that makes it as easy as possible for undergrad students to link theory to practice without any hurdles regarding material, notation, or terminology. The approach: Take a popular econometrics textbook (Jeff Wooldridge'sIntroductory Econometrics) and make the whole thing as consistent as possible.I introduce R and show how to implement all methods Wooldridge mentions mostly using his examples. I also add some Monte Carlo simulation and present tools like R Markdown.The book is self-published, so I can offer the whole text for free online reading and a hard copy is really cheap as well."

The link for the online version of Florian's book is http://www.urfie.net/.

What you`ll find there are two versions of his 365-page book (Flash and HTML5) that you can read online; and all of the related R files for easy download.

If you're after a hard copy of the book you can purchase it for the bargain price of US$26.90 directly from CreateSpace, or from Amazon.

**Please comment on the article here:** **Econometrics Beat: Dave Giles' Blog**

The post "Using R for Introductory Econometrics" appeared first on All About Statistics.

]]>Mon: Forking paths vs. six quick regression tips Tues: Primed to lose Wed: Point summary of posterior simulations? Thurs: In general, hypothesis testing is overrated and hypothesis generation is underrated, so it’s fine for these data to be collected with exploration in mind. Fri: “Priming Effects Replicate Just Fine, Thanks” Sat: Pooling is relative to […]

The post On deck this week appeared first on Statistical Modeling, Causal Inference, and Social Science.

The post On deck this week appeared first on All About Statistics.

]]>**Mon:** Forking paths vs. six quick regression tips

**Tues:** Primed to lose

**Wed:** Point summary of posterior simulations?

**Thurs:** In general, hypothesis testing is overrated and hypothesis generation is underrated, so it’s fine for these data to be collected with exploration in mind.

**Fri:** “Priming Effects Replicate Just Fine, Thanks”

**Sat:** Pooling is relative to the model

**Sun:** Hierarchical models for phylogeny: Here’s what everyone’s talking about

The above image is so great I didn’t want you to have to wait till Tues and Fri to see it.

The post On deck this week appeared first on Statistical Modeling, Causal Inference, and Social Science.

**Please comment on the article here:** **Statistical Modeling, Causal Inference, and Social Science**

The post On deck this week appeared first on All About Statistics.

]]>The post Connection between hypergeometric distribution and series appeared first on All About Statistics.

]]>What’s the connection between the **hypergeometric distributions**, **hypergeometric functions**, and **hypergeometric series**?

The hypergeometric distribution is a probability distribution with parameters *N*, *M*, and *n*. Suppose you have an urn containing *N* balls, *M* red and the rest, *N* – *M* blue and you select *n* balls at a time. The hypergeometric distribution gives the probability of selecting *k* red balls.

The **probability generating function** for a discrete distribution is the series formed by summing over the probability of an outcome *k* and *x*^{k}. So the probability generating function for a hypergeometric distribution is given by

The summation is over all integers, but the terms are only non-zero for *k* between 0 and *M* inclusive. (This may be more general than the definition of binomial coefficients you’ve seen before. If so, see these notes on the general definition of binomial coefficients.)

It turns out that *f* is a **hypergeometric function** of *x* because it is can be written as a **hypergeometric series**. (Strictly speaking, *f* is a constant multiple of a hypergeometric function. More on that in a moment.)

A hypergeometric function is defined by a pattern in its power series coefficients. The hypergeometric function *F*(*a, **b*; *c*; *x*) has a the power series

where (*n*)_{k} is the *k*th rising power of *n*. It’s a sort of opposite of factorial. Start with *n* and multiply consecutive *increasing* integers for *k* terms. (*n*)_{0} is an empty product, so it is 1. (*n*)_{1} = *n*, (*n*)_{2} = *n*(*n*+1), etc.

If the ratio of the *k*+1st term to the *k*th term in a power series is a polynomial in *k*, then the series is a (multiple of) a hypergeometric series, and you can read the parameters of the hypergeometric series off the polynomial. This ratio for our probability generating function works out to be

and so the corresponding hypergeometric function is *F*(-*M*, –*n*; *N* – *M* – *n* + 1; *x*). The constant term of a hypergeometric function is always 1, so evaluating our probability generating function at 0 tells us what the constant is multiplying *F*(-*M*, –*n*; *N* – *M* – *n* + 1; *x*). Now

and so

The hypergeometric series above gives the original hypergeometric function as defined by Gauss, and may be the most common form in application. But the definition has been extended to have any number of rising powers in the numerator and denominator of the coefficients. The classical hypergeometric function of Gauss is denoted _{2}*F*_{1} because it has two falling powers on top and one on bottom. In general, the hypergeometric function _{p}*F*_{q} has *p* rising powers in the denominator and *q* rising powers in the denominator.

The CDF of a hypergeometric distribution turns out to be a more general hypergeometric function:

where *a* = 1, *b* = *k*+1-*M*, *c* = *k*+1-*n*, *d = k*+2, and *e* = *N*+*k*+2-*M*–*n*.

Thanks to Jan Galkowski for suggesting this topic via a comment on an earlier post, Hypergeometric bootstrapping.

**Please comment on the article here:** **Statistics – John D. Cook**

The post Connection between hypergeometric distribution and series appeared first on All About Statistics.

]]>