Here’s the wiki, and here’s the background: Our statistical models are imperfect compared to the true data generating process and our complete state of knowledge (from an informational-Bayesian perspective) or the set of problems over which we wish to average our inferences (from a population-Bayesian or frequentist perspective). The practical question here is what model […]

The post Prior choice recommendations wiki ! appeared first on Statistical Modeling, Causal Inference, and Social Science.

The post Prior choice recommendations wiki ! appeared first on All About Statistics.

]]>Here’s the wiki, and here’s the background:

Our statistical models are imperfect compared to the true data generating process and our complete state of knowledge (from an informational-Bayesian perspective) or the set of problems over which we wish to average our inferences (from a population-Bayesian or frequentist perspective).

The practical question here is what model to choose, or what class of models. Advice here is not typically so clear. We have a bunch of existing classes of models such as linear regression, logistic regression, along with newer things such as deep learning, and usual practice is to take a model that has been applied on similar problems and keep using it until it is obviously wrong. That’s not such a bad way to go; I’m just pointing out the informality of this aspect of model choice.

What about the choice of prior distribution in a Bayesian model? The traditional approach leads to an awkward choice: either the fully informative prior (wildly unrealistic in most settings) or the noninformative prior which is supposed to give good answers for any possible parameter valuers (in general, feasible only in settings where data happen to be strongly informative about all parameters in your model).

We need something in between. In a world where Bayesian inference has become easier and easier for more and more complicated models (and where approximate Bayesian inference is useful in large and tangled models such as recently celebrated deep learning applications), we need prior distributions that can convey information, regularize, and suitably restrict parameter spaces (using soft rather than hard constraints, for both statistical and computational reasons).

I used to talk about these as weakly informative prior distributions (as in this paper from 2006 on hierarchical variance parameters and this from 2008 on logistic regression coefficients) but now I just call them “prior distributions.” In just about any real problem, there’s no fully informative prior and there’s no noninformative prior; every prior contains some information and does some regularization, while still leaving some information “on the table,” as it were.

That is, *all* priors are weakly informative; we now must figure out where to go from there. We should be more formal about the benefits of including prior information, and the costs of overconfidence (that is, including information that we don’t really have).

We (Dan Simpson, Michael Betancourt, Aki Vehtari, and various others) have been thinking about the problem in different ways involving theory, applications, and implementation in Stan.

For now, we have this wiki on Prior Choice Recommendations. I encourage you to take a look, and feel free to add to it if you have something useful to say. You can also post your questions and thoughts in comments below.

The post Prior choice recommendations wiki ! appeared first on Statistical Modeling, Causal Inference, and Social Science.

**Please comment on the article here:** **Statistical Modeling, Causal Inference, and Social Science**

The post Prior choice recommendations wiki ! appeared first on All About Statistics.

]]>The post a secretary problem with maximum ability appeared first on All About Statistics.

]]>**T**he Riddler of today has a secretary problem, where one measures sequentially N random variables until one deems the current variable to be the largest of the whole sample. The classical secretary problem has a counter-intuitive solution where one first measures N/e random variables without taking any decision and then and only then picks the first next outcome larger than the largest in the first group. (For large values of N.) The added information in the current riddle is that the distribution of those iid random variables is set to be uniform on {1,…,M}, which begs for a modification in the algorithm. As for instance when observing M on the current draw.

The approach I devised is certainly suboptimal, as I decided to pick the currently observed value if the (conditional) probability it is the largest is larger than the probability subsequent draws. This translates into the following R code:

M=100 #maximum value N=10 #total number of draws hyprob=function(m){ # m is sequence of draws so far n=length(m);mmax=max(m) if ((m[n]<mmax)||(mmax-n<N-n)){prob=0 }else{ prob=prod(sort((1:mmax)[-m],dec=TRUE) [1:(N-n)]/((M-n):(M-N+1))} return(prob)} decision=function(draz=sample(1:M,N)){ i=0 keepgoin=TRUE while ((keepgoin)&(i<N)){ i=i+1 keepgoin=(hyprob(draz[1:i])<0.5)} return(c(i,draz[i],(draz[i]<max(draz))))}

which produces a winning rate of around 62% when N=10 and M=100, hence much better than the expected performances of the (asymptotic) secretary algorithm, with a winning frequency of 1/e. (For N=10 and M=100, the winning frequency is only 27%.)

Filed under: Kids, R Tagged: mathematical puzzle, R, secretary problem, stopping rule, The Riddler

**Please comment on the article here:** **R – Xi'an's Og**

The post a secretary problem with maximum ability appeared first on All About Statistics.

]]>Following a recent post that mentioned

The post A whole fleet of Wansinks: is “evidence-based design” a pseudoscience that’s supporting a trillion-dollar industry? appeared first on Statistical Modeling, Causal Inference, and Social Science.

The post A whole fleet of Wansinks: is “evidence-based design” a pseudoscience that’s supporting a trillion-dollar industry? appeared first on All About Statistics.

]]>Following a recent post that mentioned << le Sherlock Holmes de l'alimentation >>, we go this blockbuster comment which seemed worth its own post by Ujjval Vyas:

I work in an area where social psychology is considered the gold standard of research and thus the whole area is completely full of Wansink stuff (“people recover from surgery faster if they have a view of nature out the window”, “obesity and diabetes are caused by not enough access to nature for the poor”, biomimicry is a particularly egregious idea in this field). No one even knows how to really read any of the footnotes or cares, since it is all about confirmation bias and the primary professional organizations in the area directly encourage such lack of rigor. Obscure as it may sound, the whole area of “research” into architecture and design is full of this kind of thing. But the really odd part is that the field is made up of people who have no idea what a good study is or could be (architects, designers, interior designers, academic “researchers” at architecture schools or inside furniture manufacturers trying to sell more). They even now have groups that pursue “evidence-based healthcare design” which simply means that some study somewhere says what they need it to say. The field is at such a low level that it is not worth mentioning in many ways except that it is deeply embedded in a $1T industry for building and construction as well as codes and regulations based on this junk. Any idea of replication is simply beyond the kenning in this field because, as one of your other commenters put it, the publication is only a precursor to Ted talks and keynote addresses and sitting on officious committees to help change the world (while getting paid well). Sadly, as you and commenters have indicated, no one thinks they are doing anything wrong at all. I only add this comment to suggest that there are whole fields and sub-fields that suffer from the problems outlined here (much of this research would make Wansink look scrupulous).

Here’s the wikipedia page on Evidence-based design, including this chilling bit:

As EBD is supported by research, many healthcare organizations are adopting its principles with the guidance of evidence-based designers. The Center for Health Design and InformeDesign (a not-for-profit clearinghouse for design and human-behaviour research) have developed the Pebble Project, a joint research effort by CHD and selected healthcare providers on the effect of building environments on patients and staff.

The Evidence Based Design Accreditation and Certification (EDAC) program was introduced in 2009 by The Center for Health Design to provide internationally recognized accreditation and promote the use of EBD in healthcare building projects, making EBD an accepted and credible approach to improving healthcare outcomes. The EDAC identifies those experienced in EBD and teaches about the research process: identifying, hypothesizing, implementing, gathering and reporting data associated with a healthcare project.

Later on the page is a list of 10 strategies (1. Start with problems. 2. Use an integrated multidisciplinary approach with consistent senior involvement, ensuring that everyone with problem-solving tools is included. etc.). Each of these steps seems reasonable, but put them together and they do read like a recipe for taking hunches, ambitious ideas, and possible scams and making them look like science. So I’m concerned. Maybe it would make sense to collaborate with someone in the field of architecture and design and try to do something about this.

**P.S.** It might seem kinda mean for me to pick on these qualitative types for trying their best to achieve something comparable to quantitative rigor. But . . . if there are really billions of dollars at stake, we shouldn’t sit idly by. Also, I feel like Wansink-style pseudoscience can be destructive of qualitative expertise. I’d rather see some solid qualitative work than bogus number crunching.

The post A whole fleet of Wansinks: is “evidence-based design” a pseudoscience that’s supporting a trillion-dollar industry? appeared first on Statistical Modeling, Causal Inference, and Social Science.

**Please comment on the article here:** **Statistical Modeling, Causal Inference, and Social Science**

The post A whole fleet of Wansinks: is “evidence-based design” a pseudoscience that’s supporting a trillion-dollar industry? appeared first on All About Statistics.

]]>The post Learning by Doing appeared first on All About Statistics.

]]>The New York Times did it after the election, in January 2017: **You Draw It, **Learning Statistics by drawing and comparing charts.

‘Draw your guesses on the charts below to see if you’re as smart

as you think you are.’

And Bayerischer Rundfunk did it before the election, in April 2017.

This kind of giving information is an excellent strategy to foster insights and against forgetting. And it’s an old tradition in didactics. 360 years ago Amos Comenius emphasized this technique in his Didactica Magna:

**“Agenda agendo discantur”**

Filed under: 031 Data visualization, 033 Statistical literacy Tagged: Comenius, drawing, interactive charts

**Please comment on the article here:** **Blog about Stats**

The post Learning by Doing appeared first on All About Statistics.

]]>The post “Data sleaze: Uber and beyond” appeared first on All About Statistics.

]]>Interesting discussion from Kaiser Fung. I don’t have anything to add here; it’s just a good statistics topic.

Scroll through Kaiser’s blog for more:

Dispute over analysis of school quality and home prices shows social science is hard

My pre-existing United boycott, and some musing on randomness and fairness

etc.

The post “Data sleaze: Uber and beyond” appeared first on Statistical Modeling, Causal Inference, and Social Science.

**Please comment on the article here:** **Statistical Modeling, Causal Inference, and Social Science**

The post “Data sleaze: Uber and beyond” appeared first on All About Statistics.

]]>Christian Bartels send along this paper, which he described as an attempt to use informative priors for frequentist test statistics. I replied: I’ve not tried to follow the details but this reminds me of our paper on posterior predictive checks. People think of this as very Bayesian but my original idea when doing this research […]

The post Using prior knowledge in frequentist tests appeared first on Statistical Modeling, Causal Inference, and Social Science.

The post Using prior knowledge in frequentist tests appeared first on All About Statistics.

]]>Christian Bartels send along this paper, which he described as an attempt to use informative priors for frequentist test statistics.

I replied:

I’ve not tried to follow the details but this reminds me of our paper on posterior predictive checks. People think of this as very Bayesian but my original idea when doing this research was to construct frequentist tests using Bayesian averaging in order to get p-values. This was motivated by a degrees-of-freedom-correction problem where the model had nonlinear constraints and so one could not simply do a classical correction based on counting parameters.

To which Bartels wrote:

Your work starts from the same point as mine, existing frequentist tests may be inadequate for the problem of interest. Your work ends also, where I would like to end, performing tests via integration over (i.e.,sampling of) paremeters and future observation using likelihood and prior.

In addition, I try to anchor the approach in decision theory (as referenced in my write up). Perhaps this is too ambitious, we’ll see.

Results so far, using the language of your publication:

– The posterior distribution p(theta|y) is a good choice for the deviance D(y,theta). It gives optimal confidence intervals/sets in the sense proposed by Schafer, C.M. and Stark, P.B., 2009. Constructing confidence regions of optimal expected size. Journal of the American Statistical Association, 104(487), pp.1080-1089.

– Using informative priors for the deviance D(y,theta)=p(theta|y) may improve the quality of decisions, e.g., may improve thenpower of tests.

– For the marginalization, I find it difficult to strike the balance between proposing something that can be argued/shown to give optimal tests, and something that can be calculate with availabe computational resources. I hope to end up with something like one of the variants shown in your Figure 1.I noted that you distinguish test statistics from deviances that do depend or do not depend on the parameter. I’m not aware of anything that prevents you from using deviances with a dependence on parameters for frequentist tests – it is just inconvenient, if you are after generic, closed form solutions for tests. I did not make this differentiation, and refer to tests independent on whether they depend on the parameters or not.

I don’t really have anything more to say here, as I have not thought about these issues for awhile. But I thought Bartels’s paper and this discussion might interest some of you.

The post Using prior knowledge in frequentist tests appeared first on Statistical Modeling, Causal Inference, and Social Science.

**Please comment on the article here:** **Statistical Modeling, Causal Inference, and Social Science**

The post Using prior knowledge in frequentist tests appeared first on All About Statistics.

]]>R doesn’t allow block comments. You have to comment out each line, or you can encapsulate the block in if(0){} which is the world’s biggest hack. Grrrrr. P.S. Just to clarify: I want block commenting not because I want to add long explanatory blocks of text to annotate my scripts. I want block commenting because […]

The post I hate R, volume 38942 appeared first on Statistical Modeling, Causal Inference, and Social Science.

The post I hate R, volume 38942 appeared first on All About Statistics.

]]>R doesn’t allow block comments. You have to comment out each line, or you can encapsulate the block in if(0){} which is the world’s biggest hack. Grrrrr.

**P.S.** Just to clarify: I want block commenting *not* because I want to add long explanatory blocks of text to annotate my scripts. I want block commenting because I alter my scripts, and sometimes I want to comment out a block of code.

The post I hate R, volume 38942 appeared first on Statistical Modeling, Causal Inference, and Social Science.

**Please comment on the article here:** **Statistical Modeling, Causal Inference, and Social Science**

The post I hate R, volume 38942 appeared first on All About Statistics.

]]>The post Data sleaze: Uber and beyond appeared first on All About Statistics.

]]>There has been a barrage of negative publicity related to Uber recently. The latest salvo is a long article in the *New York Times* (link). This piece focuses on Uber's CEO, who was trained as a computer engineer, but my interest lies primarily in several revelations about how Uber collects and uses customer data.

The key episode picked up by various outlets (e.g. TechCrunch, Wired) involves Uber "secretly identifying and tagging iPhones even after its app had been deleted and the devices erased." What Uber engineers did was against Apple rules, and they knew it because they also implemented an elaborate cover-up operation but Apple eventually discovered the ruse, and CEO Tim Cook called Uber to task.

Much of the reporting on this episode, whether in the *Times* or *Wired*, etc., miss the mark. These reporters seemed to have received Uber's side of the story from their PR team, and just printed it without asking the tough questions. Uber claimed that their rule-breaking code was intended to "prevent fraud," and suggested that it is a standard practice used by many other app developers.

If those assertions were true, then Apple would have summoned a lot of developers to Cupertino! Uber clearly went beyond what other app developers were doing.

Also, it is not at all clear how this alleged fraud detection scheme works. Here are the details offered by the *Times*:

At the time, Uber was dealing with widespread account fraud in places like China, where tricksters bought stolen iPhones that were erased and resold. Some Uber drivers there would then create dozens of fake email addresses to sign up for new Uber rider accounts attached to each phone, and request rides from those phones, which they would then accept. Since Uber was handing out incentives to drivers to take more rides, the drivers could earn more money this way.

To halt the activity, Uber engineers assigned a persistent identity to iPhones with a small piece of code, a practice called “fingerprinting.” Uber could then identify an iPhone and prevent itself from being fooled even after the device was erased of its contents.

As someone with an engineering degree, I don't understand what those words mean. First, it seems that erasing the device does not remove the Uber-added piece of code, which is perhaps more of an uncomfortable question for Apple than for Uber. Second, am I to believe that there are no legitimate, refurbished iPhones in use? That last sentence sounds like every iPhone that switched users is a fraud. Third, there are many other ways to detect fraud - those fake user accounts have to be tied to credit cards, for example. Fourth, if a promotion encourages gaming and fraud, to the extent that a major software development effort, including a cover-up operation, is needed to support it, maybe the promotion should be retired, no?

***

Readers should definitely read behind the lines here. The most noteworthy items weren't even mentioned.

First, Uber is after personally identifying data. If we believe that fraud detection was the original intent of such unruly fingerprinting, then Uber wants to know which customer is using a specific phone, which implies that the data being collected must identify individuals (as well as individual phones).

Second, while Uber claimed that the code was developed for fraud detection, nowhere did it deny that the collected data have been used for other purposes, such as marketing. Such data is extremely useful for acquiring new customers. It tells Uber which phones do not have their app installed. It is as if the first owner of the iPhone who installs the Uber app places every future owner of that iPhone onto Uber's prospect list.

The data is also extremely usesful for "winback" marketing efforts. This piece of code is installed on every user who deletes the Uber app, as Uber has no way of knowing if the iPhone will change hands at the time that the Uber app is deleted. So everyone who deletes the app can be tracked and presumably sent winback communications.

***

It is debatable whether consumers care about being tracked 24/7, and I don't pretend to speak for everyone. Uber is not the only company that has developed software to follow their customers' every move. What is clear though is that the engineers who write the code to execute these concepts - at Uber or elsewhere - believe that the consumers do not want to be tracked 24/7. However, they choose to do it anyway.

We know they know we don't want to be tracked. That's because the tracking code is typically hidden from view, and there is a general disclosure buried in Privacy Notices or Terms and Conditions that everyone knows nobody reads. If no one cares about tracking, then it can be performed in the open. Similarly, the cover-up operation to hide the illicit code from Apple engineers reveals that the engineers knew they were breaking the rules, and they did it anyway.

The Times article contained another revelation. Uber buys data from a startup called Slice Intelligence, who resells data from Unroll.me. Unroll.me runs a free service that helps people get rid of the clutter of spam in their email boxes - you grant permission to the company to peek into your mailbox, and pull out the unsubscribe links from various email lists. Well, it turns out that this service is a front for corporate espionage. Once inside your mailbox, their code gathers data about your purchases, and sells the data to companies: in this case, Uber buys data from Slice about its competitor, Lyft.

Here is actually one of the unspoken secrets of this "big data" industry. Unroll.me is one of many, many apps that are designed to collect data about our daily lives while fronting to be something else. I am pretty sure that the various receipt scanning apps (for expense reporting) are doing the same. I was told that weather apps are location databases that track all their users everywhere.

Again, it appears that the founders, managers or engineers who work for these outfits assume that their customers do not want to be tracked in such a manner because all such operations are hidden from view, and any disclosure is usually buried inside legalese that almost no one ever reads.

***

Slice Intelligence is hiding behind the weasel word "anonymized," which it explains as "no customer names." Usually, the deniers say they do not provide any "personally identifiable information" (PII), which would include phone numbers, addresses, emails, customer ids, etc. If any of those items are attached, not providing names is meaningless.

It is highly likely that Slice customers want personally identifying data - that is the key to connecting the dots. Analysts want to match John Smith on dataset A to John Smith on dataset B. If one of these datasets is truly anonymized, it will be very challenging, if not impossible, to correlate the data.

The "cover" of anonymized data is archaic, and I am surprised it is still in use. It's been proven a number of times that analysts can recover people's identities easily, even in datasets that are stripped of PII. Say, you are a loyal Uber customer, and pretty much take Uber cars everywhere. If I am given all of your trip information, origins, destinations and time of travel, I can immediately figure out where you live and where you work. From there, I will likely be able to identify you. Then I can look at the other trips to build a profile of your preferences, by analyzing what stores you shop at, what restaurants you eat at, how much you tip, etc.

***

Data sleaze is the data about one's own customers that are obtained secretly by businesses, and then sold to the highest bidders, also in secret transactions. The production of data sleaze is frequently justified by giving services away for "free." However, running a business as a "free service" fronting a profitable espionage operation is a choice made by management teams, not an inevitability. Indeed, many businesses that have a proper revenue model also produce data sleaze.

**Please comment on the article here:** **Big Data, Plainly Spoken (aka Numbers Rule Your World)**

The post Data sleaze: Uber and beyond appeared first on All About Statistics.

]]>Most SAS regression procedures support a CLASS statement which internally generates dummy variables for categorical variables. I have previously described what dummy variables are and how are they used. I have also written about how to create design matrices that contain dummy variables in SAS, and in particular how to [...]

The post Visualize a design matrix appeared first on The DO Loop.

The post Visualize a design matrix appeared first on All About Statistics.

]]>Most SAS regression procedures support a CLASS statement which internally generates dummy variables for categorical variables. I have previously described what dummy variables are and how are they used. I have also written about how to create design matrices that contain dummy variables in SAS, and in particular how to use different parameterizations: GLM, reference, effect, and so forth.

It occurs to me that you can visualize the structure of a design matrix by using the same technique (heat maps) that I used to visualize missing value structures.
In a design matrix, each categorical variable is replaced by several dummy variables. However, there are multiple parameterizations or *encodings* that result in different design matrices.

Heat maps require several pixels for each row and column of the design matrix, so they are limited to small or moderate sized data. The following SAS DATA step extracts the first 150 observations from the Sashelp.Heart data set and renames some variables. It also adds a fake response variable because the regression procedures that generate design matrices (GLMMOD, LOGISTIC, GLMSELECT, TRANSREG, and GLIMMIX) require a response variable even though the goal is to create a design matrix for the explanatory variables. In the following statements, the OUTDESIGN option of the GLMSELECT procedure generates the design matrix. The matrix is then read into PROC IML where the HEATMAPDISC subroutine creates a discrete heat map.

/* add fake response variable; for convenience, shorten variable names */ data Temp / view=Temp; set Sashelp.heart(obs=150 keep=BP_Status Chol_Status Smoking_Status Weight_Status); rename BP_Status=BP Chol_Status=Chol Smoking_Status=Smoking Weight_Status=Weight; FakeY = 0; run; ods exclude all; /* use OUTDESIGN= option to write the design matrix to a data set */ proc glmselect data=Temp outdesign(fullmodel)=Design(drop=FakeY); class BP Chol Smoking Weight / param=GLM; model FakeY = BP Chol Smoking Weight; run; ods exclude none; ods graphics / width=500px height=800px; proc iml; /* use HEATMAPDISC call to create heat map of design */ use Design; read all var _NUM_ into X[c=varNames]; close; run HeatmapDisc(X) title="GLM Design Matrix" xvalues=varNames displayoutlines=0 colorramp={"White" "Black"}; QUIT; |

Click on the heat map to enlarge it.
Each row of the design matrix indicates a patient in a research study. If any explanatory variable has a missing value, the corresponding row of the design matrix is missing (shown as gray). In
the design matrix for the GLM parameterization, a categorical variable with *k* levels is represented by *k* columns. The black and white heat map shows the structure of the design matrix. Black indicates a 1 and white indicates a 0. In particular:

- This first column is all black, which indicates the intercept column.
- Columns 2-4 represent the BP variable. For each row has one black rectangle in one of those columns. You can see that there are few black squares in column 4, which indicates that few patients in the study have optimal cholesterol.
- In a similar way, you can see that there are many nonsmokers (column 11) in the study. There are also many overweight patients (column 14) and few underweight patients (column 15).

The GLM parameterization is called a "singular parameterization" because each it contains redundant columns. For example, the BP_Optimal column is redundant because that column contains a 1 only when the BP_High and BP_Normal columns are both 0. Similarly, if either the BP_High or the BP_Normal columns is 1, then BP_Optimal is automatically 0. The next section removes the redundant columns.

There is a binary design matrix that contains only the independent columns of the GLM design matrix. It is called a reference parameterization and you can generate it by using PARAM=REF in the CLASS statement, as follows:

ods exclude all; /* use OUTDESIGN= option to write the design matrix to a data set */ proc glmselect data=Temp outdesign(fullmodel)=Design(drop=FakeY); class BP Chol Smoking Weight / param=REF; model FakeY = BP Chol Smoking Weight; run; ods exclude none; |

Again, you can use the HEATMAPDISC call in PROC IML to create the heat map. The matrix is similar, but categorical variables that have *k* levels are replaced by *k*–1 dummy variables. Because the reference level was not specified in the CLASS statement, the last level of each category is used as the reference level. Thus the REFERENCE design matrix is similar to the GLM design, but that the last column for each categorical variable has been dropped. For example, there are columns for BP_High and BP_Normal, but no column for BP_Optimal.

The previous design matrices were binary 0/1 matrices. The EFFECT parameterization, which is the default parameterization for PROC LOGISTIC, creates a nonbinary design matrix. In the EFFECT parameterization, the reference level is represented by using a -1 and a nonreference level is represented by 1. Thus there are three values in the design matrix.

If you do not specify the reference levels, the last level for each categorical variable is used, just as for the REFERENCE parameterization. The following statements generate an EFFECT design matrix and use the REF= suboption to specify the reference level. Again, you can use the HEATMAPDISC subroutine to display a heat map for the design. For this visualization, light blue is used to indicate -1, white for 0, and black for 1.

ods exclude all; /* use OUTDESIGN= option to write the design matrix to a data set */ proc glmselect data=Temp outdesign(fullmodel)=Design(drop=FakeY); class BP(ref='Normal') Chol(ref='Desirable') Smoking(ref='Non-smoker') Weight(ref='Normal') / param=EFFECT; model FakeY = BP Chol Smoking Weight; run; ods exclude none; proc iml; /* use HEATMAPDISC call to create heat map of design */ use Design; read all var _NUM_ into X[c=varNames]; close; run HeatmapDisc(X) title="Effect Design Matrix" xvalues=varNames displayoutlines=0 colorramp={"LightBlue" "White" "Black"}; QUIT; |

In the adjacent graph, blue indicates that the value for the patient was the reference category. White and black indicates that the value for the patient was a nonreference category, and the black rectangle appears in the column that indicates the value of the nonreference category. For me, this design matrix takes some practice to "read." For example, compared to the GLM matrix, it is harder to determine the most frequent levels for a categorical variable.

In the example, I have used the HEATMAPDISC subroutine in SAS/IML to visualize the design matrices. But you can also create heat maps in Base SAS.

If you have SAS 9.4m3, you can use the HEATMAPPARM statement in PROC SGPLOT to create these heat maps. First you have to convert the data from wide form to long form, which you can do by using the following DATA step:

/* convert from wide (matrix) to long (row, col, value)*/ data Long; set Design; array dummy[*] _NUMERIC_; do varNum = 1 to dim(dummy); rowNum = _N_; value = dummy[varNum]; output; end; keep varNum rowNum value; run; proc sgplot data=Long; /* the observation values are in the order {1, 0, -1}; use STYLEATTRIBS to set colors */ styleattrs DATACOLORS=(Black White LightBlue); heatmapparm x=varNum y=rowNum colorgroup=value / showxbins discretex; xaxis type=discrete; /* values=(1 to 11) valuesdisplay=("A" "B" ... "J" "K"); */ yaxis reverse; run; |

The heat map is similar to the one in the previous section, except that the columns are labeled 1, 2, 3, and so forth. If you want the columns to contain the variable names, use the VALUESDISPLAY= option, as shown in the comments.

If you are running an earlier version of SAS, you will need to use the Graph Template Language (GTL) to create a template for the discrete heat maps.

In summary, you can use the OUTDESIGN= option in PROC GLMSELECT to create design matrices that use dummy variables to encode classification variables. If you have SAS/IML, you can use the HEATMAPDISC subroutine to visualize the design matrix. Otherwise, you can use the HEATMAPPARM statement in PROC SGPLOT (SAS 9.4m3) or the GTL to create the heat maps. The visualization is useful for teaching and understanding the different parameterizations schemes for classification variables.

The post Visualize a design matrix appeared first on The DO Loop.

**Please comment on the article here:** **The DO Loop**

The post Visualize a design matrix appeared first on All About Statistics.

]]>The post Snap appeared first on All About Statistics.

]]>In the grand tradition of all recent election times, I've decided to have a go and try and build a model that could predict the results of the upcoming snap general election in the UK. I'm sure there will be many more people having a go at this, from various perspectives and using different modelling approaches. Also, I will try very hard to

First off: the data. I think that since the announcement of the election, the pollsters have intensified the number of surveys; I have found already 5 national polls (two by Yougov, two by ICM and one by Opinium $-$ there may be more and I'm not claiming a systematic review/meta-analysis of the polls.

Arguably, this election will be mostly about Brexit: there surely will be other factors, but because this comes almost exactly a year after the referendum, it is a fair bet to suggest that how people felt and still feel about its outcome will also massively influence the election. Luckily, all the polls I have found do report data in terms of voting intention, broken up by Remain/Leave. So, I'm considering $P=8$ main political parties:

I also have available data on the results of both the 2015 election (by constituency and again, I'm only considering the $C=632$ constituencies in England, Scotland and Wales $-$ this leaves out the 18 Northern Irish constituencies) and the 2016 EU referendum. I had to do some work to align these two datasets, as the referendum did not consider the usual geographical resolution. I have mapped the voting areas used 2016 to the constituencies and have recorded the proportion of votes won by the $P$ parties in 2015, as well as the proportion of Remain vote in 2016.

For each observed poll $i=1,\ldots,N_{polls}$, I modelled the observed data among "$L$eavers" as $$y^{L}_{i1},\ldots,y^{L}_{iP} \sim \mbox{Multinomial}\left(\left(\pi^{L}_{1},\ldots,\pi^{L}_{P}\right),n^L_i\right).$$ Similarly, the data observed for " $R$emainers" are modelled as $$y^R_{i1},\ldots,y^R_{iP} \sim \mbox{Multinomial}\left(\left(\pi^R_{1},\ldots,\pi^R_P\right),n^R_i\right).$$

In other words, I'm assuming that within the two groups of voters, there is a vector of underlying probabilities associated with each party ($\pi^L_p$ and $\pi^R_p$) that are pooled across the polls. $n^L_i$ and $n^R_i$ are the sample sizes of each poll for $L$ and $R$.

I used a fairly standard formulation and modelled $$\pi^L_p=\frac{\phi^L_p}{\sum_{p=1}^P \phi^L_p} \qquad \mbox{and} \qquad \pi^R_p=\frac{\phi^R_p}{\sum_{p=1}^P \phi^R_p} $$ and then $$\log \phi^j_p = \alpha_p + \beta_p j$$ with $j=0,1$ to indicate $L$ and $R$, respectively. Again, using fairly standard modelling, I fix $\alpha_1=\beta_1=0$ to ensure identifiability and then model $\alpha_2,\ldots,\alpha_P \sim \mbox{Normal}(0,\sigma_\alpha)$ and $\beta_2,\ldots,\beta_P \sim \mbox{Normal}(0,\sigma_\beta)$.

This essentially fixes the "Tory effect" to 0 (if only I could

I then use the estimated party- and EU result-specific probabilities to compute a "relative risk" with respect to the observed overall vote at the 2015 election $$\rho^j_p = \frac{\pi^j_p}{\pi^{15}_p},$$ which essentially estimates how much better (or worse) the parties are doing in comparison to the last election, among leavers and remainers. The reason I want these relative risks is because I can then distribute the information from the current polls and the EU referendum to each constituency $c=1,\ldots,C$ by estimating the predicted share of votes at the next election as the mixture $$\pi^{17}_{cp} = (1-\gamma_c)\pi^{15}_p\rho^L_p + \gamma_c \pi^{15}_p\rho^R_p,$$ where $\gamma_c$ is the observed proportion of remain voters in constituency $c$.

Finally, I can simulate the next election by ensuring that in each constituency the $\pi^{17}_{cp} $ sum to 1. I do this by drawing the vote shares as $\hat{\pi}^{17}_{cp} \sim \mbox{Dirichlet}(\pi^{17}_1,\ldots,\pi^{17}_P)$.

In the end, for each constituency I have a distribution of election results, which I can use to determine the average outcome, as well as various measures of uncertainty. So in a nutshell, this model is all about i) re-proportioning the 2015 and 2017 votes based on the polls; and ii) propagating uncertainty in the various inputs.

I'll update this model as more polls become available $-$ one extra issue then will be about discounting older polls (something like what Roberto did here and here, but I think I'll keep things easy for this). For now, I've run my model for the 5 polls I mentioned earlier and this is the (rather depressing) result.

From the current data and the modelling assumption, this looks like the Tories are indeed on course for a landslide victory $-$ my results are also kind of in line with other predictions (eg here). The model here may be flattering to the Lib Dems $-$ the polls seem to indicate almost unanimously that they will be doing very well in areas of a strong Remain persuasion, which means that the model predicts they will gain many seats, particularly where the 2015 election was won with a little margin (and often they leapfrog Labour to the first place).

The following table shows the predicted "swings" $-$ who's stealing votes from whom:

Conservative Green Labour Lib Dem PCY SNP

Conservative 325 0 0 5 0 0

Green 0 1 0 0 0 0

Labour 64 0 160 6 1 1

Liberal Democrat 0 0 0 9 0 0

Plaid Cymru 0 0 0 0 3 0

Scottish National Party 1 0 0 5 0 50

UKIP 1 0 0 0 0 0

Again,

**Please comment on the article here:** **Gianluca Baio's blog**

The post Snap appeared first on All About Statistics.

]]>