There continues to be a lot of discussion on the purported increase in mortality rates among middle-aged white people in America. Actually an increase among women and not much change among men but you don’t hear so much about this as it contradicts the “struggling white men” story that we hear so much about in […]

The post Mortality rate trends by age, ethnicity, sex, and state (link fixed) appeared first on Statistical Modeling, Causal Inference, and Social Science.

The post Mortality rate trends by age, ethnicity, sex, and state (link fixed) appeared first on All About Statistics.

]]>There continues to be a lot of discussion on the purported increase in mortality rates among middle-aged white people in America.

Actually an increase among women and not much change among men but you don’t hear so much about this as it contradicts the “struggling white men” story that we hear so much about in the news media.

**A big fat pile of graphs**

To move things along, Jonathan Auerbach and I prepared a massive document (zipped file here; still huge) with 60 pages of graphs, showing raw data and smoothed trends in mortality rate from 1999-2004 for:

– 50 states

– men and women

– non-hispanic whites, blacks, and hispanics

– age categories 0-1, 1-4, 5-14, 15-24, 25-34, 35-44, 45-54, 55-64, 65-74, 75-84.

It’s amazing how much you can learn by staring at these graphs.

For example, these trends are pretty much the same in all 50 states:

But look at these:

Flat in some states, sharp increases in others, and steady decreases in other states.

The patterns are even clearer here:

**Starting point**

To get back to the question that got everything started,

here’s the story for non-Hispanic white men and women, aged 45-54:

Different things are happening in different regions—in particular, things have been getting worse for women in the south and midwest, whereas the death rate of men in this age group have been declining during the past few years—but overall there has been little change since 1999. In contrast, as Anne Case and Angus Deaton noticed a bit over a year ago, other countries and U.S. nonwhites have seen large declines in death rates, something like 20%.

**Breaking down trends by education: it’s tricky**

In a forthcoming paper, “Mortality and morbidity in the 21st century,” Case and Deaton report big differences in trends among whites with high and low levels of education: “mortality is rising for those without, and falling for those with, a college degree.”

But the comparison of death rates by education is tricky because average education levels have been increasing over time. There’s a paper from 2015 on this topic, “Measuring Recent Apparent Declines In Longevity: The Role Of Increasing Educational Attainment, by John Bound, Arline Geronimus, Javier Rodriguez,” and Timothy Waidmann, who write:

Independent researchers have reported an alarming decline in life expectancy after 1990 among US non-Hispanic whites with less than a high school education. However, US educational attainment rose dramatically during the twentieth century; thus, focusing on changes in mortality rates of those not completing high school means looking at a different, shrinking, and increasingly vulnerable segment of the population in each year.

**Breaking down trends by state**

In my paper with Jonathan Auerbach, we found big differences in different regions of the country. We followed up and estimated mortality rates by state and by age group, and there are tons of interesting patterns. Again, our latest graph dump is here (zipped file here), and you can look through the graphs yourself to see what you see. Next step is to build some sort of open-ended tool to use Stan to do smoothing for arbitrary slices of these data. Also there are selection issues as people move between states, which is similar but not identical to selection issues regarding education.

The post Mortality rate trends by age, ethnicity, sex, and state (link fixed) appeared first on Statistical Modeling, Causal Inference, and Social Science.

**Please comment on the article here:** **Statistical Modeling, Causal Inference, and Social Science**

The post Mortality rate trends by age, ethnicity, sex, and state (link fixed) appeared first on All About Statistics.

]]>The post The Tidyverse Curse appeared first on All About Statistics.

]]>I’ve just finished a major overhaul to my widely read article, Why R is Hard to Learn. It describes the main complaints I’ve heard from the participants to my workshops, and how those complaints can often be mitigated. Here’s the only new section:

**The Tidyverse Curse**

There’s a common theme in many of the sections above: a task that is hard to perform using base a R function is made much easier by a function in the dplyr package. That package, and its relatives, are collectively known as the tidyverse. Its functions help with many tasks, such as selecting, renaming, or transforming variables, filtering or sorting observations, combining data frames, and doing by-group analyses. dplyr is such a helpful package that Rdocumentation.org shows that it is the single most popular R package (as of 3/23/2017.) As much of a blessing as these commands are, they’re also a curse to beginners as they’re more to learn. The main packages of dplyr, tibble, tidyr, and purrr contain a few hundred functions, though I use “only” around 60 of them regularly. As people learn R, they often comment that base R functions and tidyverse ones feel like two separate languages. The tidyverse functions are often the easiest to use, but not always; its pipe operator is usually simpler to use, but not always; tibbles are usually accepted by non-tidyverse functions, but not always; grouped tibbles may help do what you want automatically, but not always (i.e. you may need to ungroup or group_by higher levels). Navigating the balance between base R and the tidyverse is a challenge to learn.

A demonstration of the mental overhead required to use tidyverse function involves the usually simple process of printing data. I mentioned this briefly in the Identity Crisis section above. Let’s look at an example using the built-in mtcars data set using R’s built-in print function:

> print(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 ...

We see the data, but the variable names actually ran off the top of my screen when viewing the entire data set, so I had to scroll backwards to see what they were. The dplyr package adds several nice new features to the print function. Below, I’m taking mtcars and sending it using the pipe operator “%>%” into dplyr’s as_data_frame function to convert it to a special type of tidyverse data frame called a “tibble” which prints better. From there I send it to the print function (that’s R’s default function, so I could have skipped that step). The output all fits on one screen since it stopped at a default of 10 observations. That allowed me to easily see the variable names that had scrolled off the screen using R’s default print method. It also notes helpfully that there are 22 more rows in the data that are not shown. Additional information includes the row and column counts at the top (32 x 11), and the fact that the variables are stored in double precision (<dbl>).

> library("dplyr") > mtcars %>% + as_data_frame() %>% + print() # A tibble: 32 × 11 mpg cyl disp hp drat wt qsec vs am gear carb * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 # ... with 22 more rows

The new print format is helpful, but we also lost something important: the names of the cars! It turns out that row names get in the way of the data wrangling that dplyr is so good at, so tidyverse functions replace row names with 1, 2, 3…. However, the names are still available if you use the rownames_to_columns() function:

> library("dplyr") > mtcars %>% + as_data_frame() %>% + rownames_to_column() %>% + print() Error in function_list[[i]](value) : could not find function "rownames_to_column"

Oops, I got an error message; the function wasn’t found. I remembered the right command, and using the dplyr package did cause the car names to vanish, but the solution is in the tibble package that I “forgot” to load. So let’s load that too (dplyr is already loaded, but I’m listing it again here just to make each example stand alone.)

> library("dplyr") > library("tibble") > mtcars %>% + as_data_frame() %>% + rownames_to_column() %>% + print() # A tibble: 32 × 12 rowname mpg cyl disp hp drat wt qsec vs am gear carb <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 # ... with 22 more rows

Another way I could have avoided that problem is by loading the package named tidyverse, which includes both dplyr and tibble, but that’s another detail to learn.

In the above output, the row names are back! What if we now decided to save the data for use with a function that would automatically display row names? It would not find them because now they’re now stored in a *variable* called rowname, not in the row names position! Therefore, we would need to use either the built-in names function or the tibble package’s column_to_rownames function to restore the names to their previous position.

Most other data science software requires row names to be stored in a standard variable e.g. rowname. You then supply its name to procedures with something like SAS’

“ID rowname;” statement. That’s less to learn.

This isn’t a defect of the tidyverse, it’s the result of an architectural decision on the part of the original language designers; it probably seemed like a good idea at the time. The tidyverse functions are just doing the best they can with the existing architecture.

Another example of the difference between base R and the tidyverse can be seen when dealing with long text strings. Here I have a data frame in tidyverse format (a tibble). I’m asking it to print the lyrics for the song American Pie. Tibbles normally print in a nicer format than standard R data frames, but for long strings, they only display what fits on a single line:

> songs_df %>% + filter(song == "american pie") %>% + select(lyrics) %>% + print() # A tibble: 1 × 1 lyrics <chr> 1 a long long time ago i can still remember how that music used

The whole song can be displayed by converting the tibble to a standard R data frame by routing it through the as.data.frame function:

> songs_df %>% + filter(song == "american pie") %>% + select(lyrics) %>% + as.data.frame() %>% + print() ... <truncated> 1 a long long time ago i can still remember how that music used to make me smile and i knew if i had my chance that i could make those people dance and maybe theyd be happy for a while but february made me shiver with every paper id deliver bad news on the doorstep i couldnt take one more step i cant remember if i cried ...

These examples demonstrate a small slice of the mental overhead you’ll need to deal with as you learn base R and the tidyverse packages, such as dplyr. Since this section has focused on what makes R hard to learn, it may make you wonder why dplyr is the most popular R package. You can get a feel for that by reading the Introduction to dplyr. Putting in the time to learn it is well worth the effort.

**Please comment on the article here:** **r4stats.com**

The post The Tidyverse Curse appeared first on All About Statistics.

]]>The post Visualizing citation impact appeared first on All About Statistics.

]]>Michael Bales and his associates at Cornell are working on a new visual tool for citations data. This is an area that is ripe for some innovation. There is a lot of data available but it seems difficult to gain insights from them. The prototypical question is how authoritative is a particular researcher or research group, judging from his or her or their publications.

A proxy for "quality" is the number of times the paper is cited by others. More sophisticated metrics take into account the quality of the researchers who cite one's work. There are various summary statistics e.g. h-index that attempts to capture the data distribution but reducing to a single number may remove too much context.

Contextual information is very important for interpretation: certain disciplines might enjoy higher average numbers of citations because researchers tend to list more references, or that papers typically have large numbers of co-authors; individual researchers may have a few influential papers, or a lot of rarely-cited papers or anything in between.

A good tool should be able to address a number of such problems.

Michael was a former student who attended the Data Visualization workshop at NYU (syllabus here), and the class spent some time discussing his citations impact tool. He contacted me to let me know that what we did during the workshop has now reached the research conferences.

Here is a wireframe of the visual form we developed:

This particular chart shows the evolution in citations data over three time periods for a specific sub-field of study. The vertical scale is a percentile ranking based on some standard used in the citations industry. We grouped the data into deciles (and within each deciles, into thirds) to facilitate understanding. The median rank is highlighted - we can see that in this sub-field, the publications have both increased in quantity but also in quality with the median rank showing improvement over the three periods of time. Because "review articles" are interpreted differently by some, those are highlighted in purple.

One of the key strengths of this design is the filter mechanism shown on the right. The citations researcher can customize comparisons. This is really important because the citations data are meaningless by themselves; they only attain meaning when compared to peer groups.

Here is an even rougher sketch of the design:

For a single researcher, this view will list all of his or her papers, ordered by each paper's percentile rank, with review papers given a purple color.

The entire VIVO dashboard project by Weill Cornell Medicine has a github page, but the citation impact tool does not seem to be there at the moment.

Michael tells me the citation impact tool is found here.

**Please comment on the article here:** **Junk Charts**

The post Visualizing citation impact appeared first on All About Statistics.

]]>The post Datashader is a big deal appeared first on All About Statistics.

]]>I recently got back from Strata West 2017 (where I ran a very well received workshop on `R`

and `Spark`

). One thing that really stood out for me at the exhibition hall was `Bokeh`

plus `datashader`

from Continuum Analytics.

I had the privilege of having Peter Wang himself demonstrate `datashader`

for me and answer a few of my questions.

I am so excited about `datashader`

capabilities I literally *will not wait* for the functionality to be exposed in `R`

through `rbokeh`

. I am going to leave my usual `knitr`

/`rmarkdown`

world and dust off `Jupyter Notebook`

just to use `datashader`

plotting. This is worth trying, even for diehard `R`

users.

Every plotting system has two important ends: the *grammar* where you specify the plot, and the *rendering pipeline* that executes the presentation. Switching plotting systems means switching how you specify plots and can be unpleasant (this is one of the reasons we wrap our most re-used plots in WVPlots to hide or decouple how the plots are specified from the results you get). Given the convenience of the ggplot2 grammar, I am always reluctant to move to other plotting systems unless they bring me something big (and even then sometimes you don’t have to leave: for example the absolutely amazing adapter `plotly::ggplotly`

).

Currently, to use `datashader`

you must talk directly to `Python`

and `Bokeh`

(i.e. learn a different language). But what that buys you is massive: in-pixel analytics. Let me clarify that.

`datashader`

makes points and pixels first class entities in the graphics rendering pipeline. It admits they exist (many plotting systems render to an imaginary infinite resolution abstract plane) and allows the user to specify scale dependent calculations and re-calculations over them. It is easiest to show by example.

Please take a look at these stills from the `datashader`

US Census example. We can ask pixels to be colored by the majority race in the region of Lake Michigan:

If we were to use the interactive version of this graph we could zoom in on Chicago and the majorities are re-calculated based on the new scale:

What is important to understand is that is this is vastly more powerful than zooming in on a low-resolution rendering:

and even more powerful than zooming out on a static high-resolution rendering:

`datashader`

can redo aggregations and analytics on the fly. It can recompute histograms and renormalize them relative to what is visible to maintain contrast. It can find patterns that emerge as we change scale: think of zooming in on a grey pixel that resolves into a black and white checkerboard.

I am going to share a simple `datashader`

example here. Again, to see the full effect you would have to copy it into an `Jupyter`

notebook and run it. But I will use it to show my point.

After going through the steps to install `Anaconda`

and `Juputer notebook`

(plus some more `conda install`

steps to include necessary packages) we can make a plot of the `ggplot2`

data example `diamonds`

`ggplot2`

renderings of `diamonds`

typically look like the following (and show of the power and convenience of the grammar):

A `datashader`

rendering looks like the following:

If we use the interactive rectangle selector to zoom in on the apparently isolated point around $18300 and 3.025 carats we get the following dynamic re-render:

Notice the points shrunk (and didn’t subdivide) and there are some extremely faint points. There is something wrong with that as a presentation; but it isn’t because of `datashader`

! It is something unexpected in the data which is now jumping out at us.

`datashader`

is shading proportional to aggregated count. So the small point staying very dark (and being so dark it causes other point to render near transparent) means there are multiple observations in this tiny neighborhood. Going back to `R`

we can look directly at the data:

> library("dplyr") > diamonds %>% filter(carat>=3, carat<=3.05, price>=18200, price<=18400) # A tibble: 5 × 10 carat cut color clarity depth table price x y z <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> 1 3.01 Premium I SI2 60.2 59 18242 9.36 9.31 5.62 2 3.01 Fair I SI2 65.8 56 18242 8.99 8.94 5.90 3 3.01 Fair I SI2 65.8 56 18242 8.99 8.94 5.90 4 3.01 Good I SI2 63.9 60 18242 9.06 9.01 5.77 5 3.01 Good I SI2 63.9 60 18242 9.06 9.01 5.77

There are actually 5 rows with the exact carat and pricing indicated by the chosen point. The point stood out at fine scale because it indicated something subtle in the data (repetitions) that the analyst may not have known about or expected. The “ugly” presentation was an important warning. This is hands on the data, the quickest path to correct results.

For some web browsers, you don’t always see proper scaling, yielding artifacts like the following:

The `Jupyter notebooks`

always work, and web-browsers usually work (I am assuming it is security or ad-blocking that is causing the effect, not a `datashader`

issue).

`datashader`

brings to production resolution dependent per-pixel analytics. This is a very powerful style of interaction that is going to appear more and more places. This is something that the Continuum Analytics team has written about before and requires some interesting cross-compiling (Numba) to implement at scale. Now that analysts have seen this in action they are going to want this and ask for this.

**Please comment on the article here:** **Statistics – Win-Vector Blog**

The post Datashader is a big deal appeared first on All About Statistics.

]]>I received the following email, entitled “A research lead (potentially bigger than the opioid epidemic,” from someone who wishes to remain anonymous: My research lead is related to the use of psychotropic medications in Alzheimer’s patients. I should note that strong cautions have already been issued with respect to the use of these medications in […]

The post This could be a big deal: the overuse of psychotropic medications for advanced Alzheimer’s patients appeared first on Statistical Modeling, Causal Inference, and Social Science.

The post This could be a big deal: the overuse of psychotropic medications for advanced Alzheimer’s patients appeared first on All About Statistics.

]]>I received the following email, entitled “A research lead (potentially bigger than the opioid epidemic,” from someone who wishes to remain anonymous:

My research lead is related to the use of psychotropic medications in Alzheimer’s patients. I should note that strong cautions have already been issued with respect to the use of these medications in the elderly (e.g. https://www.mind.uci.edu/alzheimers-disease/articles-of-interest/medications-to-avoid-for-patients/). As a practical matter, however, at present agitated and aggressive behaviors are considered a common symptom of advanced Alzheimer’s and therefore the use of these drugs as chemical restraints is common even by the most conservative physicians. Furthermore, as the medical profession is very diverse many physicians have not updated their practice and are far from cautious in prescribing these medications to the elderly. Indeed, chemical restraints are the norm in nursing home practice, so it is my belief that psychotropic medications are ubiquitous in nursing homes.

A family member is an active medical professional (a physician assistant who works as a front-line health care provider in a family practice office in rural America), so I have some insight into how a conservative medical practitioner will behave. I have observed two things: first, when a symptom is in the list of potential symptoms for a disease that has been diagnosed in the patient, there is a strong presumption that the symptom is caused by the disease. Second, when a symptom is a side effect of a medication (particularly one prescribed by another doctor), there is a presumption that the need for the medication outweighs the side effect.

A combination of close familiarity with my mother’s symptoms/behavior on and off a variety of drugs and my knowledge of the ubiquitous bank gaming of regulatory controls in financial markets leads me to wonder whether the drug companies aren’t playing the same kind of game. In particular the addition of “behavioral and psychological symptoms of dementia” (BPSD) to the criteria for the diagnosis of late-stage Alzheimer’s is as far as I can tell of very recent vintage. (I believe they were introduced with the DSM-5 in 2013.) BPSD are symptoms treated by psychotropic medications. These same medications are also commonly used to treat mild sleep and anxiety disorders in the general population.

The problem with all the psychotropic medications is that they are used to treat the same behaviors that they can also cause as side effects, including irritability, anxiety, “disinhibition” which maps into a willingness to hit out at or behave abusively to others, aggressiveness, and self-harm. In sufficiently high doses, however, the patient is heavily sedated and they are very effective chemical restraints.

My suspicion is that the introduction of BPSD into the definition of common symptoms of Alzheimer’s has developed as a result of the ubiquitous use of psychotropic medications in this population. That is, as far as I can tell the studies that have found BPSD to be common in Alzheimer’s are population-based studies that did not control for the use of medications that have as side effects BPS behaviors. Successfully bringing BPSD into the clinical definition of Alzheimer’s is hugely profitable for the drug companies that now have physician’s biases – to attribute symptoms to the disease that has already been diagnosed, rather than to the drug that may cause it as a side effect – working on their side.

In short, what I would really like to see is a careful statistician’s review of the studies that find that BPSD are common symptoms of Alzheimer’s and an analysis of whether sufficient controls for the use of medicines that have as side effects the same symptoms have been implemented. (With psychotropic medications, the length of use is an important factor because they build up in the system. Thus long-term use has different effects from short-term use. Long-term prescriptions are not a good proxy for long-term use, since refill/renewal of shorter term prescriptions is common practice.)

I don’t know anything about this, but I ran it by an expert on eldercare who said that, yes, this is a big deal. So I’m posting this in the hope that someone will look into it more carefully. As you can see from the discussion above, there are statistical entry points to this one, and it touches on some interactions between causal inference and decision analysis.

The post This could be a big deal: the overuse of psychotropic medications for advanced Alzheimer’s patients appeared first on Statistical Modeling, Causal Inference, and Social Science.

**Please comment on the article here:** **Statistical Modeling, Causal Inference, and Social Science**

The post This could be a big deal: the overuse of psychotropic medications for advanced Alzheimer’s patients appeared first on All About Statistics.

]]>Prior to SAS/IML 14.2, every variable in the Interactive Matrix Language (IML) represented a matrix. That changed when SAS/IML 14.2 introduced two new data structures: data tables and lists. This article gives an overview of data tables. I will blog about lists in a separate article. A matrix is a [...]

The post Data tables: Nonmatrix data structures in SAS/IML appeared first on The DO Loop.

The post Data tables: Nonmatrix data structures in SAS/IML appeared first on All About Statistics.

]]>Prior to SAS/IML 14.2, every variable in the Interactive Matrix Language (IML) represented a matrix. That changed when SAS/IML 14.2 introduced two new data structures: data tables and lists. This article gives an overview of data tables. I will blog about lists in a separate article.

A matrix is a rectangular array that contains all numerical or all character values. Numerical matrices are incredibly useful for computations because linear algebra provides a powerful set of tools for implementing analytical algorithms. However, a matrix is somewhat limiting as a data structure. Matrices are two-dimensional, rectangular, and cannot contain mixed-type data (numeric AND character). Consequently, you can't use one single matrix to pass numeric and character data to a function.

Data tables in SAS/IML are in-memory versions of a data set. They contain columns that can be numeric or character, as well as column attributes such as names, formats, and labels. The data table is associated with a single symbol and can be passed to modules or returned from a module. the TableCreateFromDataSet function, as shown:

proc iml; tClass = TableCreateFromDataSet("Sashelp", "Class"); /* SAS/IML 14.2 */ |

The function reads the data from the Sashelp.Class data set and creates an in-memory copy. You can use the `tClass` symbol to access properties of the table. For example, if you want to obtain the names of the columns in the table, you can use

varNames = TableGetVarName(tClass); print varNames; |

Data tables are not matrices. You cannot add, subtract, or multiply with tables. When you want to compute something, you need to extract the data into matrices. For example, if you want to compute the body-mass index (BMI) of the students in Sashelp.Class, you can use the TableAddVar function to add the BMI as a new column in the table:

Y = TableGetVarData(tClass, {"Weight" "Height"}); wt = Y[,1]; ht = Y[,2]; /* get Height and Weight variables */ BMI = wt / ht##2 * 703; /* BMI formula */ call TableAddVar(tClass, "BMI", BMI); /* add new "BMI" column to table */ |

As indicated earlier, you can use data tables to pass mixed-type data into a user-defined function. For example, the following statements define a module whose argument is a data table. The module prints the mean value of the numeric columns in the table, and it prints the number of unique levels for character columns. To do so, it first extracts the numeric data into a matrix, then later extracts the character data into a matrix.

start QuickSummary(tbl); type = TableIsVarNumeric(tbl); /* 0/1 vector */ /* for numeric columns, print mean */ idx = loc(type=1); /* numeric cols */ if ncol(idx)>0 then do; /* there is a numeric col */ varNames = TableGetVarName(tbl, idx); /* get names */ m = TableGetVarData(tbl, idx); /* extract numeric data */ mean = mean(m); print mean[colname=varNames L="Mean of Numeric Variables"]; end; /* for character columns, print number of levels */ idx = loc(type=0); /* character cols */ if ncol(idx)>0 then do; /* there is a character col */ varNames = TableGetVarName(tbl, idx); /* get names */ m = TableGetVarData(tbl, idx); /* extract character data */ levels = countunique(m, "col"); print levels[colname=varNames L="Levels of Character Variables"]; end; finish; run QuickSummary(tClass); |

SAS/IML 14.2 supports data tables, which are rectangular arrays of mixed-type data. You can use built-in functions to extract columns from a table, add columns to a table, and query the table for attributes of columns. For more information about data tables, see the chapter

The post Data tables: Nonmatrix data structures in SAS/IML appeared first on The DO Loop.

**Please comment on the article here:** **The DO Loop**

The post Data tables: Nonmatrix data structures in SAS/IML appeared first on All About Statistics.

]]>The post Sorting correlation coefficients by their magnitudes in a SAS macro appeared first on All About Statistics.

]]>Many statisticians and data scientists use the **correlation coefficient** to study the relationship between 2 variables. For 2 random variables, and , the correlation coefficient between them is defined as their covariance scaled by the product of their standard deviations. Algebraically, this can be expressed as

.

In real life, you can never know what the true correlation coefficient is, but you can estimate it from data. The most common estimator for is the **Pearson correlation coefficient**, which is defined as the sample covariance between and divided by the product of their sample standard deviations. Since there is a common factor of

in the numerator and the denominator, they cancel out each other, so the formula simplifies to

.

In predictive modelling, you may want to find the covariates that are most correlated with the response variable before building a regression model. You can do this by

- computing the correlation coefficients
- obtaining their absolute values
- sorting them by their absolute values.

Since starting my new job at Environics Analytics last year, I have noticed that my co-workers like to perform univariate correlational analysis as the first phase in regression modelling, and that is a sensible thing to do. I have written the following SAS macro, “corr_sort()”, to facilitate this process.

There are 4 arguments in this macro:

- The name of the data set with the variables for correlational analysis
target variable__One__- PROC CORR can handle multiple target variables, but I have to sort the absolute values of the correlation coefficients for one chosen variable, so I have written the macro to handle only one target variable

- The covariates
- Multiple covariates can be used, as demonstrated below in the example. Just separate them by a space.

- Your chosen name for the output data set

**It’s important to emphasize that there is no such distinction between “target” and “covariate” in correlational analysis – there is no directionality between the 2 sets of variables to be correlated.** However, in PROC CORR, there is a need to specify one set of variables in the VAR statement and another set of variables in the WITH statement, so I have chosen the names “target” and “covariates” in my macro.

Here is the code for the macro. If you wish to use it in your own SAS script, you can invoke it by using the “%include” statement.

%macro corr_sort(ds, target, covariates, output_name); * get the Pearson correlation coefficients between the target and the covariates; * export them into a data set called "c1"; * suppress the printing of any output; ods select none; proc corr data = &ds; var ⌖ with &covariates; ods output pearsoncorr = c1; run; ods select all; * give proper labels to the results; * drop the number of cases for each target; data c2; set c1; abs_corr = abs(&target); label abs_corr = 'Absolute Value of Correlation'; label &target = 'Pearson Correlation'; rename &target = corr; label p&target = 'P-Value'; rename p&target = PValue; run; * sort the results by the absolute value of the correlation coefficients; proc sort data = c2 out = &output_name; by descending abs_corr; run; %mend;

Here is an example of using this macro with the CLASS data set in the SASHELP library that all SAS users can access.

%(sashelp.class, age, height weight, corrs);corr_sortprocrun;

Here is the output as seen in the Results Viewer in SAS.

Correlation with age |

Obs |
Variable |
corr |
PValue |
abs_corr |

1 |
Height | 0.81143 | <.0001 | 0.81143 |

2 |
Weight | 0.74089 | 0.0003 | 0.74089 |

Filed under: Applied Statistics, Data Analysis, Descriptive Statistics, Mathematical Statistics, Predictive Modelling, SAS Programming, Statistics, Tutorials Tagged: correlation, macro, pearson correlation, pearson correlation coefficient, predictive modelling, PROC CORR, regression, regression modelling, SAS, sas macro

**Please comment on the article here:** **Statistics – The Chemical Statistician**

The post Sorting correlation coefficients by their magnitudes in a SAS macro appeared first on All About Statistics.

]]>The post Forecasting and "As-If" Discounting appeared first on All About Statistics.

]]>Check out the fascinating and creative new paper, "Myopia and Discounting", by Xavier Gabaix and David Laibson.

From their abstract (slightly edited):

We assume that perfectly patient agents estimate the value of future events by generating noisy, unbiased simulations and combining those signals with priors to form posteriors. These posterior expectations exhibit as-if discounting: agents make choices as if they were maximizing a stream of known utils weighted by a discount function. This as-if discount function reflects the fact that estimated utils are a combination of signals and priors, so average expectations are optimally shaded toward the mean of the prior distribution, generating behavior that partially mimics the properties of classical time preferences. When the simulation noise has variance that is linear in the event's horizon, the as-if discount function is hyperbolic.Among other things, then, they provide a rational foundation for the "myopia" associated with hyperbolic discounting.

Note that in the Gabaix-Laibson environment everything depends on how forecast error variance behaves as a function of forecast horizon \(h\). But we know a lot about that. For example, in linear covariance-stationary \(I(0)\) environments, optimal forecast error variance grows with \(h\) at a decreasing rate, approaching the unconditional innovation variance from below. Hence it cannot grow linearly with \(h\), which is what produces hyperbolic as-if discounting. In contrast, in non-stationary \(I(1)\) environments, optimal forecast error variance

**Please comment on the article here:** **No Hesitations**

The post Forecasting and "As-If" Discounting appeared first on All About Statistics.

]]>Blake McShane and David Gal recently wrote two articles (“Blinding us to the obvious? The effect of statistical training on the evaluation of evidence” and “Statistical significance and the dichotomization of evidence”) on the misunderstandings of p-values that are common even among supposed experts in statistics and applied social research. The key misconception has nothing […]

The post Some natural solutions to the p-value communication problem—and why they won’t work appeared first on Statistical Modeling, Causal Inference, and Social Science.

The post Some natural solutions to the p-value communication problem—and why they won’t work appeared first on All About Statistics.

]]>Blake McShane and David Gal recently wrote two articles (“Blinding us to the obvious? The effect of statistical training on the evaluation of evidence” and “Statistical significance and the dichotomization of evidence”) on the misunderstandings of p-values that are common even among supposed experts in statistics and applied social research.

The key misconception has nothing to do with tail-area probabilities or likelihoods or anything technical at all, but rather with the use of significance testing to finesse real uncertainty.

As John Carlin and I write in our discussion of McShane and Gal’s second paper (to appear in the Journal of the American Statistical Association):

Even authors of published articles in a top statistics journal are often confused about the meaning of p-values, especially by treating 0.05, or the range 0.05–0.15, as the location of a threshold. The underlying problem seems to be deterministic thinking. To put it another way, applied researchers and also statisticians are in the habit of demanding more certainty than their data can legitimately supply. The problem is not just that 0.05 is an arbitrary convention; rather, even a seemingly wide range of p-values such as 0.01–0.10 cannot serve to classify evidence in the desired way.

In our article, John and I discuss some natural solutions that won’t, on their own, work:

– Listen to the statisticians, or clarity in exposition

– Confidence intervals instead of hypothesis tests

– Bayesian interpretation of one-sided p-values

– Focusing on “practical significance” instead of “statistical significance”

– Bayes factors

You can read our article for the reasons why we think the above proposed solutions won’t work.

From our summary:

We recommend saying No to binary conclusions . . . resist giving clean answers when that is not warranted by the data. . . . It will be difficult to resolve the many problems with p-values and “statistical significance” without addressing the mistaken goal of certainty which such methods have been used to pursue.

**P.S.** Along similar lines, Stephen Jenkins sends along the similarly-themed article, “‘Sing Me a Song with Social Significance’: The (Mis)Use of Statistical Significance Testing in European Sociological Research,” by Fabrizio Bernardi, Lela Chakhaia, and Liliya Leopold.

The post Some natural solutions to the p-value communication problem—and why they won’t work appeared first on Statistical Modeling, Causal Inference, and Social Science.

**Please comment on the article here:** **Statistical Modeling, Causal Inference, and Social Science**

The post Some natural solutions to the p-value communication problem—and why they won’t work appeared first on All About Statistics.

]]>The post Your charts need the gift of purpose appeared first on All About Statistics.

]]>Via Twitter, I received this chart:

My readers are nailing it when it comes to finding charts that deserve close study. On Twitter, the conversation revolved around the inversion of the horizontal axis. Favorability is associated with positive numbers, and unfavorability with negative numbers, and so, it seems the natural ordering should be to place Favorable on the right and Unfavorable on the left.

Ordinarily, I'd have a problem with the inversion but here, the designer used the red-orange color scheme to overcome the potential misconception. It's hard to imagine that orange would be the color of disapproval, and red, of approval!

I am more concerned about a different source of confusion. Take a look at the following excerpt:

If you had to guess, what are the four levels of favorability? Using the same positive-negative scale discussed above, most of us will assume that going left to right, we are looking at Strongly Favorable, Favorable, Unfavorable, Strongly Unfavorable. The people in the middle are neutrals and the people on the edages are extremists.

But we'd be mistaken. The order going left to right is Favorable, Strongly Favorable, Strongly Unfavorable, Unfavorable. The designer again used tints and shades to counter our pre-conception. This is less successful because the order defies logic. It is a double inversion.

The other part of the chart I'd draw attention to is the column of data printed on the right. Each such column is an act of giving up - the designer admits he or she couldn't find a way to incorporate that data into the chart itself. It's like a footnote in a book. The problem arises because such a column frequently contains very important information. On this chart, the data are "net favorable" ratings, the proportion of Favorables minus the proportion of Unfavorables, or visually, the length of the orange bar minus the length of the red bar.

The net rating is a succinct way to summarize the average sentiment of the population. But it's been banished to a footnote.

***

Anyone who follows American politics a little in recent years recognizes the worsening polarization of opinions. A chart showing the population average is thus rather meaningless. I'd like to see the above chart broken up by party affiliation (Republican, Independent, Democrat).

This led me to the original source of the chart. It turns out that the data came from a Fox News poll but the chart was not produced by Fox News - it accompanied this *Washington Post* article. Further, the article contains three other charts, broken out by party affiliation, as I hoped. The headline of the article was "Bernie Sanders remains one of the most popular politicians..."

But reading three charts, printed vertically, is not the simplest matter. One way to make it easier is to gift the chart a purpose. It turns out there are no surprises among the Republican and Democratic voters - they are as polarized as one can imagine. So the real interesting question in this data is the orientation of the Independent voters - are they more likely to side with Democrats or Republicans?

Good house-keeping means when you acquire stuff, you must remove other stuff. After adding the party dimension, it makes more sense to collapse the favorability dimension - precisely by using the net favorable rating column:

**Please comment on the article here:** **Junk Charts**

The post Your charts need the gift of purpose appeared first on All About Statistics.

]]>