This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
North Carolina is a very diverse state – especially when it comes to outdoor recreation opportunities. This weekend you could go hiking or kayaking in the mountains, watch a hot air balloon festival near Raleigh, and go wind surfing or fishing at the coast. And if you’ve got your SCUBA […]
The post Where to SCUBA dive in North Carolina appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post Take that SAS option appeared first on SAS Learning Post.
]]>This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
Elizabeth is courageous. Scoliosis since birth, corrective spinal surgery replaced her spine with steel, tripping on stairs permanently broke her right ankle. Then she decided to come take yoga with me. To help ease back pain & reduce hip stress, I offered options like bent legs not cross. In class […]
The post Take that SAS option appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post Sample quantiles: A comparison of 9 definitions appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
According to Hyndman and Fan (“Sample Quantiles in Statistical Packages,” TAS, 1996), there are nine definitions of sample quantiles that commonly appear in statistical software packages. Hyndman and Fan identify three definitions that are based on rounding and six methods that are based on linear interpolation. This blog post shows how to use SAS to visualize and compare the nine common definitions of sample quantiles. It also compares the default definitions of sample quantiles in SAS and R.
Suppose that a sample has N observations that are sorted so that x[1] ≤ x[2] ≤ … ≤ x[N], and suppose that you are interested in estimating the p_th quantile (0 ≤ p ≤ 1) for the population. Intuitively, the data values near x[j], where j = floor(Np) are reasonable values to use to estimate the quantile. For example, if N=10 and you want to estimate the quantile for p=0.64, then j = floor(Np) = 6, so you can use the sixth ordered value (x[6]) and maybe other nearby values to estimate the quantile.
Hyndman and Fan (henceforth H&F) note that the quantile definitions in statistical software have three properties in common:
Thus a general formula for quantile estimates is
q = (1 – λ) x[j]+ λ x[j+1], where λ and j depend on the values of p, N, and a method-specific parameter m.
You can read Hyndman and Fan (1986) for details or see the Wikipedia article about quantiles for a summary.
The Wikipedia article points out a practical consideration: for values of p that are very close to 0 or 1, some definitions need to be slightly modified. For example, if p < 1/N, the quantity Np < 1
and so j = floor(Np) equals 0, which is an invalid index. The convention is to return x[1] when p is very small and return x[N] when p is very close to 1.
SAS has built-in support for five of the quantile definitions, notably in PROC UNIVARIATE, PROC MEANS, and in the QNTL subroutine in SAS/IML. You can use the QNTLDEF= option to choose from the five definitions. The following table associates the five QNTLDEF= definitions in SAS to the corresponding definitions from H&F, which are also used by R. In R you choose the definition by using the type parameter in the quantile function.
It is straightforward to write a SAS/IML function to compute the other four definitions in H&F. In fact, H&F present the quantile interpolation functions as specific instances of one general formula that contains a parameter, which they call m. As mentioned above, you can also define a small value c (which depends on the method) such that the method returns x[1] if p < c, and the method returns x[N] if p ≥ 1 – c.
The following table presents the parameters for computing the four sample quantile definitions that are not natively supported in SAS:
You can
download the SAS program that shows how to compute sample quantiles and graphs for any of the nine definitions in H&F.
The differences between the definitions are most evident for small data sets and when there is a large “gap” between one or more adjacent data values. The following panel of graphs shows the nine sample quantile
methods for a data set that has 10 observations,
{0 1 1 1 2 2 2 4 5 8}. Each cell in the panel shows the quantiles for p = 0.001, 0.002, …, 0.999. The bottom of each cell is a fringe plot that shows the six unique data values.
In these graphs, the horizontal axis represents the data and quantiles. For any value of x, the graph estimates the cumulative proportion of the population that is less than or equal to x. Notice that if you turn your head sideways, you can see the quantile function, which is the inverse function that estimates the quantile for each value of the cumulative probability.
You can see that although the nine quantile functions have the same basic shape, the first three methods estimate quantiles by using a discrete rounding scheme, whereas the other methods use a continuous interpolation scheme.
You can use the same data to compare methods. Instead of plotting each quantile definition in its own cell, you can overlay two or more methods. For example, by default, SAS computes sample quantiles by using the type=2 method, whereas R uses type=7 by default. The following graph overlays the sample quantiles to compare the default methods in SAS and R on this tiny data set. The default method in SAS always returns a data value or the average of adjacent data values; the default method in R can return any value in the range of the data.
As shown above, different software packages use different defaults for sample quantiles. Consequently, when you report quantiles for a small data set, it is important to report how the quantiles were computed.
However, in practice analysts don’t worry too much about which definition they are using because the difference between methods is typically small for larger data sets (100 or more observations). The biggest differences are often between the discrete methods, which always report a data value or the average between two adjacent data values, and the interpolation methods, which can return any value in the range of the data. Extreme quantiles can also differ between the methods because the tails of the data often have fewer observations and wider gaps.
The following graph shows the sample quantiles for 100 observations that were generated from a random uniform distribution. As before, the two sample quantiles are type=2 (the SAS default) and type=7 (the R default). At this scale, you can barely detect any differences between the estimates. The red dots (type=7) are on top of the corresponding blue dots (type=2), so few blue dots are visible.
So does the definition of the sample quantile matter? Yes and no. Theoretically, the different methods compute different estimates and have different properties. If you want to use an estimator that is unbiased or one that is based on distribution-free computations, feel free to read Hyndman and Fan and choose the definition that suits your needs. The differences are evident for tiny data sets. On the other hand, the previous graph shows that there is little difference between the methods for moderately sized samples and for quantiles that are not near gaps. In practice, most data analysts just accept the default method for whichever software they are using.
In closing, I will mention that there are other quantile estimation methods that are not simple formulas. In SAS, the QUANTREG procedure solves a minimization problem to estimate the quantiles. The QUANTREG procedure enables you to not only estimate quantiles, but also estimate confidence intervals, weighted quantiles, the difference between quantiles, conditional quantiles, and more.
SAS program to compute nine sample quantiles.
The post Sample quantiles: A comparison of 9 definitions appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The post Foreigners in Germany - let's map it! appeared first on SAS Learning Post.
]]>This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
These days, more and more people move to where the work is. And for many people in Europe, that ‘where’ is Germany. I recently saw an map of Germany, that showed which country had the most foreigners living in each area. It was an interesting map, but I thought I might […]
The post Foreigners in Germany – let’s map it! appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post Quantile definitions in SAS appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
In last week’s
article about the Flint water crisis, I computed the 90th percentile of a small data set. Although I didn’t mention it, the value that I reported is different from the the 90th percentile that is reported in Significance magazine.
That is not unusual. The data only had 20 unique values, and there are many different formulas that you can use to compute sample percentiles (generally called quantiles). Because different software packages use different default formulas for sample quantiles, it is not uncommon for researchers to report different quantiles for small data sets. This article discusses the five percentile definitions that are supported in SAS software.
You might wonder why there are multiple definitions. Recall that
a sample quantile is an estimate of a population quantile. Statisticians have proposed many quantile estimators, some of which are based on the empirical cumulative distribution (ECDF) of the sample, which approximates the cumulative distribution function (CDF) for the population.
The ECDF is a step function that has a jump discontinuity at each unique data value. Consequently, the inverse ECDF does not exist and the quantiles are not uniquely defined.
In SAS, you can use the PCTLDEF= option in PROC UNIVARIATE or the QNTLDEF= option in other procedures to control the method used to estimate quantiles. A sample quantile does not have to be an observed data value because you are trying to estimate an unknown population parameter.
For convenience, assume that the sample data are listed in sorted order.
In high school, you probably learned that if a sorted sample has an even number of observations, then the median value is the average of the middle observations. The default quantile definition in SAS (QNTLDEF=5) extends this familiar rule to other quantiles. Specifically, if the sample size is N and you ask for the q_th quantile, then when Nq is an integer,
the quantile is the data value x[Nq]. However, when Nq is not an integer, then the quantile is defined (somewhat arbitrarily) as the average of the two data x[j] and x[j+1], where j = floor(Nq). For example, if N=10 and you want the q=0.17 quantile, then Nq=1.7, so j=1 and the 17th percentile is reported as the midpoint between the ordered values x[1] and x[2].
Averaging is not the only choices you can make when Nq is not an integer. The other percentile definitions correspond to making different choices. For example, you could round Nq down (QNTLDEF=3), or you could round it to the nearest integer (QNTLDEF=2).
Or you could use linear interpolation (QNTLDEF=1 and QNTLDEF=4) between the data values whose (sorted) indices are closest to Nq. In the example where N=10 and q=0.17, the QNTLDEF=1 interpolated quantile is 0.3 x[1] + 0.7 x[2].
The SAS documentation contains the formulas used for the five percentile definitions, but sometimes a visual comparison is easier than slogging through mathematical equations.
The differences between the definitions are most apparent on small data sets that contain integer values, so let’s create a tiny data set and apply the five definitions to it.
The following example has 10 observations and six unique values.
data Q; input x @@; datalines; 0 1 1 1 2 2 2 4 5 8 ; |
You can use PROC UNIVARIATE or other methods to plot the empirical cumulative proportions, as shown. Because the ECDF is a step function, most cumulative proportions values (such as 0.45) are “in a gap.” By this I mean that there is no observation t in the data for which the cumulative proportion P(X ≤ t) equals 0.45. Depending on how you define the sample quantiles, the 0.45 quantile might be reported as 1, 1.5, 1.95, or 2.
Since the default definition is QNTLDEF=5, let’s visualize the sample quantiles for that definition. You can use the PCTPTS= option on the OUTPUT statement in PROC UNIVARIATE to declare the percentiles that you want to compute. Equivalently, you can use the QNTL function in PROC IML, as below. Regardless, you can ask SAS to find the quantiles for a set of probabilities on a fine grid of points such as {0.001, 0.002, …, 0.998, 0.999}. You can the graph of the probabilities versus the quantiles to visualize how the percentile definition computes quantiles for the sample data.
proc iml; use Q; read all var "x"; close; /* read data */ prob = T(1:999) / 1000; /* fine grid of prob values */ call qntl(quantile, x, prob, 5); /* use method=5 definition */ create Pctls var {"quantile" "prob" "x"}; append; close; quit; title "Sample Percentiles"; title2 "QNTLDEF = 5"; proc sgplot data=Pctls noautolegend; scatter x=quantile y=prob / markerattrs=(size=5 symbol=CircleFilled); fringe x / lineattrs=GraphData2(thickness=3); xaxis display=(nolabel) values=(0 to 8); yaxis offsetmin=0.05 grid values=(0 to 1 by 0.1) label="Cumulative Proportions"; refline 0 1 / axis=y; run; |
For each probability value (Y axis), the graph shows the corresponding sample quantile (X axis) for the default definition in SAS, which is QNTLDEF=5. The X axis also displays red tick marks at the location of the data. You can use this graph to find any quantile. For example, to find the 0.45 quantile, you start at 0.45 on the Y axis, move to the right until you hit a blue marker, and then drop down to the X axis to discover that the 0.45 quantile estimate is 2.
If you prefer to think of the quantiles (the X values) as a function of the probabilities, just interchange the X= and Y= arguments in the SCATTER statement (or turn your head sideways!). Then the quantile function is a step function.
It is easy to put a loop around the SAS/IML computation to compute the sample quantiles for the five different definitions that are supported in SAS. The following SAS/IML program writes a data set that contains the sample quantiles. You can use the WHERE statement in PROC PRINT to compare the same quantile across the different definitions. For example, the following displays the 0.45 quantile (45th percentile) for the five definitions:
/* Compare all SAS methods */ proc iml; use Q; read all var "x"; close; /* read data */ prob = T(1:999) / 1000; /* fine grid of prob values */ create Pctls var {"Qntldef" "quantile" "prob" "x"}; do def = 1 to 5; call qntl(quantile, x, prob, def); /* qntldef=1,2,3,4,5 */ Qntldef = j(nrow(prob), 1, def); /* ID variable */ append; end; close; quit; proc print data=Pctls noobs; where prob = 0.45; /* compare 0.45 quantile for different definitions */ var Qntldef quantile; run; |
You can see that the different definitions lead to different sample quantiles. How do the quantile functions compare? Let’s plot them and see:
ods graphics / antialiasmax=10000; title "Sample Percentiles in SAS"; proc sgpanel data=Pctls noautolegend; panelby Qntldef / onepanel rows=2; scatter x=quantile y=prob/ markerattrs=(size=3 symbol=CircleFilled); fringe x; rowaxis offsetmax=0 offsetmin=0.05 grid values=(0 to 1 by 0.1) label="Cumulative Proportion"; refline 0 1 / axis=y; colaxis display=(nolabel); run; |
The graphs (click to enlarge) show that QNTLDEF=1 and QNTLDEF=4 are piecewise-linear interpolation methods, whereas QNTLDEF=2, 3, and 5 are discrete rounding methods. The default method (QNTLDEF=5) is similar to QNTLDEF=2 except for certain averaged values. For the discrete definitions, SAS returns either a data value or the average of adjacent data values. The interpolation methods do not have that property: the methods will return quantile values that can be any value between observed data values.
If you have a small data set, as in this blog post, it is easy to see how the percentile definitions are different. For larger data sets (say, 100 or more unique values), the five quantile functions look quite similar.
The differences between definitions are most apparent when there are large gaps between adjacent data values. For example, the sample data has a large gap between the ninth and tenth observations, which have the values 5 and 8, respectively.
If you compute the 0.901 quantile, you will discover that the “round down” method (QNTLDEF=2) gives 5 as the sample quantile, whereas the “round up” method (QNTLDEF=3) gives the value 8. Similarly, the “backward interpolation method” (QNTLDEF=1) gives 5.03, whereas the “forward interpolation method” (QNTLDEF=4) gives 7.733.
In summary, this article shows how the (somewhat obscure) QNTLDEF= option results in different quantile estimates. Most people just accept the default definition (QNTLDEF=5), but if you are looking for a method that interpolates between data values, rather than a method that rounds and averages, I recommend QNTLDEF=1, which performs linear interpolations of the ECDF. The differences between the definitions are most apparent for small samples and when there are large gaps between adjacent data values.
For more information about sample quantiles, including a mathematical discussion of the various formulas, see
Hyndman, R. J. and Fan, Y. (1996) “Sample quantiles in statistical packages”, American Statistician, 50, 361–365.
The post Quantile definitions in SAS appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
In recent years, solar panels have become much more economical, and therefore more popular. But because of the curvature of the Earth, the angle at which you need to install the panels varies, depending on where you live. In this example, I demonstrate how to visualize this kind of data […]
The post Using a peekaboo map to align your solar panels appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The US unemployment rate was down to 4.4% in April, which is the lowest we’ve seen since before the big recession (about 10 years ago). But a single number seldom tells the whole story, so let’s look at unemployment data in several different ways, to get a more complete picture… […]
The post Several ways to look at US unemployment data appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post Quantiles and the Flint water crisis appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The April 2017 issue of Significance magazine features a cover story by Robert Langkjaer-Bain about the Flint (Michigan) water crisis. For those who don’t know, the Flint water crisis started in 2014 when the impoverished city began using the Flint River as a source of city water. The water was not properly treated, which led to unhealthy (even toxic) levels of lead and other contaminants in the city’s water supply. You can read an overview of the Flint Water Crisis on Wikipedia.
The crisis was compounded because someone excluded two data points before computing a quantile of a small data set. This seemingly small transgression had tragic ramifications. This blog post examines the Flint water quality data and shows why excluding those two points changed the way that the city responded. You can download the SAS program that analyzes these data.
The federal Lead and Copper Rule of 1991 specifies a statistical method for determining when the concentration of lead in a water supply is too high.
First, you sample from a number of “worst case” homes (such as those served by lead pipes), then compute the 90th percentile of the lead levels from those homes. If the 90th percentile exceeds 15 parts per billion (ppb), then the water is unsafe and action must be taken to correct the problem.
In spring 2015, this data collection and analysis was carried out in Flint by the Michigan Department of Environmental Quality (MDEQ), but according to the Significance article, the collection process was flawed. For example, the MDEQ was supposed to collect 100 measurements, but only 71 samples were obtained, and they were not from the “worst case” homes.
The 71 lead measurements that they collected are reproduced below, where I have used ‘0’ for “not detectable.” A call to PROC MEANS computes the 90th percentile (P90) of the complete sample:
/* values of lead concentration in Flint water samples. Use 0 for "not detectable" */ data FlintObs; label Lead = "Lead Concentration (ppb)"; input lead @@; Exclude = (Lead=20 | Lead=104); /* MDEQ excluded these two large values */ datalines; 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 4 4 5 5 5 5 5 5 5 5 6 6 6 6 7 7 7 8 8 9 10 10 11 13 18 20 21 22 29 43 43 104 ; proc means data=FlintObs N p90; var lead; run; |
In this analysis of the full data, the 90th percentile of the sample is 18 ppb, which exceeds the federal limit of 15 ppb. Consequently, the Flint water fails the safety test, and the city must take action to improve the water.
But that is not what happened. Allegedly,
“the MDEQ told the city’s water quality supervisors to remove two of the samples” (Langkjaer-Bain, p. 19) that were over 15 ppb. The MDEQ claimed that these data were improperly collected. The two data points that were excluded have the values 20 and 104. Because these values are both higher than the 90th percentile, excluding these observations lowered the 90th percentile of the modified sample, which has 69 observations. The following call to PROC MEANS computes the 90th percentile for the modified sample:
proc means data=FlintObs N p90; where Exclude=0; /* exclude two observations */ var lead; run; |
The second table shows that the 90th percentile of the modified sample is 13 ppb.
“The exclusion of these two samples nudged the 90th percentile reading…below the all-important limit of 15 ppb.” (Langkjaer-Bain, p. 20)
The modified conclusion is that the city does not need to undertake expensive corrective action to render the water safe.
The following histogram (click to enlarge) is similar to Figure 1 in Langkjaer-Bain’s article. The red bars indicate observations that the MDEQ excluded from the analysis. The graph clearly shows that the distribution of lead values has a long tail. A broken axis is used to indicate that the distance to the 104 ppb reading has been shortened to reduce horizontal space.
The huge gap near 15 ppb indicates a lack of data near that important value. Therefore the quantiles near that point will be extremely sensitive to deleting extreme values. To me, the graph indicates that more data should be collected so that policy makers can be more confident in their conclusions.
Clearly the point estimate for the 90th percentile depends on whether or not those two measurements are excluded. But can statistics provide any additional insight into the 90th percentile of lead levels in the Flint water supply?
When someone reports a statistic for a small sample, I like to ask “how confident are you in that statistic?” A standard error is one way to express the accuracy of a point estimate; a
95% confidence interval (CI) is another. The width of a confidence interval gives us information about the accuracy of the point estimate. As you might expect, the standard error for an extreme quantile (such as 0.9) is typically much bigger than for a quantile near 0.5, especially when there isn’t much data near the quantile.
Let’s use SAS procedures to construct a 95% CI for the 90th percentile. PROC UNIVARIATE supports the CIPCTLDF option, which produces distribution-free confidence intervals. I’ll give the Flint officials the benefit of the doubt and compute a confidence interval for the modified data that excluded the two “outliers”:
proc univariate data=FlintObs CIPctlDF; where Exclude=0; /* exclude two observations */ var Lead; ods select quantiles; run; |
The 95% confidence interval for P90 is [8, 43], which is very wide and includes the critical value 15 ppb in its interior. If someone asks, “how confident are you that the 90th percentile does not exceed 15 ppb,” you should respond, “based on these data, I am not confident at all.”
As I’ve written before, you can also use the QUANTREG procedure in SAS to provide an alternative method to compute confidence intervals for percentiles. Furthermore, the QUANTREG procedure supports the ESTIMATE statement, which you can use to perform a one-sided test for the hypothesis “does the 90th percentile exceed 15 ppb?” The following call to PROC QUANTREG performs this analysis and uses 10,000 bootstrap resamples to estimate the confidence interval:
proc quantreg data=FlintObs CI=resampling(nrep=10000); where Exclude=0; /* exclude two observations */ model lead = / quantile=0.9 seed=12345; estimate 'P90 > 15' Intercept 1 / upper CL testvalue=15; ods select ParameterEstimates Estimates; run; |
The standard error for the 90th percentile is about 5.3. Based on bootstrap resampling methods, the 95% CI for P90 is approximately [2.4, 23.6]. (Other methods for estimating the CI give similar or wider intervals.) The ESTIMATE statement is used to test the null hypothesis that P90 is greater than 15. The p-value is large, which means that
even if you delete two large lead measurements,
the data do not provide evidence to reject the null hypothesis.
There is not enough evidence to reject the hypothesis that P90 is greater than the legal limit of 15 ppb. Two different 95% confidence intervals for P90 include 15 ppb in their interiors.
In fact, the confidence intervals include 15 ppb whether you use all 71 observations or just the 69 observations that MDEQ used.
So you can argue about whether the MDEQ should have excluded the controversial measurements, but the hypothesis test gives the same conclusion regardless. By using these data, you cannot rule out the possibility that the 90th percentile of the Flint water supply is greater than 15 ppb.
What do you have to say? Share your comments.
The post Quantiles and the Flint water crisis appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This post was kindly contributed by SAS – r4stats.com - go there to comment and to read the full post. |
What tools do we use most for data science, machine learning, or analytics? Python, R, SAS, KNIME, RapidMiner,…? How do we use them? We are about to find out as the two most popular surveys on data science tools have both just gone live. Please chip in and help us all get a better understanding of the tools of our trade.
For 18 consecutive years, Gregory Piatetsky has been asking people what software they have actually used in the past twelve months on the KDnuggets Poll. Since this poll contains just one question, it’s very quick to take and you’ll get the latest results immediately. You can take the KDnuggets poll here.
Every other year since 2007 Rexer Analytics has surveyed data science professionals, students, and academics regarding the software they use. It is a more detailed survey which also asks about goals, algorithms, challenges, and a variety of other factors. You can take the Rexer Analytics survey here (use Access Code M7UY4). Summary reports from the seven previous Rexer surveys are FREE and can be downloaded from their Data Science Survey page.
As always, as soon as the results from either survey are available, I’ll post them on this blog, then update the main results in The Popularity of Data Science Software, and finally send out an announcement on Twitter (follow me as @BobMuenchen).
This post was kindly contributed by SAS – r4stats.com - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
I guess a coding dinosaur is someone who uses an old/legacy computer language, or at least a language that isn’t en vogue these days. Coding dinosaurs are still around (and probably will be for a while), whereas the real dinosaurs that lived millions of years ago are extinct. What caused […]
The post What happened to the dinosaurs? (not the coding dinosaurs – the real ones!) appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |