The post Parameter estimates for different parameterizations appeared first on The DO Loop.
]]>
A canonical example of a probability distribution that has multiple parameterizations is the gamma distribution. You can parameterize the gamma distribution in terms of a scale parameter or in terms of a rate parameter, where the rate parameter is the reciprocal of the scale parameter. The density functions for the gamma distribution with scale parameter β or rate parameter c = 1 / β are as follows:
Shape α, scale β: f(x; α, β) = 1 / (β^{α} Γ(α)) x^{α-1} exp(-x/β)
Shape α, rate c: f(x; α, c) = c^{α} / Γ(α) x^{α-1} exp(-c x)
You can use PROC NLMIXED to fit either set of parameters and use the ESTIMATE statement to obtain estimates for the other set. In general, you can do this whenever you can explicitly write down an analytical formula that expresses one set of parameters in terms of another set.
Let's implement this idea on some simulated data. The following SAS DATA step simulates 100 observations from a gamma distribution with shape parameter α = 2.5 and scale parameter β = 1 / 10. A call to PROC UNIVARIATE estimates the parameters from the data and overlays a gamma density on the histogram of the data:
/* The gamma model has two different parameterizations: a rate parameter and a scale parameter. Generate gamma-distributed data and fit a model for each parameterization. Use the ESTIMATE stmt to estimate the parameter in the model that was not fit. */ %let N = 100; /* N = sample size */ data Gamma; aParm = 2.5; scaleParm = 1/10; rateParm = 1/scaleParm; do i = 1 to &N; x = rand("gamma", aParm, scaleParm); output; end; run; proc univariate data=Gamma; var x; histogram x / gamma; run; |
The parameter estimates are (2.99, 0.09), which are close to the parameter values that were used to generate the data. Notice that PROC UNIVARIATE does not provide standard errors or confidence intervals for these point estimates. However, you can get those statistics from PROC NLMIXED. PROC NLMIXED also supports the ESTIMATE statement, which can estimate the rate parameter:
/*Use PROC NLMIXED to fit the shape and scale parameters in a gamma model */ title "Fit Gamma Model, SCALE Parameter"; proc nlmixed data=Gamma; parms a 1 scale 1; * initial value for parameter; bounds 0 < a scale; model x ~ gamma(a, scale); rate = 1/scale; estimate 'rate' rate; ods output ParameterEstimates=PE; ods select ParameterEstimates ConvergenceStatus AdditionalEstimates; run; |
The parameter estimates are the same as from PROC UNIVARIATE. However, PROC NLMIXED also provides standard errors and confidence intervals for the parameters. More importantly, the ESTIMATE statement enables you to obtain an estimate of the rate parameter without needing to refit the model. The next section explains how the rate parameter is estimated.
You might wonder how the ESTIMATE statement computes the estimate for the rate parameter from the estimate of the scale parameter. The PROC NLMIXED documentation states that the procedure "computes approximate standard errors for the estimates by using the delta method (Billingsley 1986)." I have not studied the "delta method," but it seems to a way to approximate a nonlinear transformation by using the first two terms of its Taylor series, otherwise known as a linearization.
In the ESTIMATE statement, you supply a transformation, which is c = 1 / β for the gamma distribution. The derivative of that transformation is -1 / β^{2}. The estimate for β is 0.09. Plugging that estimate into the transformations give 1/0.09 as the point estimate for c and 1/0.09^{2} as a linear multiplier for the size of the standard error. Geometrically, the linearized transformation scales the standard error for the scale parameter into the standard error for the rate parameter. The following SAS/IML statements read in the parameter estimates for β. It then uses the transformation to produce a point estimate for the rate parameter, c, and uses the linearization of the transformation to estimate standard errors and confidence intervals for c.
proc iml; start Xform(p); return( 1 / p ); /* estimate rate from scale parameter */ finish; start JacXform(p); return abs( -1 / p##2 ); /* Jacobian of transformation */ finish; use PE where (Parameter='scale'); read all var {Estimate StandardError Lower Upper}; close; xformEst = Xform(Estimate); xformStdErr = StandardError * JacXform(Estimate); CI = Lower || Upper; xformCI = CI * JacXform(Estimate); AdditionalParamEst = xformEst || xformStdErr || xformCI; print AdditionalParamEst[c={"Est" "StdErr" "LowerCL" "UpperCL"} F=D8.4]; quit; |
The estimates and other statistics are identical to those produced by the ESTIMATE statement in PROC NLMIXED. For a general discussion of changing variables in probability and statistics, see these Penn State course notes. For a discussion about two different parameterizations of the lognormal distribution, see the blog post "Geometry, sensitivity, and parameters of the lognormal distribution."
In case you are wondering, you obtain exactly the same parameter estimates and standard errors if you fit the rate parameter directly. That is, the following call to PROC NLMIXED produces the same parameter estimates for c.
/*Use PROC NLMIXED to fit gamma model using rate parameter */ title "Fit Gamma Model, RATE Parameter"; proc nlmixed data=Gamma; parms a 1 rate 1; * initial value for parameter; bounds 0 < a rate; model x ~ gamma(a, 1/rate); ods select ParameterEstimates; run; |
For this example, fitting the scale parameter is neither more nor less difficult than fitting the rate parameter. If I were faced with a distribution that has two different parameterizations, one simple and one more complicated, I would try to fit the simple parameters to data and use this technique to estimate the more complicated parameters.
The post Parameter estimates for different parameterizations appeared first on The DO Loop.
]]>The post Get the unique values of a variable in data order appeared first on The DO Loop.
]]>/* Three ways to get unique values of a variable in sorted order */ proc freq data=sashelp.cars; tables Type / missing; /* MISSING option treats mising values as a valid category */ run; proc sql; select distinct(Type) as uType from sashelp.cars; quit; proc iml; use sashelp.cars; read all var "Type"; close; uType = unique(Type); print uType; |
The FREQ procedure has a nice option that I use FREQently: you can use the ORDER=DATA option to obtain the unique categories in the order in which they appear in a data set. Over the years, I've used the ORDER=DATA option for many purposes, including a recent post that shows how to create a "Top 10" chart. Another useful option is ORDER=FREQ, which orders the categories in descending order by frequency.
Recently a colleague asked how to obtain the unique values of a SAS/IML vector in the order in which they appear in the vector. I thought it might also be useful to support an option to return the unique values in (descending) order of frequency, so I wrote the following function. The first argument to the UniqueOrder function is the data vector. The second (optional) argument is a string that determines the order of the unique values. If the second argument has the value "DATA", then the unique values are returned in data order. If the second argument has the value "FREQ", then the unique values are returned in descending order by frequency. The function is shown below. I test the function on a character vector (TYPE) and a numeric vector (CYLINDERS) that contains some missing values.
proc iml; /* Return the unique values in a row vector. The second parameter determines the sort order of the result. "INTERNAL": Order by alphanumeric values (default) "DATA" : Order in which elements appear in vector "FREQ" : Order by frequency */ start UniqueOrder(x, order="INTERNAL"); if upcase(order)="DATA" then do; u = unique(x); /* unique values (sorted) */ idx = j(ncol(u),1,0); /* vector to store indices in data order */ do i = 1 to ncol(u); idx[i] = loc(x=u[i])[1]; /* 1st location of i_th unique value */ end; call sort(idx); /* sort the indices ==> data order */ return ( T(x[idx]) ); /* put the values in data order */ end; else if upcase(order)="FREQ" then do; call tabulate(u, freq, x, "missing"); /* compute freqs; missing is valid level */ call sortndx(idx, freq`) descend=1; /* order descending by freq */ return ( T(u[idx]) ); end; else return unique(x); finish; /* test the function by calling it on a character and numeric vectors */ use sashelp.cars; read all var "Type"; read all var "Cylinders"; close; uData = UniqueOrder(Type, "data"); /* unique values in the order they appear in data */ uFreq = UniqueOrder(Type, "freq"); /* unique values in descending frequency order */ print uData, uFreq; uData = UniqueOrder(Cylinders, "data"); uFreq = UniqueOrder(Cylinders, "freq"); print uData, uFreq; |
The results of the function are similar to the results of PROC FREQ when you use the ORDER= option. There is a small difference when the data contain missing values. PROC FREQ always lists the missing value first, regardless of where the missing values appear in the data or how many missing values there are. In contrast, the SAS/IML function handles missing and nonmissing values equivalently.
In summary, whether you are implementing an analysis in Base SAS or SAS/IML, this article shows how to obtain the unique values of data in sorted order, in order of frequency, or in the order that the values appear in the data.
The post Get the unique values of a variable in data order appeared first on The DO Loop.
]]>The post Fit a growth curve in SAS appeared first on The DO Loop.
]]>The SAS DATA step specifies the mean height (in centimeters) of 58 sunflowers at 7, 14, ..., 84 days after planting. The American naturalist H. S. Reed studied these sunflowers in 1919 and used the mean height values to formulate a model for growth. Unfortunately, I do not have access to the original data for the 58 plants, but using the mean values will demonstrate the main ideas of fitting a growth model to data.
/* Mean heights of 58 sunflowers: Reed, H. S. and Holland, R. H. (1919), "Growth of sunflower seeds" Proceedings of the National Academy of Sciences, volume 5, p. 140. http://www.pnas.org/content/pnas/5/4/135.full.pdf */ data Sunflower; input Time Height; label Time = "Time (days)" Height="Sunflower Height (cm)"; datalines; 7 17.93 14 36.36 21 67.76 28 98.1 35 131 42 169.5 49 205.5 56 228.3 63 247.1 70 250.5 77 253.8 84 254.5 ; |
A simple mathematical model for population growth that is constrained by resources is the logistic growth model, which is also known as the Verhulst growth model. (This should not be confused with logistic regression, which predicts the probability of a binary event.) The Verhulst equation can be parameterized in several ways, but a popular parameterization is
Y(t) = K / (1 + exp(-r*(t – b)))
where
The model has three parameters, K, r, and b. When you use PROC NLIN in SAS to fit a model, you need to specify the parametric form of the model and provide an initial guess for the parameters, as shown below:
proc nlin data=Sunflower list noitprint; parms K 250 r 1 b 40; /* initial guess */ model Height = K / (1 + exp(-r*(Time - b))); /* model to fit; Height and Time are variables in data */ output out=ModelOut predicted=Pred lclm=Lower95 uclm=Upper95; estimate 'Dt' log(81) / r; /* optional: estimate function of parameters */ run; title "Logistic Growth Curve Model of a Sunflower"; proc sgplot data=ModelOut noautolegend; band x=Time lower=Lower95 upper=Upper95; /* confidence band for mean */ scatter x=Time y=Height; /* raw observations */ series x=Time y=Pred; /* fitted model curve */ inset ('K' = '261' 'r' = '0.088' 'b' = '34.3') / border opaque; /* parameter estimates */ xaxis grid; yaxis grid; run; |
The output from PROC NLIN includes an ANOVA table (not shown), a parameter estimates table (shown below), and an estimate for the correlation of the parameters. The parameter estimates include estimates, standard errors, and 95% confidence intervals for the parameters. The OUTPUT statement creates a SAS data set that contains the original data, the predicted values from the model, and a confidence interval for the predicted mean. The output data is used to create a "fit plot" that shows the model and the original data.
Articles about the Verhulst model often mention that the "maximum growth rate" parameter, r, is sometimes replaced by a parameter that specifies the time required for the population to grow from 10% to 90% of the carrying capacity, K. This time period is called the "characteristic duration" and denoted as Δt. You can show that Δt = log(81)r. The ESTIMATE statement in PROC NLIN produces a table (shown below) that estimates the characteristic duration.
The value 50.1 tells you that, on average, it takes about 50 days for a sunflower to grow from 10 percent of its maximum height to 90 percent of its maximum height. By looking at the graph, you can see that most growth occurs during the 50 days between Day 10 and Day 60.
This use of the ESTIMATE statement can be very useful. Some models have more than one popular parameterization. You can often fit the model for one parameterization and use the ESTIMATE statement to estimate the parameters for a different parameterization.
The post Fit a growth curve in SAS appeared first on The DO Loop.
]]>The post The intersection of multiple sets appeared first on The DO Loop.
]]>The idea for this topic came from reading a blog post by 'mikefc' at the "Cool but Useless" blog about the relative performance of various ways to intersect vectors in R. Mikefc used vectors that contained between 10,000 and 100,000 elements and whose intersection was only a few dozen elements. In this article, I increase the sizes of the vectors by an order of magnitude and time the performance of the intersection function in SAS/IML. I also ensure that the sets share a large set of common elements so that the intersection is sizeable.
In general, finding the intersection between sets is a fast operation. In languages such as R, MATLAB, and SAS/IML, you can store the sets in vectors and use built-in functions to form the intersection. Because the intersection of two sets is always smaller than (or equal to) the size of the original sets, computing the "intersection of intersections" is usually faster than computing the intersection of the original (larger) sets.
The following SAS/IML program creates eight SAS/IML vectors that have between 200,000 and 550,000 elements. The elements in the vectors are positive integers, although they could also be character strings or non-integers. The vectors contain a large subset of common elements so that the intersection is sizeable. The vectors A, B, ..., H are created as follows:
proc iml; call randseed(123); NIntersect = 5e4; common = sample(1:NIntersect, NIntersect, "replace"); /* elements common to all sets */ source = 1:1e6; /* elements are positive integers less than 1 million */ BaseN = 5e5; /* approx magnitude of vectors (sets) */ A = sample(source, BaseN, "replace") || common; /* include common elements */ B = sample(source, 0.9*BaseN, "replace") || common; C = sample(source, 0.8*BaseN, "replace") || common; D = sample(source, 0.7*BaseN, "replace") || common; E = sample(source, 0.6*BaseN, "replace") || common; F = sample(source, 0.5*BaseN, "replace") || common; G = sample(source, 0.4*BaseN, "replace") || common; H = sample(source, 0.3*BaseN, "replace") || common; |
The intersection of these vectors contains about 31,600 elements. You can implement the following methods that find the intersection of these vectors:
The following SAS/IML statements implement these three methods and compute the time required to intersect each:
/* 1. one call to the built-in method */ t0 = time(); w = xsect(a,b,c,d,e,f,g,h); tBuiltin = time() - t0; /* 2. loop over variable names and form pairwise intersections */ varName = "a":"h"; t0 = time(); w = value(varName[1]); do i = 2 to ncol(varName); w = xsect(w, value(varName[i])); /* compute pairwise intersection */ end; tPairwise = time() - t0; /* 3. Sort by size of sets, then loop */ varName = "a":"h"; t0 = time(); len = j(ncol(varName), 1); do i = 1 to ncol(varName); len[i] = ncol(value(varName[i])); /* number of elements in each set */ end; call sortndx(idx, len); /* sort smallest to largest */ sortName = varName[,idx]; w = value(sortName[1]); do i = 2 to ncol(sortName); w = xsect(w, value(sortName[i])); /* compute pairwise intersection */ end; tSort = time() - t0; print tBuiltin tPairwise tSort; |
For this example data, each method takes about 0.5 seconds to find the intersection of eight sets that contain a total of 3,000,000 elements. If you rerun the analysis, the times will vary by a few hundredths of a second, so the times are essentially equal. ('mikefc' also discusses a few other methods, including a "tally method." The tally method is fast, but it relies on the elements being positive integers.)
There is a quote that I like: "In theory, there is no difference between theory and practice. But, in practice, there is." (Published by Walter J. Savitch, who claims to have overheard it at a computer science conference.) For the intersection problem, you can argue theoretically that pre-sorting by length is an inexpensive way to potentially obtain a boost in performance. However, in practice (for this example data), pre-sorting improves the performance by less than 1%. Furthermore, the absolute times for this operation are short, so saving a 1% of 0.5 seconds might not be worth the effort.
Even if you never compute the intersection of sets in SAS, there are two take-aways from this exercise:
The post The intersection of multiple sets appeared first on The DO Loop.
]]>The post Fast simulation of multivariate normal data with an AR(1) correlation structure appeared first on The DO Loop.
]]>The AR(1) correlation structure is used in statistics to model observations that have correlated errors. (For example, see the documentation of PROC MIXED in SAS.) If Σ is AR(1) correlation matrix, then its elements are constant along diagonals. The (i,j)th element of an AR(1) correlation matrix has the form Σ[i,j] = ρ^{|i – j|}, where ρ is a constant that determines the geometric rate at which correlations decay between successive time intervals. The exponent for each term is the L_{1} distance between the row number and the column number. As I have shown in a previous article, you can use the DISTANCE function in SAS/IML 14.3 to quickly evaluate functions that depend on the distance between two sets of points. Consequently, the following SAS/IML function computes the correlation matrix for a p x p AR(1) matrix:
proc iml; /* return p x p matrix whose (i,j)th element is rho^|i - j| */ start AR1Corr(rho, p); return rho##distance(T(1:p), T(1:p), "L1"); finish; /* test on 10 x 10 matrix with rho = 0.8 */ rho = 0.8; p = 10; Sigma = AR1Corr(rho, p); print Sigma[format=Best7.]; |
Every covariance matrix has a Cholesky decomposition, which represents the matrix as the crossproduct of a triangular matrix with itself: Σ = R^{T}R, where R is upper triangular. In SAS/IML, you can use the ROOT function to compute the Cholesky root of an arbitrary positive definite matrix. However, the AR(1) correlation matrix has an explicit formula for the Cholesky root in terms of ρ. This explicit formula does not appear to be well known by statisticians, but it is a special case of a general formula developed by V. Madar (Section 5.1, 2016), who presented a poster at a Southeast SAS Users Group (SESUG) meeting a few years ago. An explicit formula means that you can compute the Cholesky root matrix, R, in a direct and efficient manner, as follows:
/* direct computation of Cholesky root for an AR(1) matrix */ start AR1Root(rho, p); R = j(p,p,0); /* allocate p x p matrix */ R[1,] = rho##(0:p-1); /* formula for 1st row */ c = sqrt(1 - rho**2); /* scaling factor: c^2 + rho^2 = 1 */ R2 = c * R[1,]; /* formula for 2nd row */ do j = 2 to p; /* shift elements in 2nd row for remaining rows */ R[j, j:p] = R2[,1:p-j+1]; end; return R; finish; R = AR1Root(rho, p); /* compare to R = root(Sigma), which requires forming Sigma */ print R[L="Cholesky Root" format=Best7.]; |
You can compute an AR(1) covariance matrix from the correlation matrix by multiplying the correlation matrix by a positive scalar, σ^{2}.
An efficient way to simulate data from a multivariate normal population with covariance Σ is to use the Cholesky decomposition to induce correlation among a set of uncorrelated normal variates. This is the technique used by the RandNormal function in SAS/IML software. Internally, the RandNormal function calls the ROOT function, which can compute the Cholesky root of an arbitrary positive definite matrix.
When there are thousands of variables, the Cholesky decomposition might take a second or more. If you call the RandNormal function thousands of times during a simulation study, you pay that one-second penalty during each call. For the AR(1) covariance structure, you can use the explicit formula for the Cholesky root to save a considerable amount of time. You also do not need to explicitly form the p x p matrix, Σ, which saves RAM. The following SAS/IML function is an efficient way to simulate N observations from a p-dimensional multivariate normal distribution that has an AR(1) correlation structure with parameter ρ:
/* simulate multivariate normal data from a population with AR(1) correlation */ start RandNormalAR1( N, Mean, rho ); mMean = rowvec(Mean); p = ncol(mMean); U = AR1Root(rho, p); /* use explicit formula instead of ROOT(Sigma) */ Z = j(N,p); call randgen(Z,'NORMAL'); return (mMean + Z*U); finish; call randseed(12345); p = 1000; /* big matrix */ mean = j(1, p, 0); /* mean of MVN distribution */ /* simulate data from MVN distribs with different values of rho */ v = do(0.01, 0.99, 0.01); /* choose rho from list 0.01, 0.02, ..., 0.99 */ t0 = time(); /* time it! */ do i = 1 to ncol(v); rho = v[i]; X = randnormalAR1(500, mean, rho); /* simulate 500 obs from MVN with p vars */ end; t_SimMVN = time() - t0; /* total time to simulate all data */ print t_SimMVN; |
The previous loop generates a sample that contains N=500 observations and p=1000 variables. Each sample is from a multivariate normal distribution that has an AR(1) correlation, but each sample is generated for a different value of ρ, where ρ = 0.01. 0.02, ..., 0.99. On my desktop computer, this simulation of 100 correlated samples takes about 4 seconds. This is about 25% of the time for the same simulation that explicitly forms the AR(1) correlation matrix and calls RandNormal.
In summary, the AR(1) correlation matrix is an easy way to generate a symmetric positive definite matrix. You can use the DISTANCE function in SAS/IML 14.3 to create such a matrix, but for some applications you might only require the Cholesky root of the matrix. The AR(1) correlation matrix has an explicit Cholesky root that you can use to speed up simulation studies such as generating samples from a multivariate normal distribution that has an AR(1) correlation.
The post Fast simulation of multivariate normal data with an AR(1) correlation structure appeared first on The DO Loop.
]]>The post Chi-square tests for proportions in one-way tables appeared first on The DO Loop.
]]>To make these questions concrete, let's look at some example data. According to a 2016 Pew research study, the party affiliation of registered voters in the US in 2016 was as follows: 33% of voters registered as Democrats, 29% registered as Republicans, 34% were Independents, and 4% registered as some other party. If you have a sample of registered voters, you might want to ask whether the observed proportion of affiliations matches the national averages. The following SAS data step defines the observed frequencies for a hypothetical sample of 300 voters:
data Politics; length Party $5; input Party $ Count; datalines; Dem 125 Repub 79 Indep 86 Other 10 ; |
You can use the TESTP= option on the TABLES statement in PROC FREQ to compare the observed proportions with the national averages for US voters. You might assume that the following statements perform the test, but there is a potential pitfall. The following statements contain a subtle error:
proc freq data=Politics; /* National Pct: D=33%, R=29%, I=34%, Other=04% */ tables Party / TestP=(0.33 0.29 0.34 0.04) nocum; /* WARNING: Contains an error! */ weight Count; run; |
If you look carefully at the OneWayFreqs table that is produced, you will see that the test proportions that appear in the fourth column are not the proportions that we intended to specify! The problem is that order of the categories in the table is alphabetical whereas the proportions in the LISTP= option correspond to the order that the categories appear in the data. In an effort to prevent this mistake, the documentation for the TESTP= option warns you to "order the values to match the order in which the corresponding variable levels appear in the one-way frequency table." The order of categories is important in many SAS procedures, so always think about the order! (The ESTIMATE and CONTRAST statements in linear regression procedures are other statements where order is important.)
To specify the correct order, you have two options: (1) list the proportions for the TESTP= option according to the alphabetical order of the categories, or (2) use the ORDER=DATA option on the PROC FREQ statement to tell the procedure to use the order of the categories as they appear in the data. The following statement uses the ORDER=DATA option to specify the proportions:
proc freq data=Politics ORDER=DATA; /* list proportions in DATA order */ /* D=33%, R=29%, I=34%, Other=04% */ tables Party / TestP=(0.33 0.29 0.34 0.04); /* cellchi2 not available for one-way tables */ weight Count; ods output OneWayFreqs=FreqOut; output out=FreqStats N ChiSq; run; |
The analysis is now correctly specified. The chi-square table indicates that the observed proportions are significantly different from the national averages at the α = 0.05 significance level.
A SAS programmer posted a similar analysis on a discussion and asked whether it was possible to determine which categories were the most different from the specified proportions. The analysis shows that the chi-square test rejects the null hypothesis, but does not indicate whether only one category is different than expected or whether many categories are different.
Interestingly, PROC FREQ supports such an option for two-way tables when the null hypothesis is the independence of the two variables. Recall that the chi-square statistic is a sum of squares, where each cell in the table contributes one squared value to the sum. The CELLCHI2 option on the TABLES statement "displays each table cell’s contribution to the Pearson chi-square statistic.... The cell chi-square is computed as
(frequency – expected)^{2} / expected
where frequency is the table cell frequency (count) and expected is the expected cell frequency" under the null hypothesis.
Although the option is not supported for one-way tables, it is straightforward to use the DATA step to compute each cell's contribution. The previous call to PROC FREQ used the ODS OUTPUT statement to write the OneWayFreqs table to a SAS data set. It also wrote a data set that contains the sample size and the chi-square statistic. You can use these statistics as follows:
/* create macro variables for sample size and chi-square statistic */ data _NULL_; set FreqStats; call symputx("NumObs", N); call symputx("TotalChiSq", _PCHI_); run; /* compute the proportion of chi-square statistic that is contributed by each cell in the one-way table */ data Chi2; set FreqOut; ExpectedFreq = &NumObs * TestPercent / 100; Deviation = Frequency - ExpectedFreq; ChiSqContrib = Deviation**2 / ExpectedFreq; /* (O - E)^2 / E */ ChiSqPropor = ChiSqContrib / &TotalChiSq; /* proportion of chi-square contributed by this cell */ format ChiSqPropor 5.3; run; proc print data=Chi2; var Party Frequency TestPercent ExpectedFreq Deviation ChiSqContrib ChiSqPropor; run; |
The table shows the numbers used to compute the chi-square statistic. For each category of the PARTY variable, the table shows the expected frequencies, the deviations from the expected frequencies, and the chi-square term for each category. The last column is the proportion of the total chi-square statistic for each category. You can see that the 'Dem' category contributes the greatest proportion. The interpretation is that the observed count of the 'Dem' group is much greater than expected and this is the primary reason why the null hypothesis is rejected.
You can also create a bar chart that shows the contributions to the chi-square statistic. You can create the "chi-square contribution plot" by using the following statements:
title "Proportion of Chi-Square Statistic for Each Category"; proc sgplot data=Chi2; vbar Party / response=ChiSqPropor datalabel=ChiSqPropor; xaxis discreteorder=data; yaxis label="Proportion of Chi-Square Statistic" grid; run; |
The bar chart makes it clear that the frequency of the 'Dem' group is the primary factor in the size of the chi-square statistic. The "chi-square contribution plot" is a visual companion to the Deviation Plot, which is produced automatically by PROC FREQ when you specify the PLOTS=DEVIATIONPLOT option. The Deviation Plot shows whether the counts for each category are more than expected or less than expected. When you combine the two plots, you can make pronouncements like Goldilocks:
In summary, this article addresses three topics related to testing the proportions of counts in a one-way frequency table. You can use the TESTP= option to specify the proportions for the null hypothesis. Be sure that you specify the proportions in the same order that they appear in the OneWayFreqs table. (The ORDER=DATA option is sometimes useful for this.) If the data proportions do not fit the null hypothesis, you might want to know why. One way to answer this question is to compute the contributions of each category to the total chi-square computation. This article shows how to display that information in a table or in a bar chart.
The post Chi-square tests for proportions in one-way tables appeared first on The DO Loop.
]]>The post Radial basis functions and Gaussian kernels in SAS appeared first on The DO Loop.
]]>One of the many useful features of the SAS/IML language is its ability to compactly represent matrix and vector expressions. The expression ||x – c|| looks like the distance between two vectors, but in the SAS/IML language the DISTANCE function can handle multiple sets of vectors:
The following SAS/IML statements define a Gaussian kernel function. Notice that the function is very compact! To test the function, define one center at C = (2.3, 3.2). Because SAS/IML is a matrix language, you can evaluate the Gaussian kernel on a grid of integer coordinates (x,y) where x is an integer in the range [1,5] and y is in the range [1,8]. Let Z be the matrix of the 40 ordered pairs. The following call evaluates the Gaussian kernel at the grid of points:
proc iml; /* Radial basis function (Gaussian kernel). If z is m x d and c is n x d, this function returns the mxn matrix of values exp( -||z[i,] - c[j,]||**2 / (2*sigma**2) ) */ start GaussKernel(z, c, sigma=1); return exp( -distance(z,c)##2 / (2*sigma**2) ); finish; /* test on small data: Z is an 5 x 8 grid and C = {2.3 3.2} */ xPts = 1:5; yPts = 1:8; Z = expandgrid(xPts, yPts); /* expand into (8*5) x 2 matrix */ C = {2.3 3.2}; /* the center */ phi = GaussKernel(Z, C); /* phi is 40 x 1 vector */ print Z phi; /* print in expanded form */ phi_Grid = shapecol(phi, ncol(yPts)); /* reshape into grid (optional) */ print phi_Grid[c=(char(xPts)) r=(char(yPts)) F=4.2]; |
The table shows the Gaussian kernel evaluated at the grid points. The columns represent the values at the X locations and the rows indicate the Y locations. The function is largest at the value (x,y)=(2,3) because (2,3) is the grid point closest to the center (2.3, 3.2). The largest value 0.94. Notice that the function is essentially zero at points that are more than 3 units from the center, which you would expect from a Gaussian distribution with σ = 1.
You can use the HEATMAPCONT subroutine to make a heat map of the function values. However, notice that in the matrix the rows increase in the downward direction, whereas in the usual Cartesian coordinate system the Y direction increases upward. Consequently, you need to reverse the rows and the Y-axis labels when you create a heat map:
start PlotValues( v, xPts, yPts ); G = shapecol(v, ncol(yPts)); /* reshape vector into grid */ M = G[nrow(G):1, ]; /* flip Y axis (rows) */ yRev = yPts[, ncol(yPts):1]; /* reverse the Y-axis labels */ call heatmapcont(M) xvalues=xPts yValues=yRev; finish; run PlotValues(phi, xPts, yPts); |
Often the "centers" are the locations of some resource such as a warehouse, a hospital, or an ATM. Let's use the locations of 86 large US cities, which I used in a previous article about spatial data analysis. A graph of the locations of the cities is shown to the right. (Click to enlarge.) The locations are in a standardized coordinate system, so they do not directly correspond to longitudes and latitudes.
If there are multiple centers, the GaussKernel function returns a column for every center. Many applications require a weighted sum of the columns. You can achieve a weighted sum by using a matrix-vector product A*w, where w is a column vector of weights. If you want an unweighted sum, you can use the SAS/IML subscript reduction operator to sum across the columns: A[,+].
For example, the following statements evaluate the Gaussian kernel function at each value in a grid (the Z matrix) and for each of 86 cities (the C matrix). The result is a 3726 x 86 matrix of values. You can use the subscript reduction operator to sum the kernel evaluations over the cities, as shown:
use BigCities; read all var {x y} into C; /* C = (x,y) locations of centers */ read all var "City"; close; /* Z = a regular grid in (x,y) coordinates that contains the data */ XGridPts = round( do(-0.4, 0.4, 0.01), 0.001); YGridPts = round( do(-0.2, 0.25, 0.01), 0.001); Z = expandgrid( XGridPts, YGridPts ); /* 3,726 points on a 81x46 grid */ phi = GaussKernel(Z, C, 0.025); /* use smaller bandwidth */ sumPhi = phi[,+]; /* for each grid point, add sum of kernel evaluations */ |
The resulting heat map shows blobs centered at each large city in the data. Locations near isolated cities (such as Oklahoma City) are lighter in color than locations near multiple nearby cities (such as southern California and the New York area) because the image shows the superposition of the kernel functions. At points that are far from any large city, the sum of the Gaussian kernel functions is essentially zero.
In summary, if you work with algorithms that use radial basis functions such as Gaussian kernels, you can use the SAS/IML language to evaluate these functions. By using the matrix features of the language and the fact that the DISTANCE function supports matrices as arguments, you can quickly and efficiently evaluate weighted sums of these kernel functions.
The post Radial basis functions and Gaussian kernels in SAS appeared first on The DO Loop.
]]>The post How many perfect riffle shuffles are required to restore a deck to its initial order? appeared first on The DO Loop.
]]>After being reminded of this interesting fact, I wondered how the result generalizes to decks of various sizes. That is, if you use a deck with N cards, what is the minimum number of perfect riffle shuffles (call it P(N)) that you need to restore the deck to its original order? I decided to run a SAS program to discover the answer. The result is summarized in the following graph, which plots P(N) versus the number of cards in a deck. All points are below the identity line, which implies that at most N – 1 shuffles are required for a deck that contains N cards. If you want to learn more about the graph and its interesting patterns, read on.
The perfect riffle shuffle is easier to program than the Gilbert-Shannon-Reeds (GSR) model, which uses randomness to model how real people shuffle real cards. In a perfect riffle shuffle, you cut the deck exactly two equal parts. Then you interlace the cards from the top stack with the cards from the bottom stack. The new deck ordering is TBTBTB..., where T indicates a card from the top half of the original deck and B indicates a card from the bottom half.
There are actually two perfect riffle shuffles for an even number of cards: you can interlace the cards TBTBTB..., which is called an "out shuffle," or you can interlace the cards BTBTBT..., which is called an "in shuffle." For odd-numbered decks, there is also a choice of where to cut the deck: does the top part of the deck get the extra card or does the bottom part? My program always uses the "out shuffle" convention (the top card stays on top) and gives the top stack the extra card when there is an odd number of cards. Therefore, the shuffled deck looks like TBTB...TB for even-numbered decks and TBTB...TBT for odd-numbered decks. The following SAS/IML function takes a row vector that represents a card deck and performs a perfect riffle shuffle to reorder the cards in the deck:
proc iml; start PerfectRiffle(deck); N = ncol(deck); /* send in deck as a row vector */ shuffle = j(1,N,.); /* allocate new deck */ nT = ceil(N/2); /* cut into two stacks; "Top" gets extra card when N odd */ nB = N - nT; /* nT = size of Top; nB = size of Bottom */ shuffle[, do(1,N,2)] = deck[,1:nT]; /* top cards go into odd positions */ shuffle[, do(2,N,2)] = deck[,nT+1:N]; /* bottom cards go into even positions */ return(shuffle); finish; /* test the algorithm on a deck that has N = 10 cards */ origDeck = 1:10; deck = PerfectRiffle(origDeck); print (origdeck//deck)[r={"Orig Deck" "Perfect Riffle"}]; |
The output shows one perfect riffle of a deck that has 10 cards that are originally in the order 1, 2, ..., 10.
If you call the PerfectSuffle function repeatedly on the same deck, you can simulate perfect riffle shuffles. After each shuffle, you can test whether the order of the deck is in the original order. If so, you can stop shuffling. If not, you can shuffle again. The following SAS/IML loops test decks that contain up to 1,000 cards. The array nUntilRestore keeps track of how many perfect riffle shuffles were done for each deck size.
decksize = 1:1000; /* N = 1, 2, 3, ..., 1000 */ nUntilRestore = j(1, ncol(decksize), .); /* the results vector */ do k = 2 to ncol(decksize); /* for the deck of size N */ origDeck = 1:decksize[k]; /* original order is 1:N */ deck = OrigDeck; /* set order of deck */ origOrder = 0; /* flag variable */ do nShuffles = 1 to decksize[k] until (origOrder); /* repeat shuffling */ deck = PerfectRiffle( deck ); /* shuffle deck */ origOrder = all(deck = origDeck); /* until the order of the cards are restored */ end; nUntilRestore[k] = nShuffles; /* store the number of shuffles */ end; /* print the results for N=1,..,99 */ x = j(10, 10, .); x[2:100] = nUntilRestore[1:99]; /* present in 10x10 array */ rr = putn( do(0, 90, 10), "Z2."); /* each row has 10 elements */ cc = ("0":"9"); print x[r=rr c=cc L="Number of Perfect Shuffles to Restore Order"]; title "Number of Perfect Shuffles to Restore Deck Order"; call scatter(deckSize, nUntilRestore) grid={x y} other="lineparm x=0 y=0 slope=1;" label={"N: Number of Cards in Deck" "P(N): Number of Perfect Shuffles to Restore Order"} procopt="noautolegend"; |
The table shows the number of perfect riffle shuffles required for decks that have fewer than 99 cards. The results are in a 10x10 grid. The first row shows decks that have less than 10 cards, the second row shows sizes 10-19, and so on. For example, to find the result for a 52-card deck, move down to the row labeled "50" and over to the column labeled "2". The result in that cell is 8, which is the number of perfect shuffles required to restore a 52-card deck to its original order.
The graph at the top of this article shows the number of shuffles for decks that contain up to 1,000 cards. To me, the most surprising feature of the graph is the number of points that fall near diagonal lines of various rational slopes. The most apparent diagonal features are the lines whose slopes are approximately equal to 1, 1/2, and 1/3.
A complete mathematical description of this problem is beyond the scope of this blog article, but here are a few hints and links to whet your appetite.
There is much more to be said about this topic, but I'll stop here. If you have something to add, please leave a comment.
The post How many perfect riffle shuffles are required to restore a deck to its initial order? appeared first on The DO Loop.
]]>The post Shuffling smackdown: Overhand shuffle versus riffle shuffle appeared first on The DO Loop.
]]>The most popular way to mix a deck of cards is the riffle shuffle, which separates the deck into two pieces and interleaves the cards from each piece. Besides being popular with card players, the riffle shuffle is well-known among statisticians because Bayer and Diaconis (1992) showed that it takes seven riffle shuffles to randomize a 52-card deck. This result annoyed many recreational card players because shuffling seven times "slows down the game".
The second most popular shuffling technique is the overhand shuffle. The overhand shuffle is much less efficient at mixing the cards than a riffle shuffle. A result by Pemantle (1989) indicates that at least 50 overhand shuffles are required to adequately mix the deck, but computer experiments (Pemantle, 1989, p. 49) indicate that in many situations 1,000 or more overhand shuffles are required to randomize a deck. Talk about slowing down the game! This article shows how to simulate an overhand shuffle in SAS and visually compares the mixing properties of the riffle and overhand shuffles.
An overhand shuffle transfers a few cards at a time from the dealer's right hand to his left. The dealer slides a few cards from the top of the deck in his right hand into his left hand. The process is repeated until all cards in his right hand are transferred. Consequently, cards that started near the top of the deck end up near the bottom, and vice versa.
Mathematically, you can model an overhand shuffle by randomly choosing k cut points that separate the deck into k+1 "packets" of contiguous cards. The size of a packet can vary, as can the number of packets. The overhand shuffle reverses the order of the packets while preserving the order of the cards within each packet. When the packet sizes are small, the cards are mixed better than when the packets are bigger.
For example, a deck that contains 10 cards (enumerated 1-10) might be split into five packets such as (1)(234)(5)(678)(9 10), where the parentheses show the cards in each of the packets. In this example, the cut points are the locations after cards 1, 4, 5, and 8. The packets contain 1, 3, 1, 3, and 2 cards. The overhand shuffle reverses the order of the packets, so the deck is reordered as (9 10)(678)(5)(234)(1). This is shown in the following graphic.
In a statistical model of the overhand shuffle, the cut points are assumed to be randomly chosen with probability p from among the N-1 possible locations between adjacent cards. An ideal overhand shuffle uses p = 1/2, which results in packets that are typically sized, 1, 2, and 3 (the packet size is geometrically distributed). However, among amateurs, the probability is much lower and an overhand shuffle reorders only a few large packets, which can lead to poor mixing of the cards.
I previously showed how to simulate a riffle shuffle in the SAS/IML language. The following program shows how to simulate an overhand shuffle. The deck of cards (represented as a vector of integers with the values 1 to N) is passed in, as is the probability of choosing a cut point. To test the program, I pass in a deck that has 10 cards enumerated by the integers 1–10. The new order of the deck was shown in the previous section.
proc iml; call randseed(1234); /* Simulate overhand shuffle. Randomly split deck into k packets. Reverse the order of the packets. Each position between cards is chosen to be a cut point with probability p. (Default: p=0.5) */ start Overhand(_deck, p=0.5); deck = colvec(_deck); /* make sure input is column vector */ N = nrow(deck); /* number of cards in deck */ b = randfun(N-1, "Bernoulli", p); /* randomly choose cut points */ if all(b=0) then return _deck; /* no cut points were chosen */ cutpt = loc(b=1); /* cut points in range 1:n-1 */ cutpt = 0 || cutpt || N; /* append 0 and n */ rev = n - cutpt + 1; /* reversed positions */ newDeck = j(N, 1, .); do i = 1 to ncol(cutpt)-1; /* for each packet... */ R = (1 + cutpt[i]):cutpt[i+1]; /* positions of cards in right hand */ L = rev[i+1]:(rev[i]-1); /* new positions in left hand */ newDeck[L] = deck[R]; /* transfer packet from right to left hand */ end; return shape(newDeck, nrow(_deck), ncol(_deck)); /* return same shape as input */ finish; deck = 1:10; OverhandDeck = Overhand(deck, 0.5); print deck; print OverhandDeck; |
Let's now consider the standard 52-card deck, represented by the integers 1–52. The following loop performs 10 overhand shuffles (p = 0.5) and uses a heat map to show the order of the cards after each shuffle:
N = 52; /* standard 52-card deck */ deck = 1:N; /* initial order */ numShuffles = 10; /* number of times to shuffle cards */ OverDeck = j(numShuffles+1, N); OverDeck[1,] = deck; do i = 2 to numShuffles + 1; OverDeck[i,] = Overhand(OverDeck[i-1,]); end; ramp = palette("YLORBR",9); call heatmapcont(OverDeck) colorramp=ramp yvalues=0:numShuffles title="Positions of 52 Cards After Overhand Shuffles"; |
The cell colors in the heat map correspond to the initial position of the cards: light colors indicate cards that are initially near the top of the deck and dark colors represent cards that are initially near the bottom of the deck. The first row shows the initial positions of the cards in the deck. The second row shows the cards after one overhand shuffle. The third row shows the cards after two overhand shuffles, and so on. It is interesting to see how the pattern changes if you use p=0.25 or p=0.1. Try it!
One feature that is apparent in the image is that the relative positions of cards depend on whether you have performed an even or an odd number of overhand shuffles. After an odd number of shuffles, the top of the deck contains many cards that were initially at the bottom, and the bottom of the deck contains cards that were initially at the top. After an even number of shuffles, the top of the deck contains many cards that were initially at the top. Another apparent feature is that this process does not mix the cards very well. The last row in the heat map indicates that after 10 overhand shuffles, many cards are not very far from their initial positions. Furthermore, many cards that were initially next to each other are still within a few positions of each other.
Maybe additional overhand shuffles will address these deficiencies? The following program simulates 100 overhand shuffles. To aid in the visualization, the heat map shows only the results after 2, 4, 6, ..., and 100 shuffles (the even numbers):
deck = 1:N; /* reset initial order */ numShuffles = 100; /* number of times to shuffle cards */ numEvenShuffles = numShuffles / 2; OverDeck = j(numEvenShuffles+1, N); OverDeck[1,] = deck; /* save only the EVEN shuffles */ do i = 2 to numEvenShuffles + 1; OverDeck[i,] = Overhand(Overhand( OverDeck[i-1,] )); /* 2 calls */ end; call heatmapcont(OverDeck) colorramp=ramp yvalues=0:numEvenShuffles title="Positions of 52 Cards After EVEN Overhand Shuffles"; |
The heat map indicates that the deck is not well-mixed even after 100 overhand shuffles. Look at the last row in the heat map, which shows the positions of the cards in the deck colored by their initial position. There are a small number of light-colored cells on the right side of the heat map (cards that were initially on top; now near the bottom), and a small number of dark-colored cells on the left side of the heat map (cards that were initially on the bottom; now near the top). However, in general, the left side of the heat map contains mostly light-colored cells and the right side contains mostly dark-colored cells, which indicates that cards that have not moved far from their initial positions.
If you use smaller values of p (that is, the average packet size is larger), you will observe different patterns. The theory says that large packets cause nearby cards to tend to stay together for many shuffles. Jonasson (2006) was able to prove that if a deck has N cards, then O(N^{2} log(N)) overhand shuffles are required to randomize the deck. This compares with (3/2) log_{2}(N) for the riffle shuffle, which is presented in the next section.
The riffle shuffle, which randomizes a deck after about seven shuffles. You can simulate a riffle shuffle in SAS and create a similar heat map. The following heat map shows the order of a 52-card deck after each of 10 riffle shuffles:
The heat map shows how quickly a riffle shuffle randomizes a deck. The last row (after 10 riffles) shows that light-colored cells are distributed throughout the deck, as are dark-colored cells and medium-colored cells. Furthermore, this behavior seems to occur after six or seven rows (at least to my untrained eye). I cannot see much difference between the "randomness" in row 6 and row 10. However, theoretically, a card in an earlier row (say, row 5) is more likely to be near cards that were initially close to it. For example, a dark-colored cell in row 5 is more likely to be near another dark-colored cell than would be expected in a randomized deck. However, by the time you perform 7 or more riffle shuffles, two cards that started near each other might be located far apart or might be close to each other. You cannot predict which will occur because the cards have been randomized.
In summary, the overhand shuffle does not mix cards very well. The simulation indicates that an overhand shuffle is not effective at separating cards that begin near each other. Even after 100 overhand shuffles (with p=0.5), the positions of many cards are only slightly different from their initial positions. In contrast, a riffle shuffle is an effective way to randomize a deck in approximately seven shuffles.
The post Shuffling smackdown: Overhand shuffle versus riffle shuffle appeared first on The DO Loop.
]]>The post Linearly spaced vectors in SAS appeared first on The DO Loop.
]]>Syntactically, the main difference between the DO function in SAS/IML and the linspace function in MATLAB is that the third argument to the DO function is a step size (an increment), whereas the third function to the linspace function is the number of points to generate in an interval. But that's no problem: to generate n evenly spaced points on the interval [a, b], you can use a step size of (b – a)/(n – 1). Therefore, the following SAS/IML function is a drop-in replacement for the MATLAB linspace function:
proc iml; /* generate n evenly spaced points (a linearly spaced vector) in the interval [a,b] */ start linspace(a, b, numPts=100); n = floor(numPts); /* if n is not an integer, truncate */ if n < 1 then return( {} ); /* return empty matrix */ else if n=1 then return( b ); /* return upper endpoint */ return( do(a, b, (b-a)/(n-1)) ); /* return n equally spaced points */ finish; |
A typical use for the linspace function is to generate points in the domain of a function so that you can quickly visualize the function on an interval. For example, the following statements visualize the function exp( -x^{2} ) on the domain [-3, 3]:
x = linspace(-3, 3); /* by default, 100 points in [-3,3] */ title "y = exp( -x^2 )"; call series(x, exp(-x##2)); /* graph the function */ |
This is a good time to remind everyone of the programmer's maxim (from Kernighan and Plauger, 1974, The Elements of Programming Style) that "10.0 times 0.1 is hardly ever 1.0." Similarly, "5 times 0.2 is hardly ever 1.0." The maxim holds because many finite decimal values in base 10 have a binary representation that is infinite and repeating. For example, 0.1 and 0.2 are represented by repeating decimals in base 2. Specifically, 0.2_{10} = 0.00110011..._{2}. Thus, just as 3 * (0.3333333) is not equal to 1 in base 10, so too is 5 * 0.00110011..._{2} not equal to 1 in base 2.
A consequence of this fact is that you should avoid testing floating-point values for equality. For example, if you generate evenly spaced points in the interval [-1, 1] with a step size of 0.2, do not expect that 0.0 is one of the points that are generated, as shown by the following statements:
z = do(-1, 1, 0.2); /* find all points that are integers */ idx = loc( z = int(z) ); /* test for equality (bad idea) */ print (idx // z[,idx])[r={'idx', 'z[idx]'}]; /* oops! 0.0 is not there! */ print z; /* show that 0.0 is not one of the points */ |
When you query for all values for which z = int(z), only the values -1 and +1 are found. If you print out the values in the vector, you'll see that the middle value is an extremely tiny but nonzero value (-5.55E-17). This is not a bug but is a consequence of the fact that 0.2 is represented as a repeating value in binary.
So how can you find the points in a vector that "should be" integers (in exact arithmetic) but might be slightly different than an integer in floating-point arithmetic? The standard approach is to choose a small distance (such as 1e-12 or 1e-14) and look for floating-point numbers that are within that distance from an integer. In SAS, you can use the ROUND function or check the absolute value of the difference, as follows:
eps = 1e-12; w = round(z, eps); /* Round to nearest eps */ idx = loc( int(w) = w); /* find points are within epsilon of integer */ print idx; idx = loc( abs(int(z) - z) < eps ); /* find points whose distance to integer is less than eps */ print (idx // z[,idx])[r={'idx', 'z[idx]'}]; |
In summary, this article shows how to define a SAS/IML function that is equivalent to the MATLAB linspace function. It also reminds us that some finite decimal values (such as 0.1 and 0.2) do not have finite binary representations. When these values are used to generate an arithmetic sequence, the resulting vector of values might be different from what you expect. A wise practice is to never test a floating-point value for equality, but instead to test whether a floating-point value is within a small distance from a target value.
The post Linearly spaced vectors in SAS appeared first on The DO Loop.
]]>