The post Compute with combinations: Maximize a function over combinations of variables appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
About once a month I see a question on the SAS Support Communities that involves what I like to call “computations with combinations.” A typical question asks how to find k values (from a set of p values) that maximize or minimize some function, such as “I have 5 variables, and for each observation I want to find the largest product among any 3 values.”
These types of problems are specific examples of a single abstract problem, as follows:
This is an “exhaustive” method that explicitly generates all subsets, so clearly this technique is impractical for large values of p.
The examples that I’ve seen on discussion forums often use p ≤ 10 and small values of k (often 2, 3, or 4). For parameters in this range, an exhaustive solution is feasible.
This general problem includes “leave-one-out” or jackknife estimates as a special case (k = p – 1), so clearly this formulation is both general and powerful.
This formulation also includes the knapsack problem in discrete optimization. In the knapsack problem, you have p items and a knapsack that can hold k items. You want to choose the items so that the knapsack holds as much value as possible. The knapsack problem maximizes the sum of the values whereas the general problem in this article can handle nonlinear functions of the values.
You can use the following DATA set to simulate integer data with a specified number of columns and rows. I use the relatively new “Integer” distribution to generate uniformly distributed integers in the range [-3, 9].
%let p = 5; /* number of variables */ %let NObs = 6; /* number of observations */ data have(drop=i j); call streaminit(123); array x[&p]; do i = 1 to &NObs; do j = 1 to dim(x); x[j] = rand("Integer", -3, 9); /* SAS 9.4M4 */ end; output; end; ; proc print data=have; run; |
For p = 5 and k = 3, the problem is: “For each observation of the 5 variables, find the largest product among any 3 values.”
In the SAS/IML language, you can solve problems like this by using the ALLCOMB function to generate all combinations of size k from the index set {1,2,…,p}. These values are indices that you can use to reference each combination of values. You can evaluate your function on each combination and then compute the max, min, mean, etc.
For example, the following SAS/IML statements generate all combinations of 3 values from the set {1, 2, 3, 4, 5}:
proc iml; p = 5; k = 3; c = allcomb(p, k); /* combinations of p items taken k at a time */ print c; |
A cool feature of the SAS/IML language is that you can use these values as column subscripts! In particular, the expression X[i, c] generates all 3-fold combinations of values in the i_th row. You can then use the SHAPE function to reshape the values into a matrix that has 3 columns, as follows:
/* Example: find all combination of elements in the first row */ varNames = "x1":"x5"; use have; read all var varNames into X; close; Y = X[1, c]; /* all combinations of columns for 1st row */ M = shape(Y, nrow(c), k); /* reshape so each row has k elements */ prod = M[, #]; /* product of elements across columns */ print M prod; |
Notice that each row of the matrix M contains k = 3 elements of Y. There are “5 choose 3” = 10 possible ways to choose 3 items from a set of 5, so the M matrix has 10 rows. Notice that you can use a subscript reduction operator (#) to compute the product of elements for each combination of elements. The maximum three-value product for the first row of data is 24.
The following loop performs this computation for each observation. The result is a vector that contains the maximum three-value product of each row. The original data and the results are then displayed side by side:
/* for each row and for X1-X4, find maximum product of three elements */ result = j(nrow(X), 1); do i = 1 to nrow(X); Y = X[i, c]; /* get i_th row and all combinations of coloumns */ M = shape(Y, nrow(c), k); /* reshape so each row has k elements */ result[i] = max( M[,#] ); /* max of product of rows */ end; print X[colname=varNames] result[L="maxMul"]; |
Of course, if the computation for each observation is more complicated than in this example,
you can define a function that computes the result and then call the module like this: result[i]= MyFunc(M);
You can perform a similar computation in the DATA step, but it requires more loops. You can use the ALLCOMB function (or the LEXCOMBI function) to generate all k-fold combinations of the indices {1, 2, …, p}.
You should call the ALLCOMB function inside a loop from 1 to NCOMB(p, k).
Inside the loop, you can evaluate the objective function on each combination of data values.
Many DATA step functions such as MAX, MIN, SMALLEST, and LARGEST accept arrays of variables, so you probably want to store the variables and the indices in arrays. The following DATA step contains comments that describe each step of the program:
%let p = 5; %let k = 3; %let NChooseK = %sysfunc(comb(&p,&k)); /* N choose k */ data Want(keep=x1-x&p maxProd); set have; array x[&p] x1-x&p; /* array of data */ array c[&k]; /* array of indices */ array r[&NChoosek]; /* array of results for each combination */ ncomb = comb(&p, &k); /* number of combinations */ do i=1 to &k; c[i]=0; end; /* zero the array of indices before first call to ALLCOMB */ do j = 1 to ncomb; call allcombi(&p, &k, of c[*]); /* generate j_th combination of indices */ /* evaluate function of the array {x[c[1]], x[c[2]], ..., x[c[k]]} */ r[j] = 1; /* initialize product to 1 */ do i=1 to &k; r[j] = r[j] * x[c[i]]; end; /* product of j_th combination */ end; maxProd = max(of r[*]); /* max of products */ output; run; proc print data=Want; run; |
The DATA step uses an array (R) of values to store the result of the function evaluated on each subset. For a MAX or MIN computation, this array is not necessary because you can keep track of the current MAX or MIN inside the loop over combinations. However, for more general problems (for example, find the median value), an array might be necessary.
In summary, this article shows how to solve a general class of problems. The general problem generates all subsets of size k from a set of size p. For each subset, you evaluate a function and produce a statistic. From among the “p choose k” statistics, you then choose the max, min, or some other measure. This article shows how to solve these problems efficiently in the SAS/IML language or in the SAS DATA step. Because this is a “brute force” technique, it is limited to small values of p. I suggest p ≤ 25.
The post Compute with combinations: Maximize a function over combinations of variables appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The post Visualize repetition in song lyrics appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
One of my favorite magazines, Significance, printed an intriguing image of a symmetric matrix that shows repetition in a song’s lyrics. The image was created by Colin Morris, who has created many similar images.
When I saw these images, I knew that I wanted to duplicate the analysis in SAS!
The analysis is easy. Suppose that a song (or any text source) contains N words. Define the repetition matrix to be the N x N matrix where the (i,j)th cell has the value 1 if the i_th word is the same as the j_th word. Otherwise, the (i,j)th cell equals 0. Now visualize the matrix by using a heat map: Black indicates cells where the matrix is 1 and white for 0. A SAS program that performs this analysis is available at the end of this article.
To illustrate this algorithm, consider the nursery rhyme, “Row, Row, Row the Boat”:
Row, row, row your boat Gently down the stream Merrily, merrily, merrily, merrily Life is but a dream. |
There are 18 words in this song. Words 1–3 are repeated, as are words 1-–13.
You can use the SAS DATA steps to read the words of the song into a variable and use other SAS functions to strip out any punctuation. You can then use SAS/IML software to construct and visualize the repetition matrix. The details are shown at the end of this article.
The repetition matrix for the song “Row, Row, Row, Your Boat” is shown to the right. For this example I could have put the actual words along the side and bottom of the matrix, but that is not feasible for songs that have hundreds of words. Instead, the matrix has a numerical axis where the number indicates the position of each word in the song.
Every repetition matrix has 1s on the diagonal. In this song,
the words “row” and “merrily” are repeated. Consequently, there is a 3 x 3 block of 1s at the top left and a 4 x 4 block of 1s in the middle of the matrix. (Click to enlarge.)
As mentioned, this song has very little repetition. One way to quantify the amount of repetition is to compute the proportion of 1s in the upper triangular portion of the repetition matrix. The upper triangular portion of an
N x N matrix has N(N–1)/2 elements. For this song, N=18, so there are 153 cells and 9 of them are 1s. Therefore the “repetition score” is 9 / 153 = 0.059.
I wrote a SAS/IML function that creates and visualizes the repetition matrix and returns the repetition score. In order to visualize songs that might have hundreds of words, I suppress the outlines (the grid) in the heat map. To illustrate the output of the function, the following image visualizes the words of the song “Here We Go Round the Mulberry Bush”:
Here we go round the mulberry bush, The mulberry bush, The mulberry bush. Here we go round the mulberry bush So early in the morning. |
The repetition score for this song is 0.087. You can see diagonal “stripes” that correspond to the repeating phrases “here we go round” and “the mulberry bush”.
In fact, if you study only the first seven rows, you can “see” almost the entire structure of the song. The first seven words contain all lyrics except for four words (“so”, “early”, “in”, “morning”).
Let’s visualize the repetitions in the lyrics of several classic songs.
When I saw Morris’s examples, the first song I wanted to visualize was “Hey Jude” by the Beatles. Not only does the title phrase repeat throughout the song, but the final chorus (“Nah nah nah nah nah nah, nah nah nah, hey Jude”) repeats more than a dozen times. This results in a very dense block in the lower right corner of the repetition matrix and a very high repetition score of 0.183. The following image visualizes “Hey Jude”:
The second song that I wanted to visualize was “Love Shack” by The B-52s. In addition to a title that repeats almost 40 times, the song contains a sequence near the end in which the phrase “Bang bang bang on the door baby” is alternated with various interjections. The following visualization of the repetition matrix indicates that there is a lot of variation interspersed with regular repetition. The repetition score is 0.035.
Lastly, I wanted to visualize the song “Call Me” by Blondie. This classic song has only 241 words, yet the title is repeated 41 times! In other words, about 1/3 of the song consists of those two words! Furthermore, there is a bridge in the middle of the song in which the phrase “oh oh oh oh oh” is alternated with other phrases (some in Italian and French) that appear only once in the song. The repetition score is 0.077. The song is visualized below:
If you think this is a fun topic, you can construct these images yourself by using SAS.
If you discover a song that has an interesting repetition matrix, post a comment!
Here’s the basic idea of how to construct and visualize a repetition matrix.
First, use the DATA step to read each word, use the COMPRESS function to remove any punctuation, and standardize the input by transforming all words to lowercase:
data Lyrics; length word $20; input word @@; word = lowcase( compress(word, ,'ps') ); /* remove punctuation and spaces */ datalines; Here we go round the mulberry bush, The mulberry bush, The mulberry bush. Here we go round the mulberry bush So early in the morning. ; |
In SAS/IML software you can use the ELEMENT function to find the locations in the i_th row that have the value 1. After you construct a repetition matrix, you can use the HEATMAPDISC subroutine to display it. For example, the following SAS/IML program reads the words of the song into a vector and visualizes the repetition matrix. It also returns the repetition score, which is the proportion of 1s in the upper triangular portion of the matrix.
ods graphics / width=500 height=500; /* for 9.4M5 you might need NXYBINSMAX=1000000 */ proc iml; /* define a function that creates and visualizes the repetition matrix */ start VizLyrics(DSName, Title); use (DSName); read all var _CHAR_ into Word; close; N = nrow(Word); M = j(N,N,0); /* allocate N x N matrix */ do i = 1 to N; M[,i] = element(Word, Word[i]); /* construct i_th row */ end; run heatmapdisc(M) title=Title colorramp={white black} displayoutlines=0 showlegend=0; /* compute the proportion of 1s in the upper triangular portion of the matrix */ upperIdx = loc(col(M)>row(M)); return ( M[upperIdx][:] ); /* proportion of words that are repeated */ finish; score = VizLyrics("Lyrics", "Here We Go Round the Mulberry Bush"); print score; |
If you want to reproduce the images in this post, you can
download the SAS program for this article.
In addition, the program creates repetition matrices for “We Didn’t Start the Fire” (Billy Joel) and a portion of Martin Luthor King Jr.’s “I Have a Dream” speech.
You can modify the program and enter lyrics for your favorite songs.
The post Visualize repetition in song lyrics appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
My buddy Rick Wicklin recently pointed me towards an animation of some opioid prescription rate data for Illinois. And, of course, I decided we needed a similar animation for North Carolina (with a few improvements…) Here’s the original, and here are the problems that jump out at me: Counties with […]
The post Where are opioids prescribed most, in North Carolina? appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post Pi, special functions, and distributions appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
Welcome to my annual Pi Day post. Every year on March 14th (written 3/14 in the US), geeky mathematicians and their friends celebrate “all things pi-related” because 3.14 is the three-decimal approximation to pi.
Pi is a mathematical constant that never changes. Pi is the same value today as it was in ancient Babylon and Greece. The timeless constancy of pi is a comforting presence in a world of rapid change.
But even though the value of pi does not change, our knowledge about pi does change and grow.
I was reminded of this recently when I opened my worn copy of
the Handbook of Mathematical Functions (more commonly known as “Abramowitz and Stegun,” the names of its editors). When the 1,046-page Handbook was published in 1964, it was the premier reference volume for applied mathematicians and mathematical scientists.
Interestingly, pi is not even listed in the index! It does appear on p. 3 under “Mathematical Constants,” which gives a 25-digit approximation of many mathematical constants such as pi, e, and sqrt(2).
Fast forward to the age of the internet.
In 2010, the Handbook was transformed into an expanded online, searchable, interactive web site. The new Handbook is called The NIST Digital Library of Mathematical Functions.
This is very exciting because the Handbook is now available (for free!) to everyone!
If you search for pi in the online Digital Library, you find that the editors chose to define pi as the value of the integral
This seems to be a strange way to define pi. Pi is the ratio of the circumference and diameter of a circle, and upon first glance that formula doesn’t seem related to a circle.
A more geometric choice would be an integrand such as
sqrt(1 + t^{2}), which connects pi to the area under the unit circle.
Of course, the integral in the Digital Library is equal to pi, but it is not obvious.
You might recall from calculus that the antiderivative of 1/(1+t^{2}) is arctan(t). Therefore the expression is just a complicated way to write 4 arctan(1). Ah! This makes more sense because arctan(1) is equal to π/4. In fact, before SAS introduced the CONSTANT function, SAS programmers used to define pi by using the computation pi = 4*ATAN(1). Nevertheless, I think expressing arctan(1) as an integral is unnecessarily obtuse.
I am not enamored with the editors’ choice of an integral to define pi, but if I were to use that integrand to define pi, I would use a variation that has applications in probability and statistics. Statisticians sometimes use the Cauchy distribution, which is a fat-tailed distribution that has the interesting mathematical property that the distribution has no mean (expected) value! (Mathematicians say that “the first moment does not exist.”) Researchers in robust statistical methods sometimes use Cauchy-distributed errors to generate extreme outliers in simulated data.
The Cauchy probability density function (PDF) is 1/π 1/(1+t^{2}), which means that
the integral of the PDF on the interval [-∞, ∞] is 1. Equivalently, the integral of
1/(1+t^{2}) on the interval [-∞, ∞] is π:
This definition of pi seems more natural than the integral on [0, 1].
I could make other suggestions (such as the integral of arccos on [-1, 1]), but I think I’ll stop here.
The purpose of this post is to celebrate pi, which is so ubiquitous and important that it can be defined in numerous ways.
A secondary purpose is to highlight the availability of the
NIST Digital Library of Mathematical Functions, which is an online successor of the venerable Handbook of Mathematical Functions. I am thrilled with the availability of this amazing resource, regardless of how they define pi!
To complete this Pi Day post, I leave you with a pi-ku. A pi-ku is like a haiku, but each line contains syllables the number of syllables in the decimal expansion of pi. A common structure for a pi-ku is 3-1-4. The following pi-ku celebrates the new Digital Library:
Handbook of
Math
Functions? Online!
The post Pi, special functions, and distributions appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The post Fit a distribution from quantiles appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
Data analysts often fit a probability distribution to data. When you have access to the data, a common technique is to use maximum likelihood estimation (MLE) to compute the parameters of a distribution that are “most likely” to have produced the observed data. However, how can you fit a distribution if you do not have access to the data?
This question was asked by
a SAS programmer who wanted to fit a gamma distribution by using sample quantiles of the data.
In particular, the programmer said,
“we have the 50th and 90th percentile” of the data and “want to find the parameters for the gamma distribution [that fit] our data.”
This is an interesting question. Recall that the method of moments uses sample moments (mean, variance, skewness,…) to estimate parameters in a distribution. When you use the method of moments, you express the moments of the distribution in terms of the parameters, set the distribution’s moments equal to the sample moments, and solve for the parameter values for which the equation is true.
In a similar way, you can fit a distribution matching quantiles: Equate the sample and distributional quantiles and solve for the parameters of the distribution. This is sometimes called quantile-matching estimation (QME). Because the quantiles involve the cumulative distribution function (CDF), the equation does not usually have a closed-form solution and must be solved numerically.
To answer the programmer’s question,
suppose you do not have the original data, but you are told that the 50th percentile (median) of the data is x = 4 and the 90th percentile is x = 8. You suspect that the data are distributed according to a gamma distribution, which has a shape parameter (α) and a scale parameter (β).
To use quantile-matching estimation, set F(4; α, β) = 0.5 and
F(8; α, β) = 0.9, where F is the cumulative distribution of the Gamma(α, β) distribution. You can then solve for the values of (α, β) that satisfy the equations. You will get a CDF that matches the quantiles of the data, as shown to the right.
I have previously written about four ways to solve nonlinear equations in SAS. One way is to use PROC MODEL, as shown below:
data initial; alpha=1; beta=1; /* initial guess for finding root */ p1=0.5; X1 = 4; /* eqn for 1st quantile: F(X1; alpha, beta) = p1 */ p2=0.9; X2 = 8; /* eqn for 2nd quantile: F(X2; alpha, beta) = p2 */ run; proc model data=initial; eq.one = cdf("Gamma", X1, alpha, beta) - p1; /* find root of eq1 */ eq.two = cdf("Gamma", X2, alpha, beta) - p2; /* and eq2 */ solve alpha beta / solveprint out=solved outpredict; run;quit; proc print data=solved noobs; var alpha beta; run; |
The output indicates that the parameters (α, β) = (2.96, 1.52) are the values for which the Gamma(α, β) quantiles match the sample quantiles. You can see this by graphing the CDF function and adding reference lines at the 50th and 90th percentiles, as shown at the beginning of this section. The following SAS code creates the graph:
/* Graph the CDF function to verify that the solution makes sense */ data Check; set solved; /* estimates of (alpha, beta) from solving eqns */ do x = 0 to 12 by 0.2; CDF = cdf("gamma", x, alpha, beta); output; end; run; title "CDF of Gamma Distribution"; title2 "Showing 50th and 90th Percentiles"; proc sgplot data=Check; series x=x y=CDF / curvelabel; dropline y=0.5 X=4 / dropto=both; /* first percentile */ dropline y=0.9 X=8 / dropto=both; /* second percentile */ yaxis values=(0 to 1 by 0.1) label="Cumulative Probability"; xaxis values=(0 to 12 by 2); run; |
The previous section is relevant when you have as many sample quantiles as parameters. If you have more sample quantiles than parameters, then the system is overconstrained and you probably want to compute a least squares solution. If there are m sample quantiles,
the least squares solution is the set of parameters that minimizes the sum of squares Σ_{i}^{m} (p_{i} – F(x_{i}; α, β))^{2}.
For example, the following DATA step contains five sample quantiles. The observation (p,q) = (0.1, 1.48) indicates that the 10th percentile is x=1.48. The second observation indicates that the 25th percentile is x=2.50. The last observation indicates that the 90th percentile is x=7.99. You can use PROC NLIN to find a least squares solution to the quantile-matching problem, as follows:
data SampleQntls; input p q; /* p is cumul probability; q is p_th sample quantile */ datalines; 0.1 1.48 0.25 2.50 0.5 4.25 0.75 6.00 0.9 7.99 ; /* least squares fit of parameters */ proc nlin data=SampleQntls /* sometimes the NOHALVE option is useful */ outest=PE(where=(_TYPE_="FINAL")); parms alpha 2 beta 2; bounds 0 < alpha beta; model p = cdf("Gamma", q, alpha, beta); run; proc print data=PE noobs; var alpha beta; run; |
The solution indicates the parameter values (α, β) = (2.72, 1.70) minimize the sum of squares between the observed and theoretical quantiles. The following graph shows the observed quantiles overlaid on the CDF of the fitted Gamma(α, β) distribution. Alternatively, you can graph the quantile-quantile plot of the observed and fitted quantiles.
For small samples, quantiles in the tail of a distribution have a large standard error, which means that the observed quantile might not be close to the theoretical quantile. One way to handle that uncertainty is to compute a weighted regression analysis where each sample quantile is weighted by the inverse of its variance.
According to Stuart and Ord (Kendall’s Advanced Theory of Statistics, 1994, section 10.10), the standard error of the p_th sample quantile in a sample of size n is σ^{2} = p(1-p) / (n f(ξ_{p})^{2}), where
ξ_{p} is the p_th quantile of the distribution and f is the probability density function.
In PROC NLIN, you can perform weighted analysis by using the automatic variable _WEIGHT_.
The following statements define the variance of the p_th sample quantile and define weights equal to the inverse variance. Notice the NOHALVE option, which can be useful for iteratively reweighted least squares problems. The option eliminates the requirement that the weighted sum of squares must decrease at every iteration.
/* weighted least squares fit where w[i] = 1/variance[i] */ proc nlin data=SampleQntls NOHALVE outest=WPE(where=(_TYPE_="FINAL")); parms alpha 2 beta 2; bounds 0 < alpha beta; N = 80; /* sample size */ xi = quantile("gamma", p, alpha, beta); /* quantile of distrib */ f = pdf("Gamma", xi, alpha, beta); /* density at quantile */ variance = p*(1-p) / (N * f**2); /* variance of sample quantiles */ _weight_ = 1 / variance; /* weight for each observation */ model p = cdf("Gamma", q, alpha, beta); run; |
The parameter estimates for the weighted analysis are slightly different than for the unweighted analysis. The following graph shows the CDF for the weighted estimates, which does not pass as close to the 75th and 90th percentiles as does the CDF for the unweighted estimates. This is because the PDF of the gamma distribution is relatively small for those quantiles, which causes the regression to underweight those sample quantiles.
In summary, this article shows how to use SAS to fit distribution parameters to observed quantiles by using quantile-matching estimation (QME). If the number of quantiles is the same as the number of parameters, you can numerically solve for the parameters for which the quantiles of the distribution equal the sample quantiles. If you have more quantiles than parameters, you can compute a least squares estimate of the parameters. Because quantile estimates in the tail of a distribution have larger uncertainty, you might want to underweight those quantiles. One way to do that is to run a weighted least squares regression where the weights are inversely proportional to the variance of the sample quantiles.
The post Fit a distribution from quantiles appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The R SWAT package (SAS Wrapper for Analytics Transfer) enables you to upload big data into an in-memory distributed environment to manage data and create predictive models using familiar R syntax. In the SAS Viya Integration with Open Source Languages: R course, you learn the syntax and methodology required to […]
The post Use R to interface with SAS Cloud Analytics Services appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
Are you going to Denver, Colorado, and wondering what fun/interesting/eclectic things you can do there? Then this is the map for you! For the past couple of years, I’ve made maps of the city SAS Global Forum is in, pointing out some of the attractions that conference attendees might want […]
The post What to do in Denver, during SAS Global Forum! appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
Do you procrastinate and find excuses to delay doing certain things, even when you know they really ought to be done? And you probably realize that starting those projects is usually the hardest part, eh? Well, I finally started converting hundreds of my SAS examples to use the new GfK maps, […]
The post Converting your old SAS jobs to use the new GfK maps appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
SAS Academy for Data Science has been churning out high quality of data scientists for a quite few years thereby full filling the demand supply gap that exists in the area. It enables career seekers to enhance their skills and known worldwide by obtaining SAS Credentials and SAS Digital Badges […]
The post SAS Data Science Academy creates modern Data Scientists : Experience Sharing appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post The probability of a saddle point in a matrix appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
Many people know that a surface can contain a saddle point, but did you know that you can define the saddle point of a matrix? Saddle points in matrices are somewhat rare, which means that if you choose a random matrix you are unlikely to choose one that has a saddle point. This article shows how to
You might remember from multivariable calculus that a critical point (x0, y0) is a saddle point of a function f if it is a local minimum of the surface in one direction and a local maximum in another direction. The canonical example is f(x,y) = x^{2} – y^{2} at the point (x0, y0) = (0, 0). Along the horizontal line y=0, the point (0, 0) is a local minimum. Along the vertical line x=0, the point (0, 0) is a local maximum.
The definition of a saddle point of a matrix is similar. For an n x p matrix M, the cell (i, j) is a saddle point of the matrix if M[i, j] is simultaneously the minimum value of the i_th row and the maximum value of the j_th column.
Consider the following 3 x 3 matrices. The minimum value for each row is highlighted in blue. The maximum value for each column is highlighted in red. Two of the matrices have saddle points, which are highlighted in purple.
In a matrix language such as SAS/IML, you can use row and column operators to find the location of a saddle point, if it exists. The SAS/IML language supports subscript reduction operators, which enable you to compute certain statistics for all rows or all columns without writing a loop. For this problem, you can use the “index of the minimum operator” (>:<) to find the elements that are the minimum for each row and the
“index of the maximum operator” (<:>) to find the elements that are the maximum for each column. To find whether there is an element in common, you can use the XSECT function find the intersection of the two sets. (SAS/IML supports several functions that perform set operations.)
A function that computes the location of the saddle point follows. It uses the SUB2NDX function to convert subscripts to indices.
proc iml; start LocSaddlePt(M); dim = dimension(M); n = dim[1]; p = dim[2]; minRow = M[ ,>:<]; /* for each row, find column that contains min */ minSubscripts = T(1:n) || minRow; /* location as (i,j) pair */ minIdx = sub2ndx(dim, minSubscripts); /* convert location to index in [1, np] */ maxCol = T(M[<:>, ]); /* for each column, find row that contains max */ maxSubscripts = maxCol || T(1:p); /* loation as (i,j) pair */ maxIdx = sub2ndx(dim, maxSubscripts); /* convert location to to index in [1, np] */ xsect = xsect(minIdx, maxIdx); /* intersection; might be empty matrix */ if ncol(xsect) > 0 then return ( xsect ); else return ( . ); finish; M1 = {1 2 3, 4 5 6, 7 8 9}; idx1 = LocSaddlePt(M1); M2 = {9 1 2, 8 5 7, 3 4 6}; idx2 = LocSaddlePt(M2); M3 = {8 1 9, 7 2 6, 3 4 5}; idx3 = LocSaddlePt(M3); print idx1 idx2 idx3; |
The indices of the saddle points are shown for the three example matrices. The 7th element of the M1 matrix is a saddle point, which corresponds to the (3, 1) cell for a 3 x 3 matrix in row-major order.
The 5th element (cell (2,2)) of the M2 matrix is a saddle point. The M3 matrix does not contain a saddle point.
A matrix contains either zero or one saddle point.
There is a theorem (Ord, 1979) that says that the probability of finding a saddle point in a random n x p matrix is (n! p!) / (n+p-1)!
when the elements of the matrix are randomly chosen from any continuous probability distribution.
Although the theorem holds for any continuous probability distribution, it suffices to prove it for the uniform distribution. Notice that the existence and location of a saddle point is invariant under any monotonic transformation because those transformations do not change the relative order of elements. In particular, if F is any probability distribution, the existence and location of the saddle point is invariant under the monotonic transformation F^{–1}, which transforms the numbers into a sample from the uniform distribution. In fact, you can replace the number by their ranks, which converts the random matrix into a matrix of that contains the consecutive integers 1, 2, …, np. Thus the previous integer matrices represent the ranks of a random matrix where the elements are drawn from any continuous distribution.
For small values of n and p, you can use the ALLPERM function to enumerate all possible matrices that have integers in the range [1, np].
You can reshape the permutation into an n x p matrix and compute whether the matrix contains a saddle point. If you use a 0/1 indicator array to encode the result, then the exact probability of finding a saddle point in a random n x p matrix is found by computing the mean of the indicator array. The following example computes the exact probability of finding a saddle point in a random 3 x 3 matrix:
A = allperm(9); /* all permutations of 1:9 */ saddle = j(nrow(A), 1); /* allocate vector for results */ do i = 1 to nrow(A); /* for each permutation (row) ... */ M = shape(A[i,], 3); /* reshape row into 3x3 matrix */ idx = LocSaddlePt( M ); /* find location of saddle */ saddle[i] = (idx > 0); /* 0/1 indicator variable */ end; prob = saddle[:]; /* (sum of 1s) / (number of matrices) */ print prob; |
As you can see, the computational enumeration shows that the probability of a saddle point is 0.3 for 3 x 3 matrices. This agrees with the theoretical formula (3! 3!) / 5! = 36 / 120 = 3/10.
Of course, when n and p are larger it becomes infeasible to completely enumerate all n x p matrices. However, you can adopt a Monte Carlo approach: generate a large number of random matrices and compute the proportion which contains a saddle point. The following program simulates one million random 4 x 4 matrices from the uniform distribution and computes the proportion that has a saddle point. This is a Monte Carlo estimate of the true probability.
NSim = 1e6; n = 4; p=4; /* random 4x4 matrices from uniform distribution */ A = j(NSim, n*p); call randgen(A, "Uniform"); saddleLoc = j(NSim, 1, .); /* store location of the saddle point */ saddle = j(NSim, 1); /* binary indicator variable */ do i = 1 to NSim; /* Monte Carlo estimation of probability */ M = shape(A[i,], n, p); saddleLoc[i] = LocSaddlePt( M ); /* save the location 1-16 */ saddle[i] = (saddleLoc[i] > 0 ); /* 0/1 binary variable */ end; estProb = saddle[:]; Prob = fact(n)*fact(p) / fact(n+p-1); print estProb Prob; |
The output shows that the estimated probability agrees with the exact probability to four decimal places.
The saddle point can occur in any cell in the matrix. Consider, for example, the integer matrix M1 from the earlier 3 x 3 example. If you swap the first and third rows of M1, you will obtain a new matrix that has a saddle point in the (1,1) cell. If you swap the second and third rows of M1, you obtain a saddle point in the (2,1) cell. Similarly, you can swap columns to move a saddle point to a different column. In general, you can move the saddle point to any cell in the matrix by a transposition of rows and columns. Thus each cell in the matrix has an equal probability of containing a saddle point, and the probability that the saddle point is in cell (i,j) is 1/(np) times to the probability that a saddle point exists.
In the preceding SAS/IML simulation, the saddleLoc variable contains the location of the saddle point in the 4 x 4 matrix. The following statements compute the empirical proportion of times that a saddle point appeared in each of the 16 cells of the matrix:
call tabulate(location, freq, saddleLoc); /* frequencies in cells 1-16 */ counts = shape(freq, n, p); /* shape into matrix */ EstProbLoc = counts / (NSim); /* empirical proportions */ print EstProbLoc[r=('1':'4') c=('1':'4') F=7.5]; |
The output indicates that each cell has a probability that is estimated by 0.007 (the “James Bond” probability!).
Theoretically, you can prove that the probability in each cell is given by the value of the complete beta function B(n, p), which for n=p=4 yields B(4,4) = 0.0071429 (Hofri, 2006).
In summary, this article has explored three topics that are dear to my heart: probability, simulation, and properties of matrices. Along the way, we encountered factorials, permutations, the complete beta function, and got to write some fun SAS/IML programs.
For more information about the probability of a saddle point in random matrices, see
“The probability that a matrix has a saddle point” (Thorp, 1979) and
and “On the distribution of a saddle point value in a random
matrix” (Hofri, 2006).
The post The probability of a saddle point in a matrix appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |