The post Grids and linear subspaces appeared first on The DO Loop.
]]>A grid is a set of evenly spaced points. You can use SAS to create a grid of points on an interval, in a rectangular region in the plane, or even in higher-dimensional regions like the parallelepiped shown at the left, which is generated by three vectors. You can use vectors, matrices, and functions in the SAS/IML language to obtain a grid in a parallelepiped.
In one dimension, you can use a DO loop in the SAS DATA step to construct evenly spaced points in an interval. In the SAS/IML language, you can construct a vector of evenly spaced values operation without writing a DO loop. You can use the DO function, or you can use the colon operator (which SAS/IML documentation calls the "index creation operator"). The results are equivalent, as shown in the following program:
proc iml; a = 2; b = 5; N = 5; delta = (b-a) / N; y = do(a, b, delta); /* Method 1: DO function: start=1, stop=b, step=delta */ x = a + delta*(0:N); /* Method 2: Use colon operator and vector sum */ print y, x; |
The second method generalizes to any linear subspace. For example, suppose you are given a vector u with magnitude α = ||u||. Then the following SAS/IML statements create points that are evenly spaced points along the vector u. The magnitudes of the N vectors are 0, α/N, 2α/N, and so forth up to α.
/* linear spacing along a vector */ u = {3 2 1}; delta = u / N; /* Note: delta is vector */ w = delta @ T(0:N); /* i_th row is (i-1)*u */ print w; |
It's always exciting to use the Kronecker product operator (@). The Kronecker product operator multiplies delta by 0, 1, 2, ..., N and stacks the results. Notice that the ith column of the w matrix is a linear progression from 0 to the ith component of u. Each row is a vector in the same direction as u.
It is only slightly harder to generate a grid of points in two or higher dimensions. In the DATA step, you can use nested DO loops to generate a grid of points in SAS. In the SAS/IML language, you can use the EXPANDGRID function.
Generate evenly spaced grid points in SAS. #SAStip
Click To Tweet
In fact, with the help of linear algebra, the EXPANDGRID function enables you to construct a grid in any linear subspace. Recall that a linear subspace is the "span," or set of all linear combinations, of k basis vectors.
From a linear algebra perspective, a rectangular grid is a set of discrete, evenly spaced, linear combinations of the standard Cartesian basis vectors e_{1}, e_{2}, ... e_{k}. In a linear algebra course you learn that a matrix multiplication enables you to change to another set of basis vectors.
To give a concrete example, consider a rectangular grid of points in the square [0,1]x[0,1]. In the following SAS/IML program, the rows of the matrix G contain the grid points. You can think of each row as being a set of coefficients for a linear combination of the basis vectors e_{1} and e_{2}. You can use those same coefficients to form linear combinations of any other basis vectors. The program shows how to form a grid in the two-dimensional linear subspace that is spanned by the vectors u = { 3 2 1} and v = {-3 5 2}:
h = 1/N; /* generate linear combinations of basis vectors e1 and e2 */ G = ExpandGrid( do(0,1,h), do(0,1,h) ); u = { 3 2 1}; v = {-3 5 2}; M = u // v; /* create matrix of new basis vectors */ grid = G * M; /* grid = linear combinations of basis vectors u and v */ |
In the graph, the three-dimensional vectors u and v are indicated. The grid of points is a set of linear combinations of the form (i/N) u + (j/N) v, where 0 ≤ i, j ≤ N. The grid lies on a two-dimensional subspace spanned by u and v. The points in the grid are colored according to the values of the third component: blue points (lower left corner) correspond to low z-values whereas red points (upper right corner) correspond to high z-values.
Grids are important for visualizing multivariate functions. The function in this example is linear, but grids also enable you to visualize and understand nonlinear transformations.
The post Grids and linear subspaces appeared first on The DO Loop.
]]>The post Compute the square root matrix appeared first on The DO Loop.
]]>Did you know that matrices can also have square roots? For certain matrices S, you can find another matrix X such that X*X = S. To give a very simple example, if S = a*I is a multiple of the identity matrix and a > 0, then X = ±sqrt(a)*I is a square root matrix.
Did you know that some matrices have square roots?
Click To Tweet
I'm going to restrict this article to real numbers, so from now on when I say "a number" I mean a real number and when I say "a matrix" I mean a real matrix. Negative numbers do not have square roots, so it is not surprising that not every matrix has a square root. For example, the negative identity matrix (-I) does not have a square root matrix.
All positive numbers have square roots, and mathematicians, who love to generalize everything, have defined a class of matrices with properties that are reminiscent of positive numbers. They are called positive definite matrices, and they arise often in statistics because every covariance and correlation matrix is symmetric and positive definite (SPD).
It turns out that if S is a symmetric positive definite matrix, then there exists a unique SPD matrix X, (called the square root matrix) such that X^{2} = A. For a proof, see Golub and van Loan, 3rd edition, 1996, p. 149. Furthermore, the following iterative algorithm converges quadratically to the square root matrix (ibid, p. 571):
The astute reader will recognize this algorithm as the matrix version of the Babylonian method for computing a square root. As I explained last week, this iterative method implements Newton's method to find the roots of the (matrix) function f(X) = X*X - S.
In SAS, the SAS/IML matrix language is used to carry out matrix computations. To illustrate the square root algorithm, let S be the 7x7 Toeplitz matrix that is generated by the vector {4 3 2 1 0 -1 -2}. I have previously shown that this Toeplitz matrix (and others of this general form) are SPD. The following SAS/IML program implements the iterative procedure for computing the square root of an SPD matrix:
proc iml; /* Given an SPD matrix S, this function to compute the square root matrix X such that X*X = S */ start sqrtm(S, maxIters=100, epsilon=1e-6); X = I(nrow(S)); /* initial starting matrix */ do iter = 1 to maxIters while( norm(X*X - S) > epsilon ); X = 0.5*(X + S*inv(X)); /* Newton's method converges to square root of S */ end; if norm(X*X - S) <= epsilon then return X; else return(.); finish; S = toeplitz(4:-2); /* 7x7 SPD example */ X = sqrtm(S); /* square root matrix */ print X[L="sqrtm(S)" format=7.4]; |
The output shows the square root matrix. If you multiply this matrix with itself, you get the original Toeplitz matrix. Notice that the original matrix and the square root matrix can contain negative elements, which shows that "positive definite" is different from "has all positive entries."
The square root algorithm can be thought of as a mapping that takes an SPD matrix and produces the square root matrix. Therefore it is an example of a function of a matrix. Nick Higham (2008) has written a book Functions of Matrices: Theory and Computation, which contains many other functions of matrices (exp(), log(), cos(), sin(),....) and how to compute the functions accurately. SAS/IML supports the EXPMATTRIX function, which computes the exponential function of a matrix.
The square root algorithm in this post is a simplified version of a more robust algorithm that has better numerical properties. Higham (1986), "Newton's Method for the Matrix Square Root" cautions that Newton's iteration can be numerically unstable for certain matrices. (For details, see p. 543, Eqns 3.11 and 3.12.) Higham suggests an alternate (but similar) routine (p. 544) that is only slightly more expensive but has improved stability properties.
I think it is very cool that the ancient Babylonian algorithm for computing square roots of numbers can be generalized to compute the square root of a matrix. However, notice that there is an interesting difference. In the Babylonian algorithm, you are permitted to choose any positive number to begin the square root algorithm. For matrices, the initial guess must commute with the matrix S (Higham, 1986, p. 540). Therefore a multiple of the identity matrix is the safest choice for an initial guess.
The post Compute the square root matrix appeared first on The DO Loop.
]]>The post How to fit a variety of logistic regression models in SAS appeared first on The DO Loop.
]]>There are many types of logistic regression models:
If you know the procedure that you want to use, you can read the procedure documentation and follow the examples to perform the analyses. In practice, however, you might know the reverse information: You know the analysis that you want to perform, but you do not know which SAS procedure implements that regression model.
I was therefore pleased to discover a short SAS Knowledge Base article titled "Types of logistic (or logit) models that can be fit using SAS." This article provides a "reverse look-up": Given the type of logistic regression model that you want to fit, it provides the name of the SAS procedures that support that analysis!
Logistic models that can be fit with SAS: A reverse look-up resource. #SAStip
Click To Tweet
Use this resource the next time you need to know which SAS procedure can conduct a certain variant of logistic regression. Bookmark it, print it out, tattoo it on your forearm, or come to my blog and type "HOW TO FIT A LOGISTIC MODEL" into the search box. But don't forget it! This resource is a treasure for the SAS statistical programmer.
The post How to fit a variety of logistic regression models in SAS appeared first on The DO Loop.
]]>The post All I really need to know about Newton's method I learned in primary school appeared first on The DO Loop.
]]>No, I didn't go to a school for geniuses. I didn't even know it was Newton's method until decades later. However, in sixth grade I learned an iterative algorithm that taught me (almost) everything I need to know about Newton's method for finding the roots of functions.
The algorithm I learned in the sixth grade is an iterative procedure for computing square roots by hand. It seems like magic because it estimates a square root from an arbitrary guess. Given a positive number S, you can guess any positive number x_{0} and then apply the "magic formula" x_{n+1} = (x_{n} + S / x_{n})/2 until the iterates converge to the square root of S. For details, see my article about the Babylonian algorithm.
It turns out that this Babylonian square-root algorithm is a special case of Newton's general method for finding the roots of functions. To see this, define the function f(x) = x^{2} - S. Notice that the square root of S is the positive root of f. Newton's method says that you can find roots by forming the function Q(x) = x - f(x) / f′(x), where f′ is the derivative of f, which is 2x. With a little bit of algebra, you can show that Q(x) = (x + S / x)/2, which is exactly the "magic formula" in the Babylonian algorithm.
Therefore, the square root algorithm is exactly Newton's method applied to a special quadratic polynomial whose root is the square root of S. The Newton iteration process is visualized by the following graph (click to enlarge), which shows the iteration history of the initial guess 10 when S = 20.
It turns out that almost everything I need to know about Newton's method, I learned in sixth grade. The Babylonian method for square roots provides practical tips about how to use Newton's method for finding roots:
All I really need to know about Newton's method I learned in sixth grade.
Click To Tweet
In fact, I'll go further. The Babylonian method teaches important lessons in numerical analysis:
So you see, all I really needed to know about Newton's method I learned in sixth grade!
The post All I really need to know about Newton's method I learned in primary school appeared first on The DO Loop.
]]>The post The Babylonian method for finding square roots by hand appeared first on The DO Loop.
]]>I still remember being amazed when I first saw the iterative square root algorithm. It was the first time that I thought math is magical. The algorithm required that I make an initial guess for the square root. I then applied a "magic formula" a few times. The magic formula improved my guess and estimated the square root that I sought.
The iterative method is called the Babylonian method for finding square roots, or sometimes Hero's method. It was known to the ancient Babylonians (1500 BC) and Greeks (100 AD) long before Newton invented his general procedure.
Here's how it works. Suppose you are given any positive number S. To find the square root of S, do the following:
Let's use this algorithm to compute the square root of S = 20 to at least two decimal places.
Because x_{3} and x_{4} agree to two decimal places, the algorithm ends after four iterations. An estimate for sqrt(20) is 4.47214.
The Babylonians and Greeks could estimate square roots by hand. Algorithm and #SAS program.
Click To Tweet
You can choose any positive value as an initial guess. However, when I was a kid I used to race my friends to see who could find the square root the fastest. I discovered that if you choose a good guess, then you have to compute only a few iterations. I invented a strategy for finding a good guess, which I call "The Rule of Twos and Sevens."
The Rule of Twos and Sevens chooses an initial guess from among the candidates {2, 7, 20, 70, 200, 700, ...}. The Rule use the fact that a square root has about half as many integer digits as the number itself. The Rule follows and is illustrated by using S = 3249.
In other words, a good guess starts with a 2 or a 7 and has about half as many digits as are in the whole-number part of S. My experience is that The Rule of Twos and Sevens usually converges to a solution (within two decimal places) in four or fewer iterations.
For small numbers, you can choose the initial guess more precisely. If S is less than 225 (=15x15), you can bracket the square root by using the perfect squares {1, 4, 9, ..., 196, 225} and then use the corresponding integer in the range [1, 15] as an initial guess.
In tribute to all students who ever struggled to perform this algorithm by hand, I implemented the Babylonian square-root algorithm in SAS. You can implement the algorithm directly as a DATA step program, but I chose to use PROC FCMP to define new DATA step functions.
proc fcmp outlib=work.funcs.MathFuncs; /* Rule of 2s and 7s: Count the number of digits in S. Choose initial guess to have half as many digits (rounded up) and start with a 2 or 7. */ function BabylonianGuess(S); /* provide initial guess */ str = put(floor(S), 16.); /* convert [S] to string */ L = length(strip(str)); /* count how many digits */ d = ceil(L/2); /* about half as many digits (round up) */ guess2 = 2*10**(d-1); guess7 = 7*10**(d-1); if abs(guess2**2 - S) < abs(guess7**2 - S) then return( guess2 ); else return( guess7 ); endsub; /* the Babylonian method (aka, Hero's method) for finding square roots */ function BabylonianSqrt(S); epsilon = 100*constant("maceps"); /* convergence criterion */ if S < 0 then x = .; /* handle negative numbers */ else if S=0 then x = 0; /* handle zero */ else do; x = BabylonianGuess(S); /* initial guess */ xPrev = 0; do while (abs(x - xPrev) > epsilon); xPrev = x; x = (xPrev + S/xPrev)/2; /* iterate to improve guess */ end; end; return( x ); endsub; quit; /* Functions defined. Compare Babylonian algorithm to modern SQRT function */ options cmplib=work.funcs; /* define location of functions */ data Compare; input S @@; BabySqrt = BabylonianSqrt(S); /* Babylonian algorithm */ Sqrt = sqrt(S); /* modern computation */ Diff = abs(BabySqrt - Sqrt); /* compare values */ datalines; -3 0 1 2 4 10 16 30 100 3249 125348 ; proc print label; label BabySqrt = "Babylonian Sqrt"; run; |
Notice that the Babylonian method produces essentially the same numerical value as the built-in SQRT function in SAS. Of course, the iterative BabylonianSqrt function is not nearly as efficient as the built-in SQRT function.
For more information about the Babylonian algorithm and other algorithms for computing square roots, see Brown (1999) Coll. Math J.
Do you remember learning a square root algorithm in school? Do you still remember it? Are there other "ancient" algorithms that you have learned that are now unnecessary because of modern technology? Leave a comment.
The post The Babylonian method for finding square roots by hand appeared first on The DO Loop.
]]>The post Ten tips before you run an optimization appeared first on The DO Loop.
]]>Over the years I have seen many questions about optimization (which is also call nonlinear programming (NLP)) posted to discussion forums such as the SAS/IML Support Community. A typical question describes a program that is producing an error, will not converge, or is running exceedingly slow. Such problems can be exceedingly frustrating, and some programmers express their frustration by punctuating their questions with multiple exclamation points:
To paraphrase Shakespeare, in most cases the fault lies not in our optimization routines, but in ourselves. The following checklist describes 10 common SAS/IML issues that lead to errors or lack of convergence in nonlinear optimization. If you check this list before attempting to optimize a function, then you increase the chances that the SAS/IML optimization routines will produce an optimal solution quickly and efficiently.
10 ways to avoid errors and improve efficiency in nonlinear #optimization. #SAStip
Click To Tweet
Optimization can be challenging, but you can increase the chance of success if you follow the tips and strategies in this checklist. These tips help you to avoid common pitfalls. By following these suggestions, your SAS/IML optimization will be more efficient, more accurate, and less frustrating.
The post Ten tips before you run an optimization appeared first on The DO Loop.
]]>The post What is a DATA step view and why is it important? appeared first on The DO Loop.
]]>A novice SAS programmer told me that he has never heard of a "DATA step view." He asked, "What is a DATA step view?"
Simply put, a "view" is a SAS DATA step program that is stored for later execution. When the SAS language processor encounters the RUN statement, the program is compiled and saved, but not executed. When a procedure uses a data view, the program runs and serves data to the procedure as if the procedure were reading a regular SAS data set. Thus you can use a view to manipulate data "on the fly."
I like to create views when I want to construct a new variable in a huge data set, but I don't want to physically copy the data. When I analyze the data view by using another procedure, the constructed variable is computed on the fly.
Here's an example. Suppose that you have a large data set that includes heights and weights for millions of patients. Some (but perhaps not all) analysts at your company need to analyze the body-mass index (BMI) for these patients. You have two options. The first option is to create a new data set that has the new column in it. This requires that you duplicate the original data and add a new column. The second option is to keep the original data unchanged, but create a view that computes the BMI. Any analyst that needs the BMI can access the data by using the view.
Let's see how this works by using a small data set. The Sashelp.Class data set contains height and weight (in English measurements) for 19 children. The BMI formula for children is slightly more complicated than for adults, but for simplicity the following SAS program simply uses the adult formula. The following DATA step creates a data view by using the VIEW= option on the DATA statement. The MEANS procedure then analyzes the newly calculated BMI variable:
data BMI / view=BMI; /* define DATA step view */ set Sashelp.Class; BMI = weight / height**2 * 703; /* BMI formula for adults (pounds and inches) */ run; proc means data=BMI; var BMI; run; |
As you can see, the syntax for the MEANS procedure does not change. In fact, the only syntax that changes is the VIEW= option in the DATA step.
When SAS encounters the RUN statement in the DATA step, it saves the program. The program is not executed until it is used in the DATA= option in PROC MEANS. At that point, SAS executes the program and computes the BMI variable, which the procedure consumes and analyzes.
There are three main advantage of a data view: reduced storage space, not cluttering up a data set with extra variables, and if the view uses data that is updated regularly (for example, nightly sales data) then the view always reads the current data.
Three advantages to DATA step views #SAStip
Click To Tweet
The main disadvantage to DATA step views is that the computed columns must be recomputed every time that a procedure uses the view. If you specify DATA=BMI for additional procedures, the BMI variable is recomputed each time, which is slower than reading pre-computed data.
For more information about SAS data views, see
Do you use DATA step views in your company? Leave a comment.
The post What is a DATA step view and why is it important? appeared first on The DO Loop.
]]>The post Create a package in SAS/IML appeared first on The DO Loop.
]]>Do you have some awesome SAS/IML functions to share? Are you an expert in some subject area? This article describes how to create a SAS/IML package for others to use.
Share knowledge: Create a package in #SAS/IML
Click To Tweet
The following list shows the main steps to create a SAS/IML package. Each step is demonstrated by using the polygon package that I created for the paper "Writing Packages: A New Way to Distribute and Use SAS/IML Programs." You can download the polygon package and examine its files and structure.
The first step is to think of a name for your SAS/IML package. The name must be a valid SAS name, which means it must contain 32 characters or less, begin with a letter or underscore, and contain only letters, underscores, and digits. Create a directory that has this name in all lowercase letters. For example, the root-level directory for the polygon package is named polygon.
The info.txt file is a manifest. It provides information about the names of the source files in the package. It consists of a series of keyword-value pairs. The SAS/IML User's Guide provides complete details about the form of the info.txt file, but an example file follows:
# SAS/IML Package Information File Format 1.0 Name: polygon Description: Computational geometry for polygons Author: Rick Wicklin <Rick.Wicklin@sas.com> Version: 1.0 RequiresIML: 14.1 SourceFiles: PolyArea.iml PolyCentroid.iml PolyDraw.iml <...list additional files...> |
The first line of the info.txt file specifies the file format. Use exactly the line shown here. The value of the Name keyword specifies the name of the package. When you create a ZIP file (Step 6), be sure to match the case exactly.
For most SAS/IML packages, the functionality is provided by a series of related functions. Put all source files in the source subdirectory.
A source file usually contains one or more module definitions. The source files cannot contain the PROC IML statement or the QUIT statement because those statement would cause an IML session to exit when the source file is read. Source files are read in the order in which they are specified in the info.txt file.
I like to put each major function in its own source file, although sometimes I include associated helper function in the same file. I begin function names with a short prefix that identifies the package to which the functions belong. For example, most functions in the polygon package begin with the prefix "POLY". Internal-only functions (that it, those that are not publically documented) in the package begin with the prefix "_POLY".
I like to use ".iml" as a file extension to remind me that these are snippets of IML programs. However, you can use ".sas" as an extension if you prefer.
Even if your package is very useful, people won't want to use it if there isn't sufficient documentation about how to use it. You can provide two kinds of documentation. The main documentation is usually a PDF file in the help subdirectory. You might also include slideshows, Word documents, or any other useful files. The secondary documentation is a plain (ASCII) text file that has the same name as your package. For example, the polygon package contains the file polygon.txt in the help subdirectory. This file is echoed to the SAS log when a user installs the package and then submits the PACKAGE HELP polygon statement.
Sometimes the best way to show someone how to use a package is to write a driver program that loads the package, calls the functions, and displays the results. Driver programs and other sample programs should be put in the programs subdirectory.
For the polygon package, I included the file Example.sas, which shows how to load the package and call each function in the package. I also included other programs that test the functions. For example, the drawing function (POLYDRAW) has many options, so I wanted to show how to use each option. I also show how to define and use non-simple polygons, which are polygons that have edge intersection.
Another sure-fire way to make sure that users know how to use your functions is to include sample data. If you have data sets to distribute, create a subdirectory named data. You can put SAS data sets, CSV files, and other sources of data into that directory.
If you have other files to distribute, create additional subdirectories. For example, you might have a C directory if you include C files or an R subdirectory if you include R files.
At this point, the directories and files for the polygon package looks like the following:
C:%HOMEPATH%\My Documents\My SAS Files\polygon | info.txt | +---data | simple.sas7bdat, states48.sas7bdat | +---help | polygon.docx, polygon.pdf, polygon.txt | +---programs | Example.sas, TestNonSimplePoly.sas, TestPolyDraw.sas, TestPolyPtInside.sas | \---source PolyArea.iml, PolyBoundingBox.iml, PolyCentroid.iml, ..., PolyRegular.iml |
The last step is to run a compression utility to create a ZIP file that contains the root-level directory and all subdirectories. The name of the ZIP file should match the package name, including the case. For example, the ZIP file for the polygon package is named polygon.zip. The following image shows creating a ZIP file by using the WinZip utility. Other popular (and free) alternatives are 7-Zip, PeaZip, and PKZip.
Your SAS/IML package is now ready to be tested. Make sure that you can successfully use the PACKAGE INSTALL, PACKAGE LOAD, and PACKAGE HELP statements on your package. If you are distributing data sets, test the PACKAGE LIBNAME statement.
When you are satisfied that the package installs and loads correctly, you can share the package with others. To share your work in-house, ask your system administrator to load the package into the PUBLIC collection so that everyone in your workgroup can use it. If you want to make the package available to others outside of your company, upload the package the SAS/IML File Exchange. You can follow the directions in the article "How to contribute to the SAS/IML File Exchange."
Creating a package is a convenient way to share your work with others in your company or around the world. It enables other SAS/IML programmers to leverage your expertise and programming skills. And who knows, it just might make you famous!
For complete details about how to create a package, see the "Packages" chapter of the SAS/IML User's Guide.
The post Create a package in SAS/IML appeared first on The DO Loop.
]]>The post How much do New Yorkers tip taxi drivers? appeared first on The DO Loop.
]]>When I read Robert Allison's article about the cost of a taxi ride in New York City, I was struck by the scatter plot (shown at right; click to enlarge) that plots the tip amount against the total bill for 12 million taxi rides. The graph clearly reveals diagonal and horizontal trends in the data. The apparent "lines" in the scatter plot correspond to three kinds of riders:
The graph made me wonder how much a typical New Yorker tips as a percentage of the bill.
How much do New Yorkers tip their taxi drivers? #Statistics
Click To Tweet
You can download the data from Robert Allison's post. This article analyzes the same 12 million records (from January, 2015) that Robert featured.
The first step is simply to use a SAS DATA step to compute the percentage of the tip. The following program uses a data VIEW to save disk space. You can then use PROC MEANS in Base SAS to compute percentiles for the tip amounts:
libname taxi "<path to data>"; data tax / view=tax; label tip_pct = "Tip Percent"; set taxi.ny_taxi_data_2015_01; where 0 < fare_amount <= 100 & 0 <= tip_amount <= 100; total_bill = total_amount - tip_amount ; tip_pct = tip_amount / total_bill; keep fare_amount tip_amount tip_pct total_bill; run; title "Tip as Percentage of Total Fare"; proc means data=tax P40 P50 P75 P90 P95 P99; var tip_pct; run; |
The output shows some surprising facts. According to these data on NYC Yellow Taxis:
Edit: (03MAY2016) A reader notes in the comments that 40% of riders not tipping seems excessively high. He suggests that the data could be biased by a systematic underreporting of tips and suggests we need a better understanding of data collection process. I went to the "data dictionary" for this data and discovered that the tip_amount value "is automatically populated for credit card tips. Cash tips are not included.” Ah-hah! The reader is correct! There is a systematic bias in the data.
You can use a histogram to show the distribution of tip percentages. The distribution has a long tail, so the following call to PROC UNIVARIATE truncates the distribution at 100%.
proc univariate data=tax(obs=1000000); format tip_pct PERCENT7.0; where tip_pct <= 1; var tip_pct; histogram tip_pct / endpoints=(0 to 1 by 0.05) statref=Q1 Median Q3 odstitle="Percent of Fare Left as Tip for NYC Yellow Taxis"; run; |
The histogram appears to be a mixture distribution. The first bar (about 41%) represents the riders who leave either no tip or a negligible tip. [Edit: Many of these are cash transactions for which the tip is not reported.] Then there is a "hump," which appears to be normally distributed, that represents riders who leave a tip that is between 5% and 40% of the fare. As you might expect, the mode of the hump is between 15% and 20%, which is the cultural norm in the US. The last part of the distribution is the long tail, which represents generous individuals whose tip is a substantial percentage of the fare.
The vertical lines show the 25th percentile (at 0%), the 50th percentile (at 13%), and the 75th percentile (at 20%).
The distribution of tips is dominated by the large proportion of taxi riders who do not tip. (Or, to be more cynical, the proportion of rides for which the reported tip is zero.) This section excludes the non-tippers and analyzes only those riders who leave a non-zero tip.
Let's repeat the analyses of the previous sections, but this time exclude the non-tippers. First, look at the quantiles. Notice the WHERE clause to exclude the non-tippers.
title "Tip Percentages Conditional on Tipping"; proc means data=tax Q1 Median Q3 P90 P95 P99; where tip_amount > 0; /* exclude non-tippers */ var tip_pct; run; |
Ah! That's more like what I expected to see! For riders who leave a tip, the median and 75th percentiles are about 20%. About one in ten New Yorkers that leave a tip leave 25% or more.
For a final graph, let's visualize the distribution of non-zero tip amounts. I could draw the same histogram as before, but for fun I will use the three-panel visualization in SAS that I created a few years ago. In the three-panel visualization, a histogram is stacked in a column with a box plot and a normal quantile-quantile plot (Q-Q plot). Each panel reveals slightly different aspects of the distribution. You can download the %ThreePanel macro from my previous article. The following graph shows the three-panel visualization (click to enlarge) of one million tippers:
%ThreePanel(tax(obs=1000000 where=(tip_amount > 0 && tip_pct <= 1)), tip_pct); |
As compared to the previous histogram, this histogram uses a smaller bin width and therefore shows small-scale structure that was not previously apparent. The bumps in the histogram and kernel density estimate show that many riders tip 20%, 25%, and 30%. The box plot shows a relatively narrow interquartile range (the box) and a many riders whose tip percentages are well-above or well-below the median (the outliers).
Lastly, consider the Q-Q plot. If the points fall near the diagonal line, then the distribution is approximately normal. That is not the case here. The middle of the distribution falls somewhat near the diagonal line, but the tails do not. By the usual interpretation of a Q-Q plot, the lower- and upper tails of the distribution are decidedly nonnormal. The middle of the distribution is approximately normal, but the staircase structure of the Q-Q plot means that certain values are repeated many times (20%, 25%, and 30%).
I conjecture that the lower tail is from the "keep the change" riders who rounded the fare to a whole-dollar amount that was few pennies more than the fare. The upper tail is likely from generous riders who substantially rounded up the fare, perhaps paying $20 bill to cover a $13 fare.
In summary, a short analysis of 12 million New York taxi rides indicates that slightly more than 40% of riders do not leave any tip. [EDIT: This large percentage appears to be from cash transactions for which the tip amount is not collected.] A small percentage leave a tiny tip, but the bulk of the tippers leave tips in the 16% to 20% range. About 10% of riders leave as much as 25% or 30% for a tip, and a small number of riders leave tips that correspond to larger percentages.
The post How much do New Yorkers tip taxi drivers? appeared first on The DO Loop.
]]>The post Packages: A new way to share SAS/IML programs appeared first on The DO Loop.
]]>As of SAS/IML 14.1, there is an easy way to share SAS/IML functions with others. The IML language now supports packages through the PACKAGE statement. A package is a collection of files that contains source code, documentation, example programs, and sample data. Packages are distributed as ZIP files.
This article shows how to install and use a package. A subsequent article will discuss how to author a package. For more information, see my 2016 SAS Global Forum paper, "Writing Packages: A New Way to Distribute and Use SAS/IML Programs".
The main steps to use a package are as follows:
In the following sections, each step is illustrated by using the polygon package, which contains functions that compute geometric properties of planar polygons. You can download the package from the SAS/IML File Exchange.
The SAS/IML File Exchange is the recomended location to obtain packages. However, some authors might choose to post packages on GitHub or on their own web site.
Download the polygon package and save it to a location that SAS can access. This article assumes that the ZIP file is saved to the location C:\Packages\polygon.zip on a Windows PC.
To install the package, run the following statements in SAS/IML 14.1 or later:
proc iml; package install "C:\Packages\polygon.zip"; quit; |
You only need to install a package one time. You can then call it in future PROC IML sessions. The package remains installed even after you exit from SAS and reboot your computer.
The help directory in the ZIP file contains the full documentation for the package, often in the form of a PDF file. You can read that documentation by using a PDF reader outside of SAS. The PDF documentation might contain diagrams, equations, graphs, and other non-text information.
Authors should also provide a brief text-only version of the syntax. While running SAS, you can display the text-only version in the SAS log, as follows:
proc iml;
package help polygon; |
For the polygon package, the SAS log contains the following overview of the package and the syntax for every public module.
Polygon Package Description: Computational geometry for polygons A polygon is represented as an n x 2 matrix, where the rows represent adjacent vertices for the polygon. The polygon should be "open," which means that the last row does not equal the first row. <...More text...> Module Syntax: PolyArea(P); returns the areas of simple polygons. <...documentation for nine other functions...> PolyWindingNumber(P, R); returns the winding number of the polygon R around the point R. |
The purpose of the PACKAGE HELP syntax is to remind the user of the syntax of the functions in the package while inside of SAS.
Another way to learn to use a pakage is to examine and run sample programs that the package author provides. For the polygon package, you can look in the programs directory. The file Example.sas demonstrates calling functions in the package.
To read the functions in the package into the current IML session, use the PACKAGE LOAD statement, as follows:
package load polygon; |
The SAS log will display a NOTE for each modules that is loaded:
NOTE: Module _POLYAREA defined. NOTE: Module POLYAREA defined. <...More text...> NOTE: Module POLYSTACK defined. |
Notice that all function begin with a common prefix, in this case "POLY" (or "_POLY," for internal-only functions). This is a good programming practice because it reduces the likelihood that functions in one package conflict with functions in another package. For example, if you load two packages that each have a function named "COMPUTE" (or an equally generic name), then there will be a comflict. By using "POLY" as a prefix, there is less likely to be a name collision.
After the functions are loaded, you can call them from your program. For example, the following statements define the four vertices of a rectangle and then call functions from the package to compute the perimeter, area, and whether the polygon is a convex region.
P = {0 0, 1 0, 1 2, 0 2}; /* vertices of rectangle */ Perimeter = PolyPerimeter(P); Area = PolyArea(P); IsConvex = PolyIsConvex(P); print Perimeter Area IsConvex; |
Some packages include sample data sets. You can use the PACKAGE LIBNAME statement to create a SAS libref that points to the data subdirectory of an installed package. The following statements read in sample data for polygons that are defined in the Simple data set. After the vertices of the polygons are read, the PolyDraw subroutine is called to visualize the polygons in the data set.
package libname PolyData polygon; /* define libref */ use PolyData.Simple; /* sample data in package */ read all var {u v ID} into P; /* vertices; 3rd col is ID variable */ close PolyData.Simple; run PolyDraw(P); /* visualize the polygons */ |
You can use the PACKAGE LIST statement to print the name and version of all installed packages. You can use the PACKAGE INFO command to obtain details about a specific package. The output for these statements is not shown.
package list; /* list all installed packages */ package info polygon; /* details about polygon package */ |
This article briefly describes how to install and use a SAS/IML package. Packages were introduced in SA/IML 14.1, which was released with SAS 9.4m3 in July 2015. This article shows how to download, install, load, and use packages that are written by others.
For additional details about packages, see Wicklin (2016). The official documentation of packages is contained in the "Packages" chapter in the SAS/IML User's Guide.
The post Packages: A new way to share SAS/IML programs appeared first on The DO Loop.
]]>