This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
Any of you who are even slightly into politics, or have followed any news lately, have probably seen something about the DNA test Elizabeth Warren took to prove/disprove her Native American ancestry. The test indicates she might have had a Native American ancestor 6 to 10 generations ago. That’s a […]
The post DNA and your family tree … 6-10 generations back appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post Parameter estimates for different parameterizations appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
In a recent article about nonlinear least squares, I wrote, “you can often fit one model and use the ESTIMATE statement to estimate the parameters in a different parameterization.”
This article expands on that statement. It shows how to fit a model for one set of parameters and use the ESTIMATE statement to obtain estimates for a different set of parameters.
A canonical example of a probability distribution that has multiple parameterizations is the gamma distribution. You can parameterize the gamma distribution in terms of a scale parameter or in terms of a rate parameter, where the rate parameter is the reciprocal of the scale parameter. The density functions for the gamma distribution with scale parameter β or rate parameter c = 1 / β are as follows:
Shape α, scale β: f(x; α, β) = 1 / (β^{α} Γ(α)) x^{α-1} exp(-x/β)
Shape α, rate c: f(x; α, c) = c^{α} / Γ(α) x^{α-1} exp(-c x)
You can use PROC NLMIXED to fit either set of parameters and use the ESTIMATE statement to obtain estimates for the other set. In general, you can do this whenever you can explicitly write down an analytical formula that expresses one set of parameters in terms of another set.
Let’s implement this idea on some simulated data. The following SAS DATA step simulates 100 observations from a gamma distribution with shape parameter α = 2.5 and scale parameter β = 1 / 10. A call to PROC UNIVARIATE estimates the parameters from the data and overlays a gamma density on the histogram of the data:
/* The gamma model has two different parameterizations: a rate parameter and a scale parameter. Generate gamma-distributed data and fit a model for each parameterization. Use the ESTIMATE stmt to estimate the parameter in the model that was not fit. */ %let N = 100; /* N = sample size */ data Gamma; aParm = 2.5; scaleParm = 1/10; rateParm = 1/scaleParm; do i = 1 to &N; x = rand("gamma", aParm, scaleParm); output; end; run; proc univariate data=Gamma; var x; histogram x / gamma; run; |
The parameter estimates are (2.99, 0.09), which are close to the parameter values that were used to generate the data. Notice that PROC UNIVARIATE does not provide standard errors or confidence intervals for these point estimates. However, you can get those statistics from PROC NLMIXED. PROC NLMIXED also supports the ESTIMATE statement, which can estimate the rate parameter:
/*Use PROC NLMIXED to fit the shape and scale parameters in a gamma model */ title "Fit Gamma Model, SCALE Parameter"; proc nlmixed data=Gamma; parms a 1 scale 1; * initial value for parameter; bounds 0 < a scale; model x ~ gamma(a, scale); rate = 1/scale; estimate 'rate' rate; ods output ParameterEstimates=PE; ods select ParameterEstimates ConvergenceStatus AdditionalEstimates; run; |
The parameter estimates are the same as from PROC UNIVARIATE. However, PROC NLMIXED also provides standard errors and confidence intervals for the parameters. More importantly, the ESTIMATE statement enables you to obtain an estimate of the rate parameter without needing to refit the model. The next section explains how the rate parameter is estimated.
You might wonder how the ESTIMATE statement computes the estimate for the rate parameter from the estimate of the scale parameter. The PROC NLMIXED documentation states that the procedure “computes approximate standard errors for the estimates by using the delta method (Billingsley 1986).”
I have not studied the “delta method,” but it seems to a way to approximate a nonlinear transformation by using the first two terms of its Taylor series, otherwise known as a linearization.
In the ESTIMATE statement, you supply a transformation, which is c = 1 / β for the gamma distribution. The derivative of that transformation is -1 / β^{2}. The estimate for β is 0.09. Plugging that estimate into the transformations give 1/0.09 as the point estimate for c and 1/0.09^{2} as a linear multiplier for the size of the standard error. Geometrically, the linearized transformation scales the standard error for the scale parameter into the standard error for the rate parameter. The following SAS/IML statements read in the parameter estimates for β. It then uses the transformation to produce a point estimate for the rate parameter, c, and uses the linearization of the transformation to estimate standard errors and confidence intervals for c.
proc iml; start Xform(p); return( 1 / p ); /* estimate rate from scale parameter */ finish; start JacXform(p); return abs( -1 / p##2 ); /* Jacobian of transformation */ finish; use PE where (Parameter='scale'); read all var {Estimate StandardError Lower Upper}; close; xformEst = Xform(Estimate); xformStdErr = StandardError * JacXform(Estimate); CI = Lower || Upper; xformCI = CI * JacXform(Estimate); AdditionalParamEst = xformEst || xformStdErr || xformCI; print AdditionalParamEst[c={"Est" "StdErr" "LowerCL" "UpperCL"} F=D8.4]; quit; |
The estimates and other statistics are identical to those produced by the ESTIMATE statement in PROC NLMIXED. For a general discussion of changing variables in probability and statistics, see these Penn State course notes.
For a discussion about two different parameterizations of the lognormal distribution, see the blog post “Geometry, sensitivity, and parameters of the lognormal distribution.”
In case you are wondering, you obtain exactly the same parameter estimates and standard errors if you fit the rate parameter directly. That is, the following call to PROC NLMIXED produces the same parameter estimates for c.
/*Use PROC NLMIXED to fit gamma model using rate parameter */ title "Fit Gamma Model, RATE Parameter"; proc nlmixed data=Gamma; parms a 1 rate 1; * initial value for parameter; bounds 0 < a rate; model x ~ gamma(a, 1/rate); ods select ParameterEstimates; run; |
For this example, fitting the scale parameter is neither more nor less difficult than fitting the rate parameter. If I were faced with a distribution that has two different parameterizations, one simple and one more complicated, I would try to fit the simple parameters to data and use this technique to estimate the more complicated parameters.
The post Parameter estimates for different parameterizations appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The post Get the unique values of a variable in data order appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
There are several ways to use SAS to get the unique values for a data variable. In Base SAS, you can use the TABLES statement in PROC FREQ to generate a table of unique values (and the counts). You can also use the DISTINCT function in PROC SQL to get the same information. In the SAS/IML language, you can use the UNIQUE function to find the unique values of a vector. In all these cases, the values are returned in sorted (alphanumeric) order. For example, here are three ways to obtain the unique values of the TYPE variable in the Sashelp.Cars data set. Only the result from PROC IML is shown:
/* Three ways to get unique values of a variable in sorted order */ proc freq data=sashelp.cars; tables Type / missing; /* MISSING option treats mising values as a valid category */ run; proc sql; select distinct(Type) as uType from sashelp.cars; quit; proc iml; use sashelp.cars; read all var "Type"; close; uType = unique(Type); print uType; |
The FREQ procedure has a nice option that I use FREQently: you can use the ORDER=DATA option to obtain the unique categories in the order in which they appear in a data set.
Over the years, I’ve used the ORDER=DATA option for many purposes, including a recent post that shows how to create a “Top 10” chart. Another useful option is ORDER=FREQ, which orders the categories in descending order by frequency.
Recently a colleague asked how to obtain the unique values of a SAS/IML vector in the order in which they appear in the vector.
I thought it might also be useful to support an option to return the unique values in (descending) order of frequency, so I wrote the following function. The first argument to the
UniqueOrder function is the data vector. The second (optional) argument is a string that determines the order of the unique values. If the second argument has the value “DATA”, then the unique values are returned in data order. If the second argument has the value “FREQ”, then the unique values are returned in descending order by frequency. The function is shown below. I test the function on a character vector (TYPE) and a numeric vector (CYLINDERS) that contains some missing values.
proc iml; /* Return the unique values in a row vector. The second parameter determines the sort order of the result. "INTERNAL": Order by alphanumeric values (default) "DATA" : Order in which elements appear in vector "FREQ" : Order by frequency */ start UniqueOrder(x, order="INTERNAL"); if upcase(order)="DATA" then do; u = unique(x); /* unique values (sorted) */ idx = j(ncol(u),1,0); /* vector to store indices in data order */ do i = 1 to ncol(u); idx[i] = loc(x=u[i])[1]; /* 1st location of i_th unique value */ end; call sort(idx); /* sort the indices ==> data order */ return ( T(x[idx]) ); /* put the values in data order */ end; else if upcase(order)="FREQ" then do; call tabulate(u, freq, x, "missing"); /* compute freqs; missing is valid level */ call sortndx(idx, freq`) descend=1; /* order descending by freq */ return ( T(u[idx]) ); end; else return unique(x); finish; /* test the function by calling it on a character and numeric vectors */ use sashelp.cars; read all var "Type"; read all var "Cylinders"; close; uData = UniqueOrder(Type, "data"); /* unique values in the order they appear in data */ uFreq = UniqueOrder(Type, "freq"); /* unique values in descending frequency order */ print uData, uFreq; uData = UniqueOrder(Cylinders, "data"); uFreq = UniqueOrder(Cylinders, "freq"); print uData, uFreq; |
The results of the function are similar to the results of PROC FREQ when you use the ORDER= option. There is a small difference when the data contain missing values. PROC FREQ always lists the missing value first, regardless of where the missing values appear in the data or how many missing values there are. In contrast, the SAS/IML function handles missing and nonmissing values equivalently.
In summary, whether you are implementing an analysis in Base SAS or SAS/IML, this article shows how to obtain the unique values of data in sorted order, in order of frequency, or in the order that the values appear in the data.
The post Get the unique values of a variable in data order appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
Here in the US, we recently had a new Supreme Court Justice appointed. This is a very important position, and judges usually serve until they retire or die. My buddy Rick gave me a heads-up that there was a table of all the past & present justices on Wikipedia, and […]
The post Timeline of US Supreme Court Justices appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
SAS Press is changing to meet the needs of customers worldwide.
The post New technologies and new direction for SAS Press appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post Fit a growth curve in SAS appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This article shows how to use SAS to fit a growth curve to data. Growth curves model the evolution of a quantity over time. Examples include population growth, the height of a child, and the growth of a tumor cell.
This article focuses on using PROC NLIN to estimate the parameters in a nonlinear least squares model.
PROC NLIN is my first choice for fitting nonlinear parametric models to data.
Other ways to model growth curves include using splines, mixed models (PROC MIXED or NLMIXED), and nonparametric methods such as loess.
The SAS DATA step specifies the mean height (in centimeters) of 58 sunflowers at 7, 14, …, 84 days after planting.
The American naturalist H. S. Reed studied these sunflowers in 1919 and used the mean height values to formulate a model for growth. Unfortunately, I do not have access to the original data for the 58 plants, but using the mean values will demonstrate the main ideas of fitting a growth model to data.
/* Mean heights of 58 sunflowers: Reed, H. S. and Holland, R. H. (1919), "Growth of sunflower seeds" Proceedings of the National Academy of Sciences, volume 5, p. 140. http://www.pnas.org/content/pnas/5/4/135.full.pdf */ data Sunflower; input Time Height; label Time = "Time (days)" Height="Sunflower Height (cm)"; datalines; 7 17.93 14 36.36 21 67.76 28 98.1 35 131 42 169.5 49 205.5 56 228.3 63 247.1 70 250.5 77 253.8 84 254.5 ; |
A simple mathematical model for population growth that is constrained by resources is the logistic growth model, which is also known as the Verhulst growth model. (This should not be confused with logistic regression, which predicts the probability of a binary event.) The Verhulst equation can be parameterized in several ways, but a popular parameterization is
Y(t) = K / (1 + exp(-r*(t – b)))
where
The model has three parameters, K, r, and b. When you use PROC NLIN in SAS to fit a model, you need to specify the parametric form of the model and provide an initial guess for the parameters, as shown below:
proc nlin data=Sunflower list noitprint; parms K 250 r 1 b 40; /* initial guess */ model Height = K / (1 + exp(-r*(Time - b))); /* model to fit; Height and Time are variables in data */ output out=ModelOut predicted=Pred lclm=Lower95 uclm=Upper95; estimate 'Dt' log(81) / r; /* optional: estimate function of parameters */ run; title "Logistic Growth Curve Model of a Sunflower"; proc sgplot data=ModelOut noautolegend; band x=Time lower=Lower95 upper=Upper95; /* confidence band for mean */ scatter x=Time y=Height; /* raw observations */ series x=Time y=Pred; /* fitted model curve */ inset ('K' = '261' 'r' = '0.088' 'b' = '34.3') / border opaque; /* parameter estimates */ xaxis grid; yaxis grid; run; |
The output from PROC NLIN includes an ANOVA table (not shown), a parameter estimates table (shown below), and an estimate for the correlation of the parameters. The parameter estimates include estimates, standard errors, and 95% confidence intervals for the parameters. The OUTPUT statement creates a SAS data set that contains the original data, the predicted values from the model, and a confidence interval for the predicted mean. The output data is used to create a “fit plot” that shows the model and the original data.
Articles about the Verhulst model often mention that the “maximum growth rate” parameter, r, is sometimes replaced by a parameter that specifies the time required for the population to grow from 10% to 90% of the
carrying capacity, K. This time period is called
the “characteristic duration” and denoted as Δt.
You can show that Δt = log(81)r.
The ESTIMATE statement in PROC NLIN produces a table (shown below) that estimates the characteristic duration.
The value 50.1 tells you that, on average, it takes about 50 days for a sunflower to grow from 10 percent of its maximum height to 90 percent of its maximum height. By looking at the graph, you can see that most growth occurs during the 50 days between Day 10 and Day 60.
This use of the ESTIMATE statement can be very useful. Some models have more than one popular parameterization. You can often fit the model for one parameterization and use the ESTIMATE statement to estimate the parameters for a different parameterization.
The post Fit a growth curve in SAS appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This post was kindly contributed by platformadmin.com - go there to comment and to read the full post. |
I had noticed recently that when I searched for SAS documentation on Google I wasn’t finding the SAS 9.4 documentation pages I wanted. I tweeted about it, found others had the same issue, and that SAS are working to address it. In the interim, the answer was simple of course … search using support.sas.com instead. … Continue reading “Keyword Bookmark Search with SAS Support”
This post was kindly contributed by platformadmin.com - go there to comment and to read the full post. |
The post How to read multiple text files in SAS appeared first on The SAS Dummy.
]]>This post was kindly contributed by The SAS Dummy - go there to comment and to read the full post. |
As part of my research for a different article, I recently collected data about my driving commute home via an accelerometer recorder app on my phone. The app generates a simple TSV file. (A TSV file is like a CSV file, but instead of a comma separator, it uses a TAB character to separate the values.) The raw data looks like this:
With SAS, it’s simple to import the file into a data set. Here’s my DATA step code that uses the INFILE statement to identify the file and how to read it. Note that the DLM= option references the hexadecimal value for the TAB character in ASCII (09x), the delimiter for fields in this data.
data drive; infile "/home/chris.hemedinger/tsv/drivehome.tsv" dlm='09'x; length counter 8 timestamp 8 x 8 y 8 z 8 filename $ 25; input counter timestamp x y z filename; run; |
In my research, I didn’t stop with just my drive home. In addition to my commute, I collected data about 4 other activities, and thus accumulated a collection of TSV files. Here’s my file directory in my SAS OnDemand for Academics account:
To import each of these data files into SAS, I could simply copy and paste my code 4 times and then replace the name of the file for each case that I collected. After all, copy-and-paste is a tried and true method for writing large volumes of code. But as the number of code lines grows, so does the maintenance work. If I want to add any additional logic into my DATA step, that change would need to be applied 5 times. And if I later come back and add more files to my TSV collection, I’ll need to copy-and-paste the same code blocks for my additional cases.
I can read all of my TSV files in a single step by using the wildcard notation in my INFILE statement. I replaced the “drivehome.tsv” filename with *.tsv, which tells SAS to match on all of the TSV files in the folder and process each of them in turn. I also changed the name of the data set from “drive” to the more generic “accel”.
data accel; infile "/home/chris.hemedinger/tsv/*.tsv" dlm='09'x; length counter 8 timestamp 8 x 8 y 8 z 8 filename $ 25; input counter timestamp x y z filename; run; |
The SAS log shows which files have been processed and added into my data set.
With a single data set that has all of my accelerometer readings, I can easily segment these with a WHERE clause in later processing. It’s convenient that my accelerometer app also captured the name of each TSV file so that I can keep these cases distinct. A quick PROC FREQ shows the allocation of records for each case that I collected.
If your input data files don’t contain an eponymous field name, then you will need to use a different method to keep track of which records come from which files. The DATA step INFILE statement supports a special option for this — it’s the FILENAME= option. The FILENAME= option allows you to specify an automatic variable to store the name of the current input file. Like all automatic variables in the DATA step, this variable won’t be written to the output data set by default. So, you need to include code that assigns the value to a permanent variable that you can save. Here’s my code, refactored a little bit, now with the options for reading/storing the file names we’re processing:
filename tsvs "/home/chris.hemedinger/tsv/*.tsv"; data accel; length casefile $ 100 /* to write to data set */ counter 8 timestamp 8 x 8 y 8 z 8 filename $ 25 tsvfile $ 100 /* to hold the value */ ; /* store the name of the current infile */ infile tsvs filename=tsvfile dlm='09'x ; casefile=tsvfile; input counter timestamp x y z filename; run; |
In the output, you’ll notice that we now have the fully qualified file name that SAS processed using INFILE.
Because we started this task with 5 distinct input files, it might be tempting to store the records in separate tables: one for each accelerometer case. While there might be good reasons to do that for some types of data, I believe that we have more flexibility when we keep all of these records together in a single data set. (But if you must split a single data set into many, here’s a method to do it.)
In this single data set, we still have the information that keeps the records distinct (the name of the original files), so we haven’t lost anything. SAS procedures support CLASS and BY statements that allow us to simplify our code when reporting across different groups of data. We’ll have fewer blocks of repetitive code, and we can accomplish more across all of these cases before we have to resort to SAS macro logic to repeat operations for each file.
As a simple example, I can create a simple visualization with a single PROC SGPANEL step.
ods graphics / width=1600 height=400; proc sgpanel data=accel; panelby filename / columns=5 noheader; series x=counter y=x; series x=counter y=y; series x=counter y=z; colaxis display=none minor; rowaxis label="m/s**2" grid; where counter<11000; run; |
Take a look at these 5 series plots. Using just what you know of the file names and these plots, can you guess which panel represents which accelerometer case?
Leave your guess in the comments section. I’ll explore these data further in a future blog post!
The post How to read multiple text files in SAS appeared first on The SAS Dummy.
This post was kindly contributed by The SAS Dummy - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
I attended the Scottish Highland Games this past weekend … nearby in Scotland County, North Carolina! They put on a great event, with kilt-wearing Scotsmen throwing things, bands playing bagpipes, kids dancing, and clans sharing their family history. And to get into the mood for this event, I decided to […]
The post Mapping the clans of Scotland appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post The intersection of multiple sets appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
This article compares several ways to find the elements that are common to multiple sets. I test which method is the fastest in the SAS/IML language. However, all algorithms are intrinsically fast, which raises an important question:
when is it worth the time and effort to optimize an algorithm?
The idea for this topic came from reading a blog post by ‘mikefc’ at the “Cool but Useless” blog about the relative performance of various ways to intersect vectors in R. Mikefc used vectors that contained between 10,000 and 100,000 elements and whose intersection was only a few dozen elements.
In this article, I increase the sizes of the vectors by an order of magnitude and time the performance of the intersection function in SAS/IML. I also ensure that the sets share a large set of common elements so that the intersection is sizeable.
In general, finding the intersection between sets is a fast operation. In languages such as R, MATLAB, and SAS/IML, you can store the sets in vectors and use built-in functions to form the intersection. Because the intersection of two sets is always smaller than (or equal to) the size of the original sets, computing the “intersection of intersections” is usually faster than computing the intersection of the original (larger) sets.
The following SAS/IML program creates eight SAS/IML vectors that have between 200,000 and 550,000 elements. The elements in the vectors are positive integers, although they could also be character strings or non-integers. The vectors contain a large subset of common elements so that the intersection is sizeable.
The vectors A, B, …, H are created as follows:
proc iml; call randseed(123); NIntersect = 5e4; common = sample(1:NIntersect, NIntersect, "replace"); /* elements common to all sets */ source = 1:1e6; /* elements are positive integers less than 1 million */ BaseN = 5e5; /* approx magnitude of vectors (sets) */ A = sample(source, BaseN, "replace") || common; /* include common elements */ B = sample(source, 0.9*BaseN, "replace") || common; C = sample(source, 0.8*BaseN, "replace") || common; D = sample(source, 0.7*BaseN, "replace") || common; E = sample(source, 0.6*BaseN, "replace") || common; F = sample(source, 0.5*BaseN, "replace") || common; G = sample(source, 0.4*BaseN, "replace") || common; H = sample(source, 0.3*BaseN, "replace") || common; |
The intersection of these vectors contains about 31,600 elements.
You can implement the following methods that find the intersection of these vectors:
The following SAS/IML statements implement these three methods and compute the time required to intersect each:
/* 1. one call to the built-in method */ t0 = time(); w = xsect(a,b,c,d,e,f,g,h); tBuiltin = time() - t0; /* 2. loop over variable names and form pairwise intersections */ varName = "a":"h"; t0 = time(); w = value(varName[1]); do i = 2 to ncol(varName); w = xsect(w, value(varName[i])); /* compute pairwise intersection */ end; tPairwise = time() - t0; /* 3. Sort by size of sets, then loop */ varName = "a":"h"; t0 = time(); len = j(ncol(varName), 1); do i = 1 to ncol(varName); len[i] = ncol(value(varName[i])); /* number of elements in each set */ end; call sortndx(idx, len); /* sort smallest to largest */ sortName = varName[,idx]; w = value(sortName[1]); do i = 2 to ncol(sortName); w = xsect(w, value(sortName[i])); /* compute pairwise intersection */ end; tSort = time() - t0; print tBuiltin tPairwise tSort; |
For this example data, each method takes about 0.5 seconds to find the intersection of eight sets that contain a total of 3,000,000 elements. If you rerun the analysis, the times will vary by a few hundredths of a second, so the times are essentially equal.
(‘mikefc’ also discusses a few other methods, including a “tally method.” The tally method is fast, but it relies on the elements being positive integers.)
There is a quote that I like: “In theory, there is no difference between theory and practice.
But, in practice, there is.” (Published by Walter J. Savitch, who claims to have overheard it at a computer science conference.)
For the intersection problem, you can argue theoretically that pre-sorting by length is an inexpensive way to potentially obtain a boost in performance. However, in practice (for this example data), pre-sorting improves the performance by less than 1%. Furthermore, the absolute times for this operation are short, so saving a 1% of 0.5 seconds might not be worth the effort.
Even if you never compute the intersection of sets in SAS, there are two take-aways from this exercise:
The post The intersection of multiple sets appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |