lattice
package for graphics,stats
package for modeling (e.g., lm(), t.test()
), andmosaic
package for numerical summaries and for smoothing over edge cases and inconsistencies in the other two components.lattice
has some drawbacks. While basic graphs like histograms, boxplots, scatterplots, and quantilequantile plots are simple to make with lattice
, it is challenging to combine these simple plots into more complex plots or to plot data from multiple data sources. Splitting data into subgroups and either overlaying with multiple colors or separating into subplots (facets) is easy, but the labeling of such plots is not as convenient (and takes more space) than the equivalent plots made with ggplot2
. And in our experience, students generally find the look of ggplot2
graphics more appealing.ggplot2
into a first course is challenging. The syntax tends to be more verbose, so it takes up more of the limited space on projected images and course handouts. More importantly, the syntax is entirely unrelated to the syntax used for other aspects of the course. For those adopting a “Less Volume, More Creativity” approach, ggplot2
is tough to justify.ggformula
, an R package that provides a formula interface to ggplot2
graphics. Our hope is that this provides the best aspects of lattice
(the formula interface and lighter syntax) and ggplot2
(modularity, layering, and better visual aesthetics).gf
. Here are two examples, either of which could replace the sidebyside boxplots made with lattice
in the previous post.%>%
, also commonly called a pipe) between the two layers and adjust the transparency so we can see both where they overlap.ggformula
package provides two ways to create these facets. The first uses 
very much like lattice
does. Notice that the gf_lm()
layer inherits information from the the gf_points()
layer in these plots, saving some typing when the information is the same in multiple layers.gf_facet_wrap()
or gf_facet_grid()
and can be more convenient for complex plots or when customization of facets is desired.ggformala
also fits into a tidyversestyle workflow (arguably better than ggplot2
itself does). Data can be piped into the initial call to a ggformula
function and there is no need to switch between %>%
and +
when moving from data transformations to plot operations.ggformula
strengthens this approach by bringing a richer graphical system into reach for beginners without introducing new syntactical structures. The full range of ggplot2
features and customizations remains available, and the ggformula
package vignettes and tutorials describe these in more detail.This post was kindly contributed by SAS and R  go there to comment and to read the full post. 
A previous entry (http://sasandr.blogspot.com/2017/07/optionsforteachingrtobeginners.html) describes an approach to teaching graphics in R that also “get[s] students doing powerful things quickly”, as David Robinson suggested.
In this guest blog entry, Randall Pruim offers an alternative way based on a different formula interface. Here’s Randall:
For a number of years I and several of my colleagues have been teaching R to beginners using an approach that includes a combination of
lattice
package for graphics,stats
package for modeling (e.g., lm(), t.test()
), andmosaic
package for numerical summaries and for smoothing over edge cases and inconsistencies in the other two components.Important in this approach is the syntactic similarity that the following “formula template” brings to all of these operations.
lattice
has some drawbacks. While basic graphs like histograms, boxplots, scatterplots, and quantilequantile plots are simple to make with lattice
, it is challenging to combine these simple plots into more complex plots or to plot data from multiple data sources. Splitting data into subgroups and either overlaying with multiple colors or separating into subplots (facets) is easy, but the labeling of such plots is not as convenient (and takes more space) than the equivalent plots made with ggplot2
. And in our experience, students generally find the look of ggplot2
graphics more appealing.
ggplot2
into a first course is challenging. The syntax tends to be more verbose, so it takes up more of the limited space on projected images and course handouts. More importantly, the syntax is entirely unrelated to the syntax used for other aspects of the course. For those adopting a “Less Volume, More Creativity” approach, ggplot2
is tough to justify.ggformula
, an R package that provides a formula interface to ggplot2
graphics. Our hope is that this provides the best aspects of lattice
(the formula interface and lighter syntax) and ggplot2
(modularity, layering, and better visual aesthetics).gf
. Here are two examples, either of which could replace the sidebyside boxplots made with lattice
in the previous post.%>%
, also commonly called a pipe) between the two layers and adjust the transparency so we can see both where they overlap.ggformula
package provides two ways to create these facets. The first uses 
very much like lattice
does. Notice that the gf_lm()
layer inherits information from the the gf_points()
layer in these plots, saving some typing when the information is the same in multiple layers.gf_facet_wrap()
or gf_facet_grid()
and can be more convenient for complex plots or when customization of facets is desired.ggformala
also fits into a tidyversestyle workflow (arguably better than ggplot2
itself does). Data can be piped into the initial call to a ggformula
function and there is no need to switch between %>%
and +
when moving from data transformations to plot operations.ggformula
strengthens this approach by bringing a richer graphical system into reach for beginners without introducing new syntactical structures. The full range of ggplot2
features and customizations remains available, and the ggformula
package vignettes and tutorials describe these in more detail.— Randall Pruim
This post was kindly contributed by SAS and R  go there to comment and to read the full post. 
The post Amusement park attendance (could Wikipedia be wrong?!?) appeared first on SAS Learning Post.
]]>This post was kindly contributed by SAS Learning Post  go there to comment and to read the full post. 
How do the North American amusement parks compare in popularity? If this question was to come up during a lunch discussion, I bet someone would pull out their smartphone and go to Wikipedia for the answer. But is Wikipedia the definitive answer – how can we tell if Wikipedia is wrong? […]
The post Amusement park attendance (could Wikipedia be wrong?!?) appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post  go there to comment and to read the full post. 
The post Fisher's transformation of the correlation coefficient appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop  go there to comment and to read the full post. 
Pearson’s correlation measures the linear association between two variables.
Because the correlation is bounded between [1, 1], the sampling distribution for highly correlated variables is highly skewed. Even for bivariate normal data, the skewness makes it challenging to estimate confidence intervals for the correlation, to run onesample hypothesis tests (“Is the correlation equal to 0.5?”), and to run twosample hypothesis tests (“Do these two samples have the same correlation?”).
In 1921, R. A. Fisher studied the correlation of bivariate normal data and discovered a wonderful transformation (shown to the right) that converts the skewed distribution of the sample correlation (r) into a distribution that is approximately normal.
Furthermore, whereas the variance of the sampling distribution of r depends on the correlation, the variance of the transformed distribution is independent of the correlation.
The transformation is called Fisher’s z transformation.
This article describes Fisher’s z transformation and shows how it transforms a skewed distribution into a normal distribution.
The following graph (click to enlarge) shows the sampling distribution of the correlation coefficient for bivariate normal samples of size 20 for four values of the population correlation, rho (ρ). You can see that the distributions are very skewed when the correlation is large in magnitude.
The graph was created by using simulated bivariate normal data as follows:
The histograms approximate the sampling distribution of the correlation coefficient (for bivariate normal samples of size 20) for the various values of the population correlation. The distributions are not simple. Notice that the variance and the skewness of the distributions depend on the value the underlying correlation (ρ) in the population.
Fisher sought to transform these distributions into normal distributions. He proposed the transformation f(r) = arctanh(r), which is the inverse hyperbolic tangent function. The graph of arctanh is shown at the top of this article. Fisher’s transformation can also be written as (1/2)log( (1+r)/(1r) ). This transformation is sometimes called Fisher’s “z transformation” because the letter z is used to represent the transformed correlation: z = arctanh(r).
How he came up with that transformation is a mystery to me, but he was able to show that arctanh is a normalizing and variancestabilizing transformation. That is, when r is the sample correlation for bivariate normal data and z = arctanh(r) then the following statements are true (See Fisher, Statistical Methods for Research Workers, 6th Ed, pp 199203):
The graph to the right demonstrates these statements. The graph is similar to the preceding panel, except these histograms show the distributions of the transformed correlations z = arctanh(r). In each cell, the vertical line is drawn at the value arctanh(ρ). The curves are normal density estimates with σ = 1/sqrt(N3), where N=20.
The two features of the transformed variables are apparent. First, the distributions are normally distributed, or, to quote Fisher, “come so close to it, even for a small sample…,
that the eye cannot detect the difference” (p. 202).
Second, the variance of these distributions are constant and are independent of the underlying correlation.
From the graph of the transformed variables, it is clear why Fisher’s transformation is important. If you want to test some hypothesis about the correlation, the test can be conducted in the z coordinates where all distributions are normal with a known variance. Similarly, if you want to compute a confidence interval, the computation can be made in the z coordinates and the results “back transformed” by using the inverse transformation, which is r = tanh(z).
You can perform the calculations by applying the standard formulas for normal distributions (see p. 34 of Shen and Lu (2006)), but most statistical software provides an option to use the Fisher transformation to compute confidence intervals and to test hypotheses. In SAS, the CORR procedure supports the FISHER option to compute confidence intervals and to test hypotheses for the correlation coefficient.
The following call to PROC CORR computes a sample correlation between the length and width of petals for 50 Iris versicolor flowers. The FISHER option specifies that the output should include confidence intervals based on Fisher’s transformation. The RHO0= suboption tests the null hypothesis that the correlation in the population is 0.75. (The BIASADJ= suboption turns off a bias adjustment; a discussion of the bias in the Pearson estimate will have to wait for another article.)
proc corr data=sashelp.iris fisher(rho0=0.75 biasadj=no); where Species='Versicolor'; var PetalLength PetalWidth; run; 
The output shows that the Pearson estimate is r=0.787. A 95% confidence interval for the correlation is [0.651, 0.874]. Notice that r is not the midpoint of that interval. In the transformed coordinates, z = arctanh(0.787) = 1.06 is the center of a symmetric confidence interval (based on a normal distribution with standard error 1/sqrt(N3)). However, the inverse transformation (tanh) is nonlinear, and the right halfinterval gets compressed more than the left halfinterval.
For the hypothesis test of ρ = 0.75, the output shows that the pvalue is 0.574. The data do not provide evidence to reject the hypothesis that ρ = 0.75 at the 0.05 significance level. The computations for the hypothesis test use only the transformed (z) coordinates.
This article shows that Fisher’s “z transformation,” which is z = arctanh(r), is a normalizing transformation for the Pearson correlation of bivariate normal samples of size N. The transformation converts the skewed and bounded sampling distribution of r into a normal distribution for z. The standard error of the transformed distribution is 1/sqrt(N3), which does not depend on the correlation. You can perform hypothesis tests in the z coordinates. You can also form confidence intervals in the z coordinates and use the inverse transformation (r=tanh(z)) to obtain a confidence interval for ρ.
The Fisher transformation is exceptionally useful for small sample sizes because, as shown in this article, the sampling distribution of the Pearson correlation is highly skewed for small N.
When N is large, the sampling distribution of the Pearson correlation is approximately normal except for extreme correlations.
Although the theory behind the Fisher transformation assumes that the data are bivariate normal, in practice the Fisher transformation is useful as long as the data are not too skewed and do not contain extreme outliers.
You can
download the SAS program that creates all the graphs in this article.
The post Fisher’s transformation of the correlation coefficient appeared first on The DO Loop.
This post was kindly contributed by The DO Loop  go there to comment and to read the full post. 
The post The path of zip codes appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop  go there to comment and to read the full post. 
Toe bone connected to the foot bone,
Foot bone connected to the leg bone,
Leg bone connected to the knee bone,…
— American Spiritual, “Dem Bones”
Last week I read an interesting article on Robert Kosara’s data visualization blog. Kosara connected the geographic centers of the US zip codes in the lower 48 states, in order, from 01001 (Agawam, MA) to 99403 (Clarkston, WA). Since the SasHelp.zipcode data set is one of the sample data sets that is distributed with SAS, it is a simple matter to recreate the graph with SAS. The following SAS statements sort the data and exclude certain American territories before graphing the locations of the US zip code in the contiguous US. (Click on a graph to enlarge it.):
proc sort data=sashelp.zipcode(where=(StateCode NOT IN ("PR", "FM", "GU", "MH", "MP", "PW", "VI") /* exclude territories */ AND ZIP_Class = " ")) /* exclude "special" zip codes */ out=zipcode(keep=ZIP x y City StateCode); by zip; run; title "Path of US Zip Codes in Numerical Order"; proc sgplot data=zipcode noborder; where StateCode NOT IN ("AK", "HI"); /* contiguous US */ series x=X y=Y / group=StateCode; xaxis display=none; yaxis display=none; run; 
Note that the coordinates are “raw” longitudes and latitudes; no projection is used. The graph is interesting. On this scale, the western states clearly show that the zip codes form small clusters within the states. This fact is also true for the Eastern states, but it is harder to see because of the greater density of zip codes in those states. Kosara includes an interactive map that enables you to zoom in on regions of interest. To create a static “zoomedin” version in SAS, you can use WHERE processing to subset the data. You can also add state boundaries and labels to obtain a closeup map, such as this map of Utah (UT), Colorado (CO), Arizona (AZ), and New Mexico (NM):
This graph shows that there are about a dozen zipcode clusters within each of these states. This reflects the hierarchical nature of zip codes. By design, the first digit in a zip code specifies a region of the country, such as the Southeast or the Midwest. The next two digits determine the Sectional Center Facility (SCF) code, which is a region within a state. From looking at the map, I conjecture that those clusters of jagged line segments are approximations of the SCFs.
You can use the following SAS code to generate the SCF from the zip code. The subsequent call to PROC FREQ counts the number of zip codes in each SCF in Utah. Some SCFs have many postal delivery zones within them (for example, 840xx) whereas others have relatively few (for example, 844xx). Note that the SCFs are not necessarily contiguous: Utah does not (yet) have zip codes of the form 842xx.
data Zip1; length SCF $5.; set zipcode; FloorZip = floor(zip/100); /* round down to nearest 100 */ SCF = putn(FloorZip,"Z3.")"xx"; /* Sectional Center Facility, eg, 841xx */ keep x y zip StateCode City SCF; run; proc freq data=Zip1; where StateCode='UT'; tables SCF / nocum; run; 
If you choose a point within each Sectional Center Facility and connect those points in order, you can obtain a much less cluttered diagram that shows the basic progression of the hierarchy of zip codes. The SCFs can zigzag across a state and do not necessarily follow a geographical progression such as northsouth or eastwest. The following image connects the location of the first zip code in each SCF region in Utah. The individual zipcode centers are shown as markers that are colored by the SCF.
For states that have more than a dozen SCFs, the fivecharacter labels can obscure the path of the SCFs. If you don’t care about the actual zipcode prefixes but you just want to visualize the progression, you can label positions along the path by integers. For example, there are 25 SCFs in Florida. The following graph visualizes the regions. The first SCF (320xx) is labeled ‘1’ and the last SCF (349xx) is labeled ’25’.
Lastly, the following graph shows the progression of Sectional Center Facilities at the national level. You can see certain large “jumps” across multiple states. These are present in the original map of zip codes but are obscured by the complexity of thousands of crisscrossing line segments. Two large jumps that are apparent are a diagonal line from Montana in the Pacific Northwest (prefix ’59’) down to Illinois (prefix ’60’). Another big jump is from Nebraska in the Midwest (prefix ’69’) down to Louisiana (prefix ’70’) in the SouthCentral region.
In summary, this article uses SAS to reproduce the fascinating image on Robert Kosara’s blog. Kosara’s image is the path obtained by connecting the geographic centers of each zip code. By zooming in on individual states, you can identify clusters of zip codes, which correspond to Sectional Center Facilities (SCFs). By choosing one zip code from each SCF and connecting those locations in order, you obtain a simplified version of the path that connects major postal zones.
If you would like to explore these data yourself, you can
download the SAS program that analyzes the zip code data.
The post The path of zip codes appeared first on The DO Loop.
This post was kindly contributed by The DO Loop  go there to comment and to read the full post. 
This post was kindly contributed by Avocet Solutions  go there to comment and to read the full post. 
The Western Users of SAS Software 2017 conference is coming to Long Beach, CA, September 2022. I have been to a lot of SAS conferences, but WUSS is always my favorite because it is big enough for me to learn a lot, but small enough to be really friendly.
If you come I hope you will catch my presentations. If you want a preview or if you can’t come, click the links below to download the papers.
On Wednesday, I will once again present SAS Essentials, a whirlwind introduction to SAS programming in just three hours specially designed for people who are new to SAS.
Introduction to DATA Step Programming: SAS Basics II
Introduction to SAS Procedures: SAS Basics III
Then on Friday Lora Delwiche will present a HandsOn Workshop about SAS Studio, a new SAS interface that runs in a web browser.
SAS Studio: A New Way to Program in SAS
I hope to see you there!
This post was kindly contributed by Avocet Solutions  go there to comment and to read the full post. 
This post was kindly contributed by Avocet Solutions  go there to comment and to read the full post. 
A while back, I wrote about the proliferation of interfaces for writing SAS programs. I am reposting that blog here (with a few changes) because a lot of SAS users still don’t understand that they have a choice.
These days SAS programmers have more choices than ever before about how to run SAS. They can use the old SAS windowing enviroment (often called Display Manager because, face it, SAS windowing environment is way too vague), or SAS Enterprise Guide, or the new kid on the block: SAS Studio. All of these are included with Base SAS.
I recently asked a SAS user, “Which interface do you use for SAS programming?”
She replied, “Interface? I just install SAS and use it.”
“You’re using Display Manager,” I explained, but she had no idea what I was talking about.
Trust me. This person is an extremely sophisticated SAS user who does a lot of leadingedge mathematical programming, but she didn’t realize that Display Manager is not SAS. It is just an interface to SAS.
This is where old timers like me have an advantage. If you can remember running SAS in batch, then you know that Display Manager, SAS Enterprise Guide, and SAS Studio are just interfaces to SAS–wonderful, manna from heaven–but still just interfaces. They are optional. It is possible to write SAS programs in an editor such as Word or Notepad++, and copyandpaste into one of the interfaces or submit them in batch. In fact, here is a great blog by Leonid Batkhan describing how to use your web browser as a SAS code editor.
Each of these interfaces has advantages and disadvantages. I’m not going to list them all here, because this is a blog not an encyclopedia, but the tweet would be
“DM is the simplest, EG has projects, SS runs in browsers.”
I have heard rumors that SAS Institute is trying to develop an interface that combines the best features of all three. So someday maybe one of these will displace the others, but at least for the near future, all three of these interfaces will continue to be used.
So what’s your SAS interface?
This post was kindly contributed by Avocet Solutions  go there to comment and to read the full post. 
The post How to learn from The Dream Team of experts at Analytics Experience 2017...even if you're not going appeared first on SAS Learning Post.
]]>This post was kindly contributed by SAS Learning Post  go there to comment and to read the full post. 
You might not know it by looking at me (I’m rounding up when I tell people I’m 5’8”) but I’m a huge basketball fan. I’ve been following the sport since I was 10, coaching it for the last decade and playing on teams throughout my life, still dedicating my winters […]
The post How to learn from The Dream Team of experts at Analytics Experience 2017…even if you’re not going appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post  go there to comment and to read the full post. 
The post Simulate multivariate clusters in SAS appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop  go there to comment and to read the full post. 
This article shows how to simulate data from a mixture of multivariate normal distributions, which is also called a Gaussian mixture.
You can use this simulation to generate clustered data. The adjacent graph shows three clusters, each simulated from a fourdimensional normal distribution. Each cluster has its own withincluster covariance, which controls the spread of the cluster and the amount overlap between clusters.
This article is based on an example in Simulating Data with SAS (Wicklin, 2013, p. 138). Previous articles have explained how to simulate from a multivariate normal distribution and
how to generate a random sample from a univariate mixture distribution.
The graph at the top of this article shows 100 random observation from a Gaussian mixture distribution.
A mixture distribution consists of K component distributions and a set of mixing probabilities that determine the probability that a random observation belongs to each component. For example, if π = {0.35, 0.5, 0.15} is a vector of mixing probabilities, then in large random samples about 35% of the observations are drawn from the first component, about 50% from the second component, and about 15% from the third component.
For a mixture of Gaussian components, you need to specify the mean vector and the covariance matrix for each component. For this example, the means and covariances are approximately the estimates from Fisher’s famous iris data set, so the scatter plot matrix might look familiar to statisticians who have previously analyzed the iris data.
The means of the three component distributions are
μ_{1} = {50, 34, 15, 2},
μ_{2} = {59, 28, 43, 13}, and
μ_{3} = {66, 30, 56, 20}.
The covariance matrices for the component distributions are shown later.
The SAS/IML language is the easiest way to simulate multivariate data in SAS. To simulate from a mixture of K Gaussian distributions, do the following:
A technical issue is how to pass the mean vectors and covariance matrices into a module that simulates the data. If you are using SAS/IML 14.2 or beyond, you can use a list of lists (Wicklin, 2017, p. 12–14). For compatibility with older versions of SAS, the implementation in this article uses a different technique: each mean and covariance are stored as rows of a matrix. Because the covariance matrices are symmetric, Wicklin (2013) stores them in lowertriangular form. For simplicity, this article stores each 4 x 4 covariance matrix as a row in a 3 x 16 matrix.
proc iml; /* Simulate from mixture of K mulitvariate normal distributions in dimension d. OUTPUT variables: X : a SampleSize x d matrix of simulated observations ID : a column vector that contains the component from which each obs is drawn INPUT variables: pi : 1 x K vector of mixing probabilities, sum(pi)=1 mu : K x d matrix whose i_th row contains the mean vector for the i_th component Cov : K x (d**2) matrix whose i_th row contains the covariance matrix for the i_th component */ start SimMVNMixture(X, ID, /* output arguments */ SampleSize, pi, mu, Cov); /* input arguments */ K = ncol(pi); /* number of components */ d = ncol(mu); /* number of variables */ X = j(SampleSize, d); /* output: each row is observation */ ID = j(SampleSize, 1); /* ID variable */ N = RandMultinomial(1, SampleSize, pi); /* vector of samples sizes for components */ b = 1; /* b = beginning index for group i */ do i = 1 to K; e = b + N[i]  1; /* e = ending index for group i */ ID[b:e] = i; /* set ID variable */ c = shape(Cov[i,], d, d); /* reshape Cov to square matrix */ X[b:e, ] = RandNormal(N[i], mu[i,], c); /* simulate i_th MVN sample */ b = e + 1; /* next group starts at this index */ end; finish; 
The SimMVNMixture routine allocates a data matrix (X) that is large enough to hold the results. It generates a vector N ={N1, N2,…, NK} to determine the number of observations that will be drawn from each component and calls the RANDNORMAL function to simulate from each Gaussian component. The scalar values b and e keep track of the beginning and ending rows of each sample from each component.
After the module is defined, you can call it for a specific set of parameters.
Assume you want to generate K=3 clusters of fourdimensional data (d=4).
The following statements specify the mixing probabilities for a threecomponent model.
The mu matrix is a K x d matrix whose rows are the mean vectors for the components.
The Cov matrix is a K x (d**2) matrix whose rows are the covariance matrices for the components.
The following statements generate a total of 100 observations from the threecomponent mixture distribution:
/* specify input args; means/cov correspond to sashelp.iris data for species */ pi = {0.35 0.5 0.15}; /* mixing probs for K=3 groups */ mu = {50 34 15 2 , /* means of Group 1 */ 59 28 43 13 , /* means of Group 2 */ 66 30 56 20 }; /* means of Group 3 */ /* specify withingroup covariances */ Cov = {12 10 2 1 10 14 1 1 2 1 3 1 1 1 1 1 , /* cov of Group 1 */ 27 9 18 6 9 10 8 4 18 8 22 7 6 4 7 4 , /* cov of Group 2 */ 40 9 30 5 9 10 7 5 30 7 30 5 5 5 5 8 }; /* cov of Group 3 */ /* run the simulation to generate 100 observations */ call randseed(12345); run SimMVNMixture(X, Group, 100, pi, mu, Cov); 
The call to the SimMVNMixture routine returned the simulated random sample in X, which is a 100 x d matrix. The module also returns an ID vector that identifies the component from which each observation was drawn.
You can visualize the random sample by writing the data to a SAS data set and using the SGSCATTER procedure to create a paneled scatter plot, as follows:
/* save simulated data to data set */ varNames="x1":"x4"; Y = Group  X; create Clusters from Y[c=("Group"  varNames)]; append from Y; close; quit; ods graphics / attrpriority=none; title "Multivariate Normal Components"; title2 "(*ESC*){unicode pi} = {0.35, 0.5, 0.15}"; proc sgscatter data=Clusters; compare y=(x1 x2) x=(x3 x4) / group=Group markerattrs=(Size=12); run; 
The graph is shown at the top of this article.
Although each component in this example is multivariate normal, the same technique will work for any component distributions. For example, one cluster could be multivariate normal, another multivariate t, and a third multivariate uniform.
In summary, you can create a function module in the SAS/IML language to simulate data from a mixture of Gaussian components. The RandMutinomial function returns a random vector that determines the number of observations to draw from each component. The RandNormal function generates the observations. A technical detail involves how to pass the parameters to the module. This implementation packs the parameters into matrices, but other options, such as a list of lists, are possible.
The post Simulate multivariate clusters in SAS appeared first on The DO Loop.
This post was kindly contributed by The DO Loop  go there to comment and to read the full post. 
This post was kindly contributed by SAS Learning Post  go there to comment and to read the full post. 
With Hurricane Irma recently pummeling pretty much the entire state of Florida, I got to wondering where past hurricanes have hit the state. Let’s get some data, and figure out how to best analyze it using SAS software! I did a bit of web searching, and found the following map […]
The post Where do hurricanes strike Florida? (110 years of data) appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post  go there to comment and to read the full post. 
++
 Subscriber 
++
 
 + psubscribe():void ++
   PubSub 
++ +++
^

+++
 Publisher 
++
 
 + calculateNextTTL():int 
 + onPMessage():void 
 + dispatchCrawler():boolean 
 
++
This post was kindly contributed by Software & Service  go there to comment and to read the full post. 
The algorithmic pricing is an exciting new area, and it combines engineering and mathematics. Chen’s paper has introduced the algorithmic pricing on Amazon Marketplace. This post is to discuss the implementation of an algorithmic pricing based on Redis from the perspective of the sellers.
The overall infrastructure may include three parts.
Amazon most time allows a crawler running 1 QPS/IP. But all the data centers and all ASINs from the seller and his competitors have to be closely watched. So a distributed approach will be safer. The response time will be crucial to decide which one is the winner within the game. In a common chasing diagram below, clearly Seller 1 has the upper hand and Seller 2 is losing the group, since the crawler from Seller 1 is faster.
The embedded algorithm will have two purposes
The computed price will enter Amazon and become effective. Amazon will rank lower for the sellers whom he thinks doing algorithmic pricing. Sometimes Amazon even bans the seller. So don’t let Amazon to find that a computer is manipulating its APIs. To emulate human’s behavior on a browser, the two options are phantom.js and headless Chrome.
Redis is a inmemory data store, which supports persistence and sharding. The first usage for this algorithmic pricing engine is that it can be used as a centralized task scheduler for the crawlers. Besides that, some other interesting fields in algorithmic pricing could be explored and utilized with Redis.
Redis as a cache has support for TTL(time to live). With the accumulation of the data, the competitors’ price changing time could be predicted. In the publishersubscriber model, each time the predicted duration for next price changing can be inputted as an expiring key with TTL. Once the key expires, the publisher dispatches a crawling task and makes the price adjustment. The good thing for this approach is that the crawlers don’t need to tap a single web page from Amazon every second that brings the risk of being banned.
++
 Subscriber 
++
 
 + psubscribe():void ++
   PubSub 
++ +++
^

+++
 Publisher 
++
 
 + calculateNextTTL():int 
 + onPMessage():void 
 + dispatchCrawler():boolean 
 
++
The synchronization mechanism across the data centers costs time sometimes hours. That is the reason why we see different prices at different IPs of Amazon at the same time. People will also have different purchasing behavior pattern at different time. Since a seller has the option to change price at a specified data center with IP instead of domain name, it will be an interesting topic to utilize the cool down time for the price to spread to the overall network and make a hedging. Redis’ capacity to keep all prices as hashes in memory will be helpful to spot those valuable occasions.
This post was kindly contributed by Software & Service  go there to comment and to read the full post. 