This post was kindly contributed by platformadmin.com - go there to comment and to read the full post. |
One of the new additions in the recently released Metacoda Plug-ins 6.1 R2 is a Compare Metadata Objects feature. This is a something that several customers have requested, particularly for the comparison of SAS metadata security objects like ACTs, Users, Groups, and Roles. One of the most common requests was to be able to compare … Continue reading “Metacoda Plug-ins Tip: Compare Metadata Objects”
This post was kindly contributed by platformadmin.com - go there to comment and to read the full post. |
This post was kindly contributed by Avocet Solutions - go there to comment and to read the full post. |
This year I’ve had the honor of helping to recruit speakers for the Career Development area at SAS Global Forum. We have some fantastic presentations that everyone can benefit from whether you are a student, a new graduate, or a mid-career professional.
I particularly recommend the panel discussion (Career Advice We’d Give to Our Kids) Tuesday April 30, 3:00-4:00 in Level 2, Ballroom C4. The panelists (Shelley Blozis, AnnMaria De Mars, Paul LaBrec) are all great so this should be both informative and entertaining.
The following presentations are listed in order by day and time. As you scroll through this list, you may notice that most (but not all!) of these presentations are in Level 1 Room D168.
Poster (available every day)
Tips to Ensure Success in Your New SAS Project
Flora Fang Liu
Tuesday, April 30, 2019
10:00-11:00 Level 1, D168
Don’t Just Survive, Thrive! A Learning- Based Strategy for a Modern Organization Centered Around SAS
Jovan Marjanovic
11:00-12:00 Level 1, D168
The Power of Know-How: Pump Up Your Professional Value by Refining Your SAS Skills
Gina Huff
1:00-1:15 Level 2, Exhibit Hall D, Super Demo 12
SAS Programming Exam Moves to Performance-Based Format
Mark Stevens
1:30-2:00 Level 1, D168
The Why and How of Teaching SAS to High School Students
Jennifer Richards
2:00-2:30 Level 1, D168
Puzzle Me, Puzzle You: How a Thought Experiment Became a Rubik’s Cube Among a Set of Fun Puzzles
Amit Patel, Lewis Mitchell
2:30-3:00 Level 1, D168
How to Land Work as a SAS Professional
Charu Shankar
3:00-3:15 Level 2, Exhibit Hall D, Super Demo 12
Take SAS Certification Exams from Home Online Proctored
Terry Barham
3:00-4:00 Level 2, Ballroom C4
Panel Discussion: Career Advice We’d Give to Our Kids
Shelley Blozis, AnnMaria De Mars, Paul LaBrec
3:00-4:00 Level 1, D168
How To Be an Effective Statistician
Alexander Schacht
4:00-5:00 Level 1, D168
Stories from the Trenches: Tips and Techniques for Career Advancement from a SAS Industry Recruiter
Molly Hall
5:00-5:30 Level 1, D168
How to HOW: Hands-on- Workshops Made Easy
Chuck Kincaid
Wednesday, May 1, 2019
10:00-11:00 Level 2, Ballroom C3
Tell Me a Data Story
Kat Greenbrook
10:00-11:00 Level 2 Ballroom C4
The Good, The Bad, and The Creepy: Why Data Scientists Need to Understand Ethics
11:00 Jennifer Priestley HOW POI
11:30-12:00 Level 1, D168
New to SAS? Helpful Hints for Developing Your Own Professional Development Plan
Kelly Smith
This post was kindly contributed by Avocet Solutions - go there to comment and to read the full post. |
The post Deeper enjoyment of your favorite shows -- through data appeared first on The SAS Dummy.
]]>This post was kindly contributed by The SAS Dummy - go there to comment and to read the full post. |
Do you have a favorite television show? Or a favorite movie franchise that you follow? If you call yourself a “fan,” just how much of a fan are you? Are you merely a spectator, or do you take your fanaticism to the next level by creating something new?
When it comes to fandom for franchises like Game of Thrones, the Marvel movies, or Stranger Things, there’s a new kind of nerd in town. And this nerd brings data science skills. You’ve heard of the “second screen” experience for watching television, right? That’s where fans watch a show (or sporting event or awards ceremony), but also keep up with Twitter or Facebook so they can commune with other fans of the show on social media. These fan-data-scientists bring a third screen: their favorite data workbench IDE.
I was recently lured into into a rabbit hole of Game of Thrones data by a tweet. The Twitter user was reacting to a data visualization of character screen time during the show. The visualization was built in a different tool, but the person was wondering whether it could be done in SAS. I knew the answer was Yes…as long as we could get the data. That turned out to be the easiest part.
WARNING: While this blog post does not reveal any plot points from the show, the data does contain spoilers! No spoilers in what I’m showing here, but if you run my code examples there might be data points that you cannot “unsee.” I was personally conflicted about this, since I’m a fan of the show but I’m not yet keeping up with the latest episodes. I had to avert my eyes for the most recent data.
A GitHub user named Jeffrey Lancaster has shared a repository for all aspects of data around Game of Thrones. He also has similar repos for Stranger Things and Marvel universe. Inside that repo there’s a JSON file with episode-level data for all episodes and seasons of the show. With a few lines of code, I was able to read the data directly from the repo into SAS:
filename eps temp; /* Big thanks to this GoT data nerd for assembling this data */ proc http url="https://raw.githubusercontent.com/jeffreylancaster/game-of-thrones/master/data/episodes.json" out=eps method="GET"; run; /* slurp this in with the JSON engine */ libname episode JSON fileref=eps; |
Note that I’ve shared all of my code for my steps in my own GitHub repo (just trying to pay it forward). Everything should work in Base SAS, including in SAS University Edition.
The JSON library reads the data into a series of related tables that show all of the important things that can happen to characters within a scene. Game of Thrones fans know that death, sex, and marriage (in that order) make up the inflection points in the show.
With a little bit of data prep using SQL, I was able to show the details of the on-screen time per character, per scene. These are the basis of the visualization I was trying to create.
/* Build details of scenes and characters who appear in them */ PROC SQL; CREATE TABLE WORK.character_scenes AS SELECT t1.seasonNum, t1.episodeNum, t2.ordinal_scenes as scene_id, input(t2.sceneStart,time.) as time_start format=time., input(t2.sceneEnd,time.) as time_end format=time., (calculated time_end) - (calculated time_start) as duration format=time., t3.name FROM EPISODE.EPISODES t1, EPISODE.EPISODES_SCENES t2, EPISODE.SCENES_CHARACTERS t3 WHERE (t1.ordinal_episodes = t2.ordinal_episodes AND t2.ordinal_scenes = t3.ordinal_scenes); QUIT; |
With a few more data prep steps (see my code on GitHub), I was able to summarize the screen time for scene locations:
You can see that The Crownlands dominate as a location. In the show that’s a big region and a sort of headquarters for The Seven Kingdoms, and the show data actually includes “sub-locations” that can help us to break that down. Here’s the makeup of that 18+ hours of time in The Crownlands:
My goal is to show how much screen time each of the major characters receives, and how that changes over time. I began by creating a series of charts using PROC SGPLOT. These were created using a single SGPLOT step using a BY group, segmented by show episode. They appear in a grid because I used ODS LAYOUT GRIDDED to arrange them.
Here’s the code segment that creates these dozens of charts. Again, see my GitHub for the intermediate data prep work.
/* Create a gridded presentation of Episode graphs CUMULATIVE timings */ ods graphics / width=500 height=300 imagefmt=svg noborder; ods layout gridded columns=3 advance=bygroup; proc sgplot data=all_times noautolegend ; hbar name / response=cumulative categoryorder=respdesc colorresponse=total_screen_time dataskin=crisp datalabel=name datalabelpos=right datalabelattrs=(size=10pt) seglabel seglabelattrs=(weight=bold size=10pt color=white) ; ; by epLabel notsorted; format cumulative time.; label epLabel="Ep"; where rank<=10; xaxis display=(nolabel) grid ; yaxis display=none grid ; run; ods layout end; ods html5 close; |
The example shared on Twitter showed an animation of screen time, per character, over the complete series of episodes. So instead of a huge grid with many plots, need to produce a single file with layers for each episode. In SAS we can produce an animated GIF or animated SVG (scalable vector graphics) file. The SVG is a much smaller file format, but you need a browser or a special viewer to “play” it. Still, that’s the path I followed:
/* Create a single animated SVG file for all episodes */ options printerpath=svg animate=start animduration=1 svgfadein=.25 svgfadeout=.25 svgfademode=overlap nodate nonumber; /* change this file path to something that works for you */ ODS PRINTER file="c:\temp\got_cumulative.svg" style=daisy; /* For SAS University Edition ODS PRINTER file="/folders/myfolders/got_cumulative.svg" style=daisy; */ proc sgplot data=all_times noautolegend ; hbar name / response=cumulative categoryorder=respdesc colorresponse=total_screen_time dataskin=crisp datalabel=name datalabelpos=right datalabelattrs=(size=10pt) seglabel seglabelattrs=(weight=bold size=10pt color=white) ; ; by epLabel notsorted; format cumulative time.; label epLabel="Ep"; where rank<=10; xaxis label="Cumulative screen time (HH:MM:SS)" grid ; yaxis display=none grid ; run; options animation=stop; ods printer close; |
Here’s the result (hosted on my GitHub repo — but as a GIF for compatibility.)
Like the Game of Thrones characters, my visualization is imperfect in many ways. As I was just reviewing it I discovered a few data prep missteps that I should correct. I used some features of PROC SGPLOT that I’ve learned only a little about, and so others might suggest improvements. And my next mission should be to bring this data in SAS Visual Analytics, where the real “data viz maesters” who work with me can work their magic. I’m just hoping that I can stay ahead of the spoilers.
The post Deeper enjoyment of your favorite shows — through data appeared first on The SAS Dummy.
This post was kindly contributed by The SAS Dummy - go there to comment and to read the full post. |
The post Create your own version of Anscombe's quartet: Dissimilar data that have similar statistics appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
I think every course in exploratory data analysis should begin by studying Anscombe’s quartet. Anscombe’s quartet is a set of four data sets (N=11) that have nearly identical descriptive statistics but graphical properties. They are a great reminder of why you should graph your data.
You can read about Anscombe’s quartet on Wikipedia, or check out a quick visual summary by my colleague Robert Allison. Anscombe’s first two examples are shown below:
The Wikipedia article states,
“It is not known how Anscombe created his data sets.”
Really? Creating different data sets that have the same statistics sounds like a fun challenge!
As a tribute to Anscombe, I decided to generate my own versions of the two data sets shown in the previous scatter plots. The first data set is linear with normal errors. The second is quadratic (without errors) and has the exact same linear fit and correlation coefficient as the first data.
The Wikipedia article notes that there are “several methods to generate similar data sets with identical statistics and dissimilar graphics,” but I
did not look at the modern papers. I wanted to try it on my own. If you want to solve the problem on your own, stop reading now!
I used the following approach to construct the first two data sets:
From geometric reasoning, there are three different solution for the β parameter: One with β_{2} > 0 (a parabola that opens up), one with β_{2} = 0 (a straight line), and one with β_{2} < 0 (a parabola that opens down). Since Anscombe used a downward-pointing parabola, I will make the same choice.
You can construct the first data set in a number of ways, but I choose to construct it randomly. The following SAS/IML statements construct the data, defines a helper function (LinearReg). The program computes the target values, which are the parameter estimates for a linear regression and the sample correlation for the data:
proc iml; call randseed(12345); x = T( do(4, 14, 0.2) ); /* evenly spaced X */ eps = round( randfun(nrow(x), "Normal", 0, 1), 0.01); /* normal error */ y = 3 + 0.5*x + eps; /* linear Y + error */ /* Helper function. Return paremater estimates for linear regression. Args are col vectors */ start LinearReg(Y, tX); X = j(nrow(tX), 1, 1) || tX; b = solve(X`*X, X`*Y); /* solve normal equation */ return b; finish; targetB = LinearReg(y, x); /* compute regression estimates */ targetCorr = corr(y||x)[2]; /* compute sample correlation */ print (targetB`||targetCorr)[c={'b0' 'b1' 'corr'} F=5.3 L="Target"]; |
You can use these values as the target values. The next step is to find a parameter vector β such that
Y_{2} = β_{0} + β_{1} X + β_{2} X^{2} has the same regression line and corr(Y_{2}, X) has the same sample correlation. For uniqueness, set β_{2} < 0.
You can formulate the problem as a system of equations and use the NLPHQN subroutine in SAS/IML to solve it. (SAS supports multiple ways to solve a system of equations.)
The following SAS/IML statements define two functions. Given any value for the β parameter, the first function returns the regression estimates and sample correlation between Y_{2} and X. The second function is the objective function for an optimization. It subtracts the target values from the estimates. The NLPHQN subroutine implements a hybrid quasi-Newton optimization routine that uses least squares techniques to find the β parameter that generates quadratic data that tries to match the target statistics.
/* Define system of simultaneous equations: https://blogs.sas.com/content/iml/2018/02/28/solve-system-nonlinear-equations-sas.html */ /* This function returns linear regression estimates (b0, b1) and correlation for a choice of beta */ start LinearFitCorr(beta) global(x); y2 = beta[1] + beta[2]*x + beta[3]*x##2; /* construct quadratic Y */ b = LinearReg(y2, x); /* linear fit */ corr = corr(y2||x)[2]; /* sample corr */ return ( b` || corr); /* return row vector */ finish; /* This function returns the vector quantity (beta - target). Find value that minimizes Sum | F_i(beta)-Target+i |^2 */ start Func(beta) global(targetB, targetCorr); target = rowvec(targetB) || targetCorr; G = LinearFitCorr(beta) - target; return( G ); /* return row vector */ finish; /* now let's solve for quadratic parameters so that same linear fit and correlation occur */ beta0 = {-5 1 -0.1}; /* initial guess */ con = {. . ., /* constraint matrix */ 0 . 0}; /* quadratic term is negative */ optn = ncol(beta0) || 0; /* LS with 3 components || amount of printing */ /* minimize sum( beta[i] - target[i])**2 */ call nlphqn(rc, beta, "Func", beta0, optn) blc=con; /* LS solution */ print beta[L="Optimal beta"]; /* How nearly does the solution solve the problem? Did we match the target values? */ Y2Stats = LinearFitCorr(beta); print Y2Stats[c={'b0' 'b1' 'corr'} F=5.3]; |
The first output shows that the linear fit and correlation statistics for the linear and quadratic data are identical (to 3 decimal places). Anscombe would be proud! The second output shows the parameters for the quadratic response: Y_{2} = 4.955 + 2.566*X – 0.118*X^{2}.
The following statements create scatter plots of the new Anscombe-like data:
y2 = beta[1] + beta[2]*x + beta[3]*x##2; create Anscombe2 var {x y y2}; append; close; QUIT; ods layout gridded columns=2 advance=table; proc sgplot data=Anscombe2 noautolegend; scatter x=X y=y; lineparm x=0 y=3.561 slope=0.447 / clip; run; proc sgplot data=Anscombe2 noautolegend; scatter x=x y=y2; lineparm x=0 y=3.561 slope=0.447 / clip; run; ods layout end; |
Notice that the construction of the second data set depends only on the statistics for the first data. If you modify the first data set, and the second will automatically adapt. For example, you could choose the errors manually instead of randomly, and the statistics for the second data set should still match.
You can create the other data sets similarly. For example:
In summary, you can solve a system of equations to construct data similar to Anscombe’s quartet.
By using this technique, you can create your own data sets that share descriptive statistics but look very different graphically.
To be fair, the technique I’ve presented does not enable you to reproduce Anscombe’s quartet in its entirety.
My data share a linear fit and sample correlation, whereas
Anscombe’s data share seven statistics!
Anscombe was a pioneer (along with Tukey) in using computation to assist in statistical computations. He was also enamored with the APL language. He introduced computers into the Yale statistics department in the mid-1960s. Since he published his quartet in 1973, it is possible that he used computers to assist his creation of the Anscombe quartet.
However he did it, Anscombe’s quartet is truly remarkable!
You can download the SAS program that creates the results in this article.
The post Create your own version of Anscombe's quartet: Dissimilar data that have similar statistics appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
Using code snippets in SAS® Viya® was published on SAS Users.
]]>This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
This blog post is based on the Code Snippets tutorial video in the free SAS^{®} Viya^{®} Enablement course from SAS Education. Keep reading to learn more about code snippets or check out the video to follow along with the tutorial in real-time.
Has there ever been a block of code that you use so infrequently that you always seem to forget the options that you need? Conversely, has there ever been a block of code that you use so frequently that you grow tired of typing it all the time? Code snippets can greatly assist with both of these scenarios. In this blog post, we discuss using pre-installed code snippets and creating new code snippets within SAS Viya.
SAS Viya comes with several code snippets pre-installed, including snippets to connect to CAS. To access these snippets, expand the Snippets area on the left navigation panel of SAS Studio as shown in Figure 1. You can see that the snippets are divided into categories, making it easier to find them.
If you double-click a pre-installed code snippet, or if you click and drag the snippet into the code editor panel, then the snippet will appear in the panel.
Snippets can range from very simple to very complex. Some contain comments. Some contain macro variables. Some might be only a couple of lines of code. That is the advantage of snippets. They can be anything that you want them to be.
Now, let’s create a snippet of our own. Figure 2 shows an example of code that calls PROC CARDINALITY. This code is complete and fully executable. When you have the code the way that you want in your code window, click on the shortcut button for Add to My Snippets above the code. The button is outlined in a box in Figure 2.
A window will appear that asks you to name the snippet. Naming the snippet then saves it into the My Snippets area in the left navigation panel for future use.
Remember that snippets are extremely flexible. The code that you save does not have to be fully executable. Instead of supplying the data source in your code, you may instead include notes or comments about what needs to be added, which makes the code more general, but it is still a very useful snippet.
To use one of your saved snippets, simply navigate to the My Snippets area, then double-click on your snippet or drag it into the code window.
Want to learn more about SAS Viya? Download the free e-book Exploring SAS^{®} Viya^{®}: Programming and Data Management. The content in this e-book is based on SAS^{®} Viya^{®} Enablement,” a free course available from SAS Education.
Using code snippets in SAS® Viya® was published on SAS Users.
This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
The post Efficient evaluation of a quadratic form appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
A quadratic form is a second-degree polynomial that does not have any linear or constant terms.
For multivariate polynomials, you can quickly evaluate a quadratic form by using the matrix expression
x` A x
This computation is straightforward in a matrix language such as SAS/IML. However, some computations in statistics require that you evaluate a quadratic form that looks like
x` A^{-1} x
where A is symmetric and positive definite. Typically, you know A, but not the inverse of A. This article shows how to compute both kinds of quadratic forms efficiently.
For multivariate polynomials, you can represent the purely quadratic terms by a symmetric matrix, A. The polynomial is q(x) = x` A x, where x is an d x 1 vector and A is a d x d symmetric matrix. For example,
if A is the matrix {9 -2, -2 5} and x = {x1, x2} is a column vector, the expression x` A x
equals the second degree polynomial
q(x1, x2) = 9*x1^{2} – 4*x1 x2 + 5*x2^{2}. A contour plot of this polynomial is shown below.
Probably the most familiar quadratic form is the squared Euclidean distance. If you let A be the d-dimensional identity matrix, then the squared Euclidean distance of a vector x from the origin is x` I_{d} x = x` x = Σ_{i} x_{i}^{2}. You can obtain a weighted squared distance by using a diagonal matrix that has positive values. For example, if W = diag({2, 4}), then x` W x = 2*x1^{2} + 4*x2^{2}. If you add in off-diagonal elements, you get cross terms in the polynomial.
If the matrix A is dense, then you can use matrix multiplication to evaluate the quadratic form. The following symmetric 3 x 3 matrix defines a quadratic polynomial in 3 variables. The multiplication evaluates the polynomial at (x1, x2, x3) = (-1. 2. 0.5).
proc iml; /* q(x1, x2, x3) = 4*x1**2 + 6*x2**2 + 9*x3**2 + 2*3*x1*x2 + 2*2*x1*x3 + 2*1*x2*x3 */ A = {4 3 2, 3 6 1, 2 1 9}; x = {-1, 2, 0.5}; q1 = x`*A*x; print q1; |
When you are dealing with large matrices, always remember that you should never explicitly form a large diagonal matrix. Multiplying with a large diagonal matrix is a waste of time and memory.
Instead, you can use elementwise multiplication to evaluate the quadratic form more efficiently:
w = {4, 6, 9}; q2 = x`*(w#x); /* more efficient than q = x`*diag(w)*x; */ print q2; |
In statistics, the matrix is often symmetric positive definite (SPD). The matrix A might be a covariance matrix (for a nondegenerate system), the inverse of a covariance matrix, or the Hessian evaluated at the minimum of a function.
(Recall that the inverse of a symmetric positive definite (SPD) matrix is SPD.)
An important example is the squared Mahalanobis distance x` Σ^{-1} x, which is a quadratic form.
As I have previously written, you can use a trick in linear algebra to efficiently compute the Mahalanobis distance.
The trick is to compute the Cholesky decomposition of the SPD matrix.
I’ll use a large Toeplitz matrix, which is guaranteed to be symmetric and positive definite, to demonstrate the technique. The function
EvalSPDQuadForm evaluates a quadratic form defined by the SPD matrix A at the coordinates given by x:
/* Evaluate quadratic form q = x`*A*x, where A is symmetric positive definite. Let A = U`*U be the Cholesky decomposition of A. Then q = x`(U`*U)*x = (U*x)`(Ux) So define z = U*x and compute q = z`*z */ start EvalSPDQuadForm(A, x); U = root(A); /* Cholesky root */ z = U*x; return (z` * z); /* dot product of vectors */ finish; /* Run on large example */ call randseed(123); N = 1000; A = toeplitz(N:1); x = randfun(N, "Normal"); q3 = EvalSPDQuadForm(A, x); /* efficient */ qCheck = x`*A*x; /* check computation by using a slower method */ print q3 qCheck; |
You can use a similar trick to evaluate the quadratic form x` A^{-1} x.
I previously used this trick to evaluate the Mahalanobis distance efficiently. It combines a Cholesky decomposition (the ROOT function in SAS/IML) and the TRISOLV function for solving triangular systems.
/* Evaluate quadratic form x`*inv(A)*x, where A is symmetric positive definite. Let w be the solution of A*w = x and A=U`U be the Cholesky decomposition. To compute q = x` * inv(A) * x = x` * w, you need to solve for w. (U`U)*w = x, so First solve triangular system U` z = x for z, then solve triangular system U w = z for w */ start EvalInvQuadForm(A, x); U = root(A); /* Cholesky root */ z = trisolv(2, U, x); /* solve linear system */ w = trisolv(1, U, z); return (x` * w); /* dot product of vectors */ finish; /* test the function */ q4 = EvalInvQuadForm(A, x); /* efficient */ qCheck = x`*Solve(A,x); /* check computation by using a slower method */ print q4 qCheck; |
You might wonder how much time is saved by using the Cholesky root and triangular solvers, rather than by computing the general solution by using the SOLVE function. The following graph compares the average time to evaluate the same quadratic form by using both methods. You can see that for large matrices, the ROOT-TRISOLV method is many times faster than the straightforward method that uses SOLVE.
In summary,
you can use several tricks in numerical linear algebra to efficiently evaluate a quadratic form, which is a multivariate quadratic polynomial that contains only second-degree terms. These techniques can greatly improve the speed of your computational programs.
The post Efficient evaluation of a quadratic form appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
Script for a SAS Viya installation on Azure in just one click was published on SAS Users.
]]>This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
At the end of my SAS Users blog post explaining how to install SAS Viya on the Azure Cloud for a SAS Hackathon in the Nordics, I promised to provide some technical background. I ended up with only one manual step by launching a shell script from a Linux machine and from there the whole process kicked off. In this post, I explain how we managed to automate this process as much as possible. Read on to discover the details of the script.
The script uses the Azure command-line interface (CLI) heavily. The CLI is Microsoft’s cross-platform command-line experience for managing Azure resources. Make sure the CLI is installed, otherwise you cannot use the script.
The process contains three different steps:
Let’s examine the details of each step.
When deploying software in the cloud, Red Hat Enterprise Linux recommends using a mirror repository. Since the SAS Viya package allows for this installation method, we decided to use the mirror for the hackathon images. This is optional, but optimal, say if your deployment does not have access to the Internet or if you must always deploy the same version of software (such as for regulatory reasons or for testing/production purposes).
In our Azure Subscription we created an Azure Resource group with the name ‘Nordics Hackathon.’ Within that resource group, there is an Azure VM running a web server hosting the downloaded SAS Viya repository.
Of course, we cannot start the SAS Viya installation before being sure this VM – hosting all rpms to install SAS Viya – is running.
To validate that the VM is running, we issue the start command from the CLI:
az vm start -g [Azure Resource Group] -n [AZ VM name] |
Something like:
az vm start -g my_resourcegroup -n my_viyarepo34 |
If the server is already running, nothing happens. If not, the command starts the VM. We can also check the Azure console:
The second part of the script launches a new Azure VM. We use the custom Azure image we created earlier. The SAS Viya image creation is explained in the first blog post.
The Azure image used for the Nordics hackathon was the template for all other SAS Viya installations. On this Azure image we completed several valuable tasks:
Every time we launch our script, an exact copy of a new Azure Virtual machine launches, fully customized according to our needs for the Hackathon.
Below is the Azure CLI command used in the script which creates a new Azure VM.
az vm create --resource-group [Azure Resource Group]--name $NAME --image viya_Base \ --admin-username azureuser --admin-password [your_pw] --subnet [subnet_id] \ --nsg [optional existing network security group] --public-ip-address-allocation static \ --size [any Azure size] --tags name=$NAME |
After the creation of the VM, we install SAS Viya in the third step of the process.
After running the script three times (using a different value for $NAME), we end up with the following high-level infrastructure:
After the launch of the Azure VM, the viya-install.sh script starts the install script using the original image in the /opt/sas/install/ location.
In the last step of the deployment process, the script installs OpenLdap, SAS Viya and JupyterHub. The following command runs the script:
az vm run-command invoke -g [Azure Resource Group] -n $NAME --command-id RunShellScript --scripts "sudo /opt/sas/install/viya-install.sh &" |
The steps in the script should be familiar to those with experience installing SAS Viya and/or Ansible playbooks. Below is the script in its entirety.
#!/bin/bash touch /start #################################################################### echo "Starting with the installation of OPENLDAP. Check the openldap.log in the playbook directory for more information" > /var/log/myScriptLog.txt #################################################################### # install openldap cd /opt/sas/install/OpenLDAP ansible-playbook openldapsetup.yml if [ $? -ne 0 ]; then { echo "Failed the openldap setup, aborting." ; exit 1; } fi cp ./sitedefault.yml /opt/sas/install/sas_viya_playbook/roles/consul/files/sitedefault.yml if [ $? -ne 0 ]; then { echo "Failed to copy file, aborting." ; exit 1; } fi #################################################################### echo "Starting Viya installation" >> /var/log/myScriptLog.txt #################################################################### # install viya cd /opt/sas/install/sas_viya_playbook ansible-playbook site.yml if [ $? -ne 0 ]; then { echo "Failed to install sas viya, aborting." ; exit 1; } fi #################################################################### echo "Starting jupyterhub installation" >> /var/log/myScriptLog.txt #################################################################### # install jupyterhub cd /opt/sas/install/jupy-azure ansible-playbook deploy_jupyter.yml if [ $? -ne 0 ]; then { echo "Failed to install jupyterhub, aborting." ; exit 1; } fi #################################################################### touch /finish #################################################################### |
In a future blog, I hope to show you how get up and running with SAS Viya Azure Quick Start. For now, the details I provided in this and the previous blog post is enough to get you started deploying your own SAS Viya environments in the cloud.
Script for a SAS Viya installation on Azure in just one click was published on SAS Users.
This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
The post Read RSS feeds with SAS using XML or JSON appeared first on The SAS Dummy.
]]>This post was kindly contributed by The SAS Dummy - go there to comment and to read the full post. |
This blog post could be subtitled “To Catch a Thief” or maybe “Go ahead. Steal this blog. I dare you.”* That’s because I’ve used this technique several times to catch and report other web sites who lift the blog content from blogs.sas.com and present it as their own.
Syndicating blog content is an honorable practice, made possible by RSS feeds that virtually every blog platform supports. With syndicated feeds, your blog content can appear on someone else’s site, but this always includes attribution and a link back to the original source. However, if you copy content from another website and publish it natively on your own web property, without attribution or citations…well, that’s called plagiarism. And the Digital Millennium Copyright Act (DCMA) provides authors with recourse to have stolen content removed from infringing sites — if you can establish that you’re the rightful copyright owner.
Establishing ownership is a tedious task, especially when someone steals dozens or hundreds of articles. You must provide links to each example of infringing content, along with links to the original authorized content. Fortunately, as I’ve discussed before, I have ready access to the data about all 17,000+ blog posts that we’ve published at SAS (see How SAS uses SAS to Analyze SAS Blogs). In this article, I’ll show you how I gathered that same information from the infringing websites so that I could file the DCMA “paperwork” almost automatically.
The complete programs from this article are available on GitHub.
In my experience, the people who steal our blog content don’t splurge on fancy custom web sites. They tend to use free or low-cost web site platforms, and the most popular of these include WordPress (operated by Automattic) and Blogspot (operated by Google). Both of these platforms support API-like syndication using feeds.
Blogspot sites can generate article feeds in either XML or JSON. I prefer JSON when it’s available, as I find that the JSON libname engine in SAS requires fewer “clues” in order to generate useful tables. (See documentation for the JSON engine.) While you can supply a JSON map file that tells SAS how to assemble your tables and data types, I find it just as easy to read the data as-is and post-process it to join the fields I need and convert data fields. (For an example that uses a JSON map, see Reading data with the SAS JSON libname engine.)
Since I don’t want to draw attention to the specific infringing sites, I’ll use an example of a popular (legitimate!) Blogspot site named “Maps Mania”. If you’re into data and maps (who isn’t?) you might like their content. In this code I use PROC HTTP to fetch the RSS feed, using “alt=json” to request JSON format and “max-results=100” to retrieve a larger-than-default batch of published posts.
/* Read JSON feed into a local file. */ /* Use Blogspot parameters to get 100 posts at a time */ filename resp temp; proc http url='https://googlemapsmania.blogspot.com/feeds/posts/default?alt=json&max-results=100' method="get" out=resp; run; libname rss json fileref=resp; |
This JSON libname breaks the data into a series of tables that relate to each other via common keys.
With a little bit of exploration in SAS Enterprise Guide and the Query Builder, I was able to design a PROC SQL step to assemble just the fields and records I needed: post title and post URL.
/* Join the relevant feed entry items to make a single table */ /* with post titles and URLs */ proc sql; create table work.blogspot as select t2._t as rss_title, t1.href as rss_href from rss.entry_link t1 inner join rss.entry_title t2 on (t1.ordinal_entry = t2.ordinal_entry) where t1.type = 'text/html' and t1.rel = 'alternate'; quit; libname rss clear; |
WordPress sites generate XML-based feeds by default. Site owners can install a WordPress plugin to generate JSON feeds as well, but most sites don’t bother with that. Like the JSON feeds, the XML feed can contain many fields that relate to each other. I find that with XML, the best approach is to use the SAS XML Mapper application to explore the XML and “design” the final data tables that you need. You use SAS XML Mapper to create a map file, which you can then feed into the SAS XMLv2 engine to instruct SAS how to read the data. (See documentation for the XMLv2 engine.)
SAS XML Mapper is available as a free download from support.sas.com. Download it as a ZIP file (on Windows), and extract the ZIP file to a temporary folder. Then run setup.exe in the root of that folder to install the app on your system.
To design the map, I use an example of the XML feed from the blog that I want to examine. Once again, I’ll choose a popular WordPress blog instead of the actual infringing sites. In this case, let’s look at the Star Wars News site. I point my browser at the feed address is https://www.starwars.com/news/feed and save as an XML file. Then, I use SAS XML Mapper to Open XML (File menu), and examine the result.
I found everything that I needed in “item” subset of the feed. I dragged that group over to the right pane to include in the map. That creates a data set container named “item.” Then dragged just the title, link, and pubDate fields into that data set to include in the final result.
The SAS XML Mapper generates a SAS program that you can include to define the map, and that’s what I’ve done with the following code. It uses DATA step to create the map file just as I need it.
filename rssmap temp; data _null_; infile datalines; file rssmap; input; put _infile_; datalines; <?xml version="1.0" encoding="windows-1252"?> <SXLEMAP name="RSSMAP" version="2.1"> <NAMESPACES count="0"/> <!-- ############################################################ --> <TABLE name="item"> <TABLE-PATH syntax="XPath">/rss/channel/item</TABLE-PATH> <COLUMN name="title"> <PATH syntax="XPath">/rss/channel/item/title</PATH> <TYPE>character</TYPE> <DATATYPE>string</DATATYPE> <LENGTH>250</LENGTH> </COLUMN> <COLUMN name="link"> <PATH syntax="XPath">/rss/channel/item/link</PATH> <TYPE>character</TYPE> <DATATYPE>string</DATATYPE> <LENGTH>200</LENGTH> </COLUMN> <COLUMN name="pubDate"> <PATH syntax="XPath">/rss/channel/item/pubDate</PATH> <TYPE>character</TYPE> <DATATYPE>string</DATATYPE> <LENGTH>40</LENGTH> </COLUMN> </TABLE> </SXLEMAP> ; run; |
Because WordPress feeds return just most recent 25 items by default, I need to use the “pageid=” directive to go deeper into the archive and return older items. I used a simple SAS macro loop to iterate through 5 pages (125 items) in this example. Note how I specified the XMLv2 libname with the XMLMAP= option to include my custom map. That ensures that SAS will read the XML and build the table as I’ve designed it.
My final DATA step in this part is to recast the pubDate field (a text field by default) into a proper SAS date.
/* WordPress feeds return data in pages, 25 entries at a time */ /* So using a short macro to loop through past 5 pages, or 125 items */ %macro getItems; %do i = 1 %to 5; filename feed temp; proc http method="get" url="https://www.starwars.com/news/feed?paged=&i." out=feed; run; libname result XMLv2 xmlfileref=feed xmlmap=rssmap; data posts_&i.; set result.item; run; %end; %mend; %getItems; /* Assemble all pages of entries */ /* Cast the date field into a proper SAS date */ /* Have to strip out the default day name abbreviation */ /* "Wed, 10 Apr 2019 17:36:27 +0000" -> 10APR2019 */ data allPosts ; set posts_:; length sasPubdate 8; sasPubdate = input( substr(pubDate,4),anydtdtm.); format sasPubdate dtdate9.; drop pubDate; run; |
After gathering the data I need from RSS feeds, I use SAS to match that with the WordPress data that I have about our blogs. I can then generate a table that I can easily submit in a DCMA form.
Usually, matching by “article title” is the easiest method. However, sometimes the infringing site will alter the titles a little bit or even make small adjustments to the body of the article. (This reminds me of my college days in computer science, when struggling students would resort to cheating by copying someone else’s program, but change just the variable names. It’s a weak effort.) With the data in SAS, I’ve used several other techniques to detect the “distance” of a potentially infringing post from the original work.
Maybe you want to see that code. But you can’t expect me to reveal all of my secrets, can you?
* props to Robert “I dare you to knock it off” Conrad.
The post Read RSS feeds with SAS using XML or JSON appeared first on The SAS Dummy.
This post was kindly contributed by The SAS Dummy - go there to comment and to read the full post. |
The post 4 ways to compute an SSCP matrix appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
In numerical linear algebra, there are often multiple ways to solve a problem, and each way is useful in various contexts. In fact, one of the challenges in matrix computations is choosing from among different algorithms, which often vary in their use of memory, data access, and speed. This article describes four ways to perform the sum of squares and crossproducts matrix, which is usually abbreviated as the SSCP matrix. The SSCP matrix is an essential matrix in ordinary least squares (OLS) regression. The normal equations for OLS are written as (X`*X)*b = X`*Y, where X is a design matrix, Y is the vector of observed responses, and b is the vector of parameter estimates, which must be computed. The X`*X matrix (pronounced “X-prime-X”) is the SSCP matrix and the topic of this article.
If you are performing a least squares regressions of “long” data (many observations but few variables), forming the SSCP matrix consumes most of the computational effort. In fact, the PROC REG documentation points out that for long data “you can save 99% of the CPU time by reusing the SSCP matrix rather than recomputing it.”
That is because the SSCP matrix for an n x p data matrix is a symmetric p x p matrix. When n ≫ p, forming the SSCP matrix requires computing with a lot of rows. After it is formed, it is relatively simple to solve a p x p linear system.
Conceptually, the simplest way to compute the SSCP matrix is by multiplying the matrices in an efficient manner. This is what the SAS/IML matrix language does. It recognizes that X`*X is a special kind of matrix multiplication and uses an efficient algorithm to form the product. However, this approach requires that you be able to hold the entire data matrix in RAM, which might not be possible when you have billions of rows.
The following SAS DATA Step creates a data matrix that contains 426 rows and 10 variables. The PROC IML step reads the data into a matrix and forms the SSCP matrix:
/* remove any rows that contain a missing value: https://blogs.sas.com/content/iml/2015/02/23/complete-cases.html */ data cars; set sashelp.cars; if not nmiss(of _numeric_); run; proc iml; use cars; read all var _NUM_ into X[c=varNames]; close; /* If you want, you can add an intercept column: X = j(nrow(X),1,1) || X; */ n = nrow(X); p = ncol(X); SSCP = X`*X; /* 1. Matrix multiplication */ print n p, SSCP; |
Although the data has 426 rows, it only has 10 variables. Consequentially, the SSCP matrix is a small 10 x 10 symmetric matrix. (To simplify the code, I did not add an intercept column, so technically this SSCP matrix would be used for a no-intercept regression.)
The (i,j)th element of the SSCP matrix is the inner product of the i_th column and the j_th column.
In general, Level 1 BLAS computations (inner products) are not as efficient as Level 3 computations (matrix products).
However, if you have the data variables (columns) in an array of vectors, you can compute the
p(p+1)/2 elements of the SSCP matrix by using the following loops over columns. This computation assumes that you can hold at least two columns in memory at the same time:
/* 2. Compute SSCP[i,j] as an inner-product of i_th and j_th columns */ Y = j(p, p, .); do i = 1 to p; u = X[,i]; /* u = i_th column */ do j = 1 to i; v = X[,j]; /* v = j_th column */ Y[i,j] = u` * v; Y[j,i] = Y[i,j]; /* assign symmetric element */ end; end; |
The third approach is to compute the SSCP matrix as a sum of outer products of rows.
Before I came to SAS, I considered the outer-product method to be inferior to the other two. After all, you need to form n matrices (each p x p) and add these matrices together. This did not seem like an efficient scheme. However, when I came to SAS I learned that this method is extremely efficient for dealing with Big Data because you never need to keep more than one row of data into memory! A SAS procedure like PROC REG has to read the data anyway, so as it reads each row, it also forms outer product and updates the SSCP. When it finishes reading the data, the SSCP is fully formed and ready to solve!
I’ve recently been working on parallel processing, and the outer-product SSCP is ideally suited for reading and processing data in parallel. Suppose you have a grid of G computational nodes, each holding part of a massive data set. If you want to perform a linear regression on the data, each node can read its local data and form the corresponding SSCP matrix. To get the full SSCP matrix, you merely need to add the G SSCP matrices together, which are relatively small and thus cheap to pass between nodes. Consequently, any algorithm that uses the SSCP matrix can greatly benefit from a parallel environment when operating on Big Data. You can also use this scheme for streaming data.
For completeness, here is what the outer-product method looks like in SAS/IML:
/* 3. Compute SSCP as a sum of rank-1 outer products of rows */ Z = j(p, p, 0); do i = 1 to n; w = X[i,]; /* Note: you could read the i_th row; no need to have all rows in memory */ Z = Z + w` * w; end; |
For simplicity, the previous algorithm works on one row at a time. However, it can be more efficient to process multiple rows. You can easily buffer a block of k rows and perform an outer product of the partial data matrix. The value of k depends on the number of variables in the data, but typically the block size, k, is dozens or hundreds.
In a procedure that reads a data set (either serially or in parallel), each operation would read k observations except, possibly, the last block, which would read the remaining observations.
The following SAS/IML statements loop over blocks of k=10 observations at a time.
You can use the FLOOR-MOD trick to find the total number of blocks to process, assuming you know the total number of observations:
/* 4. Compute SSCP as the sum of rank-k outer products of k rows */ /* number of blocks: https://blogs.sas.com/content/iml/2019/04/08/floor-mod-trick-items-to-groups.html */ k = 10; /* block size */ numBlocks = floor(n / k) + (mod(n, k) > 0); /* FLOOR-MOD trick */ W = j(p, p, 0); do i = 1 to numBlocks; idx = (1 + k*(i-1)) : min(k*i, n); /* indices of the next block of rows to process */ A = X[idx,]; /* partial data matrix: k x p */ W = W + A` * A; end; |
All computations result in the same SSCP matrix. The following statements compute the sum of squares of the differences between elements of X`*X (as computed by using matrix multiplication) and the other methods. The differences are zero, up to machine precision.
diff = ssq(SSCP - Y) || ssq(SSCP - Z) || ssq(SSCP - W); if max(diff) < 1e-12 then print "The SSCP matrices are equivalent."; print diff[c={Y Z W}]; |
In summary, there are several ways to compute a sum of squares and crossproducts (SSCP) matrix. If you can hold the entire data in memory, a matrix multiplication is very efficient. If you can hold two variables of the data at a time, you can use the inner-product method to compute individual cells of the SSCP. Lastly, you can process one row at a time (or a block of rows) and use outer products to form the SSCP matrix without ever having to hold large amounts of data in RAM. This last method is good for Big Data, streaming data, and parallel processing of data.
The post 4 ways to compute an SSCP matrix appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
So, you’ve figured out NLP but what’s NLU? was published on SAS Users.
]]>This post was kindly contributed by SAS Users - go there to comment and to read the full post. |
Natural language understanding (NLU) is a subfield of natural language processing (NLP) that enables machine reading comprehension. While both understand human language, NLU goes beyond the structural understanding of language to interpret intent, resolve context and word ambiguity, and even generate human language on its own. NLU is designed for communicating with non-programmers – to understand their intent and act on it. NLU algorithms tackle the extremely complex problem of semantic interpretation – that is, understanding the intended meaning of spoken or written language, with all the subtleties, of human error, such as mispronunciations or fragmented sentences.
After your data has been analyzed by NLP to identify parts of speech, etc., NLU utilizes context to discern meaning of fragmented and run-on sentences to execute intent. For example, imagine a voice command to Siri or Alexa:
Siri / Alexa play me a …. um song by … um …. oh I don’t know …. that band I like …. the one you played yesterday …. The Beach Boys … no the bass player … Dick something …
What are the chances of Siri / Alexa playing a song by Dick Dale? That’s where NLU comes in.
NLU reduces the human speech (or text) into a structured ontology – a data model comprising of a formal explicit definition of the semantics (meaning) and pragmatics (purpose or goal). The algorithms pull out such things as intent, timing, location and sentiment.
The above example might break down into:
Play song [intent] / yesterday [timing] / Beach Boys [artist] / bass player [artist] / Dick [artist]
By piecing together this information you might just get the song you want!
NLU has many important implications for businesses and consumers alike. Here are some common applications:
Imagine the power of an algorithm that can understand the meaning and nuance of human language in many contexts, from medicine to law to the classroom. As the volumes of unstructured information continue to grow exponentially, we will benefit from computers’ tireless ability to help us make sense of it all.
Further Resources:
Natural Language Processing: What it is and why it matters
White paper: Text Analytics for Executives: What Can Text Analytics Do for Your Organization?
SAS® Text Analytics for Business Applications: Concept Rules for Information Extraction Models, by Teresa Jade, Biljana Belamaric Wilsey, and Michael Wallis
Unstructured Data Analysis: Entity Resolution and Regular Expressions in SAS®, by Matthew Windham
So, you’ve figured out NLP but what’s NLU? was published on SAS Users.
This post was kindly contributed by SAS Users - go there to comment and to read the full post. |