“SAS is pleased to announce that we are the first software vendor to achieve the distinction of ODPi Interoperability,” said Craig Rubendall, VP of Platform R&D at SAS. Craig, who was recently installed as chair of the ODPi board, continued: “By declaring that SAS interfaces with Apache Hadoop in demonstrable, standard ways, we can reduce our customers' risk, simplify testing complexity, and speed time to value for anyone building or deploying SAS applications.”

To learn more about SAS and how it has embraced Hadoop, download this e-book: An Early Adopters Guide to Hadoop

]]>**Consistency.**There is a lack of consistency in the way outcomes are presented.**Lack of relevance.**Often the business doesn’t see the relevance of analytics in its decision making. Potentially the lack of relevance is linked to the style in which the outcomes are communicated.**Imbalanced.**We see an imbalance in the level of detail presented, it’s either too much or too little.**Competency.**A lack of competency in business story telling exists in today’s analytical professionals with most time being spent on analysis rather than communication.

- How do you use storytelling capabilities (like journalism) to present new knowledge in a format that your audience can understand for effective decision making?
- Should organisations provide a journalism course for all their data scientists?

The course is targeted to experienced R users who have little to no experience with SAS programming. You will begin with basic SAS programming skills such as reading in external data, creating new variables and functions, and generating statistical graphics. After mastering the basics you will gain experience working with descriptive and inferential procedures such as linear models, generalized linear models, and mixed models. Finally, the course ends with an in depth look at the Interactive Matrix Language and statistical simulation. Other topics sprinkled throughout the course include the output delivery system to customize results and macro programming for unsupervised scripting.

First, SAS University Edition is FREE for students, teachers, and independent learners. SAS University Edition gives users access to Base SAS (the foundation of all SAS software which allows users to easily manage data), SAS/ACCESS (provides several tools to easily access external data), SAS/STAT (a wide variety of statistical methods and techniques), SAS/IML (a matrix language for more specialized analyses), and SAS/ETS (a suite of time series forecasting procedures). Second, SAS/IML is an extensive matrix language which can be used to customize analyses, create complex functions, and conduct a variety of simulations.

At first glance, SAS programming looks very different than R. When it comes to preparing your data, which is most of the effort involved in any analysis, the SAS DATA step and procedures can make the process very easy. R users who come to SAS will find SAS/IML (the matrix language designed for statistical programmers) very familiar and easy to transition into. In addition, SAS offers a powerful development environment, award-winning documentation, and a robust user community with members who are eager to help.

SAS/IML provides a way to reuse R packages within your SAS programs. Even though the SAS language has procedures and functions that cover the features that most R programmers rely on, sometimes you might find a particular algorithm or technique that you want to try from an R package. In that case, SAS/IML allows you to "submit" programs to your instance of R and bring the results back into SAS for further analysis and reporting.

This course is taught entirely in SAS Studio, the newest SAS interface and the interface used by SAS University Edition. The course will be structured in a seminar style via live web. The instructor will lecture from 1 p.m. to approximately 4 p.m. ET (with breaks and short polls/quizzes) across five days. The instructor will present new ideas, programming concepts, and conduct demonstrations to familiarize you with syntax and results. From approximately 4 p.m. to 5 p.m. you will be able to practice SAS programming concepts from the day with a variety of exercises.

A recent student, Harry Fuller said, “This was an incredibly useful course. It showed me how to do things both in SAS and in R that I have always wanted to learn how to code. Even if you don't know R, this is a crash course in how to do everything that a graduate student or young professional could ever need.”

The upcoming Live Web course dates are: Oct. 31-Nov. 4 (1-4:30 pm EDT) Nov. 28-Dec. 2 (1-4:30 pm EDT) Dec. 19-23 (1-4:30 pm EDT)

You can learn more and register online. Seats are limited. Did we mention it’s free?!]]>- Dashboard Support
- More Profilers
- Reference Ranges
- Value Labels
- Value Ordering
- Pinned Tooltips
- Hover Pictures

*#StatWisdom: How to create an ogive (rhymes with 'slow jive')*

Click To Tweet

An ogive is also called a cumulative histogram. You can create an ogive from a histogram by accumulating the frequencies (or relative frequencies) in each histogram bin. The
height of an ogive curve at *x* is found by summing the heights of the histogram bins to the left of *x*.

A histogram estimates the density of a distribution; the ogive estimates the cumulative distribution. Both are easy to construct by hand. Both are coarse estimates that depend on your choice of a bin widths and anchor position.

Ogives are not used much by professional statisticians because modern computers make it easy to compute and visualize the exact empirical cumulative distribution function (ECDF). However, if you are a student learning to analyze data by hand, an ogive is an easy way to approximate the ECDF from binned data. They are also important if you do not have access to the original data, but you have a histogram that appeared in a published report. (See "How to approximate a distribution from published quantiles.")

To demonstrate the construction of an ogive, let's consider the distribution of the MPG_CITY variable in the Sashelp.Cars data set. This variable contains the reported fuel efficiency (in miles per gallon) for 428 vehicle models. The following call to PROC UNIVARIATE in Base SAS uses the OUTHIST= option in the HISTOGRAM statement to create a data set that contains the frequencies and relative frequencies of each bin. By default, the frequencies are reported for the midpoints of the intervals. To create an ogive you need the *endpoints* of each bin, so use the ENDPOINTS option as follows:

```
proc univariate data=sashelp.cars;
var mpg_city;
histogram mpg_city / grid vscale=proportion ENDPOINTS OUTHIST=OutHist;
/* cdfplot mpg_city / vscale=proportion; */ /* optional: create an ECDF plot */
run;
```

The histogram shows that most vehicles get between 15 and 25 mpg in the city. The distribution is skewed to the right, with a few vehicles getting as much as 59 or 60 mpg. A few gas-guzzling vehicles get less than 15 mpg.

You can construct an ogive from the relative frequencies in the 11 histogram bins.
The height of the ogive at *x*=10 (the leftmost endpoint in the histogram) is zero. The height at *x*=15 is the height of the first bar. The height at *x*=20 is the sum of the heights of the first two histogram bars, and so on.

Each row in the OutHist data set contains a left-hand endpoint and the relative frequency (height) of the bar.
However, to construct an ogive, you need to associate the bar height with the *right-hand* endpoints. This is because at the left-hand endpoint none of the density for the bin has accumulated, and for the right-hand endpoint all of the density has accumulated.

Consequently, to construct an ogive from the OUTHIST= data set, you can do the following:

- Associate zero with the leftmost endpoint of the bins.
- Adjust the counts and proportions in the OutHist data so that they are associated with the right-hand endpoint of each bin. You can use the LAG function to do this.
- Accumulate the relative frequencies in each bin to form the cumulative frequencies.
- Add a new observation to the OutHist data that contains the rightmost endpoint of the bins.

The following SAS DATA step carry out these adjustments:

```
data Ogive;
set outhist end=EOF;
ogiveX = _MinPt_; /* left endpoint of bin */
dx = dif(ogiveX); /* compute bin width */
prop = lag(_OBSPCT_); /* move relative frequency to RIGHT endpoint */
if _N_=1 then
prop = 0; /* replace missing value by 0 for first obs */
cumProp + prop/100; /* accumulate proportions */
output;
if EOF then do; /* append RIGHT endpoint of final bin */
ogiveX = ogiveX + dx;
cumProp = 1;
output;
end;
drop dx _:; /* drop variables that begin with underscore */
run;
```

The Ogive data set contains all the information that you need to graph an ogive. The following call to PROC SGPLOT uses a VLINE statement, which treats the endpoints of the bins as discrete values. You could also use the SERIES statement, which treats the endpoints as a continuous variable, but might not put a tick mark at each bin endpoint.

```
title "Cumulative Distribution of Binned Values (Ogive)";
proc sgplot data=Ogive;
vline OgiveX / response=cumProp markers;
/* series x=_minpt_ y=cumProp / markers; */
xaxis grid label="Miles per Gallon (City)";
yaxis grid values=(0 to 1 by 0.1) label="Cumulative Proportion";
run;
```

You can use the graph to estimate the percentiles of the data. For example:

- The 20th percentile is approximately 17 because the curve appears to pass through the point (17, 0.20). In other words, about 20% of the vehicles get 17 mpg or less.
- The 50th percentile is approximately 19 because the curve appears to pass through the point (19, 0.50).
- The 90th percentile is approximately 27 because the curve appears to pass through the point (27, 0.90). Only 10% of the vehicles have a fuel efficiency greater than 27 mpg.

You might wonder how well the ogive approximates the empirical CDF. The following graph overlays the ogive and the ECDF for this data. You can see that the two curves agree closely at the ogive values, shown by the markers. However, there is some deviation because the ogive assume a linear accumulation (a uniform distribution) of data values within each histogram bin. Nevertheless, this coarse piecewise linear curve that is based on binned data does a good job of showing the basic shape of the empirical cumulative distribution.

]]>