- Data preparation removes hand-coding and manual manipulation of data and replaces it with a more user-friendly environment that reduces the reliance on IT.
- Data preparation controls start to introduce much-needed data quality measures, checks and balances into the big data landscape, which are evidently much needed.
- Data preparation enables a multiphase approach that can encompass data discovery and
*mashing*of data to find the optimal sources and formats – while delivering greater stability of ongoing information chains required for high quality, frequent analytics and management reporting.

Download a paper about 5 data management for analytics best practices

]]>- Smart meters and their resulting data creates many opportunities for utilities toenhance consumers' lives in the form of money-saving rate plans (maintain comfort while saving money) and the ability to connect to grid scale renewables (save the planet!).
- This connectivity and data can also provide a gateway to partnerships with third-party service or device providers. The key is to sell the benefits (comfort, convenience, cost savings), not the technology (“home automation”).

I do not discuss survey data in this article. Survey statisticians use weights to make valid inferences in survey data, and you can see the SAS documentation to learn about how to use weights to estimate variance in complex survey designs.

*How to understand weighted percentiles #Statistics #StatWisdom*

Click To Tweet

Before we calculate a weighted statistic, let's remember that a weight variable is not that same as a frequency variable. A frequency variable, which associates a positive integer with each observation, specifies that each observation is replicated a certain number of times. There is nothing unintuitive about the statistics that arise from including a frequency variable. They are the same that you would obtain by duplicating each record according to the value of the frequency variable.

Weights are not frequencies. Weights can be fractional values. When comparing a weighted and unweighted analyses, the key idea is this: an unweighted analysis is equivalent to a weighted analysis for which the weights are all 1. An "unweighted analysis" is really a misnomer; it should be called an "equally weighted" analysis!

In the computational formulas that SAS uses for weighted percentiles, the weights are divided by the sum of the weights. Therefore only relative weights are important, and the formulas simplify if you choose weights that sum to 1. For the remainder of this article, assume that the weights sum to unity and that an unweighted analysis has weights equal to 1/
To understand how weights change the computation of percentiles, let's review the standard unweighted computation of the empirical percentiles (or quantiles) of a set of *n* numbers. First, sort the data values from smallest to largest. Then construct the empirical cumulative distribution function (ECDF). Recall that the ECDF is a piecewise-constant step function that increases by 1/*n* at each data point. The quantity 1/*n* represents the fact that each observation is weighted equally in this analysis.

The quantile function is derived from the CDF function, and
the quantile function for a discrete distribution is also a step function.
You can use the graph of the ECDF to compute the quantiles. For example, suppose your data are

{
1
1.9
2.2
3
3.7
4.1
5 }

The following graph shows the ECDF for these seven values:

The data values are indicated by tick marks along the horizontal axis. Notice that the ECDF jumps by 1/7 at each data value because there are seven unique values.

I should really show you the graph of the quantile function (an "inverse function" to the CDF), but you can visualize the graph of the quantile function if you rotate your head clockwise by 90 degrees. To find a quantile, start at some value on the Y axis, move across until you hit a vertical line, and then drop down to the X axis to find the datum value. For example, to find the 0.2 quantile (=20th percentile), start at Y=0.2 and move right horizontally until you hit the vertical line over the datum 1.9. Thus 1.9 is the 20th percentile. Similarly, the 0.6 quantile is the data value 3.7. (I omit details about what to do if you hit a horizontal line.)

Of course, SAS can speed up this process. The following call to PROC MEANS displays the 20th, 40th, 60th, and 80th percentiles:

```
data A;
input x wt;
datalines;
1 0.25
1.9 0.05
2.2 0.15
3.0 0.25
3.7 0.15
4.1 0.10
5 0.05
;
proc means data=A p20 p40 p60 p80;
var x; /* unweighted analysis: data only */
run;
```

The previous section shows the relationship between percentile values and the graph of the ECDF. This section describes how the ECDF changes if you specify unequal weights for the data. The change is that the weighted ECDF will jump by the (standardized) weight at each data value. Because the weights sum to unity, the CDF is still a step function that rises from 0 to 1, but now the steps are not uniform in height. Instead, data that have relatively large weights produce a large step in the graph of the ECDF function.

In the previous section, the DATA step defined a weight variable. The weights for this example are

{
0.25
0.05
0.15
0.25
0.15
0.10
0.05 }

The following graph shows the weighted ECDF for these weights:

By using this weighted ECDF, you can read off the weighted quantiles. For example, to find the 0.2 weighted quantile, start at Y=0.2 and move horizontally until you hit a vertical line, which is over the datum 1.0. Thus 1.0 is the 20th weighted percentile. Similarly, the 0.6 quantile is the data value 3.0. You can confirm this by calling PROC MEANS with a WEIGHT statement, as shown below:

```
proc means data=A p20 p40 p60 p80;
weight wt; /* weighted analysis */
var x;
run;
```

You can use a physical model to intuitively understand weighted percentiles. The model is the same as I used to
visualize a weighted mean. Namely, imagine a point-mass of *w*_{i} concentrated at position *x*_{i} along a massless rod. Finding a weighted percentile *p* is equivalent to finding the first location along the rod (moving from left to right) at which the proportion of the weight is greater than *p*. (I omit how to handle special percentiles for which the proportion is equal to *p*.)

The physical model looks like the following:

From the figure you can see that *x*_{1} is the *p*^{th} quantile for *p* < 0.25. Similarly, *x*_{2} is the *p*^{th} quantile for 0.25 < *p* < 0.30, and so forth.

If you want to apply these concepts to your own data, you can download the SAS program that generates the CDF graphs and computes the weighted percentiles.

]]>**Database Query Builders.**When your sensor data lives in a structured SAS or ODBC-accessed database, you can quickly create queries that join multiple tables, apply filters, and preview the results so you always get the perfect subset to explore. Queries can be saved, shared, and even updated directly from your imported table. You can move more quickly from raw data to insight by adding summary columns and graphics scripts to run automatically post-query.**JMP Query Builder**. If you don’t have direct access to your database, you can still take advantage of the query, filtering, summarization and sorting tools afforded by the Query Builder by importing a set of flat files into JMP tables. After you create a query once, it can be saved as a script and rerun automatically. Just pull the most recent data, rejoin, and move forward with your analysis.**New Formula Column**. Often, sensor data is collected and stored simply with a time stamp, sensor name and sensor value to conserve storage space. But pairing raw data with summarized values can help you answer broader questions, clean up artifacts and discover interesting trends. Right-click any data column to add a new formula column, choosing from Date/Time transformations and summaries like Col Moving Average, useful for cleaning up sensor data with high-frequency noise. The formula shortcut menu offers quick access to the Group By functionality that identifies a column of categories (for example, Day) by which to calculate new measures.**Virtual Join**. When working with high-resolution sensor measurements, you may not be able to spare the memory required to merge summary values directly into your raw data table. Use virtual join to link values stored in separate tables, and use them together in analysis and graphing platforms. You can link and unlink data sources on demand, which makes it easier to update disparate data sources independently.**Process Screening.**Graphing many sensor measures can be tedious when generating control charts is a one-at-a-time process. And once you have your results, sorting through charts manually to find interesting or suspect readings is both painful and error-prone. The Process Screening platform does a lot of the heavy lifting up front, consolidating control chart statistics and recent shifts into a single table that you can sort interactively to identify important process variables. This helps you focus your efforts and makes is easy to drill down to critical control charts first.**Modeling and Cross-Validation.**At some point, you need to build a model to separate signal from noise. JMP includes a full suite of linear and nonlinear modeling tools that build useful models on their own but also link into cross-validation tools, crucial when evaluating observational data like those from sensor streams.**Graph Builder.**When exploring your sensor data, you need to quickly create and customize a variety of multi-element graphs to identify patterns and spot potential artifacts. The JMP Graph Builder platform provides an extensive palette of graph types and an intuitive drag-and-drop interface. Plus, you can save and reproduce any graph you create when you retrieve new data.**Selection Filters**. Once you have graphed your sensor data, you may want to consider the impact of changes over time, location or another variable. JMP provides traditional list-based filters for exploring custom data slices, but you can also use a summary graph as a filter for one or more graphs. Simply select elements in your filter graph to refocus your report on a category or time frame of interest.**Dashboards.**Dashboards help you show patterns in your data in a concise and consistent report and assist in communicating key findings to decision makers. With JMP, you can create a new dashboard from a template and drag-and-drop tables and reports onto the canvas. You can save and refresh your dashboard when new data is available or export it as an interactive HTML web report for colleagues who don’t have JMP.**Sharing With Scripts and Add-Ins**. When you construct a workflow that gives you the results you need, you can easily share your detailed steps with other JMP users. The software automatically generates scripts to reproduce key data manipulation steps and custom graphs. As you become more experienced, you may want to develop custom menus and add-ins to walk colleagues through more complex workflows.

- At the conclusion of an A/B/n test, there is usually a winner and one or more losers.
- Is there really one superior experience for your entire marketable audience? Is it possible that experiences should vary by segment?
- Performing algorithmic segmentation sounds awesome, but who really has the time to do it? We have so many tests to run.

```
/* checks for first instance of ... */
alnum = anyalnum(value); /* alpha-numeric */
nalnum = notalnum(value); /* non-alphanumeric */
alpha = anyalpha(value); /* alphabetic */
nalpha = notalpha(value); /* non-alphabetic */
digit = anydigit(value); /* digit */
ndigit = notdigit(value); /* non-digit */
punct = anypunct(value); /* punctuation */
npunct = notpunct(value); /* non-puncuatation */
```

Want to learn more about these functions? At MWSUG this year you can see how they can be used along with other common SAS functions to extract numbers from a text string or how to build ISO 8601 dates.
So please join me at the MidWest SAS Users Group Conference October 9 – 11 at the Hyatt Regency in downtown Cincinnati, Ohio. Register now for three days of great educational opportunities, 100+ presentations, training, workshops, networking and more.
Hope to see you there!]]>