The post 2 Easy Steps to Write a Great Book Outline appeared first on SAS Learning Post.
]]>This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
Editor’s note: This series of blogs addresses the questions we are most frequently asked at SAS Press! It worth spending some time on this. Arguably, this is one of the most important parts of the book. The table of contents and outline provide the blue print of your book – […]
The post 2 Easy Steps to Write a Great Book Outline appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
This post was kindly contributed by SAS – r4stats.com - go there to comment and to read the full post. |
In my ongoing quest to track The Popularity of Data Analysis Software, I’ve finally decided to change the title to use the newer term “data science”. The 2017 version of Gartner’s Magic Quadrant for Data Science Platforms was just published, so I have updated my IT Research Firms section, which I repeat here to save you from having to dig through the entire 40+ page tome. If your organization is looking for training in the R language, you might consider my books, R for SAS and SPSS Users or R for Stata Users, or my on-site workshops.
IT research firms study software products and corporate strategies, they survey customers regarding their satisfaction with the products and services, and then provide their analysis on each in reports they sell to their clients. Each company has its own criteria for rating companies, so they don’t always agree. However, I find the reports extremely interesting reading. While these reports are expensive, the companies that receive good ratings often purchase copies to give away to potential customers. An Internet search of the report title will often reveal the companies that are distributing such copies.
Gartner, Inc. is one of the companies that provides such reports. Out of the roughly 100 companies selling data science software, Gartner selected 16 which had either high revenue or lower revenue but high growth (see full report for details). After extensive input from both customers and company representatives, Gartner analysts rated the companies on their “completeness of vision” and their “ability to execute” that vision. Figure 3 shows the resulting plot. Note that purely open source software is not rated by Gartner, but nearly all the software in Figure 3 includes the ability to interact with R and Python.
The Leader’s Quadrant is the place for companies who have a future direction in line with their customer’s needs and the resources to execute that vision. The four companies in the Leaders quadrant have remained the same for the last three reports: IBM, KNIME, RapidMiner, and SAS. Of these, they rate IBM as having slightly greater “completeness of vision” due to the extensive integration they offer to open source software compared to SAS Institute. KNIME and RapidMiner are quite similar as the are driven by an easy to use workflow interface. Both offer free and open source versions, but RapidMiner’s is limited by a cap on the amount of data that it can analyze. IBM and SAS are market leaders based on revenue and, as we have seen, KNIME and RapidMiner are the ones with high growth.
The companies in the Visionaries quadrant are those that have a good future plans but which may not have the resources to execute that vision. Of these, Microsoft increased its ability to execute compared to the 2016 report, and Alpine, one of the smallest companies, declined sharply in their ability to execute. The remaining three companies in this quadrant have just been added: H2O.ai, Dataiku, and Domino Data Lab.
Those in the Challenger’s quadrant have ample resources but less customer confidence on their future plans. Mathworks, the makers of MATLAB, is new to the report. Quest purchased Statistica from Dell, and it appears in roughly the same position as Dell did last year.
The Niche Players quadrant offer tools that are not as broadly applicable.
In 2017 Gartner dropped coverage of Accenture, Lavastorm, Megaputer, Predixion Software, and Prognoz.
This post was kindly contributed by SAS – r4stats.com - go there to comment and to read the full post. |
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
Have you been using the SAS/Graph Gmap procedure to plot your data on maps for years, but never knew you could add roads to your maps?!? Follow along in this blog post, and I’ll teach you how… But before we get started, here’s a picture of a nice aerial view […]
The post How to add roads to your SAS maps appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
The post Quantile estimates and the difference of medians in SAS appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
Sometimes SAS programmers ask about how to analyze quantiles with SAS. Common questions include:
Historically, PROC UNIVARIATE and
PROC NPAR1WAY are two procedures in SAS that analysts used for univariate analysis. PROC UNIVARIATE performs standard parametric tests. In contrast, PROC NPAR1WAY performs nonparametric tests and distribution-free analyses. An internet search reveals many resources that describe how to use UNIVARIATE and NPAR1WAY for analyzing quantiles.
However, there is an alternative way to analyze univariate quantiles:
PROC QUANTREG. Although QUANTREG is designed for quantile regression, the same procedure can easily analyze quantiles of univariate data. All you need to do is omit the regressors from the right side of the MODEL statement and the procedure will analyze the “response” variable.
Be aware that the QUANTREG procedure uses an optimization algorithm to perform its analysis. This can sometimes result in different estimates than a traditional computation. For example, if the data set has an even number of observations and the middle values are a and b, one estimate for the median is the average of the two middle values (a+b)/2. The QUANTREG procedure might provide a different estimate, which could be any value in [a, b]. This difference is most noticeable in small samples. (Don’t let this bother you too much. There are many definitions for quantile estimates. SAS supports five different definitions for calculating quantiles.)
I have previously shown how to compute confidence intervals for percentiles in SAS by using PROC UNIVARIATE. The following statements compute the 20th, 50th, and 90th percentiles for the cholesterol levels of 5209 patients in a medical study, along with 95% confidence intervals for the quantiles. The computation is shown twice: first with PROC UNIVARIATE, then with PROC QUANTREG.
/* 1. Use PROC UNIVARIATE to get 95% CIs for 20th, 50th, and 90th pctls */ proc univariate data=Sashelp.Heart noprint; var Cholesterol; output out=pctl pctlpts=20 50 90 pctlpre=p cipctldf=(lowerpre=LCL upperpre=UCL); /* 12.1 options (SAS 9.3m2) */ run; data QUni; /* rearrange the statistics into a table */ set pctl; Quantile = 0.2; Estimate = p20; Lower = LCL20; Upper = UCL20; output; Quantile = 0.5; Estimate = p50; Lower = LCL50; Upper = UCL50; output; Quantile = 0.9; Estimate = p90; Lower = LCL90; Upper = UCL90; output; keep Quantile Estimate Lower Upper; run; title "UNIVARIATE Results"; proc print noobs; run; /**************************************/ /* 2. Alternative: Use PROC QUANTREG! */ ods select none; ods output ParameterEstimates=QReg ; proc quantreg data=Sashelp.Heart; model Cholesterol = / quantile=0.2 0.5 .9; run; ods select all; title "QUANTREG Results"; proc print noobs; var Quantile Estimate LowerCL UpperCL; run; |
The output shows that the confidence intervals (CIs) for the quantiles are similar, although the QUANTREG intervals are slightly wider. Although UNIVARIATE can produce CIs for these data, the situation changes if you add a weight variable. The UNIVARIATE procedure supports estimates for weighted quantiles, but does not produce confidence intervals.
However, the QUANTREG procedure can provide CIs even for a weighted analysis.
In general, PROC QUANTREG can compute statistics for quantiles that UNIVARIATE cannot.
For example, you can use the ESTIMATE statement in QUANTREG to get a confidence interval for the difference between medians in two independent samples. If the confidence interval does not contain 0, you can conclude that the medians are significantly different.
The adjacent box plot shows the distribution of
diastolic blood pressure for male and female patients in a medial study.
Reference lines are drawn at the median values for each gender.
You might want to
estimate the median difference in diastolic blood pressure between male and female patients and compute a confidence interval for the difference. The following call to PROC QUANTREG estimates those quantities:
ods select ParameterEstimates Estimates; proc quantreg data=Sashelp.Heart; class sex; model diastolic = sex / quantile=0.5; estimate 'Diff in Medians' sex 1 -1 / CL; run; |
The syntax should look familiar to programmers who use PROC GLM to compare the means of groups. However, this computation compares medians of groups. The analysis indicates that female patients have a diastolic blood pressure that is 3 points lower than male patients. The 95% confidence interval for the difference does not include 0, therefore the difference is statistically significant. By changing the value on the QUANTILE= option, you can compare quantiles other than the median. No other SAS procedure provides that level of control over quantile estimation.
PROC QUANTREG provides another tool for the SAS programmer who needs to analyze quantiles. Although QUANTREG was written for quantile regression, the procedure can also analyze univariate samples. You can use the ESTIMATE statement to compare quantiles across groups and to obtain confidence intervals for the parameters.
In general, SAS regression procedures enable you to conduct univariate analyses that are not built into any univariate procedure.
The post Quantile estimates and the difference of medians in SAS appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
The post The Other 27 SAS Numeric Missing Values appeared first on SAS Learning Post.
]]>This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
What?!? You mean a period (.) isn’t the only SAS numeric missing value? Well, there are 27 others: .A .B, to .Z and ._ (period underscore). Your first question might be: “Why would you need more than one missing value?” One situation where multiple missing values are useful involves survey data. Suppose […]
The post The Other 27 SAS Numeric Missing Values appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
This post was kindly contributed by SAS Programming for Data Mining - go there to comment and to read the full post. |
My paymenet information for this custom URL is expiring and Google told me to update the payment information. That’s OK, I will do it. However, by switching to G Suite, Google makes paying this $10.00 so difficult that I decided it is not worthy to continue the business with Google. Its email contains link that won’t work for me, taking me to webpages that do not answer the question. All the steps become very bumpy and it is really a bad experience for customers. With this, now I understand why Google is “good at inventing technologies but bad at making products”. I will switch to my WordPress sites eventually. This dying out of sas-programming is also chronical with the fading out of SAS from new emerging markets (Surely SAS will still thrive in traditional analytics market for a while).
This post was kindly contributed by SAS Programming for Data Mining - go there to comment and to read the full post. |
The post The distribution of colors for plain M&M candies appeared first on The DO Loop.
]]>This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
Many introductory courses in probability and statistics encourage students to collect and analyze real data. A popular experiment in categorical data analysis is to give students a bag of M&M® candies and ask them to estimate the proportion of colors in the population from the sample data. In some classes, the students are also asked to perform a chi-square analysis to test whether the colors are uniformly distributed or whether the colors match a hypothetical set of proportions.
M&M’s® have a long history at SAS. SAS is the world’s largest corporate consumer of M&M’s. Every Wednesday a SAS employee visits every breakroom on campus and fill two large containers full of M&M’s. This article uses SAS software to analyze the classic “distribution of colors” experiment.
The “plain” M&M candies (now called “milk chocolate M&M’s”) are produced by the Mars, Inc. company. The distribution of colors in M&M’s has a long and colorful history.
The colors and proportions occasionally change, and the distribution is different for
peanut and other varieties. A few incidents from my lifetime that made the national news are:
The breakroom containers at SAS are filled from two-pound bags. So as to not steal all the M&M’s in the breakroom, I conducted this experiment over many weeks in late 2016 and early 2017, taking one scoop of M&M’s each week. The following data set contains the cumulative counts for each of the six colors in a sample of size N = 712:
data MandMs; input Color $7. Count; datalines; Red 108 Orange 133 Yellow 103 Green 139 Blue 133 Brown 96 ; |
A bar chart that shows the observed distribution of colors in M&M’s is shown at the top of this article.
To estimate the proportion of colors in the population, simply divide each count by the total sample size, or use the FREQ procedure in SAS. PROC FREQ also enables you to run a chi-square test that compares the sample counts to the expected counts under a specified distribution. The most recent published distribution is from 2008, so let’s test those proportions:
proc freq data = MandMs order=data; weight Count; tables Color / nocum chisq /* 2008 proportions: red orange yellow green blue brown */ testp=(0.13 0.20 0.14 0.16 0.24 0.13); run; |
The observed and expected proportions are shown in the table to the right.
The chi-square test rejects the test hypothesis at the α = 0.05 significance level (95% confidence). In other words, the distribution of colors for M&M’s in this 2017 sample does NOT appear to be the same as the color distribution from 2008! You can see this visually from the bar chart: the red and green bars are too tall and the blue bar is too short compared with the expected values.
You need a large sample to be confident that this empirical deviation is real. After collecting data for a few weeks, I did a preliminary analysis that analyzed about 300 candies. With that smaller sample, the difference between the observed and expected proportions could be attributed to sampling variability and so the chi-square test did not reject the null hypothesis. However, while running that test I noticed that the green and blue colors accounted for the majority of the difference between the observed and theoretical proportions, so I decided to collect more data.
As I explained in a previous article, you can
use the sample proportions to construct simultaneous confidence intervals for the population proportions.
The following SAS/IML statements load and call the functions from the previous post:
%include "conint.sas"; /* define the MultCI and MultCIPrint modules */ proc iml; load module=(MultCI MultCIPrint); use MandMs; read all var {"Color" "Count"}; close; alpha = 0.05; call MultCIPrint(Color, Count, alpha, 2); /* construct CIs using Goodman's method */ |
The table indicates that the published 2008 proportion for blue (0.24) is far outside the 95% confidence interval, and the proportion for green (0.16) is just barely inside its interval. That by itself does not prove that the 2008 proportion are no longer valid (we might have gotten unlucky during sampling), but combined with the earlier chi-square test, it seems unlikely that the 2008 proportions are applicable to these data.
The published proportions for green and blue do not seem to match the sample proportions from 2008. For this large sample,
the published proportion of blue is too large whereas the published proportion of green is too small.
From reading previous articles, I know that the Customer Care team at M&M/Mars is very friendly and responsive. Apparently they get asked about the distribution of colors quite often, so I sent them a note. The next day they sent a breakdown of the colors for all M&M candies.
Interestingly, plain (and peanut) M&M’s are now produced at two different factories in the US, and the factories do not use the same mixture of colors! You need to look on the packaging for the manufacturing code, which is usually stamped inside a rectangle. In the middle of the code will be the letters HKP or CLV. For example, the code might read 632GCLV20.
Although I did not know about the manufacturing codes when I collected the data, I think it is clear that the bulk of my data came from the CLV plant.
You can create a graph that shows the sample proportions, the 95% simultaneous confidence intervals, and vertical hash marks to indicate the CLV population parameters, as follows:
The graph shows that the observed proportions are close to the proportions from the CLV plant. All proportions are well within the 95% simultaneous confidence intervals from the data. If you rerun the PROC FREQ chi-square analysis with the CLV proportions, the test does not reject the null hypothesis.
The experimental evidence indicates that the colors of plain M&M’s in 2017 do not match the proportions that were published in 2008.
After contacting the M&M/Mars Customer Care team, I was sent a new set of proportions for 2017.
The color proportions now depend on where the candies were manufactured. My data matches the proportion of colors from the Cleveland plant (manufacturing code CLV).
If you are running this analysis yourself, be sure to record whether your candies came from the HKP or CLV plant. If you want to see my analysis, you can
download the complete SAS program that analyzes these data.
Educators who use M&M’s to teach probability and statistics need to record the manufacturing plant, but this is still a fun (and yummy!) experiment. What do you think? Do you prefer the almost-equal orange-blue-green distribution from the CLV plant? Or do you like the orange-blue dominance from the HKP plant? Or do you just enjoy the crunchy shell and melt-in-your-mouth goodness, regardless of what colors the candies are?
The post The distribution of colors for plain M&M candies appeared first on The DO Loop.
This post was kindly contributed by The DO Loop - go there to comment and to read the full post. |
You are given a list of non-negative integers, a1, a2, ..., an, and a target, S. Now you have 2 symbols + and -. For each integer, you should choose one from + and - as its new symbol.
Find out how many ways to assign symbols to make sum of integers equal to target S.
Example 1:
Input: nums is [1, 1, 1, 1, 1], S is 3.
Output: 5
Explanation:
-1+1+1+1+1 = 3
+1-1+1+1+1 = 3
+1+1-1+1+1 = 3
+1+1+1-1+1 = 3
+1+1+1+1-1 = 3
There are 5 ways to assign symbols to make the sum of nums be target 3.
dp(n)
and dp(n-1)
. For example, our goal is to get a sum of 5, and we are given a list of [1, 1, 1, 1, 1]. If th a smaller tuple/list is (1, 1, 1, 1) and some paths get 4, that is exactly what we want since it adds 1 and becomes 5. Similarly, if they could get 6, that is fine as well. We add simply both paths together, since there are only two paths. dp(n, s) = dp(n-1, s-x) + dp(n-1, s+x
), where n
is the size of the list, s
is the sum of the numbers and x
is the one that adds to the previous list. OK, the second step is easy. def findTargetSumWays_1(nums, S):
"""
:type nums: Tuple[int]
:type S: int
:rtype: int
"""
if not nums:
if S == 0:
return 1
else:
return 0
return findTargetSumWays_1(nums[1:], S+nums[0]) + findTargetSumWays_1(nums[1:], S-nums[0])
small_test_nums = (1, 1, 1, 1, 1)
small_test_S = 3
%time findTargetSumWays_1(small_test_nums, small_test_S)
It is theoretically correct and works perfectly with small test cases. But we know that it is going to a nightmare for an engineering application, because it has a hefty time complexity of O(2^N). So math part is done, and We have to move to the third step.get
operation in Hashtable is O(1), a rolling dictionary will help record the previous states. However, Python’s dictionary does not support change/add ops while it is in a loop, then we have to manually replace it. The overall path will be like a tree structure. So the ideal solution will be like -def findTargetSumWays_2(nums, S):
if not nums:
return 0
dic = {nums[0]: 1, -nums[0]: 1} if nums[0] != 0 else {0: 2}
for i in range(1, len(nums)):
tdic = {}
for d in dic:
tdic[d + nums[i]] = tdic.get(d + nums[i], 0) + dic.get(d, 0)
tdic[d - nums[i]] = tdic.get(d - nums[i], 0) + dic.get(d, 0)
dic = tdic
return dic.get(S, 0)
big_test_nums = tuple(range(100))
big_test_S = sum(range(88))
%time findTargetSumWays_2(big_test_nums, big_test_S)
The time is exactly what we need. However, the codes are not elegant and hard to understand. CPU times: user 189 ms, sys: 4.77 ms, total: 194 ms
Wall time: 192 ms
lru_cache
decorator. Then adding one line to the first solution will quickly solve the problem. @lru_cache(10000000)
def findTargetSumWays_3(nums, S):
if not nums:
if S == 0:
return 1
else:
return 0
return findTargetSumWays_3(nums[1:], S+nums[0]) + findTargetSumWays_3(nums[1:], S-nums[0])
%time findTargetSumWays_3(big_test_nums, big_test_S)
CPU times: user 658 ms, sys: 19.7 ms, total: 677 ms
Wall time: 680 ms
This post was kindly contributed by Software & Service - go there to comment and to read the full post. |
As a formal statistician and a current engineer, I feel that a successful engineering project may require both the mathematician’s ability to find the abstraction and the engineer’s ability to find the implementation.
For a typical engineering problem, the steps are usually –
– 1. Abstract the problem with a formula or some pseudocodes
– 2. Solve the problem with the formula
– 3. Iterate the initial solution until it achieves the optimal time complexity and space complexity
I feel that a mathematician would like dynamic programming or DP questions most, because they are too similar to the typical deduction question in math. An engineer will feel it challenging, since it needs the imagination and some sense of math.
The formula is the most important: without it, try-and-error or debugging does not help. Once the the formula is figured out, the rest becomes a piece of the cake. However, sometimes things are not that straightforward. Good mathematics does not always lead to good engineering.
Let’s see one question from Leetcode.
You are given a list of non-negative integers, a1, a2, ..., an, and a target, S. Now you have 2 symbols + and -. For each integer, you should choose one from + and - as its new symbol.
Find out how many ways to assign symbols to make sum of integers equal to target S.
Example 1:
Input: nums is [1, 1, 1, 1, 1], S is 3.
Output: 5
Explanation:
-1+1+1+1+1 = 3
+1-1+1+1+1 = 3
+1+1-1+1+1 = 3
+1+1+1-1+1 = 3
+1+1+1+1-1 = 3
There are 5 ways to assign symbols to make the sum of nums be target 3.
For each of the element of a list, it has two options: plus or minus. So the question asks how many ways to get a special number by all possible paths. Of course, if the sum of numbers is unrealistic, we just need to return 0.
Sounds exactly like a DP question. If we have a pencil and a paper, we can start to explore the relationship between dp(n)
and dp(n-1)
. For example, our goal is to get a sum of 5, and we are given a list of [1, 1, 1, 1, 1]. If th a smaller tuple/list is (1, 1, 1, 1) and some paths get 4, that is exactly what we want since it adds 1 and becomes 5. Similarly, if they could get 6, that is fine as well. We add simply both paths together, since there are only two paths.
The formula is is dp(n, s) = dp(n-1, s-x) + dp(n-1, s+x
), where n
is the size of the list, s
is the sum of the numbers and x
is the one that adds to the previous list. OK, the second step is easy.
def findTargetSumWays_1(nums, S):
"""
:type nums: Tuple[int]
:type S: int
:rtype: int
"""
if not nums:
if S == 0:
return 1
else:
return 0
return findTargetSumWays_1(nums[1:], S+nums[0]) + findTargetSumWays_1(nums[1:], S-nums[0])
small_test_nums = (1, 1, 1, 1, 1)
small_test_S = 3
%time findTargetSumWays_1(small_test_nums, small_test_S)
It is theoretically correct and works perfectly with small test cases. But we know that it is going to a nightmare for an engineering application, because it has a hefty time complexity of O(2^N). So math part is done, and We have to move to the third step.
So we need to find a data structure to record all the paths. If it is the Fibonacci number problem, a simple linear data structure like a list will slash O(2^N) to O(N).
But the hard part is: what data structure is going to be used here. Since the get
operation in Hashtable is O(1), a rolling dictionary will help record the previous states. However, Python’s dictionary does not support change/add ops while it is in a loop, then we have to manually replace it. The overall path will be like a tree structure. So the ideal solution will be like –
def findTargetSumWays_2(nums, S):
if not nums:
return 0
dic = {nums[0]: 1, -nums[0]: 1} if nums[0] != 0 else {0: 2}
for i in range(1, len(nums)):
tdic = {}
for d in dic:
tdic[d + nums[i]] = tdic.get(d + nums[i], 0) + dic.get(d, 0)
tdic[d - nums[i]] = tdic.get(d - nums[i], 0) + dic.get(d, 0)
dic = tdic
return dic.get(S, 0)
big_test_nums = tuple(range(100))
big_test_S = sum(range(88))
%time findTargetSumWays_2(big_test_nums, big_test_S)
The time is exactly what we need. However, the codes are not elegant and hard to understand.
CPU times: user 189 ms, sys: 4.77 ms, total: 194 ms
Wall time: 192 ms
If we don’t want things to get complicated. Here we just want a cache and Python 3 provides a lru_cache
decorator. Then adding one line to the first solution will quickly solve the problem.
@lru_cache(10000000)
def findTargetSumWays_3(nums, S):
if not nums:
if S == 0:
return 1
else:
return 0
return findTargetSumWays_3(nums[1:], S+nums[0]) + findTargetSumWays_3(nums[1:], S-nums[0])
%time findTargetSumWays_3(big_test_nums, big_test_S)
CPU times: user 658 ms, sys: 19.7 ms, total: 677 ms
Wall time: 680 ms
Good math cannot solve all the engineering problems. It has to combine with the details of the languange, the application and the system to avoid bad engineering implementation.
The Jupyter notebook is at Github. If you have any comment, please email me wm@sasanalysis.com.
This post was kindly contributed by Software & Service - go there to comment and to read the full post. |
The post Data wrangling - down the rabbit hole, and back again! appeared first on SAS Learning Post.
]]>This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
Are your friends passing around clever memes (supposedly) featuring something your favorite actor said, or sharing news articles that you think might be “fake news”? If there’s even a hint of data analyst in you, then you probably check the actual data, and confirm or disprove the supposed facts yourself. I […]
The post Data wrangling – down the rabbit hole, and back again! appeared first on SAS Learning Post.
This post was kindly contributed by SAS Learning Post - go there to comment and to read the full post. |
Nowadays Elasticsearch is more and more popular. Besides it original search functionalities, I found Elasticsearch can be
That is the data store I see everyday
People want to know what is going on with such data. So a business intelligence or an OLAP system is needed to visualize/aggregate the data and its flow. Since Elasticsearch is so easy to scale out, it beats other solutions for big data on the market.
There are many options to implement a batch worker. Finally the decision falls to either Spring Data Batch or writing a library from the scratch in Python.
@Bean
public Step IndexMySQLJob01() {
return stepBuilderFactory.get("IndexMySQLJob01")
.<Data, Data> chunk(10)
.reader(reader())
.processor(processor())
.writer(writer())
.build();
}
class IndexMySQLJob01(object):
def __init__(self, reader, processor, writer, listener):
self.reader = reader
self.processor = processor
self.writer = writer
self.listener = listener
...
Eventually Python is picked, because the overall scenario is more algorithm-bound instead of language-bound.
Since the data size is pretty big, time and space are always considered. The direct way to decrease the time complexity is using the hash tables, as long as the memory can hold the data. For example, a join between an N rows table and an M rows table can be optimized from O(M*N) to O(M).
To save the space, a generator chain is used to stream data from the start to the end, instead of materializing sizable objects.
class JsonTask01(object):
...
def get_json(self, generator1, hashtable1):
for each_dict in generator1:
key = each_dict.get('key')
each_dict.update(hashtable1.get(key))
yield each_dict
A scheduler is a must: cron
is enough for simple tasking, while a bigger system requires a work flow. Airflow is the one that helps organize and schedule. It has a web UI and is written in Python, which is easy to be integrated with the batch worker.
Indexing of large quantity of data will impose signifciant impact. For mission-critical indexes that need 100% up time, the zero down algorithm is implemented and we keep two copies of an index for maximum safety. The alias will switch between the two copies once the indexing is finished.
def add_alias(self, idx):
LOGGER.warn("The alias {} will point to {}.".format(self.index, idx))
self.es.indices.put_alias(idx, self.index)
def delete_alias(self, idx):
LOGGER.warn("The alias {} will be removed from {}.".format(self.index, idx))
self.es.indices.delete_alias(idx, self.index)
An Elasticsearch node can choose one of three roles: master node, data node and ingest node(previously called client node). It is commonly seen to dedicate a node as master and ingest and data all together. For a large system, it is always helpful to assign the three roles to different machines/VMs. Therefore, once a node is down/up, it will be quicker to failover or recover.
elasticsearch-head can clearly visualize the data transfer procoess of the shards once an accident occurs.
With the increased number of cluster nodes, the deployment becomes painful. I feel the best tool so far is ansible-elasticsearch. With ansible-playbook -i hosts ./your-playbook.yml -c paramiko
, the cluster is on the fly.
The rules of thumb for Elasticsearch are -
Give (less than) Half Your Memory to Lucene
Don’t Cross 32 GB!
The result causes an awkward situation: if you have a machine that has more than 64GB memory, then the additional memory will mean nothing to Elasticsearch. Actually it is meaningful to run two or more Elasticsearch instances side by side to save the hardware. For example, there is a machine with 96GB memory. We can allocate 31GB for an ingest node, 31 GB for a data node and the rest for the OS. However, two data nodes in a single machine will compete for the disk IO that damages the performance, while a master node and a data node will increase the risk of downtime.
The great thing for Elasticsearch is that it provides richful REST APIs, such as http://localhost:9200/_nodes/stats?pretty. We could use Xpack(paid) or other customized tools to monitor them. I feel that the three most important statistics for the heap and therefore the performance are -
The two statistics intertwined together. The high heap usage, such as 75%, will lead to a GC, while GC with high heap usage will take longer time. We have to keep both numbers as low as possible.
"jvm" : {
"mem" : {
"heap_used_percent" : 89,
...
"gc" : {
"collectors" : {
...
"old" : {
"collection_count" : 225835,
"collection_time_in_millis" : 22624857
}
}
}
There are three kinds of thread pools: active, queue, and reject. It is useful to visualize the real time change. Once there are a lot of queued threads or rejected threads, it is good time to think about scale up or scale out.
"thread_pool" : {
"bulk" : {
"threads" : 4,
"queue" : 0,
"active" : 0,
"rejected" : 0,
"largest" : 4,
"completed" : 53680
},
The segments are the in-memory inverted indexes correponding to the indexes on the hard disk, which are persistent in the physical memory and GC will have no effect on them. The segment will have the footage on every search thread. The size of the segments are important because they will be multiplied by a factor of the number of threads.
"segments" : {
"count" : 215,
"memory_in_bytes" : 15084680,
},
The number of shards actually controls the number of the segments. The shards increase, then the size of the segments decreases and the number of the segments increases. So we cannot increase the number of shards as many as we want. If there are many small segments, the heap usage will turn much higher. The solution is Force merge, which is time-consuming but effecitve.
Kibana integreated DevTools(previously The Sense UI) for free. DevTools has code assistance and is a powerful tool for debugging. If the budget is not an issue, Xpack is also highly recommended. As for Elasticsearch, since 5.0, ingest-geoip
is now a plugin. We will have to write it to Ansible YAML such as -
es_plugins:
- plugin: ingest-geoip
There are quite a few KPIs that need system-wide term aggregations. From 5.0 the request cache will be enabled by default for all requests with size:0
.
For example -
POST /big_data_index/data/_search
{ "size": 0,
"query": {
"bool": {
"must_not": {
"exists": {
"field": "interesting_field"
}
}
}
}
}
The Fore merge
as mentioned above, such asPOST /_forcemerge?max_num_segments=1
, will combine the segments and dramatically increase the aggregation speed.
Nginx is possibly the best proxy as the frontend toward Kibana. There are two advantages: first the proxy can cache the static resources of Kibana; second we can always check the Nginx logs to figure out what causes problem for Kibana.
Elasticsearch and Kibana together provide high availability and high scalability for large BI system.
]]>if you have any comment, please email me wm@sasanalysis.com
This post was kindly contributed by Software & Service - go there to comment and to read the full post. |
Nowadays Elasticsearch is more and more popular. Besides it original search functionalities, I found Elasticsearch can be
That is the data store I see everyday
People want to know what is going on with such data. So a business intelligence or an OLAP system is needed to visualize/aggregate the data and its flow. Since Elasticsearch is so easy to scale out, it beats other solutions for big data on the market.
There are many options to implement a batch worker. Finally the decision falls to either Spring Data Batch or writing a library from the scratch in Python.
@Bean
public Step IndexMySQLJob01() {
return stepBuilderFactory.get("IndexMySQLJob01")
.<Data, Data> chunk(10)
.reader(reader())
.processor(processor())
.writer(writer())
.build();
}
class IndexMySQLJob01(object):
def __init__(self, reader, processor, writer, listener):
self.reader = reader
self.processor = processor
self.writer = writer
self.listener = listener
...
Eventually Python is picked, because the overall scenario is more algorithm-bound instead of language-bound.
Since the data size is pretty big, time and space are always considered. The direct way to decrease the time complexity is using the hash tables, as long as the memory can hold the data. For example, a join between an N rows table and an M rows table can be optimized from O(M*N) to O(M).
To save the space, a generator chain is used to stream data from the start to the end, instead of materializing sizable objects.
class JsonTask01(object):
...
def get_json(self, generator1, hashtable1):
for each_dict in generator1:
key = each_dict.get('key')
each_dict.update(hashtable1.get(key))
yield each_dict
A scheduler is a must: cron
is enough for simple tasking, while a bigger system requires a work flow. Airflow is the one that helps organize and schedule. It has a web UI and is written in Python, which is easy to be integrated with the batch worker.
Indexing of large quantity of data will impose signifciant impact. For mission-critical indexes that need 100% up time, the zero down algorithm is implemented and we keep two copies of an index for maximum safety. The alias will switch between the two copies once the indexing is finished.
def add_alias(self, idx):
LOGGER.warn("The alias {} will point to {}.".format(self.index, idx))
self.es.indices.put_alias(idx, self.index)
def delete_alias(self, idx):
LOGGER.warn("The alias {} will be removed from {}.".format(self.index, idx))
self.es.indices.delete_alias(idx, self.index)
An Elasticsearch node can choose one of three roles: master node, data node and ingest node(previously called client node). It is commonly seen to dedicate a node as master and ingest and data all together. For a large system, it is always helpful to assign the three roles to different machines/VMs. Therefore, once a node is down/up, it will be quicker to failover or recover.
elasticsearch-head can clearly visualize the data transfer procoess of the shards once an accident occurs.
With the increased number of cluster nodes, the deployment becomes painful. I feel the best tool so far is ansible-elasticsearch. With ansible-playbook -i hosts ./your-playbook.yml -c paramiko
, the cluster is on the fly.
The rules of thumb for Elasticsearch are –
Give (less than) Half Your Memory to Lucene
Don’t Cross 32 GB!
The result causes an awkward situation: if you have a machine that has more than 64GB memory, then the additional memory will mean nothing to Elasticsearch. Actually it is meaningful to run two or more Elasticsearch instances side by side to save the hardware. For example, there is a machine with 96GB memory. We can allocate 31GB for an ingest node, 31 GB for a data node and the rest for the OS. However, two data nodes in a single machine will compete for the disk IO that damages the performance, while a master node and a data node will increase the risk of downtime.
The great thing for Elasticsearch is that it provides richful REST APIs, such as http://localhost:9200/_nodes/stats?pretty. We could use Xpack(paid) or other customized tools to monitor them. I feel that the three most important statistics for the heap and therefore the performance are –
The two statistics intertwined together. The high heap usage, such as 75%, will lead to a GC, while GC with high heap usage will take longer time. We have to keep both numbers as low as possible.
"jvm" : {
"mem" : {
"heap_used_percent" : 89,
...
"gc" : {
"collectors" : {
...
"old" : {
"collection_count" : 225835,
"collection_time_in_millis" : 22624857
}
}
}
There are three kinds of thread pools: active, queue, and reject. It is useful to visualize the real time change. Once there are a lot of queued threads or rejected threads, it is good time to think about scale up or scale out.
"thread_pool" : {
"bulk" : {
"threads" : 4,
"queue" : 0,
"active" : 0,
"rejected" : 0,
"largest" : 4,
"completed" : 53680
},
The segments are the in-memory inverted indexes correponding to the indexes on the hard disk, which are persistent in the physical memory and GC will have no effect on them. The segment will have the footage on every search thread. The size of the segments are important because they will be multiplied by a factor of the number of threads.
"segments" : {
"count" : 215,
"memory_in_bytes" : 15084680,
},
The number of shards actually controls the number of the segments. The shards increase, then the size of the segments decreases and the number of the segments increases. So we cannot increase the number of shards as many as we want. If there are many small segments, the heap usage will turn much higher. The solution is Force merge, which is time-consuming but effecitve.
Kibana integreated DevTools(previously The Sense UI) for free. DevTools has code assistance and is a powerful tool for debugging. If the budget is not an issue, Xpack is also highly recommended. As for Elasticsearch, since 5.0, ingest-geoip
is now a plugin. We will have to write it to Ansible YAML such as –
es_plugins:
- plugin: ingest-geoip
There are quite a few KPIs that need system-wide term aggregations. From 5.0 the request cache will be enabled by default for all requests with size:0
.
For example –
POST /big_data_index/data/_search
{ "size": 0,
"query": {
"bool": {
"must_not": {
"exists": {
"field": "interesting_field"
}
}
}
}
}
The Fore merge
as mentioned above, such asPOST /_forcemerge?max_num_segments=1
, will combine the segments and dramatically increase the aggregation speed.
Nginx is possibly the best proxy as the frontend toward Kibana. There are two advantages: first the proxy can cache the static resources of Kibana; second we can always check the Nginx logs to figure out what causes problem for Kibana.
Elasticsearch and Kibana together provide high availability and high scalability for large BI system.
if you have any comment, please email me wm@sasanalysis.com
This post was kindly contributed by Software & Service - go there to comment and to read the full post. |