tag:blogger.com,1999:blog-49038565651279375992017-09-21T19:23:32.707-04:00Numoracle RecipesData Mining, Predictive Analytics, Statistics and Scientific Computing in the Oracle DatabaseRamkumar Krishnan (Ram)http://www.blogger.com/profile/16930865713058805668noreply@blogger.comBlogger11125tag:blogger.com,1999:blog-4903856565127937599.post-26350608100348603182008-01-04T11:38:00.001-05:002008-01-04T15:10:54.341-05:00Talk at WPI - Oracle In-Database MiningMy plan was to finish a full series on Oracle tools for Basic Statistics, before moving on to typical "blog-like" postings - but I realize I'll need at least another month (or two) to get to topics like ANOVA and Multiple Regression. I feel compelled to get a basic feel for a technique/concept before writing up examples of the functions here (which anybody can quickly get from the Oracle SQL Reference Manual).<br /><br />So I am (reluctantly) breaking my rule to sneak in this note about a talk I gave in Nov '07 at the <a href="http://web.cs.wpi.edu/Flyers/2007/Krishnan30Nov07.html">Worcester Polytechnic '07 Colloqium Series</a> - before it loses its relevance. The talk provides the motivation for in-database mining, and the <a href="http://www.box.net/shared/q8wb4d04c4">presentation slides offer a good intro </a>to Oracle Data Mining in the 11gR1 Database release.<br /><br />Oracle has touted the benefits of in-database mining for over 4 years now, with steady improvements to the product and an expanding customer base. <a href="http://www.sas.com/news/preleases/100807/IndatabaseProcessing.html">The recent announcement from market leader SAS</a> on integration with Teradata is a nice vindication of this approach. Details of the integration, whether this engineering effort pans out, and how the market receives the integrated product remains to be seen - but this is good for the industry. SAS still owns the bully pulpit in Analytics, and this move can hopefully lead to greater awareness of the benefits of this approach, and consolidation in this space. <a href="http://oracledmt.blogspot.com/2007/12/recap-post.html">Marcos has promised to write about these developments</a> from his perspective as a scientist/developer - so I will hold off my thoughts in anticipation of that post.<br /><br />Impressions on WPI - good school with active AI and Database research groups. I had the pleasure of meeting Prof. Ruiz and Prof. Heffernan from the <a href="http://web.cs.wpi.edu/Research/airg/">AI research group</a> - they have some interesting projects there, and Prof. Rundentstein and Prof. Mani from the <a href="http://davis.wpi.edu/dsrg/">database research group</a>. Check these programs out if you are an aspiring CS student.<img src="http://feeds.feedburner.com/~r/NumoracleRecipes/~4/Q0DRs8jQX0I" height="1" width="1" alt=""/>Ramkumar Krishnan (Ram)http://www.blogger.com/profile/16930865713058805668noreply@blogger.com1http://numoraclerecipes.blogspot.com/2008/01/talk-at-wpi-oracle-in-database-mining.htmltag:blogger.com,1999:blog-4903856565127937599.post-82276069189134731962007-12-29T13:22:00.001-05:002008-04-15T00:26:00.129-04:00Student's T-Distribution, Degrees of FreedomIn the <a href="http://numoraclerecipes.blogspot.com/2007/12/students-t-distribution-degrees-of.html">previous post</a>, we showed how to compute the confidence interval when σ - i.e. the population standard deviation, is known. But in a typical situation, σ is rarely known.<br /><br /><b>Computing confidence Interval for μ when σ is Unknown:</b> The solution then is to use the <em>sample</em> standard deviation, and use a variant of the standardized statistic for normal distribution z = (X_bar - μ)/σ.<br /><br /><b>Student's T Distribution:</b><br />For a <em>normally distributed population</em>, the Student's T distribution is given by this standardized statistic: <b>t = (X_bar - μ)/(S/√n), with (n - 1) degrees of freedom (df) for the deviations</b>, where S is the sample standard deviation, and <b>n</b> is the sample size. Key points:<ul><li>The t distribution resembles the bell-shaped z (normal) distribution, but with wider tails than z, with mean zero (the mean of z), and with variance tending to 1 (the variance of z) as df increases. For df > 2, <b>σ² = df/(df - 2)</b>. </li><li>The z (normal) distribution deals with one unknown <b>μ</b> - estimated by the random variable X_bar, while the t distribution deals with two unknowns <b>μ</b> and <b>σ</b> - estimated by random variables X_bar and S respectively. So it tacitly handles greater uncertainty in the data.</li><li>As a consequence, the t distribution has wider confidence intervals than z</li><li>There is a t-distribution for each df=1,..n.</li><li>A good <a href="http://www.statsoft.com/textbook/sttable.html">comparison of the distributions is provided here</a>.</li><li>For a sample (n < 30) taken from normally distributed population, a <b>(1 - α) 100% confidence interval</b> for μ when σ is <u>unknown</u> is <b>X_bar ± t<sub>α/2</sub> s/√n</b>. This is the better distribution to use for small samples - with (n - 1) df, and unknown μ and &sigma.</li><li>But larger samples (n ≥ 30), and/or with larger df, the t distribution can be approximated by a z distribution, and the <b>(1 - α)100% confidence interval</b> for μ is <b>X_bar ± z<sub>α/2</sub> s/√n</b> (Note: Using the sample sd itself).</li></ul>We will pause to understand df in this context (based on Aczel's book). <a href="http://numoraclerecipes.blogspot.com/2007/12/sampling-distributions-introduction.html">We noted earlier</a> that the df helps as compensating factor - here is how. Assume a population of five numbers - 1, 2, 3, 4, 5. The (known) population mean is μ = (1+2+3+4+5)/5 = 7.5. Assume we are asked to sample 5 numbers and find the squared standard deviation (ssd) based on μ:<pre>x | x_bar | deviation | deviation_squared<br />3 | 3 | 0.0 | 0<br />2 | 3 | -1.0 | 1<br />4 | 3 | 1.0 | 1<br />1 | 3 | -2.0 | 4<br />2 | 3 | -1.0 | 1<br />Sum of Squared Deviation = 7</pre>Given the mean, the deviation computation for the <em>random</em> 5 samples effectively retain 5 degrees of freedom. Next, assume we <em>don't</em> know the population mean, and instead are asked to compute the deviation from <em>one</em> random number. Our goal is to choose a number that will minimize the deviation. A readily available number is the s sample mean (3+2+4+1+2)/5 = 2.4 - so we will use it:<pre>x | x_bar | deviation | deviation_squared<br />3 | 2.4 | 0.6 | 0.36<br />2 | 2.4 | -0.4 | 0.16<br />4 | 2.4 | 1.6 | 2.56<br />1 | 2.4 | -1.4 | 1.96<br />2 | 2.4 | -0.4 | 0.16<br />Sum of Squared Deviation = 5.2</pre>The use of sample mean biases the SSD downward from 7 (actual) to 5.2. But given the choice of a mean, the deviation for the same random 5 samples retain df = (5 - 1) = 4 degrees of freedom.<br />Subsequent choices of 2 means - (3+2)/2, (4+1+2)/3 - or 3 means, would reduce the SSD down further; at the same time, reducing the degrees of freedom for the deviation for the 5 random samples: df=(5-2), df=(5-3) and so on. As an extreme case, if we consider the sample mean of each sampled number as itself, then we have:<pre>x | x_bar | deviation | deviation_squared<br />3 | 3 | 0 | 0<br />2 | 3 | 0 | 0<br />4 | 4 | 0 | 0<br />1 | 1 | 0 | 0<br />2 | 2 | 0 | 0<br />Sum of Squared Deviation = 0</pre> which reduces SSD to 0, and the deviation df to (5-5) = 0. So in general,<ul><li>deviations (and hence SSD) for a sample of size n taken from a known population mean μ will have df = n</li><li>deviations for a sample of size n taken from the sample mean X_bar will have df = (n - 1)</li><li> deviations for a sample of size n taken from k ≤ n different numbers (typically mean of sample points) will have df = n - k.</li></ul>Confidence interval using T-distribution applies for the narrow case of n < 30, normal population; the Z distribution covers larger samples. The practical use this distribution appears more to be in comparing two populations using the <b>Student's T-Test</b> - which requires understanding the concepts of Hypothesis Testing. So I will defer the code for an equivalent confidence_interval() routine based on T-distribution for later.<img src="http://feeds.feedburner.com/~r/NumoracleRecipes/~4/pogd35YRJJM" height="1" width="1" alt=""/>Ramkumar Krishnan (Ram)http://www.blogger.com/profile/16930865713058805668noreply@blogger.com0http://numoraclerecipes.blogspot.com/2007/12/students-t-distribution-degrees-of.htmltag:blogger.com,1999:blog-4903856565127937599.post-46903230765716913282007-12-21T14:08:00.001-05:002008-04-15T00:22:21.752-04:00Sampling Distributions, Confidence IntervalThe key goal of inferential statistics is to make predictions/observations about the population (the whole) as a generalization of observations made on a random sample (the part). In the <a href="http://numoraclerecipes.blogspot.com/2007/11/sampling-and-sampling-distributions.html">previous post</a>, we discussed common techniques to derive samples from a population. In this post, we will discuss <em>sampling distributions</em> - a key building block for the practice of statistical inference. These tools help answer questions such as: "What should be the sample size to make a particular inference about this population?" or "100 random water samples along this river show an average of 50 ppm (parts per million) of this toxin, with standard deviation of 4.5 - how much is the river contaminated on average with 95% confidence interval", and so on.<br /><br />The objective in the next few posts is to discuss the use of Oracle SQL statistical functions for various sampling distributions. But if you are a typical DB developer with novice/intermediate knowledge of statistics (like me), spending some time on these foundational concepts may be worthwhile. I am currently using <a href="http://www.amazon.com/gp/product/007287418X?ie=UTF8&tag=numoreci-20&link_code=as3&camp=211189&creative=373489&creativeASIN=007287418X">Complete Business Statistics</a> and the <a href="http://www.amazon.com/gp/product/1592571999?ie=UTF8&tag=numoreci-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=1592571999">Idiot's Guide to Statistics</a><img border="0" width="1" alt="" style="border:none !important; margin:0px !important;" src="http://www.assoc-amazon.com/e/ir?t=numoreci-20&l=as2&o=1&a=1592571999" height="1"/>as my guides - you may use these or other books and/or the free online references on the right pane of this blog. <ul><li>The various numerical measures - such as mean, variance etc - when applied to a <em>sample</em>, are called <b>sample statistics</b> or simply <b>statistics</b>.</li><li>When these numerical measures are applied to a <em>population</em>, they are called <b>population parameters</b> or simply <b>parameters</b>.</li><li>An <b>estimator</b> of a population parameter is the sample statistic used to estimate the parameter. The sample statistic - mean, X_bar - estimates the population mean μ; the sample statistic - variance, S² - estimates the population variance σ².</li><li>When a single numerical value is the estimate, it is called a <b>point estimate</b>. For example, when we sample a population and obtain a value for X_bar - the statistic - we get a <em>specific</em> sample mean, denoted by x_bar (lower-case), which is the estimate for population mean μ. When the estimate covers a range or an interval, it is called an <b>interval estimate</b> - the unknown population parameter is likely to be found in this interval</li><li>A sample statistic, such as X_bar, is a <em>random variable</em>; the values of this randome variable depend on the values in the random sample from which the statistic is computed; the sample itself depends on the population from which it is drawn. This random variable has probability distribution, which is called the <b>sampling distribution</b></li><li>The principal use of sampling distributions and its related concepts is to help predict how close the estimate is to the population parameter, and with what probability.</li></ul><b>Central Limit theorem</b>:<br />The sample statistic sample mean X_bar exhibits a unique behavior - regardless of the population distribution (uniform, exponential, other), in the limit n → ∞ ("n tends to infinity", where n is the sample size), the sampling distribution of X_bar tends to a normal distribution. The <em>rate</em> at which the sampling distribution approaches normal distribution depends on the population distribution. Now, if the <u>population itself is normally distributed</u>, then X_bar is normally distributed for <em>any</em> sample size. This is the essence of what is called Central Limit theorem.<br /><br />Formally, when a population with mean μ and standard deviation σ is sampled, the sampling distribution of the sample mean X_bar will tend to a normal distribution with (the same) mean <b>μ</b> and standard deviation <b>σ<sub>x_bar</sub> = σ/√n</b>, as the sample size <b>n</b> becomes large. "Large" is empirically defined to be <b>n ≥ 30</b>. The value <b>σ<sub>x_bar</sub></b> is called the <b>standard error of the mean</b>.<br /><br />"Okay,... so what is the big deal?". The big deal is that we can now estimate the population mean (regardless of the population's distribution) using the familiar technique (<a href="http://numoraclerecipes.blogspot.com/2007/09/st004-inferential-statistics-4.html">that we saw in an earlier post</a>) for standard normal distribution.<br /><br />Now, it is not common that one or more of the population parameters (like standard deviation) are always known. The computations have to be modified to accommodate these unknowns - which brings us to two more concepts associated with sampling distributions.<br /><br /><b>Degrees of Freedom (DF):</b><br />If we are asked to choose three random numbers <em>a</em>, <em>b</em> and <em>c</em>, we are free to choose any three numbers without any restrictions - in other words, we have 3 <em>degrees of freedom</em>. But if the three numbers are put together in a model <em>a + b + c = 10</em>, then we have just 2 degrees of freedom - choice of <em>a</em> and <em>b</em> can be arbitrary , but <em>c</em> is constrained to take a specific value that satisfies the model. The use of <em>df</em> appears to be a compensatory mechanism in the computations, specific to the context/situation in which is it applied - so we'll discuss this in the context of the technique we are illustrating.<br /><br /><b>Confidence Interval:</b><br />An interval estimate, with its associated measure of confidence is called <b>confidence interval</b>. It is a range of numbers that probably contains the unknown population parameter, with an adjoining <em>level of confidence</em> that it indeed does. This is better than a point estimate in that it gives some indication of the <em>accuracy</em> of the estimation.<br /><br /><a href="http://numoraclerecipes.blogspot.com/2007/09/st004-inferential-statistics-4.html">In an earlier post</a>, we briefly touched upon the transformation of a normal random variable (X, with arbitrary μ and σ) to a <em>standard</em> normal variable (Z, with μ = 0 and σ = 1). The transformations are X to Z: Z = (X - μ)/σ and Z to X: X = μ + Zσ. Applying the latter transformation to standardized sampling distribution with mean μ and standard deviation σ/√n, the confidence interval for the population mean is μ ± Z σ/√n.<br /><br /><a href="http://4.bp.blogspot.com/_Yr3mSaNeLPo/R3WfXx2p2lI/AAAAAAAAAAM/Vk9G36mEFPY/s1600-h/std_normal.png"><img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;" src="http://4.bp.blogspot.com/_Yr3mSaNeLPo/R3WfXx2p2lI/AAAAAAAAAAM/Vk9G36mEFPY/s320/std_normal.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5149196979836410450" /></a>A typical question will be "Give me the 95% confidence interval for the population mean". Given the confidence level, and the knowledge that the area under the standard normal curve is 1, we can obtain the value of Z from the <a href="http://www.stat.tamu.edu/~mhasson/normal.pdf">standard normal table</a>. For example, a 95% confidence level translates to an area of 0.95 symmetrically distributed around the mean, leaving 0.025 as areas on the left and right tails. From the table, Z = -1.96 for P=0.025, and Z = 1.96 for P=(0.025+0.95). So the 95% confidence interval for the population mean, when the population standard deviation is known, is given by μ ± 1.96 σ/√n<br /><br />We'll wrap this section reinforcing some concepts for use later:<ul><li>In probability-speak, the statement "95% confidence interval for the population mean" implies that "there is a 95% probability that a given confidence interval from a given random sample from the same population will contain the population mean". It does <u>NOT</u> imply a "95% probability that the population mean is a value in the range of the interval". In the figure, sample mean x for a specific sample falls within the interval - based on this, the confidence interval is considered to contain the population mean μ. If x for another sample falls in the tail region, then that confidence interval cannot assert that it contains μ.</li><li>The quantity Z σ/√n is called <b>sampling error</b> or <b>margin of error</b>.</li><li>The combined area under the curve in the tails (i.e. 1 - 0.95 = 0.05 in the above example) is called <b>level of significance α</b>, and/or <b>error probability</b>.</li><li>The area under the curve excluding the tails under the curve in the tails <b>(1 - α)</b> is called <b>confidence coefficient</b>.</li><li>The confidence coefficient x 100, expressed as a percentage, is the <b>confidence level</b>.</li><li>The Z value that cuts off the area under the right tail (i.e. the area α/2 on the right of the curve, 1.96 in our example) is denoted as <b>z<sub>α/2</sub></b>.</li><li>For a small sample (n < 30), or a sample taken from a normally distributed population, the <b>(1 - α) 100% confidence interval</b> for μ with known σ is <b>X_bar ± z<sub>α/2</sub> σ/√n</b></ul><b>Confidence Interval for Population Mean with Known σ:</b><br />Excel has a CONFIDENCE() function to compute the confidence interval. See a simple equivalent for Oracle SQL below. The function takes in a table name, the column name representing the sampled quantity, and level of significance value of 0.05, 0.01, or 0.1 (that corresponds to the three popular confidence levels - 95%, 99%, and 90% - respectively), and returns an object that contains the sample mean, sample error, the lower and upper bounds of the interval.<pre>CREATE OR REPLACE TYPE conf_interval_t AS OBJECT (<br /> pop_mean NUMBER, sample_err NUMBER, lower NUMBER, upper NUMBER);<br />/<br />CREATE OR REPLACE FUNCTION confidence_interval (<br /> table_name IN VARCHAR2,<br /> column_name IN VARCHAR2,<br /> sample_percent IN NUMBER,<br /> alpha IN NUMBER DEFAULT 0.05,<br /> seed IN NUMBER DEFAULT NULL)<br />RETURN conf_interval_t IS<br /> pop_mean NUMBER;<br /> pop_stddev NUMBER;<br /> sample_sz NUMBER;<br /> z NUMBER;<br /> err NUMBER;<br /> v_stmt VARCHAR2(32767);<br />BEGIN<br /> v_stmt :=<br /> 'SELECT AVG(' || column_name || '), count(*) ' ||<br /> 'FROM (SELECT * ' ||<br /> 'FROM ' || table_name ||<br /> ' SAMPLE(' || sample_percent || ')';<br /> IF (seed IS NOT NULL) THEN<br /> v_stmt := v_stmt || ' SEED(' || seed || ')';<br /> END IF;<br /> v_stmt := v_stmt || ')';<br /> EXECUTE IMMEDIATE v_stmt INTO pop_mean, sample_sz;<br /><br /> v_stmt :=<br /> 'SELECT STDDEV(' || column_name || ') ' ||<br /> 'FROM ' || table_name;<br /> EXECUTE IMMEDIATE v_stmt INTO pop_stddev;<br /> <br /> IF (alpha = 0.05) THEN<br /> z := 1.96;<br /> ELSIF (alpha = 0.01) THEN<br /> z := 2.57;<br /> ELSIF (alpha = 0.1) THEN<br /> z := 1.64;<br /> ELSE<br /> RETURN(NULL);<br /> END IF;<br /><br /> err := z * pop_stddev / SQRT(sample_sz);<br /> RETURN (conf_interval_t(pop_mean, err, (pop_mean - err), (pop_mean + err)));<br />END confidence_interval;<br />/</pre> I used this function to find the 90%, 95%, and 99% confidence interval for the population mean of ORDERS_TOTAL in the ORDERS table, with an approx sample size of 15, with a seed to enable repeatable runs from the SQL sampler. Notice how the interval widens and becomes less precise as the confidence level increases. The true population mean is also shown to be contained in the interval<pre>SQL> select confidence_interval('ORDERS', 'ORDER_TOTAL', 15, 0.1, 3) from dual;<br /><br />CONFIDENCE_INTERVAL('ORDERS','ORDER_TOTAL',15,0.1,3)(POP_MEAN, SAMPLE_ERR, LOWER, UPPER)<br />-------------------------------------------------------------------------<br />CONF_INTERVAL_T(24310.9188, 21444.6451, 2866.27368, 45755.5638)<br /><br />SQL> select confidence_interval('ORDERS', 'ORDER_TOTAL', 15, 0.05, 3) from dual;<br /><br />CONFIDENCE_INTERVAL('ORDERS','ORDER_TOTAL',15,0.05,3)(POP_MEAN, SAMPLE_ERR, LOWER, UPPER)<br />-------------------------------------------------------------------------<br />CONF_INTERVAL_T(24310.9188, 25628.9661, -1318.0473, 49939.8848)<br /><br />SQL> select confidence_interval('ORDERS', 'ORDER_TOTAL', 15, 0.01, 3) from dual;<br /><br />CONFIDENCE_INTERVAL('ORDERS','ORDER_TOTAL',15,0.01,3)(POP_MEAN, SAMPLE_ERR, LOWER, UPPER)<br />-------------------------------------------------------------------------<br />CONF_INTERVAL_T(24310.9188, 33605.3279, -9294.4092, 57916.2467)<br /><br />SQL> select avg(order_total) from orders;<br /><br />AVG(ORDER_TOTAL)<br />----------------<br /> 34933.8543<br /><br />SQL></pre>Fine - but how do we find the confidence interval when σ is unknown (which is the norm in practice)? Enter <b>T (or Student's) Distribution</b> - we will look at this in the <a href="http://numoraclerecipes.blogspot.com/2007/12/students-t-distribution-degrees-of.html">next post</a>.<img src="http://feeds.feedburner.com/~r/NumoracleRecipes/~4/-giolPmTjKQ" height="1" width="1" alt=""/>Ramkumar (Ram) Krishnanhttp://www.blogger.com/profile/01032962495745306728noreply@blogger.com0http://numoraclerecipes.blogspot.com/2007/12/sampling-distributions-introduction.htmltag:blogger.com,1999:blog-4903856565127937599.post-74683323492018478052007-11-03T23:49:00.000-04:002008-01-04T15:13:38.032-05:00SamplingThe key goal of inferential statistics is to make predictions/observations about the <em>population</em> (the whole) as a generalization of observations made on a random <em>sample</em> (the part). In this post, we will present tools/techniques in Oracle for sampling data.<br /><br /><b>Sampling</b><br />For ensuring high accuracy in the results of a statistical inference (technique), the sample dataset should minimally have these properties:<ul><li>the sample must be drawn randomly from the population</li><li>every segment of the population must be adequately and proportionately represented</li><li>the sample should not be <em>biased</em> - a classic example is the <em>non-response bias</em> seen in survey/poll data - where the respondents ignore/refuse to answer a particular question ("Have you smoked marijuana?" in a health questionnaire).</li></ul>We will use this <a href="http://en.wikipedia.org/wiki/Sampling_%28statistics%29">succinct introduction to sampling techniques</a> (or alternatively, <a href="http://www.statcan.ca/english/edu/power/ch13/first13.htm">this resource</a>) as a basis to discuss probabilistic (a.k.a random) sampling <a href="http://www.ma.utexas.edu/users/parker/sampling/repl.htm">with and without replacement</a>, clustered sampling, and stratified sampling in Oracle.<br /><br />The <a href="http://download.oracle.com/docs/cd/B28359_01/server.111/b28286/statements_10002.htm#i2065646">SAMPLE clause in the SQL SELECT statement</a> supports simple random sampling and clustered sampling. Note that the <em>sampling frame</em> here is simply the set of rows returned by the statement - so you can control how many, and which, of the rows are sampled using filters in the WHERE clause. Here are some examples based on the OE.ORDERS table provided with the <a href="http://st-doc.us.oracle.com/11/111/server.111/b28328/toc.htm">Sample Schema</a> (connect oe/oe in SQL*Plus to see this table).<br /><br /><b>Example: Random Sampling without replacement:</b><br />Get random samples from ORDERS table with each row having a 15% chance of being in the sample<pre>SQL> select order_id,order_status,order_total from orders sample (15) seed (3);<br /><br /> ORDER_ID ORDER_STATUS ORDER_TOTAL<br />---------- ------------ -----------<br /> 2354 0 46257<br /> 2360 4 990.4<br /> 2361 8 120131.3<br /> 2384 3 29249.1<br /> 2389 4 17620<br /> 2400 2 69286.4<br /> 2401 3 969.2<br /> 2402 8 600<br /> 2415 6 310<br /> 2416 6 384<br /> 2417 5 1926.6<br /><br /> ORDER_ID ORDER_STATUS ORDER_TOTAL<br />---------- ------------ -----------<br /> 2419 3 31574<br /> 2423 3 10367.7<br /> 2427 7 9055<br /> 2429 9 50125<br /> 2453 0 129<br /><br />16 rows selected.<br /><br />SQL></pre>The SEED clause in the above statement helps with repeatability of results from one run to the next. Next, achieving <b>clustered sampling</b> requires a BLOCK qualifier in the SAMPLE clause, as in <pre>select order_id,order_status,order_total from orders sample block (15) seed (3);</pre><b>Example: Random Sampling with replacement:</b> can be accomplished with some minimal coding. In the example below, we using a hashing function (<a href="http://st-doc.us.oracle.com/11/111/server.111/b28286/functions109.htm#sthref1754">ORA_HASH</a>) on the sequence of enumerated (using <a href="http://st-doc.us.oracle.com/11/111/server.111/b28286/pseudocolumns009.htm#SQLRF00255">ROWNUM </a>pseudo-column) rows to randomize the selection from the table to be sampled. Let's try this: "Get me random samples with replacement with sample size 10 rows from the ORDERS table".<pre>-- stage a view with row numbers tacked on to original table<br />create view orders_view as<br />select rownum rnum, o.*<br /> from orders o;<br /><br />-- create a sequence<br />create sequence orders_seq10;<br /><br />-- create a mapping table<br />create table orders_map(rnum number, rhsh number);<br /><br />-- the requested sample size is 10<br />begin<br /> for i in 1..10<br /> loop<br /> insert into orders_map (rnum) values (orders_seq10.nextval);<br /> end loop;<br />end;<br />/<br />commit;<br /><br />-- the complete orders table is the sampling frame<br />-- mark random sampled entries in mapping table<br />update orders_map set rhsh = ora_hash(rnum, (select count(*) from orders));<br /><br />-- use the mapping table and orders_view to create a view with sampled rows<br />create view orders_sample10_view as<br />select o.*<br /> from orders_view o, orders_map m<br /> where o.rnum = m.rnum;</pre>This is the result of the above code snippet, when run in the OE schema using SQL*Plus:<pre>SQL> select * from orders_sample10_view;<br /><br />RNUM ORDER_ID ORDER_MO CUSTOMER_ID ORDER_STATUS ORDER_TOTAL SALES_REP_ID<br />------ ---------- -------- ----------- ------------ ----------- ------------<br />11 2455 direct 145 7 14087.5 160<br />12 2379 direct 146 8 17848.2 161<br />13 2396 direct 147 8 34930 161<br />14 2406 direct 148 8 2854.2 161<br />15 2434 direct 149 8 268651.8 161<br />16 2436 direct 116 8 6394.8 161<br />17 2446 direct 117 8 103679.3 161<br />18 2447 direct 101 8 33893.6 161<br />19 2432 direct 102 10 10523 163<br />20 2433 direct 103 10 78 163<br /><br />10 rows selected.<br /><br />SQL></pre>For repeatability of sampled results, change all "create view" statements above to "create table". It is also useful if the data to be sampled is persisted in a table, rather than presented to the sampling queries as a view.<br /><br /><b>Stratified Sampling</b> can also be coded using SQL. But rather than provide an example query that is specific to a given table, we will provide a query generator function here. This is adapted from the output of code generated by Oracle Data Miner for stratified sampling. The function accepts a source and result table name, the name of the stratifying column (a.k.a. "variable" in statistics, "attribute" in data mining), the sampling size as a percentage of the table size, a scalar comparison operator ('<', '=', '>'), and an indicator to specify if each strata should roughly have the same number of elements.<br /><br />An important note - Oracle Data Miner - the GUI interface for the Oracle Data Mining platform provides sampling features, <em>and</em> generated SQL/PLSQL code that corresponds to the sampling input. This example has been adapted from this generated code.<pre>CREATE OR REPLACE TYPE TARGET_VALUES_LIST IS TABLE OF VARCHAR2(32);<br />/<br />CREATE OR REPLACE TYPE VALUE_COUNT_LIST IS TABLE OF NUMBER;<br />/<br /><br />CREATE OR REPLACE<br />FUNCTION GENERATE_STRATIFIED_SQL ( <br /> result_table_name IN VARCHAR2,<br /> source_table_name IN VARCHAR2, <br /> strat_attr_name IN VARCHAR2,<br /> percentage IN NUMBER,<br /> op IN VARCHAR2,<br /> equal_sized_strata IN BOOLEAN DEFAULT FALSE)<br />RETURN VARCHAR2 IS<br /> v_stmt VARCHAR2(32767);<br /> tmp_str VARCHAR2(4000);<br /> sample_count PLS_INTEGER;<br /> attr_names TARGET_VALUES_LIST;<br /> attr_values TARGET_VALUES_LIST;<br /> counts VALUE_COUNT_LIST;<br /> counts_sampled VALUE_COUNT_LIST;<br /> v_minvalue NUMBER;<br />BEGIN<br /> v_stmt :=<br /> 'SELECT column_name ' ||<br /> 'FROM user_tab_columns ' ||<br /> 'WHERE table_name = ' || '''' || UPPER(source_table_name) || '''';<br /> EXECUTE IMMEDIATE v_stmt BULK COLLECT INTO attr_names;<br /><br /> v_stmt := <br /> 'SELECT /*+ noparallel(t)*/ ' || strat_attr_name || ', count(*), ' ||<br /> 'ROUND ((count(*) * ' || percentage || ')/100.0) ' ||<br /> 'FROM ' || source_table_name || <br /> ' WHERE ' || strat_attr_name || ' IS NOT NULL ' ||<br /> 'GROUP BY ' || strat_attr_name;<br /> EXECUTE IMMEDIATE v_stmt<br /> BULK COLLECT INTO attr_values, counts, counts_sampled;<br /><br /> IF (equal_sized_strata = TRUE) THEN<br /> FOR i IN counts.FIRST..counts.LAST<br /> LOOP<br /> IF (i = counts.FIRST) THEN<br /> v_minvalue := counts(i);<br /> ELSIF (counts(i) > 0 AND v_minvalue > counts(i)) THEN<br /> v_minvalue := counts(i);<br /> END IF;<br /> END LOOP;<br /> FOR i IN counts.FIRST..counts.LAST<br /> LOOP<br /> counts(i) := v_minvalue;<br /> END LOOP;<br /> END IF;<br /><br /> v_stmt :=<br /> 'CREATE TABLE ' || result_table_name || ' AS ' ||<br /> 'SELECT ';<br /> FOR i IN attr_names.FIRST..attr_names.LAST <br /> LOOP<br /> IF (i != attr_names.FIRST) THEN<br /> v_stmt := v_stmt || ',';<br /> END IF;<br /> v_stmt := v_stmt || attr_names(i);<br /> END LOOP;<br /> <br /> v_stmt := v_stmt ||<br /> ' FROM (SELECT /*+ no_merge */ t.*, ' ||<br /> 'ROW_NUMBER() OVER ' ||<br /> '(PARTITION BY ' || strat_attr_name ||<br /> ' ORDER BY ORA_HASH(ROWNUM)) RNUM ' ||<br /> 'FROM ' || source_table_name || ' t) ' ||<br /> 'WHERE RNUM = 1 OR ';<br /><br /> FOR i IN attr_values.FIRST..attr_values.LAST <br /> LOOP<br /> IF (i != attr_values.FIRST) THEN<br /> tmp_str := ' OR ';<br /> END IF;<br /> IF (counts(i) <= 2) THEN<br /> sample_count := counts(i);<br /> ELSE<br /> sample_count := counts_sampled(i);<br /> END IF;<br /> tmp_str :=<br /> tmp_str || <br /> '(' || strat_attr_name || ' = ''' || attr_values(i) || '''' ||<br /> ' AND ORA_HASH(RNUM, (' || counts(i) || ' -1), 12345) ' ||<br /> op || sample_count || ') ';<br /> v_stmt := v_stmt || tmp_str;<br /> END LOOP;<br /><br /> RETURN(v_stmt);<br />END;<br />/</pre>Now, cut and paste the above code in a SQL*Plus session (for this example, the Sample Schema OE/OE session), and then invoke the function in a SQL*Plus session using this wrapper. The inputs are pretty self-explanatory - we are asking to sample roughly 20% of the ORDERS table, stratified based on values in the ORDERS_STATUS column, and to place the sampled output in the table SAMPLED_ORDERS.<pre>set serveroutput on<br />BEGIN<br /> DBMS_OUTPUT.PUT_LINE(<br /> generate_stratified_sql(<br /> result_table_name => 'sampled_orders',<br /> source_table_name => 'orders',<br /> strat_attr_name => 'order_status',<br /> percentage => 20,<br /> op => ' < '));<br />END;<br />/</pre>this will return the following sampling query (output formatted for easy readability):<pre>CREATE TABLE sampled_orders AS<br />SELECT ORDER_ID,ORDER_MODE,CUSTOMER_ID,ORDER_STATUS,ORDER_TOTAL,<br /> SALES_REP_ID,PROMOTION_ID<br /> FROM (SELECT /*+ no_merge */ t.*,<br /> ROW_NUMBER() OVER<br /> (PARTITION BY order_status<br /> ORDER BY ORA_HASH(ROWNUM)) RNUM<br /> FROM orders t)<br /> WHERE RNUM = 1 OR<br /> (order_status = '1' AND ORA_HASH(RNUM, (7 -1), 12345) < 1) OR<br /> (order_status = '6' AND ORA_HASH(RNUM, (9 -1), 12345) < 2) OR<br /> (order_status = '2' AND ORA_HASH(RNUM, (7 -1), 12345) < 1) OR<br /> (order_status = '5' AND ORA_HASH(RNUM, (15 -1), 12345) < 3) OR<br /> (order_status = '4' AND ORA_HASH(RNUM, (12 -1), 12345) < 2) OR<br /> (order_status = '8' AND ORA_HASH(RNUM, (17 -1), 12345) < 3) OR<br /> (order_status = '3' AND ORA_HASH(RNUM, (9 -1), 12345) < 2) OR<br /> (order_status = '7' AND ORA_HASH(RNUM, (3 -1), 12345) < 1) OR<br /> (order_status = '0' AND ORA_HASH(RNUM, (11 -1), 12345) < 2) OR<br /> (order_status = '10' AND ORA_HASH(RNUM, (5 -1), 12345) < 1) OR<br /> (order_status = '9' AND ORA_HASH(RNUM, (10 -1), 12345) < 2);</pre>If you execute this SQL in the same OE session, the resulting table is:<pre>SQL> select * from sampled_orders;<br /><br /> ORDER_ID ORDER_MO CUSTOMER_ID ORDER_STATUS ORDER_TOTAL SALES_REP_ID<br />---------- -------- ----------- ------------ ----------- ------------<br /> 2363 online 144 0 10082.3<br /> 2369 online 116 0 11097.4<br /> 2403 direct 162 0 220 154<br /> 2439 direct 105 1 22150.1 159<br /> 2408 direct 166 1 309 158<br /> 2444 direct 109 1 77727.2 155<br /> 2358 direct 105 2 7826 155<br /> 2400 direct 159 2 69286.4 161<br /> 2375 online 122 2 103834.4<br /> 2450 direct 147 3 1636 159<br /> 2423 direct 145 3 10367.7 160<br /><br /> ORDER_ID ORDER_MO CUSTOMER_ID ORDER_STATUS ORDER_TOTAL SALES_REP_ID<br />---------- -------- ----------- ------------ ----------- ------------<br /> 2385 online 147 4 295892<br /> 2437 direct 103 4 13550 163<br /> 2389 online 151 4 17620<br /> 2364 online 145 4 9500<br /> 2377 online 141 5 38017.8<br /> 2425 direct 147 5 1500.8 163<br /> 2394 direct 109 5 21863 158<br /> 2457 direct 118 5 21586.2 159<br /> 2426 direct 148 6 7200<br /> 2410 direct 168 6 45175 156<br /> 2427 direct 149 7 9055 163<br /><br /> ORDER_ID ORDER_MO CUSTOMER_ID ORDER_STATUS ORDER_TOTAL SALES_REP_ID<br />---------- -------- ----------- ------------ ----------- ------------<br /> 2446 direct 117 8 103679.3 161<br /> 2402 direct 161 8 600 154<br /> 2434 direct 149 8 268651.8 161<br /> 2447 direct 101 8 33893.6 161<br /> 2365 online 146 9 27455.3<br /> 2372 online 119 9 16447.2<br /> 2359 online 106 9 5543.1<br /> 2368 online 149 10 60065<br /><br />30 rows selected.<br /><br />SQL> select count(*) from orders;<br /><br /> COUNT(*)<br />----------<br /> 105<br /><br />SQL></pre>That is, 30 rows (a bit more than 20%) out of 105 are sampled, using order_status as the stratifying column.<br /><br />Try this example on your own schemas and tables, and send me some feedback. The maximum length of the string returned from GENERATE_STRATFIED_SQL() is 32767. Based on the interest, I can post another function that can handle SQL strings of <= 64K in size.<br /><br />Given the explosive growth in data, sampling continues to remain an active, relevant research area in statistics, data mining, and computer science.<img src="http://feeds.feedburner.com/~r/NumoracleRecipes/~4/bc1FQFrhs1M" height="1" width="1" alt=""/>Ramkumar (Ram) Krishnanhttp://www.blogger.com/profile/01032962495745306728noreply@blogger.com3http://numoraclerecipes.blogspot.com/2007/11/sampling-and-sampling-distributions.htmltag:blogger.com,1999:blog-4903856565127937599.post-74705365636060599422007-10-21T23:07:00.000-04:002008-01-04T15:13:38.033-05:00Analyzing your data for DistributionsIn the <a href="http://numoraclerecipes.blogspot.com/2007/09/st004-inferential-statistics-4.html">previous post</a>, we looked at continuous probability distributions. You can determine if the data in a given column in a given table follows a particular distribution using routines in the <a href="http://download.oracle.com/docs/cd/B19306_01/appdev.102/b14258/d_stat_f.htm#i1000106">DBMS_STAT_FUNCS package</a>. The example shows a table with four columns - each pre-loaded with data synthesized to fit a particular continuous distribution.<br /><br />You can substitute the table name and column name for your experiment. A minor annoyance in the design of this API - most other Oracle packages will default to the current session name, but this one explicitly expects the schema name as input - which, in my case, is DMUSER - please change this to your schema name accordingly. <pre>set echo on;<br />set serveroutput on;<br />create table disttab (<br /> num1 number, num2 number, num3 number, num4 number, cat1 varchar2(2));<br />rem num1: Numbers generated based on normal distribution<br />rem num2: Numbers generated based on exponential distribution<br />rem num3: Numbers generated based on weibull distribution<br />rem num4: Numbers generated based on uniform distribution<br />insert into disttab values (9.604955, 1.05536, 4.126087, 22.950835, 'AA');<br />insert into disttab values (13.022139, 1.714142, 4.999804, 32.598089, 'AA');<br />insert into disttab values (11.572116, 3.697564, 2.81854, 24.552021, 'AA');<br />insert into disttab values (9.817124, 1.530935, 2.131106, 6.359504, 'AA');<br />insert into disttab values (10.146569, 3.295829, 1.510639, 25.218639, 'AA');<br />insert into disttab values (11.280488, 0.721109, 3.145515, 23.672146, 'BB');<br />insert into disttab values (9.26679, 1.390282, 4.074397, 11.262112, 'BB');<br />insert into disttab values (14.303472, 1.327971, 2.51907, 22.675373, 'BB');<br />insert into disttab values (11.686556, 0.225337, 2.941825, 23.582254, 'BB');<br />insert into disttab values (13.124479, 7.265271, 0.945059, 29.18001, 'BB');<br />insert into disttab values (8.601027, 7.060104, 6.078573, 14.878128, 'BB');<br />insert into disttab values (12.241662, 0.257739, 3.395142, 31.148244, 'CC');<br />insert into disttab values (13.781857, 4.281371, 1.349627, 23.862069, 'CC');<br />insert into disttab values (7.827007, 1.347487, 5.836949, 10.76229, 'CC');<br />insert into disttab values (9.106408, 1.253113, 5.116857, 6.594224, 'CC');<br />insert into disttab values (11.066785, 4.56512, 3.393899, 22.435955, 'CC');<br />insert into disttab values (10.71079, 2.700015, 1.922642, 7.635145, 'DD');<br />insert into disttab values (9.13019, 5.199126, 3.763481, 32.061213, 'DD');<br />insert into disttab values (7.873859, 0.978657, 2.268487, 1.030052, 'DD');<br />insert into disttab values (7.731724, 2.382977, 2.639425, 5.676622, 'DD');<br />insert into disttab values (12.828234, 1.867099, 3.99808, 26.000458, 'DD');<br />insert into disttab values (12.125892, 1.01285, 3.345311, 8.026281, 'DD');<br />insert into disttab values (9.800528, 5.869301, 3.840932, 29.928523, 'EE');<br />insert into disttab values (10.605782, 3.145211, 2.13718, 27.398604, 'EE');<br />insert into disttab values (14.054569, 4.089033, 2.436408, 4.483585, 'EE');<br />insert into disttab values (8.120606, 2.155303, 1.787835, 19.513588, 'EE');<br />insert into disttab values (13.093059, 0.220456, 3.456848, 24.855135, 'EE');<br />insert into disttab values (8.421441, 2.4819, 2.817669, 21.137668, 'FF');<br />insert into disttab values (11.899697, 2.507618, 3.770983, 4.016285, 'FF');<br />insert into disttab values (9.601342, 1.12639, 3.21053, 28.643809, 'FF');<br />insert into disttab values (9.32297, 10.003288,6.890515, 33.67171, 'FF');<br />insert into disttab values (6.896019, 10.76641, 3.123496, 29.077463, 'FF');<br />insert into disttab values (12.542443, 0.228756, 4.081015, 33.542652, 'FF');<br />insert into disttab values (14.038144, 7.326175, 3.53459, 11.731359, 'FF');</pre>Let us first start with <strong>Normal Distribution</strong>. The mean (same as E(x)) and standard deviation are computed for the column <em>num1</em> and provided as input.<pre>declare<br /> mean number;<br /> stdev number;<br /> sig number;<br />begin<br /> select avg(num1) into mean from disttab;<br /> select stddev(num1) into stdev from disttab;<br /> dbms_output.put_line('NORMAL DISTRIBUTION - SHAPIRO_WILKS');<br /> dbms_stat_funcs.normal_dist_fit(<br /> 'dmuser', 'disttab', 'num1', 'SHAPIRO_WILKS', mean, stdev, sig);<br /> dbms_output.put_line('Mean : ' || round(mean, 4));<br /> dbms_output.put_line('Stddev : ' || round(stdev, 4));<br /> dbms_output.put_line('Sig : ' || to_char(sig,'9.9999'));<br />end;<br />/</pre>The output of this procedure looks like this:<br />NORMAL DISTRIBUTION - SHAPIRO_WILKS<br />W value : .9601472834631918774434537597940068448938<br />Mean : 10.7426<br />Stddev : 2.1094<br />Sig : .2456<br />The high value for the W returned from the Shapiro-Wilks test (explained here) indicates an agreement with the null hypothesis that the data follows normal distribution. The significance (0.25) is also relatively high for the Shapiro-Wilks test (0.05 being the typical threshold).<br /><br />Next, <strong>Exponential distribution</strong>. For the above dataset, lambda - i.e. the rate of arrival - is provided (Note that lambda is NOT simply 1/mean(num3), it is 1/E(x) and we don't know how the expected value was computed)<pre>declare<br /> lambda number;<br /> mu number;<br /> sig number;<br />begin<br /> lambda := 0.3194;<br /> -- select 1/mean into lambda from (select mean(num2) from disttab);<br /> mu := NULL;<br /> sig := NULL;<br /> dbms_output.put_line('EXPONENTIAL DISTRIBUTION - KOLMOGOROV_SMIRNOV');<br /> dbms_stat_funcs.exponential_dist_fit(<br /> 'dmuser', 'disttab', 'num3', 'KOLMOGOROV_SMIRNOV', lambda, mu, sig);<br /> dbms_output.put_line('Lambda : ' || lambda);<br /> dbms_output.put_line('Mu : ' || mu);<br /> dbms_output.put_line('Sig : ' || to_char(sig,'9.9999'));<br />end;<br />/</pre>The output shows:<br />EXPONENTIAL DISTRIBUTION - KOLMOGOROV_SMIRNOV<br />D value : .0919745379005003387429254774811052604723<br />Lambda : .3194<br />Mu : 0<br />Sig : .9237466160<br />The low D value (0.09) and high significance (0.92) suggests that the data fits an exponential distribution well.<br /><br />Next, <strong>Weibull Distribution</strong>. The alpha input parameter to the procedure corresponds to <em>η</em> - the scale, and the mu to <em>γ</em> - the location - in the <a href="http://www.weibull.com/LifeDataWeb/weibull_probability_density_function.htm">Weibull probability density function</a> discussed earlier.<pre>declare<br /> alpha number;<br /> beta number;<br />begin<br /> alpha := 3;<br /> mu := 0;<br /> beta := 4; <br /> sig := NULL;<br /> dbms_output.put_line('.');<br /> dbms_output.put_line('WEIBULL DISTRIBUTION - KOLMOGOROV_SMIRNOV');<br /> dbms_stat_funcs.weibull_dist_fit(<br /> 'dmuser', 'disttab', 'num3', 'KOLMOGOROV_SMIRNOV', alpha, mu, beta, sig);<br /> dbms_output.put_line('Alpha : ' || alpha);<br /> dbms_output.put_line('Mu : ' || mu);<br /> dbms_output.put_line('Beta : ' || beta);<br /> dbms_output.put_line('Sig : ' || to_char(sig,'9.9999'));<br />end;<br />/</pre>The output shows:<br />WEIBULL DISTRIBUTION - KOLMOGOROV_SMIRNOV<br />D value : .2575286246007637604103723582952313687414<br />Alpha : 3<br />Mu : 0<br />Beta : 4<br />Sig : .0177026134<br />The Kolmogorov-Smirnov test does not appear emphatic enough, we will try other tests later.<br /><br />Next, <strong>Uniform Distribution</strong>.<pre>declare<br /> A number;<br /> B number;<br />begin<br /> A := 1;<br /> B := 34;<br /> sig := NULL;<br /> dbms_output.put_line('.');<br /> dbms_output.put_line('UNIFORM DISTRIBUTION - KOLMOGOROV_SMIRNOV');<br /> dbms_stat_funcs.uniform_dist_fit(<br /> 'dmuser', 'disttab', 'num4', 'CONTINUOUS', 'KOLMOGOROV_SMIRNOV', A, B, sig);<br /> dbms_output.put_line('A : ' || A);<br /> dbms_output.put_line('B : ' || B);<br /> dbms_output.put_line('Sig : ' || to_char(sig,'9.9999'));<br />end;<br />/</pre>The output shows:<br />UNIFORM DISTRIBUTION - KOLMOGOROV_SMIRNOV<br />D value : .2083979233511586452762923351158645276292<br />A : 1<br />B : 34<br />Sig : .0904912415<br />The Kolmogorov-Smirnov test does not appear emphatic enough, we will try other tests later.<br /><br />Finally, let us try fitting a small dataset to <strong>Poisson Distribution</strong>.<pre>create table pdisttab (num1 number);<br />insert into pdisttab values(1);<br />insert into pdisttab values(2);<br />insert into pdisttab values(3);<br />insert into pdisttab values(4);<br />insert into pdisttab values(5);<br />declare<br /> mean number;<br /> stdev number;<br /> sig number;<br />begin<br /> mean := 3.0;<br /> stdev := 1.58114;<br /> dbms_output.put_line('.');<br /> dbms_output.put_line('POISSON DISTRIBUTION - KOLMOGOROV_SMIRNOV');<br /> dbms_stat_funcs.poisson_dist_fit(<br /> 'dmuser', 'pdisttab', 'num1', 'KOLMOGOROV_SMIRNOV', mean, sig);<br /> dbms_output.put_line('Mean : ' || mean);<br /> dbms_output.put_line('Stddev : ' || stdev);<br /> dbms_output.put_line('Sig : ' || to_char(sig,'9.9999'));<br />end;<br />/</pre>The output of:<br />POISSON DISTRIBUTION - KOLMOGOROV_SMIRNOV<br />D value : .08391793845975204<br />Mean : 3<br />Stddev : 1.58114<br />Sig : .9999999999<br />emphatically is in agreement with the null hypothesis that the data fits a Poisson distribution.<br /><br />That was fun, was'n't it? Now, we have introduced several new terms and concepts here (esp if you reviewed the DBMS_STAT_FUNCS docs) - viz. <em>goodness-of-fit testing</em> if a given data sample <em>fits</em> a particular distribution based on a <em>null hypothesis</em>, using various <em>test types</em>, the test metrics, and the <em>significance</em> output of a test, and various other parameters. In the upcoming posts, we will try to understand these new concepts. I also presume that the various references in the <a href="http://numoraclerecipes.blogspot.com/2007/09/st004-inferential-statistics-4.html">previous post</a> and this one may have been useful to experienced/intermediate statisticians also.<img src="http://feeds.feedburner.com/~r/NumoracleRecipes/~4/uIKGDP7zu8Y" height="1" width="1" alt=""/>Ramkumar (Ram) Krishnanhttp://www.blogger.com/profile/01032962495745306728noreply@blogger.com1http://numoraclerecipes.blogspot.com/2007/10/analyzing-your-data-for-distributions.htmltag:blogger.com,1999:blog-4903856565127937599.post-20405810385070700882007-09-21T23:28:00.000-04:002007-12-14T13:49:35.303-05:00Continuous Probability Distributions<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_fegsk-LclSE/RwbdCfrsbGI/AAAAAAAAB6Y/z7iCtKrO76g/s1600-h/cont.png"><img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;" src="http://2.bp.blogspot.com/_fegsk-LclSE/RwbdCfrsbGI/AAAAAAAAB6Y/z7iCtKrO76g/s320/cont.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5118021061487389794" /></a><br /><strong>Continuous Probability Distributions</strong><br />A <em>continuous probability distribution</em> is a (infinitely large) table that lists the continuous variables (outcomes) of an experiment with the relative frequency (a.k.a probability) of each outcome. Consider a histogram that plots the probability (y axis) that a particular job will get done within a time interval (x axis). As you keep making the interval shorter and more fine-grained, the step-like top of the histogram eventually melds into a curve - called the continuous probability distribution. The total area under this probability curve is 1, the probability that the value of <em>x</em> is between two values <em>a</em> and <em>b</em> is the area under <em>f(x)</em> between <em>a</em> and <em>b</em> and <em>f(x) <= 0</em> for all <em>x</em> For continuous distributions, the probability for any single point in the distribution is 0, you can compute a non-zero probability only for an interval between two values of the continuous variable <em>x</em>.<br /><br />Oracle SQL provides statistical functions to determine if the values in a given column fit a particular distribution. Before we proceed to the examples, let us look at some of the popularly known distributions.<br /><br /><strong>Normal (Gaussian) Probability Distribution</strong><br />The most common continuous probability distribution - to the point of being synonymous with the concept is <a href="http://en.wikipedia.org/wiki/Normal_distribution">Normal Probability Distribution</a>, represented graphically by a bell curve, plotted with the continuous value on the x axis, and the probability along the y axis, that is symmetric about the mean x value, with the two ends tapering off to infinity. The curve has these properties: <ul><li>The mean, median and mode are the same</li><li>the distribution is bell-shaped and symmetrical about the mean</li><li>the area under this curve is always = 1</li></ul>The <em>generic</em> normal distribution can have any mean value and standard deviation. For example, weather info may indicate an annual average rainfall in Boston of 38 inches with standard deviation 3 inches. The smaller the standard deviation (say, 2 inches instead of 3), the steeper the bell curve about the mean. Now if the mean were to shift to, say 40, the symmetric bell curve will shift two places to the right too. The <em>probability density function</em> for normal distribution is given by:<br />f(x) = 1/(σ√(2π))e<sup>-0.5 * ((x - μ)/σ)²</sup><br /><br />The <strong>Standard Normal Distribution</strong> is a special case of normal distribution with σ=0 and μ=1 as shown below (graph not to scale).<a href="http://3.bp.blogspot.com/_fegsk-LclSE/RwgjjPrsbHI/AAAAAAAAB64/-innemrEB0U/s1600-h/stdnormal.png"><img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;" src="http://3.bp.blogspot.com/_fegsk-LclSE/RwgjjPrsbHI/AAAAAAAAB64/-innemrEB0U/s320/stdnormal.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5118380064918760562" /></a>The <em><strong>standard z-score</strong></em> is a derivative of the standard normal distribution. It is given by <em>z = (x - μ)/σ</em>. The value of <em>z</em> is then cross-checked against a <em>standard normal table</em> or grid, to arrive at the probability of a required interval - repeat <em>interval</em>. Unlike discrete random variables in a discrete probability distribution, continuous variables can have infinite values for a given event - so the probability can be computed only for an interval or range of values. Continuing with the rainfall example, queried the probability that the annual rainfall next year at Boston will be <= 40.5 inches - we will compute z = (40.5 - 38)/3 = 0.8333. Stated another way, this means that a rainfall of 40.5 inches is 0.8333 standard deviations away from the mean. From <a href="http://www.stat.tamu.edu/~mhasson/normal.pdf">the standard normal table</a>, the probability is 0.7967 - that is, roughly 80%.<br /><br />Microsoft Excel's NORMDIST() function provides this functionality, but I was surprised to find <u>no function in Oracle SQL with equivalent simplicity</u> - I'll file a feature bug after some more research. The <a href="http://www.oracle.com/technology/products/bi/olap/index.html">Oracle OLAP Option </a>provides a NORMAL() function as part of its OLAP DML interface. This Calc-like interface is different from SQL - so we will defer this for later.<br /><br /><strong>Application to Data Mining</strong> A key use of the <em>z-score</em> is as a "normalizing" data transformation for mining applications. <u>Note</u> that this concept is <em>completely unrelated</em> to <a href="http://en.wikipedia.org/wiki/Database_normalization">database normalization</a>. The stolen car example in a <a href="http://numoraclerecipes.blogspot.com/2007/09/st002-inferential-stats-1-probability.html">previous post</a> was a simple example of prediction - we used a few <em>categorical</em> attributes like a car's color, type to predict if a car will be stolen or not.<br /><br />In the business world, the applications are more grown-up and mission-critical. One example is <strong>churn prediction</strong> - i.e. finding out if a (say, wireless) customer would stay loyal with the current provider, or move on ("churn") to competitor (in which case, the current provider could try to entice him/her to stay with appropriate promos). The customer data used for such churn prediction applications contains <em>categorical</em> (e.g. gender, education, occupation) and <em>numerical</em> (e.g. age, salary, fico score, distance of residence from a metro) attributes/columns in a table. The data in these numeric columns will be widely dispersed, across different scales. For e.g. values within salary can be from 10s of thousands to several millions. Two numerical attributes will be in different scales - example salary (30K - 2 mil) vs age (1-100). Such disparity in scales, if left untreated, can throw most mining algorithms out of whack - the attributes with higher range of values will start outweighing those in the lower range during the computation of prediction. For such algorithms, the numerical data is <em>normalized</em> to a smaller range [-1, 1] or [0, 1] using the z-transform, to enable uniform handling of numerical data by the algorithm.<br /><br />Min-max and decimal scaling are other data normalization techniques. <a href="http://ir.iit.edu/~nazli/cs422/CS422-Slides/DM-Preprocessing.pdf">Here is one primer on mining data transformations</a>. We will discuss mining transformations using DBMS_DATA_MINING_TRANSFORM package and Oracle SQL in a separate post.<br /><br /><strong>Uniform Distribution</strong><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_fegsk-LclSE/Rwkas_rsbII/AAAAAAAAB7A/_02Tu09runY/s1600-h/uniform.png"><img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;" src="http://3.bp.blogspot.com/_fegsk-LclSE/Rwkas_rsbII/AAAAAAAAB7A/_02Tu09runY/s320/uniform.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5118651811794545794" /></a>Ever waited outside a airport terminal under a sign that says "Rental-Cars/Long-term Parking - Pickup Every 10 minutes"? Your wait time is an example of uniform distribution - assuming a well-run airport, you arrive at the stop and expect to wait between 5 to max 15 minutes for your shuttle. This simplest of continuous distributions has the probability function<br /><em>f(x) = 1/ (b - a) a <= x <= b, f(x) = 0 for all other values of x</em><br />and is graphically represented as shown. The probability that a uniformly distributed random variable X will have values in the range x<sub>1</sub> to x<sub>2</sub> is:<br /><em>P(x<sub>1</sub> <= X <= x<sub>2</sub>) = (x<sub>2</sub> - x<sub>1</sub>)/(b - a), a <= x<sub>1</sub> < x<sub>2</sub> <= b</em>.<br />The mean <em>E(X) = (a+b)/2</em> and variance <em>V(X) = (b - a)²/12</em>.<br />To use the shuttle bus example, probability that the wait time will be 8 to 11 minutes is<br />P(8 <= X <= 11) = (11 - 8)/(15 - 10) = 3/5 = 0.6<br /><br /><strong>Exponential Distribution</strong><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_fegsk-LclSE/Rw7YPfrsbJI/AAAAAAAAB7I/EHoBmMJrRko/s1600-h/exp.png"><img style="float:left; margin:0 10px 10px 0;cursor:pointer; cursor:hand;" src="http://2.bp.blogspot.com/_fegsk-LclSE/Rw7YPfrsbJI/AAAAAAAAB7I/EHoBmMJrRko/s320/exp.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5120267587081235602" /></a> Consider that an event occurs with an average frequency (a.k.a. <em>rate</em>) of λ and this average frequency is constant. Consider that, from a given point in time, you wait for the event to occur. This <em>waiting time</em> follows an <em>exponential distribution</em> - depicted in the adjoining figure. The probability density function is given by: <em>f(x) = λ e<sup>-λx</sup></em> where <em>λ</em> is the frequency with which the event occurs - expressed as a particular number of times per time unit. The <em>mean</em>, or more appropriately, the <em>expected value E(X)</em> of the distribution is <em>μ = 1/λ</em>, the <em>variance</em> is <em>σ² = (1/λ)²</em>.<br /><br />Take the same shuttle bus example. If the bus does not stick to any schedule and randomly goes about its business of picking up passengers, then the wait time is exponentially distributed. This sounds a bit weird, but there are several temporal phenomena in nature that exhibit such behavior:<ul><li>The time between failure of the ordinary light bulb (which typically just blows out suddenly), or some electronic component, follows an exponential distribution. The <em>mean time between failure</em>, <em>μ</em> is an important metric in the parts failure/warranty claims domain</li><li>The time between arrivals, i.e. <em>inter-arrival time</em> of customers at any check-in counter is exponentially distributed.</li></ul> A key property of an exponentially distributed phenomenon is that it is <em>memory-less</em> - best explained with an example. Suppose you buy a 4-pack GE light at Walmart with a MTBF of 7000 hours (~ 10 months). Assume a bulb blows out, you plug in a new one, the time to next failure of this bulb will remain exponentially distributed. Say, this second bulb fails, and you change the bulb a month (24x30=720 hours) later, the time to next failure will remain exponentially distributed. The time between failure is independent of <em>when</em> the bulb failed and when (i.e. the passage of time before) it was replaced.<br /><br />The probability functions are best stated in terms of failure (survival). The probability that an item will survive until <em>x</em> units of time, given a MTBF of <em>μ</em> units can be stated as:<br /><em>P(X >= x) = e<sup>-λx</sup> x >= 0</em><br />and conversely, failing before <em>x</em> units of time is given by<br /><em>P(X <= x) = 1 - e<sup>-λx</sup> x >= 0</em><br />where <em>λ = 1/μ</em>.<br /><br />For example - if Dell claims your laptop fails following an exponential distribution with MTBF 60 months, and the warranty period is 90 days (3 months), what percentage of laptops does Dell expect to fail within the warranty period?<br />P(X <= 3 months) = 1 - e<sup>-(1/60)*3)</sup> = 0.048 ~ 5% of laptops.<br /><br /><strong>Weibull Distribution</strong><br /><a href="http://www.weibull.com/LifeDataWeb/weibull_probability_density_function.htm">Weibull distribution</a> is a versatile distribution that can emulate other distributions based on its parameter settings, and is a widely used, important tool for reliability engineering/<a href="http://en.wikipedia.org/wiki/Survival_analysis">survival analysis</a>. An extensive coverage on Reliability analysis is provided <a href="http://www.weibull.com/">here</a>. I am mentioning this distribution here mainly to set up the next post. There are <a href="http://en.wikipedia.org/wiki/Probability_distribution">several other popular probability distributions</a> - we will revisit them in the future on a needs-basis.<br /><br />For now, let us break off and look at some Oracle code in the <a href="http://numoraclerecipes.blogspot.com/2007/10/analyzing-your-data-for-distributions.html">next pos</a>t.<img src="http://feeds.feedburner.com/~r/NumoracleRecipes/~4/_Bge-N2Ifpo" height="1" width="1" alt=""/>Ramkumar (Ram) Krishnanhttp://www.blogger.com/profile/01032962495745306728noreply@blogger.com0http://numoraclerecipes.blogspot.com/2007/09/st004-inferential-statistics-4.htmltag:blogger.com,1999:blog-4903856565127937599.post-73297180387336037922007-09-21T14:29:00.001-04:002008-01-04T15:13:38.033-05:00Discrete Probability DistributionsA <b>Probability Distribution</b> is a table/graph that depicts the assignment of probabilities to the assumption of specific values by a given random variable.<br /><br />The following concepts are useful to understand probability distributions:<ul><li>If Event A can occur in <em>p</em> possible ways and Event B can occur in <em>q</em> possible ways, then both A and B can occur in <em>p</em> x <em>q</em> ways.</li><li>The number of different ways that a set of objects can be arranged is called <em>Combination</em>. The number of combinations of <em>n</em> objects taken <em>r </em>at a time is given by <em>nCr = n! / (n - r)! r!</em></li><li>The number of different ways that a set of objects can be arranged in order is called <em>Permutation</em>. The number of permutations of <em>n</em> objects taken <em>r</em> at a time is given by<br /><em>nPr = n! / (n - r)!</em></li></ul>Here is a PL/SQL code snippet to compute factorial:<pre>FUNCTION factorial(p_n IN NUMBER) RETURN NUMBER IS<br />BEGIN<br /> IF p_n IS NULL OR p_n < 0 THEN<br /> RAISE_APPLICATION_ERROR(-20000, 'Invalid Input Value');<br /> ELSIF p_n <= 1 THEN<br /> RETURN 1;<br /> ELSE<br /> RETURN factorial(p_n-1) * p_n;<br /> END IF;<br />END;</pre>EX> Compute 9!<pre>select factorial(9) from dual;</pre>I was curious to see how far I can push this function - the maximum value of <em>n</em> was 83.7 with NUMBER types, and 84.7 when I changed the input parameter and return type to BINARY_DOUBLE<pre>SQL> select factorial(83.7) from dual;<br />FACTORIAL(83.7)<br />---------------<br /> 9.642E+125<br />SQL> select factorial(83.71) from dual;<br />FACTORIAL(83.71)<br />---------------<br /> ~<br />SQL> select factorial2(84.7) from dual;<br />FACTORIAL2(84.7)<br />----------------<br /> 8.167E+127<br />SQL> select factorial2(84.71) from dual;<br />FACTORIAL2(84.71)<br />-----------------<br /> Inf</pre> EX> Compute the number of combinations of 9 objects taken 3 at a time.<pre>select factorial(9)/(factorial(9-3) * factorial(3)) from dual;</pre>EX> Compute the number of different ways of arranging 9 objects taken 3 at a time.<pre>select factorial(9)/factorial(9-3) from dual;</pre><strong>Discrete Probability Distributions</strong><ul><li>The <em>discrete probability distribution</em> is a table that lists the discrete variables (outcomes) of an experiment with the relative frequency (a k a probability) of each outcome.<br />Example: Tossing a coin two times gives you the combinations (H,H), (H,T), (T,H), (T,T) and hence, the following tuples for (#Heads, Frequency, Relative_Frequency):<br />(0, 1, 1/4=0.25), (1, 2, 2/4=0.5), (2, 1, 1/4=0.25).<br />This is the probability distribution for # heads after flipping a coin twice.</li><li><em>Mean</em> or <em>Expected value</em> of the discrete probability distribution <em>μ</em> = ∑<sub>i=1_to_n</sub> <em>x<sub>i</sub> * P(x<sub>i</sub>)</em> For the coin example, <em>μ</em> = 0 * 0.25 + 1 * 0.5 + 2 * 0.25 = 1</li><li><em>Variance</em> of the discrete probability distribution <em>σ²</em> = ∑<sub>i=1_to_n</sub> <em>(x<sub>i</sub> - μ)² * P(X<sub>i</sub>)</em></li><li><em>Standard deviation</em> is the square root of the variance</li></ul><strong><em>Binomial Probability Distribution<br /></em></strong>A <em>binomial</em> or <em>Bernoulli</em> experiment is one which consists of a fixed number of trials, each independent of the other, each with only two possible outcomes, with a fixed probability for success or failure representation in each outcome. The Bernoulli process counts the number of successes over a given number of attempts, or in other words, the random variable for a Binomial distribution is the number of successes over given number of attempts.<ul><li>The probability of <em>r</em> successes in <em>n</em> trials with probability of success <em>p</em> and probability of failure <em>q</em> is given by <em>P(r, n) = (n! / (n - r)! r!) p<sup>r</sup> q<sup>(n - r)</sup></em></li><li>The binomial probability distribution is a table of (r, P(r, n)) which can be subsequently graphed, as discussed in this example</li></ul>EX> Over the next 7 days, assume a 40% chance of rain and 60% chance of no rain. The probability that it will rain exactly 2 days over the next 7 days is P(2, 7) = (7! / (7 - 2)! 2!) 0.4<sup>2</sup> 0.6<sup>(7 - 2)</sup>, which can be computed using<pre>select factorial(7) * power(0.4,2) * power(0.6,(7-2))/<br /> (factorial(7-2) * factorial(2)) p_2_7<br /> from dual;</pre>The probability that it will rain <em>at least</em> 6 days over the next 7 days is P(r >= 6) = P(6,7)+P(7,7), computed using<pre>select (factorial(7) * power(0.4,6) * power(0.6,(7-6))/<br /> (factorial(7-6) * factorial(6))) +<br /> (factorial(7) * power(0.4,7) * power(0.6,(7-7))/<br /> (factorial(7-7) * factorial(7))) p_r_ge_6<br /> from dual;</pre>Finally, the probability that it will rain no more than 2 days over the next 7 days is P(r <= 2) = P(0,7) + P(1,7) + P(2,7)<ul><li>The <em>mean</em> of a binomial distribution is <em>μ = np</em></li><li>The <em>standard deviation</em> is <em>σ² = npq</em></li></ul> Excel has a function <em>BINOMDIST(r, n, p, cumulative)</em>. p is the probability of success, set cumulative=TRUE if you want the probability of r or fewer successes, set cumulative=FALSE if you want exactly r successes. Here is the PL/SQL version:<pre>FUNCTION binomdist(r NUMBER, n NUMBER, p NUMBER, cumulative BOOLEAN DEFAULT FALSE) RETURN NUMBER IS<br /> ri NUMBER;<br /> ret NUMBER;<br /> fn NUMBER;</br><br />BEGIN<br /> ret := 0;<br /> fn := factorial(n);<br /> FOR ri IN REVERSE 0..r LOOP<br /> ret := ret + (fn * power(p, ri) * power((1-p),(n - ri)))/<br /> (factorial(n - ri) * factorial(ri));<br /> IF NOT cumulative THEN<br /> EXIT;<br /> END IF;<br /> END LOOP;<br /> RETURN ret;<br />END binomdist;</pre><p><strong><em>Poisson Probability Distribution</em></strong><br />The random variable for Poission distribution is the number of occurrences of the event over a measurable metric (time, space). In a Poisson process, the (measured) <em>mean</em> number of occurences of an event is the same for each interval of measurement, and the number of occurrences in a particular interval are independent of number of occurrences in other intervals.</p><ul><li>The probability of exactly <em>r</em> occurrences over a given interval is given by <em>P(r) = μ<sup>r</sup> * e<sup>(-μ)</sup>/ r!</em></li><li>The <em>variance</em> of the Poisson distribution is the same as the (observed) mean.</li><li>A <em>goodness of fit</em> test helps verify if a given dataset fits the Poisson distribution</li></ul>A simple example of a Poisson process is customer arrival at your favorite coffee shop. Assume that you know that an average of 25 customers walk into a Dunkin Donuts every hour, then the likelihood of exactly 31 customers walking into the customer in the next hour is<pre>select power(25,31) * exp(-25)/factorial(25) p_31<br /> from dual;</pre>Just as we saw in Binomial distribution, the probability that no more than 31 customers will walk into the coffee shop is P(r <= 31) = P(0)+P(1)+..+P(31). Inversely, the probability that <em>at least</em> 31 customers will walk into the coffee shop is P(r >= 31) = 1 - P(r < 31). Obviously, this leads up to the need for a function similar to POISSON(r, μ, cumulative) in Excel - where cumulative = FALSE indicates computation of exactly r occurrences, and cumulative = TRUE indicates r or fewer.<pre>FUNCTION poissondist(r NUMBER, mu NUMBER,<br /> cumulative BOOLEAN DEFAULT FALSE) RETURN NUMBER IS<br /> ri NUMBER;<br /> ret NUMBER;<br />BEGIN<br /> ret := 0;<br /> FOR ri IN REVERSE 0..r LOOP<br /> ret := ret + (power(p, ri) * exp(-mu)/factorial(ri));<br /> IF NOT cumulative THEN<br /> EXIT;<br /> END IF;<br /> END LOOP;<br /> RETURN ret;<br />END poissondist;</pre><p><em>Poisson approximation</em> - a Poisson distribution can be used to approximate a Binomial distribution if the number of trials (in the binomial experiment) is >= 20 and the probability of success <em>p</em> is <= 5%.<img src="http://feeds.feedburner.com/~r/NumoracleRecipes/~4/HgjTEckogOY" height="1" width="1" alt=""/>Ramkumar (Ram) Krishnanhttp://www.blogger.com/profile/01032962495745306728noreply@blogger.com0http://numoraclerecipes.blogspot.com/2007/09/st002-probability-distributions.htmltag:blogger.com,1999:blog-4903856565127937599.post-10064391548277595942007-09-02T11:56:00.000-04:002008-01-04T15:13:38.033-05:00Random Variables, Probability, Bayes Theorem, Naive Bayes ModelsIn contrast to descriptive statistics, <strong>Inferential statistics</strong> describes the <em>population</em> based on information gleaned from a <em>sample</em> taken from the population. Fundamental to understanding statistical inference is the concept of probability.<br /><br />An <em>experiment</em> is the process of measuring/observing an activity. An <em>outcome</em> is a particular result of the experiment - outcomes are also called <b>Random Variables</b>. Random variables can be <b>discrete</b> - when it can assume a countable number of values (e.g. one of six outcomes from rolling a dice) or <b>Continuous</b> - when the variable can assume uncountably infinite values in a given range of values (time, a person's height). The <b>Sample space</b> is all possible outcomes of the experiment. An <b>event</b> is an outcome of interest.<br /><br />The <strong>Probability</strong> of Event A occurring is: P(A) = # of possible outcomes in which Event A occurs/ Total # outcomes in the sample space.<br /><br /><strong>Basic properties of probability:</strong><ul><li>P(A) = 1 implies Event A will occur with certainty</li><li>P(A) = 0 implies Event A will not occur with certainty</li><li>0 >= P(A) >= 1</li><li>The sum of all probabilities for events in the sample space must be 1</li><li>All outcomes in the sample space that are not part of Event A is called the <em>complement</em> of Event A (named A'). P(A') = 1 - P(A)</li><li>Given two events A and B, P(A) or P(B) - i.e. probabilities of each of the events occuring - without the knowledge of the other events occurrence - is called the <em>prior probability</em>.</li><li>Given two events A and B, the probability of event A occurring given that event B has occurred - denoted by P(A / B) - is called the <em>conditional probability</em> or <em>posterior</em> <em>probability </em>of Event A given that Event B has occurred. On the flip side, if P(A / B) = P(A), then events A and B are termed <em>independent</em>.</li><li>Given two events A and B, the probability of both A and B occurring at the same time is called the <em>joint probability</em> for A and B, computed as P(A and B) = P(A) * P(B)</li><li>Given two events A and B, the probability of either A or B occurring is called the <em>union</em> of events A and B. If events A and B do not occur at the same time (i.e. are <em>mutually </em><em>exclusive</em>), then P(A or B) = P(A) + P(B). If events A and B occur at the same time, i.e. are not mutually exclusive, then P(A or B) = P(A) + P(B) - P(A and B)</li><li>Law of Total Probability: P(A) = P(A / B)P(B) + P(A / B')P(B')</li><li>The <em>Bayes Theorem</em> for probabilities provides the ability to reverse the conditionality of events and compute the outcome:<br />P(A / B) = P(A) * P(B / A) / (P(A) * P(B / A) + P(A') * P(B / A'))</li></ul>Note that the act of finding the probability for given event is tantamount to <em>predicting that a given event will occur with a given level of certainty or chance</em> - quantified by the probability. This is a good segue to look at a real <em>business</em> problem and its solution based on Bayes theorem.<br /><br />A modern, <em>predictive</em>, loan processing application builds analytical models based on millions of historical loan applicant records (<em>training</em> data), and uses these models to <em>predict</em> the credit-worthiness (a k a risk of loan default) of an applicant by <em>classifying</em> the applicant into <em>Low, Medium, High</em> or such risk categories. In data mining lingo, a new applicant record is now <em>scored</em> based on the model. At the time of this writing (Aug-Sep 2007), the sub-prime lending woes and its effect on US and world markets is the main story. The trillions lost in this mess is fodder for Quant skeptics/detractors, but as a BusinessWeek cover story ("Not So Smart" - Sep 3 2007) explains, the problem was not analytics per se - the problems were with how various managements (mis)used analytics or (mis)understood their data.<br /><br />Returning to probability concepts, instead of A and B, the events become A and B<sub>i</sub>, i=1..n. The event A (or more appropriately for this example, the <em>target variable</em> <strong>Risk</strong>) is a dependent variable that assumes one of discrete values (called <em>classes</em> - <strong>low, medium, high</strong>) based on <em>predictor</em> variables B<sub>i</sub> through B<sub>n</sub> (age, salary, gender, occupation, and so on). The probability model for this <em>classifier</em> is P(A / B<sub>1</sub>,..,B<sub>n</sub>). We just shifted the language from statistics into the realm of data mining/predictive analytics. The Bayes theorem intrinsically assumes conditional dependence between B<sub>i</sub> through B<sub>n</sub>. Now if <em>n</em> is large, or if each B<sub>i</sub> takes on a large number of values, computing this model becomes intractable.<br /><br />The <strong>Naive Bayes</strong> probabilistic model greatly simplifies this by making a naive/strong assumption that B<sub>i</sub> through B<sub>n</sub> are <em>conditionally independent</em> - <a href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier">the details are provided here</a>. You can build a Naive Bayes model using the Oracle Data Mining Option, and predict the value for a target variable in new records using SQL Prediction Functions. The following example illustrates the process.<br />EX> Given a small, synthetic dataset about the attributes of stolen cars, predict if a particular car will be stolen - based on its attributes.<pre>create table stolen_cars(<br /> id varchar2(2),<br /> color varchar2(10),<br /> ctype varchar2(10),<br /> corigin varchar2(10),<br /> stolen varchar2(3));</pre>Table created.<pre>insert into stolen_cars values ('1', 'Red','Sports','Domestic','yes');<br />insert into stolen_cars values ('2', 'Red','Sports','Domestic','no');<br />insert into stolen_cars values ('3', 'Red','Sports','Domestic','yes');<br />insert into stolen_cars values ('4', 'Yellow','Sports','Domestic','no');<br />insert into stolen_cars values ('5', 'Yellow','Sports','Imported','yes');<br />insert into stolen_cars values ('6', 'Yellow','SUV','Imported','no');<br />insert into stolen_cars values ('7', 'Yellow','SUV','Imported','yes');<br />insert into stolen_cars values ('8', 'Yellow','SUV','Domestic','no');<br />insert into stolen_cars values ('9', 'Red','SUV','Imported','no');<br />insert into stolen_cars values ('10', 'Red','Sports','Imported','yes');<br />commit;</pre>Commit complete.<pre>begin<br /> dbms_data_mining.create_model(<br /> model_name => 'cars',<br /> mining_function => dbms_data_mining.classification,<br /> data_table_name => 'stolen_cars',<br /> case_id_column_name => 'id',<br /> target_column_name => 'stolen');<br />end;<br />/</pre>PL/SQL procedure successfully completed.<pre>create table new_stolen_cars (<br /> id varchar2(2),<br /> color varchar2(10),<br /> ctype varchar2(10),<br /> corigin varchar2(10));</pre>Table created.<pre>insert into new_stolen_cars values ('1', 'Red','SUV','Domestic');<br />insert into new_stolen_cars values ('2', 'Yellow','SUV','Domestic');<br />insert into new_stolen_cars values ('3', 'Yellow','SUV','Imported');<br />insert into new_stolen_cars values ('4', 'Yellow','Sports','Domestic');<br />insert into new_stolen_cars values ('5', 'Red','Sports','Domestic');<br />commit;</pre>Commit complete.<pre>select prediction(cars using *) pred,<br /> prediction_probability(cars using *) prob<br /> from new_stolen_cars;<br />-- Results<br />PRE PROB<br />--- ----------<br />no .75<br />no .870967746<br />no .75<br />no .529411793<br />yes .666666687</pre>The query scores each row in the new_stolen_cars table, returning the prediction, and the certainty of this predition. This dataset is very small, but a cursory glance at the results indicates that the predictions are correct - based on the training data. For example, the model predicts 'No' for a domestic yellow sports car - the training data has no such instance. The model predicts 'Yes' for a domestic red sports car, with > 50% certainty - the training data does support this prediction. You can obtain the details of this model using:<pre>select *<br /> from table(dbms_data_mining.get_model_details_nb('cars'));</pre> The SQL output is a collection of objects and may not look pretty at first glance. But once you understand the schema of the output type - viz. <pre>dm_nb_details</pre> - you can decipher the output to the following simple format:<pre>STOLEN<br />no<br />.5<br />DM_CONDITIONALS(<br />DM_CONDITIONAL('COLOR', NULL, 'Red', NULL, .4),<br />DM_CONDITIONAL('COLOR', NULL, 'Yellow', NULL, .6),<br />DM_CONDITIONAL('CTYPE', NULL, 'SUV', NULL, .6),<br />DM_CONDITIONAL('CORIGIN', NULL, 'Domestic', NULL, .6),<br />DM_CONDITIONAL('CORIGIN', NULL, 'Imported', NULL, .4),<br />DM_CONDITIONAL('CTYPE', NULL, 'Sports', NULL, .4))<br /><br />STOLEN<br />yes<br />.5<br />DM_CONDITIONALS(<br />DM_CONDITIONAL('COLOR', NULL, 'Red', NULL, .6),<br />DM_CONDITIONAL('CORIGIN', NULL, 'Imported', NULL, .6),<br />DM_CONDITIONAL('CORIGIN', NULL, 'Domestic', NULL, .4),<br />DM_CONDITIONAL('CTYPE', NULL, 'Sports', NULL, .8),<br />DM_CONDITIONAL('CTYPE', NULL, 'SUV', NULL, .2),<br />DM_CONDITIONAL('COLOR', NULL, 'Yellow', NULL, .4))</pre>This shows the target variable (STOLEN), its value ('yes', 'no'), the prior probability (0.5), and the conditional probability contributed by each predictor/predictor value pair towards each target/class value.<br /><br />Such ability to score transactional customer data directly from the database (in other words, deploy the model right at the source of customer data) with such simplicity is a key Oracle differentiator and competitive advantage over standalone data mining tools. For more on ODM, consult the references provided in this blog.<img src="http://feeds.feedburner.com/~r/NumoracleRecipes/~4/peTp5_kEQSk" height="1" width="1" alt=""/>Ramkumar (Ram) Krishnanhttp://www.blogger.com/profile/01032962495745306728noreply@blogger.com0http://numoraclerecipes.blogspot.com/2007/09/st002-inferential-stats-1-probability.htmltag:blogger.com,1999:blog-4903856565127937599.post-90570908413243587532007-09-01T21:16:00.000-04:002008-01-04T15:13:38.033-05:00Descriptive Statistics<span style="font-style: italic;">Descriptive</span> <a href="http://en.wikipedia.org/wiki/Statistics">statistics</a> summarizes and displays information about just the observations at hand - i.e. the sample and population are the same. Using data in the sample schema, we will quickly run through descriptive statistics concepts.<br /><br /><strong>Displaying descriptive statistics</strong><br /><em>Frequency distribution</em> is a table that organizes a number of values into intervals (classes). A <em>histogram</em> is the visual equivalent of frequency distribution - a bar graph that represents the number of observations in each class as the height of each bar.<br />EX> Compute the frequency distribution of all employee salaries - across 10 classes.<pre>select intvl, count(*) freq<br /> from (select width_bucket(salary,<br /> (select min(salary) from employees),<br /> (select max(salary)+1 from employees), 10) intvl<br /> from HR.employees)<br />group by intvl<br />order by intvl;<br />-- Result from this query<br />INTVL FREQ<br />----- --------<br />1 46<br />2 10<br />3 23<br />4 15<br />5 8<br />6 2<br />7 2<br />10 1<br />8 rows selected.</pre><em>Relative frequency distribution</em> provides the number of observations in each class as a percentage of the total number of observations.<br />EX> Compute the relative frequency distribution of all employee salaries - across 10 classes.<pre>with Q_freq_dist as<br />(select intvl, count(*) freq<br /> from (select width_bucket(salary,<br /> (select min(salary) from employees),<br /> (select max(salary)+1 from employees), 10) intvl<br /> from HR.employees)<br />group by intvl<br />order by intvl)<br />--<br />select intvl, freq/(select count(*) from employees) rel_freq<br /> from Q_freq_dist;<br />-- Result from this query<br />INTVL REL_FREQ<br />----- ----------<br />1 .429906542<br />2 .093457944<br />3 .214953271<br />4 .140186916<br />5 .074766355<br />6 .018691589<br />7 .018691589<br />10 .009345794<br />8 rows selected.</pre><em>Cumulative frequency distribution</em> provides the percentage of observations that are less than or equal to the class of interest.<br />EX> Compute the cumulative frequency distribution of all employee salaries - across 10 classes.<pre>with Q_freq_dist as<br />(select intvl, count(*) freq<br /> from (select width_bucket(salary,<br /> (select min(salary) from employees),<br /> (select max(salary)+1 from employees), 10) intvl<br /> from HR.employees)<br />group by intvl<br />order by intvl),<br />--<br />Q_rel_freq_dist as<br />(select intvl, freq/(select count(*) from employees) rel_freq<br /> from Q_freq_dist)<br />--<br />select intvl,<br /> sum(rel_freq) over (order by intvl<br /> rows between unbounded preceding and current row) cum_freq<br /> from Q_rel_freq_dist;<br />-- Result from this query<br />INTVL CUM_FREQ<br />----- ----------<br />1 .429906542<br />2 .523364486<br />3 .738317757<br />4 .878504673<br />5 .953271028<br />6 .971962617<br />7 .990654206<br />10 1<br />8 rows selected.</pre>The width_bucket function creates equi-width histograms. A complementary function, <em>NTILE</em>, helps you bin values into intervals of equal height - i.e. with same count of values in each bin. This is useful for computing <em>quartiles</em>, <em>quintiles</em> etc.<br />EX> The following query bins the 107 records/observations into approx 10 <em>equi-height</em> bins.<pre>select htbin, count(*) ht<br /> from (select ntile(10) over (order by salary) htbin<br /> from hr.employees)<br />group by htbin<br />order by htbin;<br />HTBIN HT<br />----- ------<br />1 11<br />2 11<br />3 11<br />4 11<br />5 11<br />6 11<br />7 11<br />8 10<br />9 10<br />10 10<br />10 rows selected.</pre>Anyone who has given a competitive exam (SAT thro GRE/GMAT) should be familiar with the term <em>percentile</em>. Given <em>n</em> data points, the <em>P</em>th percentile represents the value which resides above <em>P</em>% of the values, and is given by <em>percentile = (n+1) * P/100</em><br />EX> Assume a student took the <a href="http://www.ets.org/portal/site/ets/menuitem.fab2360b1645a1de9b3a0779f1751509/?vgnextoid=b195e3b5f64f4010VgnVCM10000022f95190RCRD">GRE </a>test in 2004 and received an overall GRE score is 2150 (out of the possible 2400) and ranked at 89th percentile. What does this mean? Based on information provided <a href="http://www.ets.org/Media/Tests/GRE/pdf/04-05_factors.pdf">here</a>, it means, of the worldwide total of 408,948, the student ranked above (408948+1) * 89/100 ~= 363965 examinees, and conversely, 408948-363964= 44982 students ranked above this student. You can compute percentiles in Oracle SQL using CUME_DIST() as follows.<br />EX> Rank all sales reps and their managers by their salary percentile.<pre>select job_id, last_name, salary,<br /> round(cume_dist() over<br /> (partition by job_id order by salary) * 100) as pctile<br /> from employees<br /> where job_id like '%SA_%'<br />order by job_id, pctile desc, salary;<br />JOB_ID LAST_NAME SALARY PCTILE<br />------ -------------- ------ ------<br />SA_MAN Russell 14000 100<br />SA_MAN Partners 13500 80<br />SA_MAN Errazuriz 12000 60<br />SA_MAN Cambrault 11000 40<br />SA_MAN Zlotkey 10500 20<br />SA_REP Ozer 11500 100<br />SA_REP Abel 11000 97<br />SA_REP Vishney 10500 93<br />SA_REP Tucker 10000 90<br />SA_REP Bloom 10000 90<br />SA_REP King 10000 90<br />...</pre>The <em>partition by job_id</em> enables to rank within job_id. You can infer things like Ms. Bloom has 90% of her fellow sales rep earning as much or less than her (<em>order by salary</em> clause), Mr. Zlotkey is in the last quintile for managers, and so on.<br />EX> Suppose a new job candidate names his salary requirements - say 7.5K - you can use the same function - in what Oracle calls its <em>aggregate</em> usage to find out what percentile this salary number would fit in, as follows:<pre>select round(cume_dist(7500) within group order by (salary) * 100, 0) offer_pctile<br /> from employees<br />OFFER_PCTILE<br />------------<br /> 64</pre>If the new candidate joins the sales force (a.k.a when his record is entered into this table), his salary will be in the 64th percentile. The <em>PERCENT_RANK()</em> function is similar to <em>CUME_DIST()</em> function - see the SQL Reference Manual in the Oracle docs for details.<br /><br /><strong>Measures of Central Tendency - Mean, Median, Mode</strong><br /><strong>Mean or Average</strong> = sum(all observations)/ count_of_observations. Note the difference between sample mean and population mean for use later.<br />EX> Compute average salary of employees in the company.<pre><br />select min(salary) min_s, max(salary) max_s,<br /> round(<strong>avg</strong>(salary),2) avg_s<br /> from HR.employees;</pre>EX> Compute average quantity sold of products from all sale transactions.<pre>select prod_id, count(*), sum(quantity_sold) sum_qty,<br /> avg(quantity_sold) avg_qty<br /> from SH.sales<br />group by prod_id<br />order by prod_id;</pre>We can also compute the <strong>Mean of grouped data from a frequency distribution</strong> using<br />x_bar = (Σ<sub>_1_to_m</sub>(freq<sub>i</sub> * x<sub>i</sub>))/Σ<sub>_1_to_m</sub>(freq<sub>i</sub>) where <span style="font-style: italic;">m </span>is the number of classes.<br />EX> Compute the mean of the frequency distribution discussed above.<pre>with Q_freq_dist as (<br />select intvl, count(*) freq<br /> from (select width_bucket(salary,<br /> (select min(salary) from employees),<br /> (select max(salary)+1 from employees), 10) intvl<br /> from HR.employees)<br />group by intvl<br />order by intvl)<br />--<br />select sum(intvl * freq)/sum(freq)<br /> from Q_freq_dist</pre><strong>Median</strong> is the value in a dataset in which half the observations have a higher value and the other half a lower value.<br />EX> Compute the median salary per department.<pre>select department_id, <strong>median</strong>(salary)<br /> from HR.employees<br />group by department_id;</pre><strong>Mode</strong> is the observation in the dataset that occurs most frequently.<br />EX> Compute the mode per department.<pre>select department_id, <strong>stats_mode</strong>(salary)<br /> from HR.employees<br />group by department_id;</pre><strong>Measures of Dispersion</strong><br />These metrics measure how a set of data values are dispersed around (a k a deviate from) the measure of central tendency (typically the mean). <strong>Variance</strong> represents the relative distance between the data points and the mean of a dataset, and is computed as the sum of squared deviation of each data point from the mean. σ<sup>2</sup> = (Σ<sub>1_to_n</sub> (x<sub>i</sub> - x_bar)²)/(n-1)<br />EX> Compute the variance, sample variance, and population variance of employee salaries.<pre>select variance(salary), var_samp(salary), var_pop(salary)<br /> from HR.employees;</pre><strong>Standard deviation</strong> is the square root of variance, and generally more useful than variance because it is in units of the original data - rather than its square.<br />EX> Compute the standard deviation, sample standard deviation, and population standard deviation of employee salaries.<pre>select stddev(salary), stddev_samp(salary), stddev_pop(salary)<br /> from HR.employees;</pre>For several commonly occuring datasets, the values tend to cluster around the mean or median - so that the distribution takes on the shape of a symmetrical bell curve. In such situations, the <em>Chebyshev's Theorem</em> states that <em>at least</em> (1 - 1/k²) x 100%, k > 1, will fall within k-standard-deviations from the mean. If the distribution is more or les symmetrical about the mean (i.e. closer to <em>normal distribution</em>) then the more rigorous <em>empirical</em> rule states that <em>approximately</em> 68% of the data will fall within 1 standard deviation of the mean, and 95% of the data within 1 standard deviations of the mean, and 99.7% of the data within 3 standard deviations of the mean.<br /><br /><strong>Skewness</strong> is a measure of the unevenness or asymmetry of the distribution. For a perfectly symmetric distribution, the mean is the same as mode is same as the median. Generally, for a skewed distribution, it is observed that the mean is to side of the median is to the side of the mode - the side being right (=ve skew) or left (-ve skew) depending on the skew. Skewness of a given population is computed by Skew = Σ<sub>1_to_N</sub> (x<sub>i</sub> - μ)/σ)³/N, where μ is the population mean, or σ is the standard deviation.<br /><strong>Kurtosis</strong> is the measure of peakedness of the distribution of the population. Kurtosis = Σ<sub>1_to_N</sub>(x<sub>i</sub> - μ)/σ)<sup>4</sup>/N.<br />EX> Compute the skewness and kurtosis for the employee salary.<pre>-- Skewness<br />select avg(val3)<br /> from (select power((salary - avg(salary) over ())/<br /> stddev(salary) over (), 3) val3<br /> from employee);<br />--<br />-- Kurtosis<br />select avg(val4)<br /> from (select power((salary - avg(salary) over ())/<br /> stddev(salary) over (), 4) val4<br /> from employee);</pre>This brevity is made possible thanks to Oracle SQL analytical functions (<em>avg(salary) over ()</em> implies compute average over all rows in the employees). Note that you can use the partition and order by clauses inside the () to group data across different employee classes and scope the computation to the group if required. From a performance standpoint, note that the computation intrinsically demands multiple scans (2 at the least) over the employees table (or the tables behind the employees view) - so for a potentially large employees table, the computations may involve use of sort (i.e. on-disk) memory rather than just the PGA (the database' equivalent of a PC's RAM). For more on such performance implications of SQL, begin your education with Oracle Concepts or SQL Tuning Guide, or look up the various references provided here.<img src="http://feeds.feedburner.com/~r/NumoracleRecipes/~4/n6siKxVBm5o" height="1" width="1" alt=""/>Ramkumar (Ram) Krishnanhttp://www.blogger.com/profile/01032962495745306728noreply@blogger.com0http://numoraclerecipes.blogspot.com/2007/09/st001-descriptive-statistics.htmltag:blogger.com,1999:blog-4903856565127937599.post-7544338099624577822007-09-01T12:51:00.000-04:002007-10-11T22:37:45.524-04:00Dev Essentials<strong>Pre-requisites</strong>:<br />This blog is aimed at an Oracle RDBMS user with a beginner to intermediate level understanding of SQL and PL/SQL languages, and basic Oracle Concepts. If you are an analyst weaned on Excel or other GUI workbenches with some basic programming ability, but know next to nothing about databases, <a href="http://philip.greenspun.com/panda/databases-choosing">read this friendly intro</a> (ignore the author's controversial opinions), kick the tires on a <a href="http://www.oracle.com/technology/products/database/xe/index.html">free edition of Oracle</a>, and return here.<br /><br /><strong>Database</strong>:<br />This blog is based on Oracle10gR2 and higher. Oracle gives away an awful lot of analytics features as part of its free (XE/Express) or low-cost (Standard) editions - but if a highlighted feature is available only in a particular higher-end edition, we will mention it.<br /><br /><span style="font-weight: bold;">Datasets:</span><br />We will use the data in the <a href="http://download.oracle.com/docs/cd/B19306_01/server.102/b14198/toc.htm">Sample Schema </a>that ships with the Oracle database for most of our examples. Enable SELECT privileges on all the tables in the various sample schemas. When this data is inadequate to explain a concept, we will import data from other public sources.<br /><br /><strong>Loading data:</strong><br />For small datasets outside the sample schema, we will use INSERT statements. For larger tables, we'll use SQL*Loader and/or External Tables. <a href="http://www.dbspecialists.com/presentations/load_faster.html">Here </a>is a quick primer on data load (Note: the extract_files_dir directory object should be created by the DBA using the<a href="http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/statements_5007.htm#i2061958"> CREATE DIRECTORY</a> statement). For more details, consult Oracle docs.<br /><br /><span style="font-weight: bold;">DevEnv</span>:<br />The baseline is Xemacs- SQL*Plus - Oracle10gR2 Enterprise Edition - RHEL on a 2-processor Core duo Dell COTS (common off-the-shelf) machine. Over time, we will try the free SQL*Developer tool for all development. Software downloads of the database and related products are available <a href="http://www.oracle.com/technology/software/products/database/index.html">here</a>.<br /><br /><strong>Output, Visualization of results:<br /></strong>One major weakness of Oracle compared to best-of-breed statistical/scientific products is the charting and visualization capability in its disparate GUI products. Oracle has several of them - SQL*Developer, Application Express, Discoverer, Warehouse Builder, Data Miner, Enterprise Manager, and a whole slew of tools from the Siebel BI (a k a BI EE) and Hyperion acquisitions. These tools address specific areas of data or application management. As the #1 database company, it would be fantastic if Oracle makes a strategic investment to build modular, data visualization solutions across its product stack - integrating the best elements from it existing portfolio of GUI-based products, and third party tools like <a href="http://www.tibco.com/software/business_intelligence/default.jsp">Spotfire</a>, <a href="http://www.tableausoftware.com/">Tableau</a>, or <a href="http://www.dundas.com/">Dundas</a> (now part of MSFT). But Visualization for BI is still in its nascent stages, so this may not happen soon. For purposes of this blog, we will embed screenshots from Excel, SQL*Dev, App Express, Data Miner, R and other Oracle GUIs as applicable. We will use <a href="http://www.gnome.org/projects/dia/">Dia </a>for diagramming.<br /><br /><strong>Code Quality:</strong><br />I will test every code fragment I publish here, but I am not the best SQL or PL/SQL developer around - so I welcome comments/corrections on better-expressed and/or more efficient queries for a particular problem. Consult AskTom or search for <a href="http://www.celko.com/articles.htm">Celko's articles</a> for SQL tips, and Steve Feuerstein's articles/books for PL/SQL coding tips.<img src="http://feeds.feedburner.com/~r/NumoracleRecipes/~4/GkliLUgYduw" height="1" width="1" alt=""/>Ramkumar (Ram) Krishnanhttp://www.blogger.com/profile/01032962495745306728noreply@blogger.com0http://numoraclerecipes.blogspot.com/2007/09/essentials-post-legend-data-load-dev.htmltag:blogger.com,1999:blog-4903856565127937599.post-739330761325957362007-09-01T09:07:00.000-04:002007-10-18T17:56:36.656-04:00PrefaceGreetings.<br /><br />This blog is a personal quest to learn/discover/enable the Oracle RDBMS as an analytical/scientific computing engine. While Oracle does not (yet) match the breadth and depth of a popular, best-of-breed platform like SAS for analytics, or a scientific computing suite like Matlab, it offers enough facilities in Oracle SQL and PL/SQL to build a competent business analytics platform, if not yet a comprehensive platform for scientific research.<br /><br />The answer to the central question of "I am an Oracle database professional ...Why Should I Know Analytics?" is simply "To keep yourself relevant ten years from now". A more lucid, lengthy answer can be found in the books in the Readings section. After a 10-15 year journey through database design, E-R diagrams, Certifications on every database technology, managing terabyte warehouses, you are bound to come to a juncture in your professional life where you ask "We store at least 7-10 years worth of transactional data (as required by Sarbanes-Oxley).. is there anything I can learn about my customers/products/employees <em>implicitly</em> from all this stored data - beyond what I code into my applications?" At this point in time, you begin to become an analyst.<br /><br />The central question of "Why <strong>In-Database</strong> Analytics?" deserves a dedicated post rather than a glib answer in this column. But the basic premise is this: The wider and deeper you <em>deploy</em> intelligence across a company, the smarter the company. Any data-driven intelligence that you <em>compute at the source</em> - i.e. where your enterprise data resides - becomes corporate intelligence consumable across the enterprise. The higher you go in your stack to compute this intelligence, the more insular this intelligence becomes to the rest of the company - SOA and other mid-tier technologies notwithstanding. So the proposition is simple - 'If you are an Oracle DBA/Developer/Analyst/CIO, explore the capabilities of the database as an analytical platform before you think about spending on tiered, best-of-breed products. Save that money for the best business analysis and data analysis talent that you can find'.<br /><br />Over the years, customers/consultants have wanted to incorporate their own algorithms into the database. Oracle has been extensible since 8i, offering a framework to link your C-based Math/Analytics libraries into the database. But unlike Illustra (the competitive Object-relational database of the 8i era), Oracle (wisely) places a user application in its own process sandbox - it does not allow a user to link their "untrusted" C library directly into the Oracle binary - mainly to prevent some buggy/malicious C fragment from crashing a million dollar production system.<br /><br />The consequence is that your application has to marshal the data between your library and the database through an ExtProc interface. Obviously, this critical data pinch-point has not helped spawn a large collection of extensible third-party applications. However, internal to Oracle, the framework has been a great success, enabling an evolutionary development of products that manage unstructured data - Oracle's Spatial Option, Text, XML, Enterprise Search, PL/SQL Table functions and such features directly or indirectly use this framework or its design concepts.<br /><br />There is a secondary motivation to this blog. 47% of the world's corporate relational data is stored in Oracle. If even a fraction of this installed base begins to demand more foundational analytics infrastructure such as high performance matrix computation, concurrent processing, math libraries, and such building blocks for scientific computing - this may nudge Oracle towards becoming a world-class analytics engine.<br /><br />Returning to more humbler pragmatics for the blog, rather than take the classic extensibility route, we'll code techniques using Oracle10gR2 (and later) SQL and PL/SQL. PL/SQL as a language has dramatically matured over the past 5 years, with support for native compilation, IEEE floating point arithmetic, function/rowset caching, and improved scalability and performance. Where expedient/convenient, we will package any open-source Java libraries as Java stored procedures.<br /><br />If you have read thus far, and have the slightest inkling of the breadth and depth of quantitative techniques out there, you will agree that the mission statement for this blog is downright audacious, or Quixotic, or stupid, or pick your adjective... <add>So to quickly bootstrap the blog, and give myself a minimal chance of success, I have afforded myself these concessions:<ul><li>I will scope my efforts to coding a particular technique, with a best-effort attempt to find an appropriate example of its applicability. I will try to provide pointers to relevant articles/case studies from domain experts (CRM/Retail/Supply Chain etc). Comments and discussions will hopefully add/illuminate the various use cases.</li><li>I will refer books, other blogs, websites for various topics, and credit the authors - either as links, or by including their books/publications in the References section below</li><li>Once I stabilize the blog content, I will welcome co-authors - there is a ton of material to cover. So if you are interested, ping me a mail.</li></ul> Wish me luck, and hope you find this site interesting and useful.<br /><br />Best regards, Ram</add></p><img src="http://feeds.feedburner.com/~r/NumoracleRecipes/~4/8vJaCWP6dbU" height="1" width="1" alt=""/>Ramkumar (Ram) Krishnanhttp://www.blogger.com/profile/01032962495745306728noreply@blogger.com0http://numoraclerecipes.blogspot.com/2007/09/foreword.html