tag:blogger.com,1999:blog-15423218867563383482018-05-29T02:10:25.901-05:00GSB420 - Business StatisticsGSB 420 - Notes from Applied Quantitative Analysis - Winter 2008Eliezerhttp://www.blogger.com/profile/14036531848996147890noreply@blogger.comBlogger91125Gsb420https://feedburner.google.comtag:blogger.com,1999:blog-1542321886756338348.post-26550181976986226582011-11-06T11:58:00.019-06:002012-04-26T15:50:36.082-05:00The New GSB420 BlogAfter several months, even years, of inactivity, I've decided to revive and revise this blog and restructure it in a logical sequence, rather than continue with the reverse chronological order which is the standard format for blogs. My goal is to make it easier for readers to find the material that they're looking for.<br /><br />Therefore, this will be the top-most entry in the blog and it will contain a table of contents with links to the articles in the logical order in which you will probably want to learn them - starting from basic principles and working towards more complex concepts. <strong>It will take me some time to complete this restructructing, so please be patient.</strong><br /><br />I've also decided to update the individual entries in the blog and remove references that are specific to the class when I took it in 2008. Those comments are no longer relevant.<br /><br />I hope you enjoy this blog and benefit from it. If you have any questions, please feel free to email me at eliezerappleton at gmail dot com.<br /><br />Lecture 1<br /><a href="http://gsb420.blogspot.com/2008/01/basic-definitions-ch-1.html">Basic Definitions</a><br /><a href="http://gsb420.blogspot.com/2008/01/presentational-statistics-ch-2.html">Presentational Statistics</a><br /><a href="http://gsb420.blogspot.com/2008/01/lecture-1-ch-3-descriptive-statistics.html">Descriptive Statistics</a><br /><a href="http://gsb420.blogspot.com/2008/01/lecture-1-pop-quiz-1.html">Quiz #1</a><br /><a href="http://gsb420.blogspot.com/2008/01/post-lecture-1-notes-and-research.html">Lecture 1 - Additional Notes and Research</a><br /><br />Lecture 2<br /><a href="http://gsb420.blogspot.com/2008/01/lecture-2-ch-3.html">Using Standard Deviation</a><br /><a href="http://gsb420.blogspot.com/2008/01/lecture-2-ch-3-shape-of-distribution.html">Shape of the Distribution</a><br /><a href="http://gsb420.blogspot.com/2008/01/lecture-2-ch-3-correlating-two-sets-of.html">Correlating Two Sets of Data</a><br /><a href="http://gsb420.blogspot.com/2008/01/more-on-covariance-and-correlation.html">More on Covariance and Correlation Coefficient</a><br /><a href="http://gsb420.blogspot.com/2008/01/sample-vs-population.html">Sample vs. Population</a><br /><a href="http://gsb420.blogspot.com/2008/01/lecture-2-ch-4-probability.html">Basic Probability</a><br /><a href="http://gsb420.blogspot.com/2008/01/lecture-2-ch-4-conditional-probability.html">Conditional Probability</a><br /><a href="http://gsb420.blogspot.com/2008/01/lecture-2-ch-4-independency.html">Independency</a><br /><a href="http://gsb420.blogspot.com/2008/01/lecture-2-pop-quiz-2.html">Pop quiz #2</a><br /><a href="http://gsb420.blogspot.com/2008/01/post-lecture-2-research-and-notes.html">Post-lecture 2 notes and research</a><br /><br />Lecture 3<br /><a href="http://gsb420.blogspot.com/2008/01/lecture-3-bayes-theorem.html">Bayes's Theorem</a><br /><a href="http://gsb420.blogspot.com/2008/01/lecture-3-counting-rules.html">Counting Rules</a><br /><a href="http://gsb420.blogspot.com/2008/01/lecture-3-ch-5-discrete-random.html">Discrete Random Variables</a><br /><a href="http://gsb420.blogspot.com/2008/01/lecture-3-ch-5-binomial-distribution.html">Binomial Distribution</a><br /><a href="http://gsb420.blogspot.com/2008/01/lecture-3-ch-5-poisson-distribution.html">Poisson Distribution</a><br /><br />Lecture 4<br /><a href="http://gsb420.blogspot.com/2008/01/lecture-4-ch-6-continuous-random.html">Continuous Random Variables</a><br /><a href="http://gsb420.blogspot.com/2008/01/lecture-4-ch-6-normal-distribution.html">The Normal Distribution</a><br /><a href="http://gsb420.blogspot.com/2008/01/lecture-4-ch-6-standard-normal.html">The Standard Normal Distribution</a><br /><br />Lecture 5<br /><a href="http://gsb420.blogspot.com/2008/02/lecture-5-ch-6b-7-and-8a-central-limit.html">The Standard Normal Distribution (continued)</a><br /><a href="http://gsb420.blogspot.com/2008/02/lecture-5-ch-6b-checking-for-normality.html">Checking for Normality</a><br /><a href="http://gsb420.blogspot.com/2008/02/lecture-5-ch-7-sampling-distributions.html">Sampling Distributions</a><br /><a href="http://gsb420.blogspot.com/2008/02/lecture-5-ch-8-confidence-interval.html">Confidence Interval Estimation</a><br /><br />Lecture 6<br /><a href="http://gsb420.blogspot.com/2008/02/lecture-6-ch-8b-confidence-interval-for.html">Confidence Interval for the Mean with Known Std Dev</a><br /><a href="http://gsb420.blogspot.com/2008/02/lecture-6-ch-8b-confidence-interval-for_15.html">Confidence Interval for the Mean with Unknown Std Dev</a><br /><a href="http://gsb420.blogspot.com/2008/02/lecture-6-ch-8b-confidence-interval-for_17.html">Confidence Interval for the Mean - Examples and Minitab</a><br /><a href="http://gsb420.blogspot.com/2008/02/lecture-6-ch-8c-determining-sample-size.html">Determining Sample Size</a><br /><a href="http://gsb420.blogspot.com/2008/02/lecture-6-ch-9-hypothesis-testing.html">Hypothesis Testing</a><br /><a href="http://gsb420.blogspot.com/2008/02/one-tail-hypothesis-testing.html">One-Tail Hypothesis Testing</a><br /><br />Lecture 7 - Linear Regression<br /><a href="http://gsb420.blogspot.com/2008/02/lecture-7-simple-linear-regression.html">Simple Linear Regression</a><br /><a href="http://gsb420.blogspot.com/2008/02/least-squares-method.html">The Least Squares Method</a><br /><a href="http://gsb420.blogspot.com/2008/02/lecture-7-assumptions-in-method-of.html">Assumptions in the Method of Least Squares</a><br /><a href="http://gsb420.blogspot.com/2008/02/lecture-7-coefficient-of-determination.html">Coefficient of Determination</a><br /><br />Lecture 8 - Residual Analysis<br /><a href="http://gsb420.blogspot.com/2008/02/lecture-8-residual-analysis-definition.html">Definition of Residual Analysis</a><br /><a href="http://gsb420.blogspot.com/2008/02/lecture-8-residual-analysis-checking.html">Checking Linearity</a><br /><a href="http://gsb420.blogspot.com/2008/03/lecture-8-residual-analysis-checking.html">Checking the Normality Assumption</a><br /><a href="http://gsb420.blogspot.com/2008/03/lecture-8-residual-analysis-checking_02.html">Checking the Equal Variance Assumption</a><br /><a href="http://gsb420.blogspot.com/2008/03/lecture-8-residual-analysis-checking_04.html">Checking Independence of Errors</a><br /><a href="http://gsb420.blogspot.com/2008/03/lecture-8-inferences-about-regression.html">Inferences About the Regression Slope - Part 1</a><br /><a href="http://gsb420.blogspot.com/2008/03/lecture-8-inferences-about-regression_04.html">Inferences About the Regression Slope - Part 2</a><br /><a href="http://gsb420.blogspot.com/2008/03/lecture-8-confidence-interval-for.html">Confidence Interval for Ŷ</a><br /><br /><img src="http://feeds.feedburner.com/~r/Gsb420/~4/4Oc8XAPm4GI" height="1" width="1" alt=""/>Eliezerhttp://www.blogger.com/profile/14036531848996147890noreply@blogger.com0tag:blogger.com,1999:blog-1542321886756338348.post-70892694689745483062008-03-28T16:36:00.002-05:002008-03-28T16:38:31.564-05:00ECO 509 - Spring Quarter 2008Next quarter (Spring 2008) I'll be taking ECO 509 - Business Conditions Analysis (aka Macroeconomics) with Professor Jaejoon Woo (Wed night section).<br /><br />You can find the blog for ECO 509 at <a href="http://eco509.blogspot.com">http://eco509.blogspot.com</a>.<img src="http://feeds.feedburner.com/~r/Gsb420/~4/DwYHnPnDGrg" height="1" width="1" alt=""/>Eliezerhttp://www.blogger.com/profile/14036531848996147890noreply@blogger.com0tag:blogger.com,1999:blog-1542321886756338348.post-6134269709347413132008-03-14T08:50:00.002-05:002008-03-14T12:20:07.565-05:00Final Exam RecapWell, it's finally over! Here's a recap of my random thoughts on the final exam:<br /><ul><li>In general, the final was harder than I expected. I'm pretty sure I did well, but it was definitely harder than the midterm and harder than I thought it would be.</li><li>He hit us with 10 straight questions from Chapter 12 right out of the block! I expected to ease into it. The previous quarter's final was pretty linear - starting at chapter 7, then chapter 8, etc and not hitting chapter 12 until the last questions. I knew chapter 12 would be a big chunk of the exam, but I didn't expect him to lead off with it.</li><li>I think we were all pretty stumped by that one question (was it 9 or 10?) that had us calculate the intercept, b<sub>0</sub>. I kept coming up with 100 for an answer, but it wasn't one of the choices. I saw Azzam go up and ask about something and figured it might be that. So I went up and asked also. I think we all breathed a sigh of relief when he made the change to the last two choices. BTW, I think he could have just changed Σ Y to be 100 and that would have made the intercept 40, which was one of the choices.<br /></li><li>Probability of Type I errors? Sheesh! I didn't see that coming. The answer is that it's alpha, α.</li><li>I was surprised at the "regular" question that had us work out the regression from the raw data. I'm almost positive I made some arithmetic mistakes in calculating the Σ(X<sub>i</sub>-Xbar)(Y<sub>i</sub>-Ybar) or Σ(X<sub>i</sub>-Xbar)<sup>2</sup> or one of the other calculations.</li><li>My answer for the "west nile" question was that we do <span style="font-style: italic;">not</span> reject the null hypothesis, H<sub>0</sub> that the average # of cases is different than 3.</li><li>On the last question, part A, you had to assume (or somehow know through ESP or divine vision) that the level of confidence to use is 95%. You can calculate the t-score easily enough (I think I got something like 2.414), but draw any conclusions, you need to calculate the critical value for t<sub>α/2, n-1</sub> which requires the level of confidence.</li><li>I <span style="font-style: italic;">knew</span> there would be a Durbin-Watson question on the exam! My answer for that one was that there was <span style="font-style: italic;">not</span> evidence of autocorrelation since the DW stat was greater than d<sub>U</sub> and less than 2+d<sub>L</sub>. I'm not sure I was using the right d<sub>U</sub> because I wasn't sure if I should use α or α/2 on the DW table. I used α (which I think was 0.05).</li><li>As predicted, there were a few questions with Minitab output. No big surprises there.</li><li>In at least 2 of the questions (one multiple choice and one "regular"), he gave us the variance rather than the standard deviation. Tricky! I almost fell for that one.</li><li>One question asked us to determine sample size, given a confidence level, margin of error and standard deviation (or maybe variance). Using the formula, you calculate n=74.3 (something like that - not a round number). You had to know to round up, not truncate the decimal.</li><li>Higher confidence levels need wider intervals. You had to figure he would ask about that.<br /></li><li>If p is low, H<sub>0</sub> must go! You could use that to answer one of the multiple choice question on p-value approach in hypothesis testing.</li></ul>All in all, not a terrible test. Just harder than last quarter's final, IMHO. I'm anxious to see how I did. I think he said he may have them graded by Monday. The multiple choice is easy to grade. I think he gives the "regular" questions to a Teaching Assistant. HW5 grades have not been posted to Blackboard yet.<img src="http://feeds.feedburner.com/~r/Gsb420/~4/seo_KlNuzCc" height="1" width="1" alt=""/>Eliezerhttp://www.blogger.com/profile/14036531848996147890noreply@blogger.com2tag:blogger.com,1999:blog-1542321886756338348.post-58940571798143458572008-03-13T13:41:00.003-05:002008-03-13T15:34:59.788-05:00Final Exam Study Guide - Last Minute NotesJust a couple of last minute thoughts:<br /><ul><li>There were no example problems that used the Durbin-Watson statistic. That doesn't mean it won't be on the exam! The D-W table is part of the formula sheet, so I'm expecting a question on it.</li><li>Remember that if DW is less than d<sub>L</sub>, there's autocorrelation. If it's between d<sub>L</sub> and d<sub>U</sub>, it's inconclusive. If it's between d<sub>U</sub> and 2+d<sub>L</sub>, there's no autocorrelation. I doubt we'll be asked about the range from 2-4.</li><li>Review how to read those Minitab outputs! There's bound to be at least one on the exam. Remember that in Minitab output, SS stands from Sum of Squares. S stands for S<sub>YX</sub>. The Coefficient of the Intercept is b<sub>0</sub>. The Coefficient of the other (independent) variable is b<sub>1</sub>. SE stands for standard error.</li><li>There weren't any practice problems on the confidence interval for mean/individual Y. I would expect one of those since the formulas are on the sheet. He'll probably give us h<sub>i</sub> and S<sub>XY</sub>. Remember to use n-2 when looking up the value for t in this case.<br /></li><li>Remember that most of the answers can be derived from the data in the question and the formula sheet. You're not really expected to memorize very much. If you can't figure it out, look at the formula sheet.</li><li>Remember to bring a copy of the formula sheet, a calculator and a #2 pencil. Don't laugh! I forgot a pencil for the midterm and ran out to Walgreen's a half hour before the exam.<br /></li></ul><img src="http://feeds.feedburner.com/~r/Gsb420/~4/IOD4VUxC_gY" height="1" width="1" alt=""/>Eliezerhttp://www.blogger.com/profile/14036531848996147890noreply@blogger.com0tag:blogger.com,1999:blog-1542321886756338348.post-72340653020064320232008-03-12T17:02:00.010-05:002008-03-13T09:46:59.448-05:00Final Exam Study Guide - Practice Questions - Part 2In this post, I'll go over the answers to the "regular" questions from the last quarter's final. I'll also note which chapter the question is from.<br /><span style="font-weight: bold;"></span><blockquote><span style="font-weight: bold;">Question 1 <span style="color: rgb(0, 0, 153);">(Chapter 12)</span>: </span>You would like to estimate the income of a person based on his age. The following data shows the yearly income (in $1,000) and age of a sample of seven individuals.<pre>Income (in $1,000) Age<br />20 18<br />24 20<br />24 23<br />25 34<br />26 24<br />27 27<br />34 27<br /></pre>a. Develop the least squares regression equation.<br />b. Estimate the yearly income of a 30-year-old individual.</blockquote><span style="font-weight: bold;">Answer:</span><br />a. In order to calculate b<sub>0</sub> and b<sub>1</sub>, we need to first calculate the mean of X (age) and Y (income). For xbar, I calculated 24.71 and for ybar, I got 25.71. To calculate b<sub>1</sub>, we need to calculate x<sub>i</sub>-xbar and y<sub>i</sub>-ybar for each i:<br /><pre>Income Age x<sub>i</sub>-xbar y<sub>i</sub>-ybar (x<sub>i</sub>-xbar)(y<sub>i</sub>-ybar) (x<sub>i</sub>-xbar)<sup>2</sup><br /><br />20 18 -6.71 -5.71 38.31 45.02<br />24 20 -4.71 -1.71 8.05 22.18<br />24 23 -1.71 -1.71 2.92 2.92<br />25 34 9.29 -0.71 -6.60 86.30<br />26 24 -0.71 0.29 -0.21 0.50<br />27 27 2.29 1.29 2.95 5.24<br />34 27 2.29 8.29 18.98 5.24<br /></pre>The sum of the (x<sub>i</sub>-xbar)(y<sub>i</sub>-ybar) is 64.4. The sum of the (x<sub>i</sub>-xbar)<sup>2</sup> is 167.4. Therefore, b<sub>1</sub> is 64.4/167.4 = 0.38.<br />We can also calculate b<sub>0</sub> = ybar - b<sub>1</sub>xbar = 25.71 - (0.38)(24.71) = 16.2.<br />Therefore, the regression equation is y = 16.2 + 0.38x.<br /><br />b. Use the equation to estimate y for x=30:<br />y = 16.2 + 0.38(30) = 27.6, which is $27,600 annual income.<br /><blockquote><span style="font-weight: bold;">Question 2 <span style="color: rgb(0, 0, 153);">(Chapter 12)</span>:</span> Below you are given a partial computer output based on a sample of 8 observations, relating an independent variable (x) and a dependent variable (y).<pre> <span style="font-weight: bold;">Coefficient Standard Error</span><br />Intercept 13.251 10.77<br />X 0.803 0.385<br /><br /><span style="font-weight: bold;">Analysis of Variance</span><br /><span style="font-weight: bold;">SOURCE SS</span><br />Regression<br />Error (Residual) 41.674<br />Total 71.875</pre>a. Develop the estimated regression line.<br />b. At α = 0.05, test for the significance of the slope.<br />c. Determine the coefficient of determination (R<sup>2</sup>).</blockquote><span style="font-weight: bold;">Answer:</span><br />a. This one's a lot easier than #1. No calculations necessary, just the ability to pull b<sub>0</sub> and b<sub>1</sub> out of the computer output. They're the coefficients of the intercept and X. So the regression equation becomes:<br />y = 13.251 + 0.803x<br /><br />b. The t score for the slope is t = b<sub>1</sub>/s<sub>b<sub>1</sub></sub>.<br />From part a, we know that b<sub>1</sub> = 0.803.<br />s<sub>b<sub>1</sub></sub> is given in the computer output as the standard error of x = 0.385.<br />Therefore, t = 0.803/0.385 = 2.086.<br />Looking at the t distribution table for n-2=6 and α/2=0.025, we find a critical t value of 2.447. Since the t score of 2.086 is less than 2.447, we do not reject the null hypothesis that there is <span style="font-style: italic;">no</span> linear relationship.<br /><br />c. r<sup>2</sup> = SSR/SST. But SSR was conveniently removed from the computer output. We need to calculate it from SSR = SST-SSE = 71.875-41.674 = 30.201.<br />Therefore, r<sup>2</sup> = 30.201/71.875 = 0.42.<br /><span style="font-weight: bold;"></span><blockquote><span style="font-weight: bold;">Question 3 (<span style="color: rgb(0, 0, 153);">Chapter 9</span>):</span> A sample of 81 account balances of a credit company showed an average balance of $1,200 with a standard deviation of $126.<br />a. Formulate the hypotheses that can be used to determine whether the mean of all account balances is significantly different from $1,150.<br />b. Let α = .05. Using the critical value approach what is your conclusion?</blockquote><span style="font-weight: bold;">Answer:</span><br />a. Since we want to know if the mean is "significantly different" from $1,150, the null hypothesis is that it <span style="font-style: italic;">is</span> $1,150.<br />H<sub>0</sub>: μ = 1150<br />H<sub>1</sub>: μ ≠ 1150<br /><br />b. Since we don't have the population standard deviation, use the t test statistic.<br />t = (xbar-μ<sub>0</sub>)/(s/√n)<br />= (1200-1150)/(126/√81)<br />= 50/14<br />= 3.57<br />The critical value for t for 80 degrees of freedom and &alpha/2=0.025 is 1.990.<br />Since the t-value=3.57 is greater than the critical value of 1.990, we reject H<sub>0</sub> and conclude that the mean is significantly different from $1,150.<br /><span style="font-weight: bold;"></span><blockquote><span style="font-weight: bold;">Question 4 (<span style="color: rgb(0, 0, 153);">Chapter 8</span>):</span> A statistician selected a sample of 16 accounts receivable and determined the mean of the sample to be $5,000 with a sample standard deviation of $400. He reported that the sample information indicated the mean of the population ranges from $4,739.80 to $5,260.20. He neglected to report what confidence level (1-a) he had used. Based on the above information, determine the confidence level that was used.</blockquote><span style="font-weight: bold;">Answer:</span> The statistician is reporting a confidence interval of 5000 ± 260.20. He only mentions the sample standard deviation (not the population std dev), so he must be using the t-distribution and the formula: xbar ± t<sub>n-1, α/2</sub>(s/√n).<br /><br />So we have:<br />260.2 = t(s/√n)<br />260.2 = t (400/√16)<br />260.2 = 100t<br />t = 2.602<br /><br />We look to the t distribution table and find that t<sub>15, α/2</sub> = 2.602 is true for α/2 = 0.01. So α = 0.02 and the confidence level is 1-0.02 = 0.98 = 98%.<br /><blockquote><span style="font-weight: bold;">Question 5 (<span style="color: rgb(0, 0, 153);">Chapter 12</span>):</span> The director of graduate studies at a college of business would like to predict the grade point index (GPI) of students in an MBA program based on their GMAT scores. A sample of 20 students is selected. The result of the regression is summarized in the following Minitab output.<pre><span style="font-weight: bold;">Regression Analysis: GPI versus GMAT</span><br /><br />The regression equation is<br />GPI = 0.300 + 0.00487 GMAT<br /><br />Predictor Coef SE Coef T<br />Constant 0.3003 0.3616 0.83<br />GMAT 0.0048702 [ N ] [ M ]<br /><br />S = 0.155870 R-Sq = 79.8%<br /><br />Analysis of Variance<br /><br />Source DF SS MS F P<br />Regression 1 1.7257 1.7257 71.03 0.000<br />Residual Error 18 0.4373 0.0243<br />Total 19 2.1631</pre>a) Given that Σ(X<sub>i</sub>-xbar)<sup>2</sup> = 72757.2 , where X = GMAT, compute N.<br />b) Compute M and interpret the result. In particular do we reject the underlying hypothesis (which hypothesis) or not?</blockquote><span style="font-weight: bold;">Answer:</span><br /><span style="font-weight: bold;">a.</span> N is what we usually call the standard error of the slope, s<sub>b<sub>1</sub></sub>. (This is the hardest part of the problem - figuring out what's missing in the Minitab output.) From the formula sheet, we know:<br />s<sub>b<sub>1</sub></sub> = S<sub>XY</sub>/√SSX<br /><br />We're given SSX, but we need to calculate S<sub>XY</sub> from the formula:<br />S<sub>XY</sub> = √(SSE/(n-2)).<br /><br />We have SSE from the output: SSE = 0.4373. So,<br />S<sub>XY</sub> = √(0.4373/18) = 0.156<br /><br />Therefore,<br />s<sub>b<sub>1</sub></sub> = 0.156/√72757.2 = 0.156/269.7 = 0.00058<br /><br /><span style="font-weight: bold;">b.</span> M is the t-score for the slope which is given by:<br />t = b<sub>1</sub>/s<sub>b<sub>1</sub></sub><br />= 0.0048702/0.00058<br />= 8.4<br /><br />The critical value for t for 18 degrees of freedom and α/2=0.005 is 2.878. Therefore, since our t-score is greater than the critical t-value, we would reject the null hypothesis, H<sub>0</sub>: μ=0.<img src="http://feeds.feedburner.com/~r/Gsb420/~4/H2lZKChV3gI" height="1" width="1" alt=""/>Eliezerhttp://www.blogger.com/profile/14036531848996147890noreply@blogger.com0tag:blogger.com,1999:blog-1542321886756338348.post-28410692028589840792008-03-10T12:44:00.021-05:002008-03-13T12:37:44.271-05:00Final Exam Study Guide - Practice Questions<span style="font-weight: bold;"></span><blockquote><span style="font-weight: bold;">Question 1:</span> A population has a standard deviation of 16. If a sample of size 64 is selected from this population, what is the probability that the sample mean will be within ±2 of the population mean?<br />a. 0.6826<br />b. 0.3413<br />c. -0.6826<br />d. Since the mean is not given, there is no answer to this question.</blockquote><span style="font-weight: bold;">Answer:</span><br />We need to calculate the z-score for the ±2 interval. In order to do that, we need the standard error of the mean, σ/√n = 16/sqrt(64) = 2.<br />So when we're asked for the probability that the sample mean is ±2 from the population mean, it's asking for the probability of the mean being within 1 standard error. Even without looking it up in the table, we know that the answer must be A - both from our experience that 68% of the data fall within 1 std dev, and because the other answers are unreasonable.<br /><span style="font-weight: bold;"></span><blockquote><span style="font-weight: bold;">Question 2:</span> The fact that the sampling distribution of sample means can be approximated by a normal probability distribution whenever the sample size is large is based on the<br />a. central limit theorem<br />b. fact that we have tables of areas for the normal distribution<br />c. assumption that the population has a normal distribution<br />d. None of these alternatives is correct.</blockquote><span style="font-weight: bold;">Answer:</span> There's not much to say here. The statement is essentially the definition of the Central Limit Theorem, see page 213. The sample size must be approximately 30 for this to hold for all distributions.<br /><blockquote><span style="font-weight: bold;">Question 3:</span> A population has a mean of 53 and a standard deviation of 21. A sample of 49 observations will be taken. The probability that the sample mean will be greater than<br />57.95 is<br />a. 0<br />b. .0495<br />c. .4505<br />d. .9505</blockquote><span style="font-weight: bold;">Answer:</span> Find the z-score of this mean: (57.95-53)/(21/sqrt(49)) = 4.95/3 = 1.65. So the question becomes: What's the probability of an observation being more than 1.65 std devs from the mean. You know it can't be much. It's greater than 0. Answer B is the only logical one. Of course, when we go to the cumulative normal distribution table, we find that 1.65 has 0.9505 area, so the area to the right of 1.65 is 0.0495.<br /><span style="font-weight: bold;"></span><blockquote><span style="font-weight: bold;">Question 4:</span> Suppose a sample of n = 50 items is drawn from a population of manufactured products and the weight, X, of each item is recorded. Prior experience has shown that the weight has a probability distribution with mu = 6 ounces and sigma = 2.5 ounces. Which of the following is true about the sampling distribution of the sample mean if a sample of size <span style="font-weight: bold;">50</span> is selected?<br />a) The mean of the sampling distribution is 6 ounces.<br />b) The standard deviation of the sampling distribution is 2.5 ounces.<br />c) The shape of the sample distribution is approximately normal.<br />d) All of the above are correct.</blockquote><span style="font-weight: bold;">Answer:</span><br />A is true. Although when you take a single sample, its mean is not necessarily equal to the population mean, nonetheless, the mean of the sampling distribution (of <span style="font-style: italic;">all<span style="font-style: italic;"><span style="font-style: italic;"> samples</span></span></span>) will tend toward the population mean as n increases.<br />B is also not necessarily true. The standard deviation of the sample is not necessarily equal to the population standard deviation. It is usually smaller by a factor of 1/&radicn.<br />C is not true. The central limit theorem tells us that when the sample size is ≥30, the distribution <span style="font-style: italic;">of the sample mean</span> is approximately normal. However, the shape of the sample distribution itself is not necessarily normal.<br />D is clearly not true since B and C are not true.<br /><span style="font-weight: bold;"></span><blockquote><span style="font-weight: bold;">Question 5:</span> The owner of a fish market has an assistant who has determined that the weights of catfish are normally distributed, with mean of 3.2 pounds and standard deviation of 0.8 pound. If a sample of 25 fish yields a mean of 3.6 pounds, what is the Z-score for this observation?<br />a) 18.750<br />b) 2.500<br />c) 1.875<br />d) 0.750</blockquote><span style="font-weight: bold;">Answer:</span><br />When evaluating the sample mean,<br />z = (xbar-μ)/(σ/√n) <span style="font-weight: bold;">Note:</span> This formula is <span style="font-style: italic;">not</span> on the sheet.<br />= (3.6-3.2)/(0.8/√25)<br />= 0.4/0.16<br />= 2.5<br />So, answer B is correct.<br /><span style="font-weight: bold;"></span><blockquote><span style="font-weight: bold;">Question 6:</span> A 95% confidence interval for a population mean is determined to be 100 to 120. If the confidence coefficient is reduced to 0.90, the interval for mu<br />a. becomes narrower<br />b. becomes wider<br />c. does not change<br />d. becomes 0.1</blockquote><span style="font-weight: bold;">Answer: </span>No calculations are necessary here. It's completely conceptual. The general rule is: A higher level of confidence requires a wider confidence interval. Therefore, if we reduce the level of confidence to 90%, the confidence interval can be narrower. Answer A is the correct answer.<br /><br /><span style="color: rgb(0, 0, 153); font-weight: bold;">Exhibit 8-3</span><br /><span style="color: rgb(0, 0, 153);">The manager of a grocery store has taken a random sample of 100 customers. The average length of time it took these 100 customers to check out was 3.0 minutes. It is known that the standard deviation of the population of checkout times is 1 minute.</span><br /><span style="font-weight: bold;"></span><blockquote><span style="font-weight: bold;">Question 7:</span> Refer to Exhibit 8-3. The standard error of the mean equals<br />a. 0.001<br />b. 0.010<br />c. 0.100<br />d. 1.000</blockquote><span style="font-weight: bold;">Answer:</span> The standard error of the mean is:<br />σ/√n = 1/√100 = 1/10 = 0.1<br />The correct answer is C.<br /><span style="font-weight: bold;"></span><blockquote><span style="font-weight: bold;">Question 8: </span>Refer to Exhibit 8-3. With a .95 probability, the sample mean will provide a margin of error of<br />a. 1.96<br />b. 0.10<br />c. 0.196<br />d. 1.64</blockquote><span style="font-weight: bold;">Answer:</span> The margin of error is the plus/minus term in the confidence interval. In this case, since we know the population standard deviation, the margin of error term is:<br />z<sub>α/2</sub>(σ/√n)<br />From the z-table, we find that z<sub>0.025</sub> = 1.96<br />Therefore,<br />margin of error, E = 1.96(1/√100) = 0.196<br />Answer C is correct.<br /><span style="font-weight: bold;"></span><blockquote><span style="font-weight: bold;">Question 12:</span> When the following hypotheses are being tested at a level of significance of α<br /><div style="text-align: center;">H<sub>0</sub>: μ ≥ 100 H<sub>a</sub>: μ < 100<br /></div>the null hypothesis will be rejected if the p-value is<br />a. < α<br />b. > α<br />c. > α/2<br />d. < α/2 </blockquote><span style="font-weight: bold;">Answer:</span> First, we notice that this is a one-tailed hypothesis test. The rejection region is entirely to one side of the mean.<br />Our general rule is <span style="font-weight: bold;">If p is low, H<sub>0</sub> must go.</span> So, if p is less than α, we reject the null hypothesis. Answer A is correct.<br /><blockquote><span style="font-weight: bold;">Question 13:</span> In order to test the following hypotheses at an α level of significance<br /><div style="text-align: center;">H<sub>0</sub>: μ ≤ 100 H<sub>a</sub>: μ > 100<br /></div>the null hypothesis will be rejected if the test statistic Z is<br />a. > Z<sub>α</sub><br />b. < Z<sub>α</sub><br />c. < -Z<sub>α</sub><br />d. > Z<sub>α</sub>/2</blockquote><span style="font-weight: bold;">Answer: </span>We've got a one-tailed hypothesis again. This time, the rejection region is in the right-hand tail. Therefore, we reject H<sub>0</sub> if the test statistic is more extreme (i.e. further to the right) than the Z<sub>α</sub>. So answer A is correct.<br /><span style="font-weight: bold;"></span><blockquote><span style="font-weight: bold;">Question 14:</span> Your investment executive claims that the average yearly rate of return on the stocks she recommends is more than 10.0%. She takes a sample to prove her claim. The correct set of hypotheses is<br />a. H<sub>0</sub>: μ = 10.0% H<sub>a</sub>: μ ≠ 10.0%<br />b. H<sub>0</sub>: μ ≤ 10.0% H<sub>a</sub>: μ > 10.0%<br />c. H<sub>0</sub>: μ ≥ 10.0% H<sub>a</sub>: μ < 10.0%</blockquote><span style="font-weight: bold;">Answer:</span> I don't really like this question because it sounds like she's making a claim based on a status quo of the return rate being > 10%. Since the null hypothesis is about the status quo, I'm tempted to pick answer C. Unfortunately, that's not the right way to look at it in this case.<br /><br />Rather, since her claim is that the return is <span style="font-style: italic;">greater than</span> 10%, which does not contain an equal sign, that must be the alternative hypothesis, H<sub>a</sub>. Therefore, the null hypothesis, H<sub>0</sub>, is μ ≤ 10%. Answer B is correct.<br /><span style="font-weight: bold;"></span><blockquote><span style="font-weight: bold;">Question 15:</span> A soft drink filling machine, when in perfect adjustment, fills the bottles with 12 ounces of soft drink. Any over filling or under filling results in the shutdown and readjustment of the machine. To determine whether or not the machine is properly adjusted, the correct set of hypotheses is<br />a. H<sub>0</sub>: μ > 12 H<sub>a</sub>: μ ≤ 12<br />b. H<sub>0</sub>: μ ≤ 12 H<sub>a</sub>: μ > 12<br />c. H<sub>0</sub>: μ = 12 H<sub>a</sub>: μ ≠ 12</blockquote><span style="font-weight: bold;">Answer:</span> This one's a gimme. The null hypothesis H<sub>0</sub> is that the machine is continuing to work properly and μ = 12. The alternative hypothesis, H<sub>a</sub> is that it is filling with some other mean volume and μ ≠ 12. Correct answer is C.<br /><span style="font-weight: bold;"></span><blockquote><span style="font-weight: bold;">Question 16:</span> A two-tailed test is performed at 95% confidence. The p-value is determined to be 0.11.<br />The null hypothesis<br />a. must be rejected<br />b. should not be rejected<br />c. could be rejected, depending on the sample size<br />d. has been designed incorrectly</blockquote><span style="font-weight: bold;">Answer:</span> Since the level of significance is 5%, the combined area of the two-tailed rejection region is 0.05. I.e., 0.025 in either tail. The p-value is 0.11. We remember our mantra: <span style="font-weight: bold;">If p is low, H<sub>0</sub> must go!</span> But p is <span style="font-style: italic;">not</span> lower than 0.05. Therefore, we do not reject H<sub>0</sub> and answer B is correct.<br /><span style="font-weight: bold;"></span><blockquote><span style="font-weight: bold;">Question 17:</span> For a one-tailed hypothesis test (upper tail) the p-value is computed to be 0.034. If the test is being conducted at 95% confidence, the null hypothesis<br />a. could be rejected or not rejected depending on the sample size<br />b. could be rejected or not rejected depending on the value of the mean of the sample<br />c. is not rejected<br />d. is rejected</blockquote><span style="font-weight: bold;">Answer:</span> Level of significance is 5% = 0.05. p is 0.034. Repeat after me: If p is low, H<sub>0</sub> must go! In this case, yes, p is lower than the level of significance and therefore H<sub>0</sub> is rejected. Answer D is correct.<br /><br /><span style="font-weight: bold;">Note:</span> If this had been a two-tailed test, then the 0.05 rejection region would have been split between the two tails, each having 0.025. In that case, it's not clear whether p = 0.034 is lower than 0.025 unless we know whether p was calculated on one side (as we did in class) or on both sides (as is done in the textbook). I asked Prof. Selcuk about this in an email and he replied that he would avoid such ambiguous cases on the final exam.<br /><br /><span style="color: rgb(0, 0, 153); font-weight: bold;">Exhibit 9-1</span><br /><span style="color: rgb(0, 0, 153);">n = 36<br />xbar = 24.6<br />S = 12<br />H<sub>0</sub>: μ ≤ 20<br />H<sub>a</sub>: μ > 20</span><br /><span style="font-weight: bold;"></span><blockquote><span style="font-weight: bold;">Question 18:</span> Refer to Exhibit 9-1. The test statistic (t-score of xbar) is<br />a. 2.3<br />b. 0.38<br />c. -2.3<br />d. -0.38</blockquote><span style="font-weight: bold;">Answer: </span>The formula (on the formula sheet) for the t test statistic is:<br />t = (xbar - μ<sub>0</sub>)/(s/√n)<br />= (24.6-20)/(12/√36)<br />= 4.6/2 = 2.3<br />A is the correct answer.<br /><span style="font-weight: bold;"></span><blockquote><span style="font-weight: bold;">Question 19:</span> Refer to Exhibit 9-1. If the test is done at 95% confidence, the null hypothesis should<br />a. not be rejected<br />b. be rejected<br />c. Not enough information is given to answer this question.<br />d. None of these alternatives is correct.</blockquote><span style="font-weight: bold;">Answer:</span> This question is tricky because we don't know if it's a one-tail or two-tail test. First, assume it's a one-tail test, i.e. the entire rejection region is in one tail. Refer to the t distribution table and look up the t value for 35 degrees of freedom and a 0.05 area in the tail. We find that t value to be approximately 1.69. Our t test statistic is 2.3 which is greater than 1.69, indicating that we should reject the null hypothesis, H<sub>0</sub>.<br /><br />Just to be sure, let's assume that's it's a two-tail test, so the rejection region is only 0.025 on each side. Referring to the t distribution table again, we find the t value for 35 degrees of freedom and a 0.025 area is approximately 2.03. Again, our t test statistic is more extreme than the critical t value. Therefore, reject the null hypothesis, H<sub>0</sub>.<br /><br />Answer B is correct.<br /><blockquote><span style="font-weight: bold;">Question 20:</span> In regression analysis if the dependent variable is measured in dollars, the independent variable<br />a. must also be in dollars<br />b. must be in some units of currency<br />c. can be any units<br />d. can not be in dollars</blockquote><span style="font-weight: bold;">Answer:</span> This is entirely conceptual. The dependent and independent variables are entirely independent of each other. Think of the site.mtw example that we were using extensively in class. The dependent variable was store sales (measured in dollars) and the independent variable was the size of the store (measured in square feet). The correct answer is C - the independent variable can be in any units.<br /><blockquote><span style="font-weight: bold;">Question 21:</span> In a regression analysis, if SST=4500 and SSE=1575, then the coefficient of determination (R<sup>2</sup>) is<br />a. 0.35<br />b. 0.65<br />c. 2.85<br />d. 0.45</blockquote><span style="font-weight: bold;">Answer: </span>Since SST=SSE+SSR, SSR=4500-1575=2925. And R<sup>2</sup>=SSR/SST=2925/4500=0.65. Therefore, answer B is correct.<br /><blockquote><span style="font-weight: bold;">Question 22: </span>Regression analysis was applied between sales (Y in $1,000) and advertising (X in $100), and the following estimated regression equation was obtained.<br />Y-hat = 80 + 6.2 X<br />Based on the above estimated regression line, if advertising is $10,000, then the point estimate for sales (in dollars) is<br />a. $62,080<br />b. $142,000<br />c. $700<br />d. $700,000</blockquote><span style="font-weight: bold;">Answer:</span> When a question is this easy, you know there's some sort of trick. <span style="font-weight: bold; color: rgb(0, 0, 153);">Watch your units!!</span> Since X is in <span style="font-style: italic;">hundreds </span>of dollars, plug in 100 in the regression equation. Y = 80 + 6.2(100) = 700. Y is in <span style="font-style: italic;">thousands </span>of dollars. Therefore, the point estimate for sales in dollars is $700,000 - answer D.<br /><span style="font-weight: bold;"></span><blockquote><span style="font-weight: bold;">Question 23:</span> If the coefficient of correlation is a positive value, then<br />a. the intercept must also be positive<br />b. the coefficient of determination (R2) can be either negative or positive, depending on the value of the slope<br />c. the regression equation could have either a positive or a negative slope<br />d. the slope of the line must be positive</blockquote><span style="font-weight: bold;">Answer: </span>We learned about the coefficient of correlation way back in Chapter 3. It's a measure of the strength of the linear relationship between x and y. Its values range from -1 to 1. Values close to -1 or 1 indicate a strong linear relationship, either negative or positive.<br /><br />Answer A is incorrect because the coefficient of correlation tells us nothing about the intercept.<br />Answer B is incorrect because the coefficient of determination (r<sup>2</sup>) can only be positive. r<sup>2</sup> = SSR/SST and both SSR and SST are positive (since they're both sums of squares), so r<sup>2</sup> must be positive.<br />Answer C is incorrect because a positive coefficient of correlation indicates a positive relationship which would be modeled with a positive slope.<br />Answer D is correct.<br /><br /><span style="color: rgb(0, 0, 153); font-weight: bold;">Exhibit 14-10</span><br /><span style="color: rgb(0, 0, 153);">The following information regarding a dependent variable Y and an independent variable X is</span><br /><span style="color: rgb(0, 0, 153);">provided.</span><br /><span style="color: rgb(0, 0, 153);">∑ X = 16 ∑ (x-xbar)(y-ybar) = -8</span><br /><span style="color: rgb(0, 0, 153);">∑ Y = 28 ∑ (x-xbar)<sup>2</sup> = 8</span><br /><span style="color: rgb(0, 0, 153);">n = 4</span><br /><span style="font-weight: bold;"></span><blockquote><span style="font-weight: bold;">Question 24:</span> Refer to Exhibit 14-10. The slope of the regression function is<br />a. -1<br />b. 1.0<br />c. 11<br />d. 0.0</blockquote><span style="font-weight: bold;">Answer:</span> On the formula sheet we have the formula for the regression slope, b<sub>1</sub>:<br />b<sub>1</sub> = ∑ (x-xbar)(y-ybar) / ∑ (x-xbar)<sup>2</sup> = -8/8 = -1.<br />So answer A is correct.<br /><blockquote><span style="font-weight: bold;">Question 25:</span> Refer to Exhibit 14-10. The intercept of the regression line is<br />a. -1<br />b. 1.0<br />c. 11<br />d. 0.0</blockquote><span style="font-weight: bold;">Answer: </span>Again, the formula sheet gives us the computation for the intercept, b<sub>0</sub>:<br />b<sub>0</sub> = ybar - b<sub>1</sub>xbar = (28/4) - (-1)(16/4) = 7 + 4 = 11.<br />So answer C is correct.<br /><br />More answers to sample problems to come. (I'm kinda jumping around for now.)<img src="http://feeds.feedburner.com/~r/Gsb420/~4/xxI5HHVWqMM" height="1" width="1" alt=""/>Eliezerhttp://www.blogger.com/profile/14036531848996147890noreply@blogger.com0tag:blogger.com,1999:blog-1542321886756338348.post-31543551673060037892008-03-10T10:23:00.005-05:002008-11-13T00:06:31.782-06:00Final Exam Study Guide - Analysis of Prior Exam QuestionsLooking at last quarter's exam gives us some insight as to what to expect on our final. The most important piece is that it provides practice questions at the level we'll be expected to perform. It is extremely worthwhile to do these problems on your own and make sure you understand the answer.*<br /><br />Another interesting insight that we gain from the sample exam is the distribution of questions. Here's what I came up for the number of questions per chapter and the number of points associated with those questions:<br /><pre><img style="margin: 0pt 0pt 10px 10px; float: right; " src="http://1.bp.blogspot.com/_QYuWE-KipG0/R9VVRdv0peI/AAAAAAAAANs/ccEBfPZMpx8/s200/ChapterPoints.JPG" alt="" id="BLOGGER_PHOTO_ID_5176137105263601122" border="0" /><br /><span style="font-family:courier new;"> Mult Short Total<br />Chapter Choice Answer Points<br />7 6 0 12<br />8 5 1 20<br />9 8 1 26<br />12 6 3 42<br /></span></pre><br />We'll probably have a few questions from Chapter 6 thrown in, but those will probably be relatively easy compared to the more advanced material. These numbers tell me one thing for sure: Chapter 12 is really important!<br /><br /><span style="font-size:78%;">*For the record: If you noticed that I didn't stick around for the in-class review off the sample final on Thursday, it's <span style="font-style: italic;">not</span> because I think I know all this stuff! Just the opposite. Almost all of this material is new to me and I wanted to work through all the questions <span style="font-style: italic;">on my own</span> without having heard the answer already solved by someone else.</span><img src="http://feeds.feedburner.com/~r/Gsb420/~4/ghddYDxwTdM" height="1" width="1" alt=""/>Eliezerhttp://www.blogger.com/profile/14036531848996147890noreply@blogger.com0tag:blogger.com,1999:blog-1542321886756338348.post-21098234984959902812008-03-10T09:05:00.011-05:002008-11-13T00:06:31.978-06:00Final Exam Study Guide - Outline<img style="margin: 0pt 0pt 10px 10px; float: right;" src="http://2.bp.blogspot.com/_QYuWE-KipG0/R9VP9tv0pdI/AAAAAAAAANk/_EpIGiVmg80/s200/Studying.JPG" alt="" id="BLOGGER_PHOTO_ID_5176131268403045842" border="0" /><span style="font-weight: bold;">Final Exam Study Guide</span><br />There's a lot of material to review for our final exam. In order to study for the final, I'm going through all the chapters that will be covered (6, 7, 8, 9 and 12) and pulling out the important points from each one. I've basically written them up as "learning objectives" for each chapter. Also since we didn't cover every section of every chapter, I've listed the sections that we <span style="font-style: italic;">did</span> cover.<br /><br />Here's my outline as it stands so far:<br /><br /><span style="font-weight: bold;">Chapter 6 - The Normal Distribution</span><br />6.1<br />Understand the concept of a continuous probability distribution and the difference between continuous and discrete probability distributions. <br /><br />6.2<br />Understand the normal and standard normal distributions.<br />Calculate the z score for any given X.<br />Read the standard normal distribution table and answer questions of the form:<br />P(X<a)<br />P(X>a)<br />P(a<X<b)<br /><br />6.3<br />Use the normal probability plot to evaluate normality of data.<br /><br /><span style="font-weight: bold;">Chapter 7 - Sampling Distributions</span><br /><br />7.1<br />Understand the concept of a sampling distribution.<br /><br />7.2<br />Calculate z-scores for xbar using the standard error of the mean: σ/√n<br />Understand the Central Limit Theorem.<br /><br /><span style="font-weight: bold;">Chapter 8 - Confidence Intervals</span><br /><br />8.1<br />Construct a confidence interval for the population mean, given a sample mean, <i style="">population</i> standard deviation, sample size and level of confidence.<br />Know that a high level of confidence requires a wider confidence interval.<br /><br />8.2<br />Construct a confidence interval for the population mean, given the sample mean, <i style="">sample</i> standard deviation, sample size and level of confidence.<br />Know that for the t statistic, the degrees of freedom is n-1.<br />Read the t-table to find the critical value for a given level of confidence and degrees of freedom.<br /><br />8.4<br />Calculate the sample size required for a given margin of error and level of confidence.<br />Know that a smaller margin of error requires a larger sample size.<br />Know that a higher level of confidence requires a larger sample size.<br /><br /><span style="font-weight: bold;">Chapter 9 – Hypothesis Testing</span><br /><br />9.1<br />Understand the concept of the null and alternative hypotheses.<br />Construct null and alternative hypotheses based on a description of the test.<br />Understand the concepts of rejection and non-rejection regions.<br />Understand the level of significance, alpha, of a hypothesis test.<br /><br />9.2<br />Know the difference between a one-tailed and two-tailed hypothesis test.<br />Calculate critical values for the rejection and non-rejection regions for both one-tailed and two-tailed tests.<br />Calculate the z test statistic and compare to critical values to make a decision whether or not to reject the null hypothesis.<br />Calculate the p-value and compare to the level of significance to make a decision whether or not to reject the null hypothesis.<br /><br />9.3<br />Create null and alternative hypotheses for one-tailed testing.<br /><br />9.4<br />Use the t test statistic to conduct one and two-tailed hypothesis tests when σ is not known.<br /><br /><span style="font-weight: bold;">Chapter 12 - Simple Linear Regression</span><br /><br />12.1<br />Understand the basic concepts of independent and dependent variables, intercept and slope.<br />Understand the concept of simple linear regression.<br />Regression: modeling a relationship between variables with a curve<br />Linear Regression: the curve in the relationship is a straight line (not some sort of arc)<br />Simple Linear Regression: only consider one independent variable as the predictor of the dependent variable<br />Understand the simple linear regression model formula: Y<sub>i</sub> = β<sub>0</sub> + β<sub>1</sub>X<sub>i</sub> + ε<sub>i</sub><br /><br />12.2<br />Understand the method of least squares.<br />Apply the computation formulas of the least squares method to compute the Y intercept b<sub>0</sub> and the slope b<sub>1</sub>.<br />Know how to read and interpret partial computer output (Minitab) and develop the regression line based on it.<br /><br />12.3<br />Understand the sum of squares terms SST, SSR and SSE for the measures of variation in regression.<br />Calculate any of the sum of squares terms, given the other two.<br />Understand the coefficient of determination, r<sup>2</sup>.<br />Calculate r<sup>2</sup> given any two sum of square terms.<br />Know how to read and interpret partial computer output (Minitab) and calculate sum of squares terms and r<sup>2</sup> based on it.<br />Understand the standard error of the estimate and calculate it, given SSE or SST and SSR.<br /><br />12.4<br />Know the four assumptions necessary to use the method of least squares in simple linear regression.<br /><br />12.5<br />Know how to use residual analysis to validate the four assumptions.<br /><br />12.6<br />Understand the Durbin-Watson statistic.<br />Know how to interpret the Durbin-Watson statistic to detect autocorrelation.<br /><br />12.7<br />Calculate the standard error of the slope, S<sub>b<sub>1</sub></sub>.<br />Calculate the t test statistic for the slope and determine whether there is a significant linear relationship.<br />Know that when comparing the t test statistic for the slope to the critical t value, you use n-2 degrees of freedom.<br />Construct a confidence interval for the slope.<br /><br />12.8<br />Construct a prediction interval for an individual response Y.<br />Construct a confidence interval for the mean of Y.<br />(This is as far as I got so far. More to come!)<img src="http://feeds.feedburner.com/~r/Gsb420/~4/-rETvESlics" height="1" width="1" alt=""/>Eliezerhttp://www.blogger.com/profile/14036531848996147890noreply@blogger.com0tag:blogger.com,1999:blog-1542321886756338348.post-18091820788692237492008-03-04T16:46:00.004-06:002008-11-13T00:06:32.435-06:00Lecture 8 - Confidence Interval for Ŷ<span style="font-weight: bold;">Confidence Interval for Ŷ</span><br />Once we've calculated our regression coefficients, b<sub>0</sub> and b<sub>1</sub>, we can estimate the value of Y at any given X with the formula:<br />Ŷ = b<sub>0</sub> + b<sub>1</sub>X<br /><br />This is known as a <span style="font-weight: bold;">point estimate</span>. It estimates Ŷ to a point. However, since it's just an estimate, it's logical to ask for a confidence interval around Ŷ.<br /><br />In addition to constructing a confidence interval for an <span style="font-style: italic;">individual </span>Ŷ, we can also construct a confidence interval for an <span style="font-style: italic;">average </span>Ŷ at X.<br /><br />The difference is best illustrated with an example. In site.mtw we have data on square footage of stores and their annual sales. In general, sales increase linearly with increasing square footage. We perform a regression analysis and determine the regression coefficients. Now we could ask two questions:<br />1. If I build a single new store with 4,000 square feet, what does the regression predict for its annual sales? The answer can be expressed as a confidence interval for an <span style="font-style: italic;">individual </span>Ŷ, because we're making a prediction for an <span style="font-style: italic;">individual </span>new store.<br />2. If I build 10 new stores, each with 4,000 square feet, what does the regression predict for the average annual sales of those stores? The answer to this question can be expressed as a confidence interval for an <span style="font-style: italic;">average </span>Ŷ, since we're making a prediction about the <span style="font-style: italic;">average </span>sales at many new stores.<br /><br /><span style="font-weight: bold;">Confidence Interval for <span style="font-style: italic;">Average </span>Ŷ</span><br />The confidence interval for the <span style="font-style: italic;">average </span>Ŷ (question #2 above) takes the common form:<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_QYuWE-KipG0/R83cDjYoJWI/AAAAAAAAANE/7w8r32lQc6I/s1600-h/AverageConfInterval.JPG"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://2.bp.blogspot.com/_QYuWE-KipG0/R83cDjYoJWI/AAAAAAAAANE/7w8r32lQc6I/s320/AverageConfInterval.JPG" alt="" id="BLOGGER_PHOTO_ID_5174033500514821474" border="0" /></a>where S<sub>Ŷ</sub> is the standard error of Ŷ. <span style="font-weight: bold;">Note: </span><span style="font-style: italic; font-weight: bold;">n-2</span> is used in looking up the t value in the t table.<br /><br />We are told that<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_QYuWE-KipG0/R83eCTYoJXI/AAAAAAAAANM/cvC1reNdjW0/s1600-h/StdErrYHat.JPG"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://1.bp.blogspot.com/_QYuWE-KipG0/R83eCTYoJXI/AAAAAAAAANM/cvC1reNdjW0/s320/StdErrYHat.JPG" alt="" id="BLOGGER_PHOTO_ID_5174035678063240562" border="0" /></a>where h<sub>i</sub> is given by:<br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_QYuWE-KipG0/R83ZUDYoJVI/AAAAAAAAAM8/u9lVLD_5hjM/s1600-h/hi.JPG"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://4.bp.blogspot.com/_QYuWE-KipG0/R83ZUDYoJVI/AAAAAAAAAM8/u9lVLD_5hjM/s320/hi.JPG" alt="" id="BLOGGER_PHOTO_ID_5174030485447779666" border="0" /></a>That's all there is to it! Well see in a minute that Minitab can calculate the standard error term for us, so it's constructing the interval is just a matter of looking up the value in the t table and then doing the arithmetic.<br /><br /><span style="font-weight: bold;">Confidence Interval for <span style="font-style: italic;">Individual </span>Ŷ</span><br />If we're constructing the confidence interval for an individual Ŷ (question #1 above), the calculations are very similar except that we use a 1+h<sub>i</sub> term in place of h<sub>i</sub>. So that term becomes:<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_QYuWE-KipG0/R83gcDYoJYI/AAAAAAAAANU/jTmnhG3HPlc/s1600-h/StdErrIndivYHat.JPG"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://4.bp.blogspot.com/_QYuWE-KipG0/R83gcDYoJYI/AAAAAAAAANU/jTmnhG3HPlc/s320/StdErrIndivYHat.JPG" alt="" id="BLOGGER_PHOTO_ID_5174038319468127618" border="0" /></a>Other than the standard error term, everything else is the same as calculating for an average Ŷ.<br /><br /><span style="font-weight: bold;">Using Minitab to calculate the confidence interval</span><br />Here's how to get the info we need out of Minitab:<br />1. Load up your data in a worksheet. (We use the site.mtw file as usual.)<br />2. Select Stat-Regression-Regression from the menubar.<br />3. Put the independent variable (square feet) in the Predictor box. Put the dependent variable (annual sales) in the Response box.<br />4. Click the Options button. Enter 4 in the Prediction interval for new observations box. This tells Minitab that we want a prediction of annual sales at 4000 square feet (the units of our data are thousands of feet).<br />5. Check the Confidence Limit checkbox if you want a confidence interval for an average Y. Check the Prediction Limit checkbox if you want a confidence interval for an individual Y.<br />6. Click OK in the Options and the main Regression windows. The results appear in the session window. Here's the relevant information:<br /><pre><span style="font-family:courier new;">Predicted Values for New Observations</span><br /><br /><span style="font-family:courier new;">New</span><br /><span style="font-family:courier new;">Obs Fit SE Fit 95% CI 95% PI</span><br /><span style="font-family:courier new;"> 1 7.644 0.309 (6.971, 8.317) (5.433, 9.854)</span><br /></pre>The prediction is 7.64 (the "fit"). The SE Fit term is the standard error for the average Y (the one with just h<sub>i</sub>, not 1+h<sub>i</sub>.<br /><br />I suspect that we'll probably be expected to construct the confidence interval for the average Y, given the Fit and SE Fit output. Don't forget: You still need to look up the t value (at n-2!) and multiply the SE Fit value by it.<img src="http://feeds.feedburner.com/~r/Gsb420/~4/kd9Q466BT4A" height="1" width="1" alt=""/>Eliezerhttp://www.blogger.com/profile/14036531848996147890noreply@blogger.com0tag:blogger.com,1999:blog-1542321886756338348.post-49318475015925904132008-03-04T15:06:00.005-06:002008-03-04T15:39:02.328-06:00Lecture 8 - Inferences About the Regression Slope - Part 2<span style="font-weight: bold;">Confidence Interval for b<sub>1</sub></span><br />The second question that we ask when evaluating the regression is: What is the confidence interval for b<sub>1</sub>?<br /><br />Like any confidence interval, this one will take the form:<br />b<sub>1</sub> ± t<sub>α/2, n-2</sub> S<sub>b<sub>1</sub></sub><br /><br />In the last blog post, we found that S<sub>b<sub>1</sub></sub> is:<br />S<sub>b<sub>1</sub></sub> = S<sub>XY</sub>/SQRT(SSX)<br /><br />Knowing S<sub>b<sub>1</sub></sub>, we can look up t in the t-table and construct the confidence interval relatively easily.<br /><br />Note that for the confidence interval for b<sub>1</sub> we use <span style="font-weight: bold;">n-2</span> in looking up the t-score.<br /><br /><span style="font-weight: bold;">Using Minitab to evaluate the regression</span><br />We won't be expected to calculate S<sub>XY</sub> and S<sub>b<sub>1</sub></sub> by hand for the final (or so we were told). But we will likely be asked to create a confidence interval for b<sub>1</sub> given a snippet of Minitab output. So it's worthwhile to take a look at it:<br /><br />We used the site.mtw dataset and ran the standard regression analysis and got this:<br /><pre>Predictor Coef SE Coef T P<br />Constant 0.9645 0.5262 1.83 0.092<br />Square Feet <span style="font-weight: bold; color: rgb(51, 51, 255);">1.6699</span> <span style="font-weight: bold; color: rgb(255, 0, 0);">0.1569</span> <span style="font-weight: bold; color: rgb(0, 153, 0);">10.64</span> <span style="color: rgb(153, 51, 153); font-weight: bold;">0.000</span><br />S = 0.966380 R-Sq = 90.4% R-Sq(adj) = 89.6%<br /></pre>The S<sub>b<sub>1</sub></sub> value is calculated for us, but it's not obvious where it is. It's the SE Coef term that I've highlighted in red. b<sub>1</sub> itself is the Coef term, in blue. With those two numbers and a t-table, you can construct a confidence interval for b<sub>1</sub>. Just remember to use n-2 in the t-table.<br /><br />With these two values you can also determine the t statistic for hypothesis testing β<sub>1</sub>=0 from the previous blog post by dividing b<sub>1</sub>/S<sub>b<sub>1</sub></sub>. But the truth is, you don't have to do that! The t-value for b<sub>1</sub> is right there in the Minitab output also. I've highlighted it in green. The number in the p column (highlighted purple) is the p-value for b<sub>1</sub>. So if that number is less than α/2, then you can reject the hypothesis that β<sub>0</sub> is 0.<img src="http://feeds.feedburner.com/~r/Gsb420/~4/Hy0J0qfWghI" height="1" width="1" alt=""/>Eliezerhttp://www.blogger.com/profile/14036531848996147890noreply@blogger.com0tag:blogger.com,1999:blog-1542321886756338348.post-5275155046165092992008-03-04T14:15:00.003-06:002008-03-04T15:29:23.220-06:00Lecture 8 - Inferences About the Regression SlopeAfter we use the method of least-squares to calculate regression coefficients (b<sub>0</sub> and b<sub>1</sub>) and we validate the LINE assumptions, we next turn to evaluating the regression, specifically the slope, b<sub>1</sub> and ask two questions:<br />1. Is it statistically significant?<br />2. What is the confidence interval for b<sub>1</sub>?<br /><br />The first question (we actually covered this <span style="font-style: italic;">after </span>the second question in class), whether b<sub>1</sub> is statistically significant, is determined by asking: Is it any better than a flat horizontal line through the data?<br /><br />We answer this question by making a hypothesis that the true relationship slope, β<sub>1</sub> is 0 and using our skills at hypothesis testing to determine whether we should reject that hypothesis.<br /><br />H<sub>0</sub>: β<sub>1</sub> = 0<br />H<sub>1</sub>: β<sub>1</sub> ≠ 0<br /><br />The t statistic that we use to test the hypothesis is:<br />t = (b<sub>1</sub>-β<sub>1</sub>)/S<sub>b1</sub><br />where S<sub>b1</sub> is the standard error of the slope.<br /><br />In our case, β<sub>1</sub> is 0 according to our hypothesis, so t reduces to:<br />t = b<sub>1</sub>/S<sub>b1</sub><br /><br />The standard error of the slope, S<sub>b1</sub>, is defined as:<br />S<sub>b1</sub> = S<sub>XY</sub>/SQRT(SSX)<br />where S<sub>XY</sub> is the standard error of the estimate.<br /><br />The standard error of the estimate, S<sub>XY</sub>, is defined as:<br />S<sub>XY</sub> = SQRT(SSE/n-2)<br /><br />So, if we have our calculations of SSX and SSE, we can do the math and find S<sub>b1</sub> and the t-score for b<sub>1</sub>.<br /><br />We finish our hypothesis testing by comparing the t-score for b<sub>1</sub> to t<sub>α/2, n-2</sub>, where α is our level of significance.<br />If t is beyond t<sub>α/2, n-2</sub> (either on the positive or negative end), we conclude that the hypothesis, H<sub>0</sub>, must be rejected.<br />We could also make the conclusion based on the p-value of the t-score. If the p-value is less than α/2, then we reject H<sub>0</sub>.<br /><br />**Confidence interval for b<sub>1</sub> will be covered in the next blog post.**<img src="http://feeds.feedburner.com/~r/Gsb420/~4/BM7JzAKaRGI" height="1" width="1" alt=""/>Eliezerhttp://www.blogger.com/profile/14036531848996147890noreply@blogger.com0tag:blogger.com,1999:blog-1542321886756338348.post-25383426368205890282008-03-04T09:59:00.006-06:002008-11-13T00:06:32.952-06:00Lecture 8 - Residual Analysis - Checking Independence of Errors<span style="font-weight: bold;">Checking the Independence of Errors Assumption</span><br />The "I" in the LINE mnemonic stands for Independence of Errors. This means that the distribution of errors is random and not influenced by or correlated to the errors in prior observations. The opposite is independence is called <span style="font-weight: bold;">autocorrelation</span>.<br /><br />Clearly, we can only check for independence/autocorrelation when we know the order in which the observations were made and the data points were collected.<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_QYuWE-KipG0/R813zA1-Z2I/AAAAAAAAAMU/DU1PNnu-HOM/s1600-h/Autocorr.JPG"><img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer;" src="http://2.bp.blogspot.com/_QYuWE-KipG0/R813zA1-Z2I/AAAAAAAAAMU/DU1PNnu-HOM/s200/Autocorr.JPG" alt="" id="BLOGGER_PHOTO_ID_5173923265201989474" border="0" /></a>We check for independence/autocorrelation in two ways. First, we can plot the residuals vs. the sequential number of the data point. If we notice a pattern, we say that there is an autocorrelation effect among the residuals and the independence assumption is not valid. The plot at right of residuals vs. observation week shows a clear up and down pattern of the residuals and indicates that the residuals are not independent.<br /><br />The second test of independent/autocorrelation is a more quantitative measure. (All the methods that we've used up to this point for checking assumptions have been graphical/visual.) This test involves calculating the <span style="font-weight: bold;">Durbin-Watson Statistic</span>. The D-W statistic is defined as:<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_QYuWE-KipG0/R82Big1-Z4I/AAAAAAAAAMk/xS8_MlnUET8/s1600-h/DWStat.JPG"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://4.bp.blogspot.com/_QYuWE-KipG0/R82Big1-Z4I/AAAAAAAAAMk/xS8_MlnUET8/s200/DWStat.JPG" alt="" id="BLOGGER_PHOTO_ID_5173933976850425730" border="0" /></a><br />It's the sum of the squares of the differences between consecutive errors divided by the the sum of the squares of all errors.<br /><br />Another way to look at the Durbin-Watson Statistic is:<br /><div style="text-align: center;"><span style="font-size:130%;">D = 2(1-ρ)</span><br /></div>where ρ (the Greek letter rho - lower case) = the correlation between consecutive errors.<br /><br />Looking at it that way, there are 3 important values for D:<br />D=0: This means that ρ=1, indicating a positive correlation.<br />D=2: In this case, ρ=0, indicating no correlation.<br />D=4: ρ=-1, indicating a negative correlation<br /><br />In order to assess whether there is independence, we check to see if D is close to 2 (in which case we say there is no correlation and errors are independent) or if it's closer to one of the other extreme values of 0 or 4 (in which case we say that the independence assumption is not valid). There is also some grey area between both 0 and 2 and between 2 and 4 in which case we say that the Durbin-Watson statistic does not give us enough information to make a determination, it is inconclusive.<br /><br />To determine the boundaries for when the Durbin-Watson statistic is relevant and when it's inconclusive, we turn to table E.9, which provides us with lower and upper bounds, d<sub>L</sub> and d<sub>U</sub>.<br /><br /><span style="font-weight: bold;">Reading the Durbin-Watson Critical Values Table</span><br />The critical values are dependent on the sample size, n, the number of independent variables in the regression model, k, and the level of significance, α. In the case of simple linear regression, there's always only 1 independent variable. (That's the <span style="font-style: italic;">simple</span> part.) The level of significance is usually 0.01 (99% confidence) or 0.05 (95% confidence).<br /><br />So, to read the table:<br />1. Locate the large section of the table for your level of significance, α.<br />2. Find the two columns, d<sub>L</sub> and d<sub>U</sub>, for k=1 (assuming it's simple).<br />3. Go down the column to the row with your sample size, n.<br />4. Read the two values for d<sub>L</sub> and d<sub>U</sub><br /><br /><span style="font-weight: bold;">Interpreting the Durbin-Watson Statistic</span><br />0 < D < d<sub>L</sub>: There is positive autocorrelation<br />d<sub>L</sub> < D < d<sub>U</sub>: Inconclusive<br />d<sub>U</sub> < D < 2+d<sub>L</sub>: No autocorrelation<br />2+d<sub>L</sub> < D < 2+d<sub>U</sub>: Inconclusive<br />2+d<sub>U</sub> < D < 4: There is negative autocorrelation<br /><br />Graphically, it can be represented like this:<br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_QYuWE-KipG0/R82MFA1-Z6I/AAAAAAAAAM0/MW2HBm_CcL0/s1600-h/DWLine.JPG"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer;" src="http://2.bp.blogspot.com/_QYuWE-KipG0/R82MFA1-Z6I/AAAAAAAAAM0/MW2HBm_CcL0/s400/DWLine.JPG" alt="" id="BLOGGER_PHOTO_ID_5173945564672190370" border="0" /></a><br />Note: Positive autocorrelation is somewhat common. Negative autocorrelation is very uncommon and our book does not deal with it.<img src="http://feeds.feedburner.com/~r/Gsb420/~4/0R9KTrFxgsQ" height="1" width="1" alt=""/>Eliezerhttp://www.blogger.com/profile/14036531848996147890noreply@blogger.com0tag:blogger.com,1999:blog-1542321886756338348.post-62252679595699354692008-03-02T15:40:00.004-06:002008-11-13T00:06:33.162-06:00Lecture 8 - Residual Analysis - Checking the Equal Variance Assumption<span style="font-weight: bold;">Homoscadasticity</span> <span style="font-size:78%;">(Not that there's anything wrong with that.)</span><br />We now turn to checking the assumption of <span style="font-style: italic;">equal variance of errors</span>, the "E" in our LINE mnemonic. This assumptions states that not only is the error at each x-value distributed normally, but the variance in the error is equal at each point.<br /><br />Equal variance of errors is known as <span style="font-style: italic;">homoscadasticity</span>. Unequal variance of errors is called <span style="font-style: italic;">heteroscadasticity</span>.<br /><br />For this analysis we turn again to the plot of residuals vs. the independent variable (x) that we used in when we validated the linearity assumption. For linearity, we were just looking to see if the residuals were evenly distributed above and below the x-axis. To check for equal variance of errors, we check to see if there's any pattern in the distribution of the residuals around the x-axis.<br /><br /><span style="font-weight: bold;">Running the residual plot versus x in Minitab:</span><br />1. Load up your data.<br />2. Select Stat-Regression-Regression from the menu bar.<br />3. Put Annual Sales in the Response box and Square Feet in the Predictor box.<br />4. Click the Graphs button. Put Square Feet in the Residuals versus the variables box.<br />5. Click OK in both the Graphs and Regression dialogs. The residual plot appears.<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_QYuWE-KipG0/R8soGd4NiXI/AAAAAAAAAMM/aOLc-D-icA4/s1600-h/ResVsSqFtPlot.JPG"><img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer;" src="http://4.bp.blogspot.com/_QYuWE-KipG0/R8soGd4NiXI/AAAAAAAAAMM/aOLc-D-icA4/s200/ResVsSqFtPlot.JPG" alt="" id="BLOGGER_PHOTO_ID_5173272688529869170" border="0" /></a>Review the graph and ask yourself: <span style="font-weight: bold;">Is there any <span style="font-style: italic;">pattern </span>in the residuals?</span> Do they get increasing larger or small as x changes? If so, then you have a case of heteroscadasticity. But if the residuals are distributed evenly and consistently around the x-axis, then you can conclude that the variances are consistent and the assumption of equality of variances is valid.<br /><br />In our example, I'd be somewhat concerned with the fact that the residuals are closer to the x-axis for small values of x, but broaden out for larger values. The variance does seem to taper off as x gets very large, which is an indication that the variances are equal for x>2 or so. (Click the graph for a larger view of the plot.)<img src="http://feeds.feedburner.com/~r/Gsb420/~4/jYwVsmKevOw" height="1" width="1" alt=""/>Eliezerhttp://www.blogger.com/profile/14036531848996147890noreply@blogger.com0tag:blogger.com,1999:blog-1542321886756338348.post-6966163549250586282008-03-02T14:20:00.004-06:002008-11-13T00:06:33.491-06:00Lecture 8 - Residual Analysis - Checking the Normality AssumptionThe next assumption in the LINE mnemonic after <span style="font-style: italic;">Linearity </span>is <span style="font-style: italic;">Independence of Errors</span>. We skipped that one momentarily because it's a bit more complex than the others. So we saved it for last. In the meantime, we looked at the next assumption: <span style="font-style: italic;">Normality of Error</span>.<br /><br /><span style="font-weight: bold;">Checking the Normality Assumption</span><br />This assumption states that the error in the observation is distributed normally at each x-value. A larger error is less likely than a smaller error and the distribution of errors at any x follows the normal distribution.<br /><br />Although we typically only have one observation at each x, if we assume that the distribution of the errors is the same at each x, we can simply plot all the errors (residuals) and check if they follow the normal distribution. We do this by running a normal probability plot of the residuals. Fortunately for us, Minitab has a built-in normal probability plot function.<br /><br /><span style="font-weight: bold;">Checking Normality Using Minitab</span><br />1. Open up your data worksheet. As usual, we'll use the site.mtw file for our example.<br />2. Select Stat-Regression-Regression from the menu bar.<br />3. Put Annual Sales in the Response box since it's the dependent (response) variable and put Square Feet in the Predictors box since it's the independent (predictor) variable.<br />4. Click the Graph button and under Residual Plots, check the Normal plot of residuals checkbox.<br />5. Click OK in the Graphs and the Regression dialogs.<br /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_QYuWE-KipG0/R8seHd4NiWI/AAAAAAAAAME/M7T5vizIn5g/s1600-h/NormPlotResids.JPG"><img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer;" src="http://4.bp.blogspot.com/_QYuWE-KipG0/R8seHd4NiWI/AAAAAAAAAME/M7T5vizIn5g/s200/NormPlotResids.JPG" alt="" id="BLOGGER_PHOTO_ID_5173261710593460578" border="0" /></a>Minitab creates the normal probability plot of the residuals. The y-axis of this graph is adjusted so that if the data are distributed normally, they will fall on a straight line on the graph. Minitab even draws a line through the residuals for us (presumably using the method of least-squares).<br /><br /><span style="font-weight: bold;">Drawing a conclusion from the graph</span><br />Review this graph and ask yourself: <span style="font-weight: bold;">Do the residual points fall more-or-less on a straight line in the normal probability plot?</span> If they do, you can conclude that the errors are distributed normally and the normality of errors assumption is valid. In our example, the normality plot of the residuals are pretty much linear, but I would be concerned about the upward trend at the far right end of the graph. (Click the graph to see it in more detail.)<img src="http://feeds.feedburner.com/~r/Gsb420/~4/at9wGQvGj0I" height="1" width="1" alt=""/>Eliezerhttp://www.blogger.com/profile/14036531848996147890noreply@blogger.com0tag:blogger.com,1999:blog-1542321886756338348.post-38379113909117059752008-02-29T14:42:00.006-06:002008-11-13T00:06:33.711-06:00Lecture 8 - Residual Analysis - Checking Linearity<span style="font-weight: bold;">Checking Linearity</span><br />Our method for checking the first assumption, linearity of the data, is not a precise, quantitative test. Rather, we'll use visual inspection to check for linearity.<br /><br />One quick way to test the linearity of the data is to create an x-y scatter plot and observe whether the data generally follows a straight line (either with positive or negative slope). Plotting the regression line through the data may help visualize this as well.<br /><br /><span style="font-weight: bold;">Using Minitab for the linearity check</span>:<br />1. Bring up your data in a worksheet. We used the site.mtw file in class.<br />2. Select Graph-Scatterplot from the menu bar. Select the "With Regression" option when prompted for the type of scatterplot.<br />3. Put Annual Sales (the dependent variable) in the Y Variables column and Square Feet (the independent variable) in the X Variables column. Remember that the independent variable is the variable that you can control and which you think will be a predictor of the dependent variable. In other words, the annual sales is dependent on the size of the store (in square feet). It's not the other way around. The size of the store doesn't grow or shrink depending on the number of sales!<br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_QYuWE-KipG0/R8iW9t4NiUI/AAAAAAAAAL0/G7RNGQ-_UW4/s1600-h/LinearRegPlot.JPG"><img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer;" src="http://2.bp.blogspot.com/_QYuWE-KipG0/R8iW9t4NiUI/AAAAAAAAAL0/G7RNGQ-_UW4/s200/LinearRegPlot.JPG" alt="" id="BLOGGER_PHOTO_ID_5172550159066564930" border="0" /></a>4. Don't change the default options and click OK. You should get a plot of your data with a regression line through it. (If you don't get the regression line, in step 3 click Data View, Regression Tab and make sure Linear is selected.)<br /><br />To interpret the linearity of this graph, "eyeball" the way the points fall above and below the regression line and ask yourself: <span style="font-weight: bold;">Are the data points relatively linear or is it curved or skewed in some way?</span> In our case, the data is relatively linear and not curved, so we conclude that the assumption of linearity is valid.<br /><br />A better way to visually assess the linearity is to plot the residuals versus the independent variable and look to see if the errors are distributed evenly above and below 0 along the entire length of the sample.<br /><br /><span style="font-weight: bold;">Plotting Residuals versus the Independent Variable with Minitab</span><br />1. Select Stat-Regression from the menu bar.<br />2. Put Annual Sales in the Response box and Square Feet in the Predictors box. In our scenario, we think that the number of square feet will be a predictor of the annual sales of the store. Notice that the predictors box is large. There can be more than one predictor - perhaps advertising, employee training, etc. Many things can influence the response variable - the annual sales. We'll get to that during <span style="font-style: italic;">multiple </span>linear regression. Right now, for simple linear regression, we're just looking at <span style="font-style: italic;">a single</span> predictor.<br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_QYuWE-KipG0/R8iYId4NiVI/AAAAAAAAAL8/ZkLZ8rSIfYE/s1600-h/ResVsSqFtPlot.JPG"><img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer;" src="http://1.bp.blogspot.com/_QYuWE-KipG0/R8iYId4NiVI/AAAAAAAAAL8/ZkLZ8rSIfYE/s200/ResVsSqFtPlot.JPG" alt="" id="BLOGGER_PHOTO_ID_5172551443261786450" border="0" /></a>3. Click the Graphs button. Put Square Feet in the Residuals versus the variables box.<br /><br />To interpret this graph, ask yourself: <span style="font-weight: bold;">Do the residual points fall equally above and below 0 along the entire length of the horizontal axis?</span> In our case, the residuals do more or less fall equally above and below 0, so we conclude that the data is linear and the assumption of linearity is valid. Note: We also see that the residuals are closer to 0 for lower values of x (square feet). That may become important later when we talk about equal variance of errors.<img src="http://feeds.feedburner.com/~r/Gsb420/~4/ZPe-IA76vtw" height="1" width="1" alt=""/>Eliezerhttp://www.blogger.com/profile/14036531848996147890noreply@blogger.com0tag:blogger.com,1999:blog-1542321886756338348.post-40467882874239744912008-02-29T10:40:00.005-06:002008-11-13T00:06:33.909-06:00Lecture 8 - Residual Analysis - Definition<img style="margin: 0pt 0pt 10px 10px; float: right;" src="http://4.bp.blogspot.com/_QYuWE-KipG0/R8hARN4NiTI/AAAAAAAAALs/7L6MXK2oEi8/s320/ICA.JPG" alt="" id="BLOGGER_PHOTO_ID_5172454836562397490" border="0" /><span style="font-weight: bold;">Residual Analysis</span><br />In Lecture 7 we discussed how to use the method of least-square to perform simple linear regression on a set of data. We also discussed the four assumptions we make about our data in order to use the method of least-squares for the regression:<br />1. <span style="font-weight: bold;">L</span>inearity<br />2. <span style="font-weight: bold;">I</span>ndependence of errors<br />3. <span style="font-weight: bold;">N</span>ormality of error<br />4. <span style="font-weight: bold;">E</span>qual variance of errors<br /><br />The error is also known as the <span style="font-weight: bold;">residual </span>and is the difference between the observed Y<sub>i</sub> value, for any particular X<sub>i</sub>, and the value for Y<sub>i</sub> predicted by our regression model which is usually symbolized by Ŷ<sub>i</sub> (read "y hat sub i"). The residual is symbolized by the greek letter epsilon (lower case) - ε<sub>i</sub>.<br /><br />ε<sub>i</sub> = Y<sub>i</sub> - Ŷ<sub>i</sub><br /><br />We perform a four-part <span style="font-weight: bold;">residual analysis</span> on our data to evaluate whether each of the four assumptions hold and, based on the outcome, we can determine whether our linear regression model is the correct model.<br /><br />It's called a residual analysis because 3 of the 4 assumptions (independence, normality and equality of variance) directly relate to the errors (the residuals) and the other assumption (linearity) is tested by assessing the residuals.<img src="http://feeds.feedburner.com/~r/Gsb420/~4/9EfO00LYte8" height="1" width="1" alt=""/>Eliezerhttp://www.blogger.com/profile/14036531848996147890noreply@blogger.com0tag:blogger.com,1999:blog-1542321886756338348.post-23714695695062225712008-02-28T15:42:00.007-06:002008-11-13T00:06:34.525-06:00Using the DePaul Online Library for Research<span style="font-weight: bold;">DePaul Online Research Library</span><br />One of the really nice perks that we have as students at DePaul is the online research library. The library subscribes to many databases of academic journals and magazines which are searchable. Many of these databases allow you to access and download full text versions of the journal articles, usually available in PDF format. To access the online research library, all you need is your Depaul student ID.<br /><br />Another really nice feature of the DePaul online research library is that it can be integrated with the Google Scholar search engine. I'll show you how to do that in a future blog post.<br /><br />Here's how to access the DePaul online research library:<br /><img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer;" src="http://3.bp.blogspot.com/_QYuWE-KipG0/R8dDPHM9ObI/AAAAAAAAALE/ntp5m6WYmTE/s200/ORLHome.JPG" alt="" id="BLOGGER_PHOTO_ID_5172176623968795058" border="0" />1. Bring up the main DePaul web site by browsing to <a href="http://www.depaul.edu/">http://www.depaul.edu</a>.<br />2. At the top of the page, click on the "Libraries" link.<br /><br />3. This will bring you to the Library page. There are lots of links to follow here. Focus on the "Research" section. The way this works is that you need to identify the database in which you want to conduct your search. Once you do that, you can use the database's internal search function to find your article. So, how do you find a database? There are a couple ways:<br /><br /><img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer;" src="http://1.bp.blogspot.com/_QYuWE-KipG0/R8dDZnM9OcI/AAAAAAAAALM/JHQvx9M9qH4/s200/ORLLibPage.JPG" alt="" id="BLOGGER_PHOTO_ID_5172176804357421506" border="0" /><span style="font-weight: bold;">Method 1:</span><br />Use this method if you're just starting out and don't know which database or journal you're going to search<br />4. Click the "Journals and newspaper articles" link. That will bring you to the subject page.<br />5. Since we're studying statistics, a good choice for subject would be Mathematical Sciences. Click that link.<br />6. You get the database list. For mathematical sciences, we subscribe to 7 databases. The database list gives you a short description of the database and the dates covered by the database. Some of the databases indicate whether we subscribe to full text of articles with a FT icon:<br /><img style="cursor: pointer;" src="http://2.bp.blogspot.com/_QYuWE-KipG0/R8dEr3M9OeI/AAAAAAAAALc/63vS2WSsLiM/s200/FT.JPG" alt="" id="BLOGGER_PHOTO_ID_5172178217401661922" border="0" /><br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_QYuWE-KipG0/R8eVnN4NiSI/AAAAAAAAALk/YZ3KzCWusFY/s1600-h/DepaulLoginProxy.JPG"><img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer;" src="http://3.bp.blogspot.com/_QYuWE-KipG0/R8eVnN4NiSI/AAAAAAAAALk/YZ3KzCWusFY/s200/DepaulLoginProxy.JPG" alt="" id="BLOGGER_PHOTO_ID_5172267198031169826" border="0" /></a>7. Choose your database and click on its link. You'll be prompted for your DePaul username and password. Enter those and click Login.<br /><br /><span style="font-weight: bold;">Method 2:</span><br />Use this method if you know the database you want to search<br />4. In the Research section of the Library page you can click on the A-Z Database List to see all the databases. If you already know the database that you want to search, you can skip by the "subject" steps 4-5 above in method 1 by just using the A-Z list.<br /><br /><span style="font-weight: bold;">Method 3:</span><br />Use this method if you know the name of the journal that you want to search<br />4. Click the "Journals and newspaper articles" link.<br />5. On the left hand margin, enter the name of the journal and click Search.<br />6. The results page will show you which databases contain that journal and for which years<br /><br />Each database has its own interface and it would be impossible for me to cover all of them, but most of them are self explanatory and user-friendly. You can usually search by author, article title or keyword. Several databases also allow you to browse the issues of the journals in the database.<br /><br />More to come...<img src="http://feeds.feedburner.com/~r/Gsb420/~4/-49vQPC8W-8" height="1" width="1" alt=""/>Eliezerhttp://www.blogger.com/profile/14036531848996147890noreply@blogger.com0tag:blogger.com,1999:blog-1542321886756338348.post-40317356771841390102008-02-26T19:16:00.004-06:002008-11-13T00:06:34.629-06:00Lecture 7 - Coefficient of DeterminationAfter calculating β<sub>0</sub> and β<sub>1</sub> to determine the best line to fit the data, we want to quantify <span style="font-style: italic;">how well</span> the line fits the data. It may be the best line, but how good is it?<br /><br /><img style="margin: 0pt 0pt 10px 10px; float: right;" src="http://4.bp.blogspot.com/_QYuWE-KipG0/R8TP1HM9OaI/AAAAAAAAAK4/bEtCnAMjvtU/s200/squares.JPG" alt="" id="BLOGGER_PHOTO_ID_5171486783501580706" border="0" /><span style="font-weight: bold;">Sum of Squares</span><br />Looking at the graph of the data, we could say that without any modeling or regression at all, we would expect the y-value for any give x to be the mean y, ybar. Most of the observations, of course, would not be equal to the mean. We can measure how far the observations are from the mean by taking the difference between each y<sub>i</sub> and ybar, squaring them, and taking the sum of the squares. We call this the<span style="font-weight: bold;"> total sum of squares </span>or <span style="font-weight: bold;">SST</span>.<br /><br />You probably remember that the <span style="font-weight: bold;">variance </span>that we discussed much earlier in the course is this sum of squares divided by n-1.<br /><br />The total sum of squares is made up of two parts - the part that is explained by the regression (yhat-ybar) and the part that the observation differs from the regression (y<sub>i</sub>-yhat). When we square each of these and sum them we compute the <span style="font-weight: bold;">regression sum of squares, SSR</span>, and the <span style="font-weight: bold;">error sum of squares, SSE</span>.<br /><br /><span style="font-weight: bold;">Coefficient of Determination</span><br />The 3 sum of squares terms, SST, SSR and SSE, don't tell us much by themselves. If we're dealing with observations which use large units, these terms may be relatively large even though the variance from a linear relationship is small. On the other hand, if the units of the measurements in our observations is small, the sum of square terms may be small even when the variance from linearity is great.<br /><br />Therefore, the objective statistic that we use to assess how well the regression fits the data is the ratio of the regression sum of squares, SSR, to the total sum of squares, SST. We call this statistic the <span style="font-weight: bold;">coefficient of determination, r<sup>2</sup></span>.<br /><br />r<sup>2</sup> = SSR / SST<img src="http://feeds.feedburner.com/~r/Gsb420/~4/ldCvijDhJaU" height="1" width="1" alt=""/>Eliezerhttp://www.blogger.com/profile/14036531848996147890noreply@blogger.com0tag:blogger.com,1999:blog-1542321886756338348.post-44781860040669648562008-02-25T19:51:00.005-06:002008-11-13T00:06:34.768-06:00Lecture 7 - Assumptions in the Method of Least Squares<div style="text-align: center;"><img style="margin: 0px auto 10px; display: block; text-align: center;" src="http://1.bp.blogspot.com/_QYuWE-KipG0/R8N1tnM9OZI/AAAAAAAAAKw/pnqMBteIrEI/s320/LunarEclipse.JPG" alt="" id="BLOGGER_PHOTO_ID_5171106223629351314" border="0" /><span style="font-size:78%;">Photo courtesy of F. Espenak at <a href="http://www.mreclipse.com/">MrEclipse.com</a></span><br /></div><span style="font-weight: bold;">Assumptions</span><br />In order to use the Least Squares Method, we must make 4 fundamental assumptions about our data and the underlying relationship between the independent and dependent variables, x and y.<br /><br /><span style="font-weight: bold;">1. Linearity</span> - that the variables are truly related to each other in a linear relationship.<br /><span style="font-weight: bold;">2. Independence</span> - that the errors in the observations are independent from one another.<br /><span style="font-weight: bold;">3. Normality</span> - that the errors in the observations are distributed normally at each x-value. A larger error is less likely than a smaller error and the distribution of errors at any x follows the normal distribution.<br /><span style="font-weight: bold;">4. Equal variance </span>- that the distribution of errors at each x (which is normal as in #3 above) has the identical variance. Errors are not more widely distributed at different x-values.<br /><br />A useful mnemonic device for remembering these assumptions is the word<span style="font-weight: bold;"> LINE</span> - <span style="font-weight: bold;">L</span>inearity, <span style="font-weight: bold;">I</span>ndependence, <span style="font-weight: bold;">N</span>ormality, <span style="font-weight: bold;">E</span>qual variance.<br /><br />Note that the first assumption, linearity, refers to the true relationship between the variables. The other three assumptions refer to the nature of the errors in the observed values for the dependent variable.<br /><br />If these assumptions are not true, we need to use a different method to perform the linear regression.<img src="http://feeds.feedburner.com/~r/Gsb420/~4/r5bWyjlUGW0" height="1" width="1" alt=""/>Eliezerhttp://www.blogger.com/profile/14036531848996147890noreply@blogger.com0tag:blogger.com,1999:blog-1542321886756338348.post-75120451370521394442008-02-25T19:14:00.005-06:002008-02-26T19:00:10.211-06:00The Least-Squares Method<span style="font-weight: bold;">The Method of Least Squares</span><br />As described in the previous post, the least-squares method minimizes the sum of the squares of the error between the y-values estimated by the model and the observed y-values.<br /><br />In mathematical terms, we need to minimize the following:<br />∑ (y<sub>i</sub> - (β<sub>0</sub>+β<sub>1</sub>x<sub>i</sub>))<br /><br />All the y<sub>i</sub> and x<sub>i</sub> are known and constant, so this can be looked at as a function of β<sub>0</sub> and β<sub>1</sub>. We need to find the β<sub>0</sub> and β<sub>1</sub> that minimize the total sum.<br /><br />From calculus we remember that to minimize a function, we take the derivative of the function, set it to zero and solve. Since this is a function of two variables, we take two derivatives - the partial derivative with respect to β<sub>0</sub> and the partial derivative with respect to β<sub>1</sub>.<br /><br /><span style="font-weight: bold;">Don't worry! </span>We won't need to do any of this in practice - it's all been done years ago and the generalized solutions are well know.<br /><br />To find b<sub>0</sub> and b<sub>1</sub>:<br />1. Calculate xbar and ybar, the mean values for x and y.<br />2. Calculate the difference between each x and xbar. Call it xdiff.<br />3. Calculate the difference between each y and ybar. Call it ydiff.<br />4. b1 = [∑(xdiff)(ydiff)] / ∑(xdiff<sup>2</sup>)<br />5. b<sub>0</sub> = ybar - b<sub>1</sub>xbar<br /><br />Notice that we switched from using β to using b? That's because β is used for the regression coefficients of the <span style="font-style: italic;">actual</span> linear relationship. b is used to represent our <span style="font-style: italic;">estimate</span> of the coefficients determined by the least squares method. We may or may not be correctly estimating β with our b. We can only hope!<img src="http://feeds.feedburner.com/~r/Gsb420/~4/B0xEgUYlaAk" height="1" width="1" alt=""/>Eliezerhttp://www.blogger.com/profile/14036531848996147890noreply@blogger.com0tag:blogger.com,1999:blog-1542321886756338348.post-13928674867197018732008-02-25T18:28:00.006-06:002012-04-26T15:27:36.499-05:00Lecture 7 - Simple Linear RegressionLinear Regression essentially means creating a linear model that describes the relationship between two variables.<br /><br />Our type of linear regression is often referred to as <span style="font-style: italic;">simple</span> linear regression. The <span style="font-style: italic;">simple</span> part of the linear regression refers to the fact that we don't consider other factors in the relationship - just the two variables. When we model how several variables may determine another variable, it's called <span style="font-style: italic;">multiple</span> regression - the topic for a more advanced course (or chapter 13 in our text).<br /><br />For example, we may think that the total sales at various stores is proportional to the number of square feet of space in the store. If we collect data from a number of stores and plot them in an XY scatter plot, we would probably find that the data points don't lie on a perfectly straight line. However, they may be "more or less" linear to the naked eye. Linear regression involves finding a single line that approximates the relationship. With this line, we can estimate the expected sales at a new store, given the number of square feet it will have.<br /><br />In mathematical terms, linear regression means finding values for β<sub>0</sub> and β<sub>1</sub> in the equation <br /><div style="text-align: center;">y = β<sub>0</sub> + β<sub>1</sub>x</div>such that the resulting equation fits the data points as closely as possible.<br /><br />The equation above may look more familiar to you in this form:<br /><div style="text-align: center;">y = mx + b</div>That's the form we learned in linear algebra. m is the slope and b is the y-intercept. Similarly, in our statistical form, <span style="font-weight: bold;">β<sub>1</sub> is the slope</span> and <span style="font-weight: bold;">β<sub>0</sub> is the y-intercept</span>.<br /><br /><span style="font-weight: bold;">How Close is Close?</span><br />We said that we want to find a line that fits the data "as closely as possible". How close is that? Well, for any given β<sub>0</sub> and β<sub>1</sub>, we can calculate how for off we are by looking at each x-value, calculating what the linear estimate would be according to our regression equation and comparing that to the actual observed y-value. The difference is error between our regression estimate and the observation. Clearly, we want to find the line that minimizes the total error.<br /><br />Minimizing the total error is done in practice by minimizing the sum of the <span style="font-style: italic;">squares</span> of the errors. If we used the actual error term, and not the square, positive and negative errors would cancel each other out. We don't use the absolute value of the error term because we will need to integrate it and the absolute value function is not integrable at 0.<br /><br />Generating <span style="font-weight: bold;">regression coefficients</span> β<sub>0</sub> and β<sub>1</sub> for the linear model by minimizing the sum of the square of the errors is known as the <span style="font-weight: bold;">least-squares method</span>.<img src="http://feeds.feedburner.com/~r/Gsb420/~4/jufO75GgKtk" height="1" width="1" alt=""/>Eliezerhttp://www.blogger.com/profile/14036531848996147890noreply@blogger.com0tag:blogger.com,1999:blog-1542321886756338348.post-67952157841445544142008-02-25T12:15:00.005-06:002008-11-13T00:06:34.900-06:00Homework<img style="margin: 0pt 0pt 10px 10px; float: right;" src="http://3.bp.blogspot.com/_QYuWE-KipG0/R8MJFHM9OYI/AAAAAAAAAKo/D-VAXRG_ryE/s200/Lioness.JPG" alt="" id="BLOGGER_PHOTO_ID_5170986780588849538" border="0" /><span style="font-weight: bold;">In case you were wondering...</span><br />According to an email I got from Prof Selcuk, there is no homework due this week. Homework #5 will be assigned this Thursday Feb 28 and due next Thursday March 6. It will be the last homework of the quarter.<br /><br />My comment: This will give us time to absorb the material on linear regression before working on homework exercises. Maybe even time to go to the zoo on Sunday instead of doing homework.<img src="http://feeds.feedburner.com/~r/Gsb420/~4/KjaetUo3lOA" height="1" width="1" alt=""/>Eliezerhttp://www.blogger.com/profile/14036531848996147890noreply@blogger.com0tag:blogger.com,1999:blog-1542321886756338348.post-38994396540524783022008-02-24T18:14:00.019-06:002008-02-24T19:48:37.414-06:00Lecture 7 - Using Minitab to Calculate Hypothesis Testing Statistics<span style="font-weight: bold;">Using Minitab to Calculate Hypothesis Testing Statistics</span><br />Minitab can be used to perform some of the calculations that are required in steps 4 and 5 of the critical value approach and step 4 of the p-value approach to hypothesis testing (see previous 2 posts). You still need to do all the study design in steps 1-3 and use them as input to Minitab. You will also need to draw your own conclusions from the calculations that Minitab performs.<br /><br /><span style="font-weight: bold;">Here's how:</span><br />1. Load up your data in a Minitab worksheet. (In lecture 7, we used the data in the insurance.mtw worksheet from exercise 9.59.)<br />2. Select Stat - Basic Statistics from the menu bar. Since we're doing hypothesis testing of the mean, we have 2 choices from the menu. Either "1-sample z" or "1-sample t". Since we don't know the standard deviation of the population, we choose the "1-sample t" test.<br />3. In the dialog box, select the column that has your sample data and click the select button so it appears in the "Samples in columns" box. In the test mean box, enter the historical value for the mean, which in our case is 45.<br />4. Click the Options button and enter the confidence level ((1-&alpha)x100) and select a testing "alternative". The testing alternative is where you specify the testing condition of the alternative hypothesis.<br /><ul><li>If H<sub>1</sub> states that the mean is not equal to the historical value, select <span style="font-style: italic;">not equal</span>. Minitab will make calculations for a two-tail test.</li><li>If H<sub>1</sub> states that the mean is strictly less than or strictly greater than the historical value, select <span style="font-style: italic;">less than</span> or <span style="font-style: italic;">greater than</span>. In this case, Minitab will calculate values for a one-tail test.</li></ul>5. Click Ok in the Options dialog and Ok in the main dialog. Minitab displays the calculated values in the Session window. The results from our sample data looked like this:<br /><br /><span style="font-family:courier new;"><span style="font-weight: bold;">One-Sample T: Time<br />Test of mu = 45 vs not = 45<br /><br />Variable N Mean StDev SE Mean 95% CI T P<br />Time 27 43.8889 25.2835 4.8658 (33.8871, 53.8907) -0.23 0.821</span></span><br /><br />Unfortunately, Minitab doesn't take the hypothesis testing all the way to drawing a conclusion about the null hypothesis. We need to do that ourselves in one of two ways: either the critical value or p-value approach.<br /><br /><span style="font-weight: bold;">For the critical value approach</span>, we need to additionally look up the t-score for t<sub>0.025,26</sub> = ±2.056. 0.025 is α/2, which we use with this two-tail test. 26 is n-1, the degrees of freedom for this test. We compare t<sub>0.025,26</sub> to the t-score of the sample mean, which Minitab calculated for us as -0.23, and find that the t-score of the sample mean is between the critical values and therefore we do not reject H<sub>0</sub>.<br /><br /><span style="font-weight: bold;">For the p-value approach</span>, we compare the p-value that Minitab calculated as 0.821 and compare that to the level of significance, &alpha, which in our case is 0.10. Since the p-value is larger than α we do not reject H<sub>0</sub>.<img src="http://feeds.feedburner.com/~r/Gsb420/~4/jyTv5y7a9Ck" height="1" width="1" alt=""/>Eliezerhttp://www.blogger.com/profile/14036531848996147890noreply@blogger.com0tag:blogger.com,1999:blog-1542321886756338348.post-79356354626863059682008-02-24T16:38:00.008-06:002008-02-24T17:50:21.407-06:00Hypothesis Testing - p-Value Approach - 5 Step Methodology<span style="font-weight: bold;">The p-Value Approach</span><br />The p-value approach to hypothesis testing is very similar to the critical value approach (see previous post). Rather than deciding whether or not to reject the null hypothesis based on whether the test statistic falls in a rejection region or not, the p-value approach allows us to make the decision based on whether or not the p-value of the sample data is more or less than the level of confidence.<br /><br />The p-value is the probability of getting a test statistic equal to or more extreme than the sample result. If the p-value is greater than the level of confidence then we can say that the probability of a more extreme test statistic is larger than the level of confidence and thus we do not reject H<sub>0</sub>.<br /><br />If, on the other hand, the p-value is less than the level of confidence, we conclude that the probability of a more extreme test statistic is smaller than the level of confidence and thus we reject H<sub>0</sub>.<br /><br />The <span style="font-weight: bold;">five step methodology</span> of the p-value approach to hypothesis testing is as follows:<br /><span style="font-weight: bold;font-size:85%;" >(Note</span><span style="font-size:85%;">: The first three steps are identical to the critical value approach described in the previous post. However, step 4, the calculation of the critical value, is omitted in this method. Differences in the final two steps between the critical value approach and the p-value approach are </span><span style="font-style: italic;font-size:85%;" >emphasized</span><span style="font-size:85%;">.</span><span style="font-weight: bold;"><span style="font-size:85%;">)</span><br /><br />State the Hypotheses</span><br />1. State the null hypothesis, H<sub>0</sub>, and the alternative hypothesis, H<sub>1</sub>.<br /><span style="font-weight: bold;">Design the Study</span><br />2. Choose the level of significance, α according to the importance of the risk or committing Type I errors. Determine the sample size, n, based on the resources available to collect the data.<br />3. Determine the test statistic and sampling distribution. When the hypotheses involve the population mean, μ, the test statistic is z when σ is known and t when σ is not known. These test statistics follow the normal distribution and the t-distribution respectively.<br /><span style="font-weight: bold;"></span><span style="font-weight: bold;">Conduct the Study</span><br />4. Collect the data and compute the test statistic <span style="font-style: italic;">and the p-value</span>.<br /><span style="font-weight: bold;">Draw Conclusions</span><br />5. Evaluate the <span style="font-style: italic;">p-value</span> and determine whether or not to reject the null hypothesis. Summarize the results and state a managerial conclusion in the context of the problem.<br /><br /><span style="font-weight: bold;">Example</span> <span style="font-size:85%;">(we'll look at the same example as the last post, also reviewed at the beginning of Lecture 7)</span>:<br />A phone industry manager thinks that customer monthly cell phone bills have increased and now average over $52 per month. The company asks you to test this claim. The population standard deviation, σ, is known to be equal to 10 from historical data.<br /><br /><span style="font-weight: bold;">The Hypotheses</span><br />1.H<sub>0</sub>: μ ≤ 52<br />H<sub>1</sub>: μ > 52<br /><span style="font-weight: bold;">Study Design</span><br />2. After consulting with the manager and discussing error risk, we choose a level of significance, α, of 0.10. Our resources allow us to sample 64 sample cell phone bills.<br />3. Since our hypothesis involves the population mean and we know the population standard deviation, our test statistic is z and follows the normal distribution.<br /><span style="font-weight: bold;">The Study</span><br />4. We conduct our study and find that the mean of the 64 sample cell phone bills is 53.1. We compute the test statstic, z = (xbar-μ)/(σ/√n) = (53.1-52)/(10/√64) = 0.88. Next, we look up the p-value of 0.88. The cumulative normal distribution table tells us that the area to the <span style="font-style: italic;">left</span> of 0.88 is 0.8106. Therefore, the p-value of 0.88 = 1-0.8106 = 0.1894.<br /><span style="font-weight: bold;">Conclusions</span><br />5. Since 0.1894 is greater than the level of significance, α, we do not reject the null hypothesis. We report to the company that, based on our testing, there is not evidence that the mean cell phone bill has increased from $52 per month.<img src="http://feeds.feedburner.com/~r/Gsb420/~4/XlZ4Xtk21F4" height="1" width="1" alt=""/>Eliezerhttp://www.blogger.com/profile/14036531848996147890noreply@blogger.com0tag:blogger.com,1999:blog-1542321886756338348.post-72940080781673523442008-02-24T15:44:00.004-06:002008-02-24T17:44:56.867-06:00Hypothesis Testing - Critical Value Approach - 6 Step MethodologyThe six-step methodology of the Critical Value Approach to hypothesis testing is as follows:<br /><span style="font-size:85%;">(<span style="font-weight: bold;">Note</span>: The methodology below works equally well for both one-tail and two-tail hypothesis testing.)</span><br /><br /><span style="font-weight: bold;">State the Hypotheses</span><br />1. State the null hypothesis, H<sub>0</sub>, and the alternative hypothesis, H<sub>1</sub>.<br /><span style="font-weight: bold;">Design the Study</span><br />2. Choose the level of significance, α according to the importance of the risk or committing Type I errors. Determine the sample size, n, based on the resources available to collect the data.<br />3. Determine the test statistic and sampling distribution. When the hypotheses involve the population mean, μ, the test statistic is z when σ is known and t when σ is not known. These test statistics follow the normal distribution and the t-distribution respectively.<br />4. Determine the critical values that divide the rejection and non-rejection regions.<br />Note: For ethical reasons, the level of significance and critical values should be determined prior to conducting the test. The test should be designed so that the predetermined values do not influence the test results.<br /><span style="font-weight: bold;">Conduct the Study</span><br />5. Collect the data and compute the test statistic.<br /><span style="font-weight: bold;">Draw Conclusions</span><br />6. Evaluate the test statistic and determine whether or not to reject the null hypothesis. Summarize the results and state a managerial conclusion in the context of the problem.<br /><br /><span style="font-weight: bold;">Example</span> (reviewed at the beginning of Lecture 7):<br />A phone industry manager thinks that customer monthly cell phone bills have increased and now average over $52 per month. The company asks you to test this claim. The population standard deviation, σ, is known to be equal to 10 from historical data.<br /><br /><span style="font-weight: bold;">The Hypotheses</span><br />1.H<sub>0</sub>: μ ≤ 52<br />H<sub>1</sub>: μ > 52<br /><span style="font-weight: bold;">Study Design</span><br />2. After consulting with the manager and discussing error risk, we choose a level of significance, α, of 0.10. Our resources allow us to sample 64 sample cell phone bills.<br />3. Since our hypothesis involves the population mean and we know the population standard deviation, our test statistic is z and follows the normal distribution.<br />4. In determining the critical value, we first recognize this test as a one-tail test since the null hypothesis involves an inequality, ≤. Therefore the rejection region is entirely on the side of the distribution greater than the historic mean - right tail.<br />We want to determine a z-value for which the area to the right of that value is 0.10, our α. We can use the cumulative normal distribution table (which gives areas to the <span style="font-style: italic;">left</span> of the z-value) and find z having value 0.90 = 1.285. This is our critical value.<br /><span style="font-weight: bold;">The Study</span><br />5. We conduct our study and find that the mean of the 64 sample cell phone bills is 53.1. We compute the test statstic, z = (xbar-μ)/(σ/√n) = (53.1-52)/(10/√64) = 0.88.<br /><span style="font-weight: bold;">Conclusions</span><br />6. Since 0.88 is less than the critical value of 1.285, we do not reject the null hypothesis. We report to the company that, based on our testing, there is not evidence that the mean cell phone bill has increased from $52 per month.<img src="http://feeds.feedburner.com/~r/Gsb420/~4/N-QATBrfbgo" height="1" width="1" alt=""/>Eliezerhttp://www.blogger.com/profile/14036531848996147890noreply@blogger.com0