GSB420 - Business Statistics

The New GSB420 Blog

2011-11-06T11:58:00.019-06:00

After several months, even years, of inactivity, I've decided to revive and revise this blog and restructure it in a logical sequence, rather than continue with the reverse chronological order which is the standard format for blogs. My goal is to make it easier for readers to find the material that they're looking for.

Therefore, this will be the top-most entry in the blog and it will contain a table of contents with links to the articles in the logical order in which you will probably want to learn them - starting from basic principles and working towards more complex concepts. It will take me some time to complete this restructructing, so please be patient.

I've also decided to update the individual entries in the blog and remove references that are specific to the class when I took it in 2008. Those comments are no longer relevant.

I hope you enjoy this blog and benefit from it. If you have any questions, please feel free to email me at eliezerappleton at gmail dot com.

Lecture 1
Basic Definitions
Presentational Statistics
Descriptive Statistics
Quiz #1
Lecture 1 - Additional Notes and Research

Lecture 2
Using Standard Deviation
Shape of the Distribution
Correlating Two Sets of Data
More on Covariance and Correlation Coefficient
Sample vs. Population
Basic Probability
Conditional Probability
Independency
Pop quiz #2
Post-lecture 2 notes and research

Lecture 3
Bayes's Theorem
Counting Rules
Discrete Random Variables
Binomial Distribution
Poisson Distribution

Lecture 4
Continuous Random Variables
The Normal Distribution
The Standard Normal Distribution

Lecture 5
The Standard Normal Distribution (continued)
Checking for Normality
Sampling Distributions
Confidence Interval Estimation

Lecture 6
Confidence Interval for the Mean with Known Std Dev
Confidence Interval for the Mean with Unknown Std Dev
Confidence Interval for the Mean - Examples and Minitab
Determining Sample Size
Hypothesis Testing
One-Tail Hypothesis Testing

Lecture 7 - Linear Regression
Simple Linear Regression
The Least Squares Method
Assumptions in the Method of Least Squares
Coefficient of Determination

Lecture 8 - Residual Analysis
Definition of Residual Analysis
Checking Linearity
Checking the Normality Assumption
Checking the Equal Variance Assumption
Checking Independence of Errors
Inferences About the Regression Slope - Part 1
Inferences About the Regression Slope - Part 2
Confidence Interval for Ŷ

ECO 509 - Spring Quarter 2008

2008-03-28T16:36:00.002-05:00

Next quarter (Spring 2008) I'll be taking ECO 509 - Business Conditions Analysis (aka Macroeconomics) with Professor Jaejoon Woo (Wed night section).

You can find the blog for ECO 509 at http://eco509.blogspot.com.

Final Exam Recap

2008-03-14T08:50:00.002-05:00

Well, it's finally over! Here's a recap of my random thoughts on the final exam:

In general, the final was harder than I expected. I'm pretty sure I did well, but it was definitely harder than the midterm and harder than I thought it would be.
He hit us with 10 straight questions from Chapter 12 right out of the block! I expected to ease into it. The previous quarter's final was pretty linear - starting at chapter 7, then chapter 8, etc and not hitting chapter 12 until the last questions. I knew chapter 12 would be a big chunk of the exam, but I didn't expect him to lead off with it.
I think we were all pretty stumped by that one question (was it 9 or 10?) that had us calculate the intercept, b₀. I kept coming up with 100 for an answer, but it wasn't one of the choices. I saw Azzam go up and ask about something and figured it might be that. So I went up and asked also. I think we all breathed a sigh of relief when he made the change to the last two choices. BTW, I think he could have just changed Σ Y to be 100 and that would have made the intercept 40, which was one of the choices.
Probability of Type I errors? Sheesh! I didn't see that coming. The answer is that it's alpha, α.
I was surprised at the "regular" question that had us work out the regression from the raw data. I'm almost positive I made some arithmetic mistakes in calculating the Σ(X_i-Xbar)(Y_i-Ybar) or Σ(X_i-Xbar)² or one of the other calculations.
My answer for the "west nile" question was that we do not reject the null hypothesis, H₀ that the average # of cases is different than 3.
On the last question, part A, you had to assume (or somehow know through ESP or divine vision) that the level of confidence to use is 95%. You can calculate the t-score easily enough (I think I got something like 2.414), but draw any conclusions, you need to calculate the critical value for t_{α/2, n-1} which requires the level of confidence.
I knew there would be a Durbin-Watson question on the exam! My answer for that one was that there was not evidence of autocorrelation since the DW stat was greater than d_U and less than 2+d_L. I'm not sure I was using the right d_U because I wasn't sure if I should use α or α/2 on the DW table. I used α (which I think was 0.05).
As predicted, there were a few questions with Minitab output. No big surprises there.
In at least 2 of the questions (one multiple choice and one "regular"), he gave us the variance rather than the standard deviation. Tricky! I almost fell for that one.
One question asked us to determine sample size, given a confidence level, margin of error and standard deviation (or maybe variance). Using the formula, you calculate n=74.3 (something like that - not a round number). You had to know to round up, not truncate the decimal.
Higher confidence levels need wider intervals. You had to figure he would ask about that.
If p is low, H₀ must go! You could use that to answer one of the multiple choice question on p-value approach in hypothesis testing.

All in all, not a terrible test. Just harder than last quarter's final, IMHO. I'm anxious to see how I did. I think he said he may have them graded by Monday. The multiple choice is easy to grade. I think he gives the "regular" questions to a Teaching Assistant. HW5 grades have not been posted to Blackboard yet.

Final Exam Study Guide - Last Minute Notes

2008-03-13T13:41:00.003-05:00

Just a couple of last minute thoughts:

There were no example problems that used the Durbin-Watson statistic. That doesn't mean it won't be on the exam! The D-W table is part of the formula sheet, so I'm expecting a question on it.
Remember that if DW is less than d_L, there's autocorrelation. If it's between d_L and d_U, it's inconclusive. If it's between d_U and 2+d_L, there's no autocorrelation. I doubt we'll be asked about the range from 2-4.
Review how to read those Minitab outputs! There's bound to be at least one on the exam. Remember that in Minitab output, SS stands from Sum of Squares. S stands for S_YX. The Coefficient of the Intercept is b₀. The Coefficient of the other (independent) variable is b₁. SE stands for standard error.
There weren't any practice problems on the confidence interval for mean/individual Y. I would expect one of those since the formulas are on the sheet. He'll probably give us h_i and S_XY. Remember to use n-2 when looking up the value for t in this case.
Remember that most of the answers can be derived from the data in the question and the formula sheet. You're not really expected to memorize very much. If you can't figure it out, look at the formula sheet.
Remember to bring a copy of the formula sheet, a calculator and a #2 pencil. Don't laugh! I forgot a pencil for the midterm and ran out to Walgreen's a half hour before the exam.

Final Exam Study Guide - Practice Questions - Part 2

2008-03-12T17:02:00.010-05:00

In this post, I'll go over the answers to the "regular" questions from the last quarter's final. I'll also note which chapter the question is from.

Question 1 (Chapter 12): You would like to estimate the income of a person based on his age. The following data shows the yearly income (in $1,000) and age of a sample of seven individuals.
Income (in $1,000) Age
20                 18
24                 20
24                 23
25                 34
26                 24
27                 27
34                 27
a. Develop the least squares regression equation.
b. Estimate the yearly income of a 30-year-old individual.

Answer:
a. In order to calculate b₀ and b₁, we need to first calculate the mean of X (age) and Y (income). For xbar, I calculated 24.71 and for ybar, I got 25.71. To calculate b₁, we need to calculate x_i-xbar and y_i-ybar for each i:

Income Age x_i-xbar y_i-ybar (x_i-xbar)(y_i-ybar) (x_i-xbar)²

20     18  -6.71   -5.71        38.31          45.02
24     20  -4.71   -1.71         8.05          22.18
24     23  -1.71   -1.71         2.92           2.92
25     34   9.29   -0.71        -6.60          86.30
26     24  -0.71    0.29        -0.21           0.50
27     27   2.29    1.29         2.95           5.24
34     27   2.29    8.29        18.98           5.24

The sum of the (x_i-xbar)(y_i-ybar) is 64.4. The sum of the (x_i-xbar)² is 167.4. Therefore, b₁ is 64.4/167.4 = 0.38.
We can also calculate b₀ = ybar - b₁xbar = 25.71 - (0.38)(24.71) = 16.2.
Therefore, the regression equation is y = 16.2 + 0.38x.

b. Use the equation to estimate y for x=30:
y = 16.2 + 0.38(30) = 27.6, which is $27,600 annual income.

Question 2 (Chapter 12): Below you are given a partial computer output based on a sample of 8 observations, relating an independent variable (x) and a dependent variable (y).
              Coefficient Standard Error
Intercept     13.251      10.77
X             0.803       0.385

Analysis of Variance
SOURCE            SS
Regression
Error (Residual)  41.674
Total             71.875
a. Develop the estimated regression line.
b. At α = 0.05, test for the significance of the slope.
c. Determine the coefficient of determination (R²).

Answer:
a. This one's a lot easier than #1. No calculations necessary, just the ability to pull b₀ and b₁ out of the computer output. They're the coefficients of the intercept and X. So the regression equation becomes:
y = 13.251 + 0.803x

b. The t score for the slope is t = b₁/s_b₁.
From part a, we know that b₁ = 0.803.
s_b₁ is given in the computer output as the standard error of x = 0.385.
Therefore, t = 0.803/0.385 = 2.086.
Looking at the t distribution table for n-2=6 and α/2=0.025, we find a critical t value of 2.447. Since the t score of 2.086 is less than 2.447, we do not reject the null hypothesis that there is no linear relationship.

c. r² = SSR/SST. But SSR was conveniently removed from the computer output. We need to calculate it from SSR = SST-SSE = 71.875-41.674 = 30.201.
Therefore, r² = 30.201/71.875 = 0.42.

Question 3 (Chapter 9): A sample of 81 account balances of a credit company showed an average balance of $1,200 with a standard deviation of $126.
a. Formulate the hypotheses that can be used to determine whether the mean of all account balances is significantly different from $1,150.
b. Let α = .05. Using the critical value approach what is your conclusion?

Answer:
a. Since we want to know if the mean is "significantly different" from $1,150, the null hypothesis is that it is $1,150.
H₀: μ = 1150
H₁: μ ≠ 1150

b. Since we don't have the population standard deviation, use the t test statistic.
t = (xbar-μ₀)/(s/√n)
= (1200-1150)/(126/√81)
= 50/14
= 3.57
The critical value for t for 80 degrees of freedom and &alpha/2=0.025 is 1.990.
Since the t-value=3.57 is greater than the critical value of 1.990, we reject H₀ and conclude that the mean is significantly different from $1,150.

Question 4 (Chapter 8): A statistician selected a sample of 16 accounts receivable and determined the mean of the sample to be $5,000 with a sample standard deviation of $400. He reported that the sample information indicated the mean of the population ranges from $4,739.80 to $5,260.20. He neglected to report what confidence level (1-a) he had used. Based on the above information, determine the confidence level that was used.

Answer: The statistician is reporting a confidence interval of 5000 ± 260.20. He only mentions the sample standard deviation (not the population std dev), so he must be using the t-distribution and the formula: xbar ± t_{n-1, α/2}(s/√n).

So we have:
260.2 = t(s/√n)
260.2 = t (400/√16)
260.2 = 100t
t = 2.602

We look to the t distribution table and find that t_{15, α/2} = 2.602 is true for α/2 = 0.01. So α = 0.02 and the confidence level is 1-0.02 = 0.98 = 98%.

Question 5 (Chapter 12): The director of graduate studies at a college of business would like to predict the grade point index (GPI) of students in an MBA program based on their GMAT scores. A sample of 20 students is selected. The result of the regression is summarized in the following Minitab output.
Regression Analysis: GPI versus GMAT

The regression equation is
GPI = 0.300 + 0.00487 GMAT

Predictor         Coef         SE Coef         T
Constant        0.3003          0.3616      0.83
GMAT         0.0048702           [ N ]     [ M ]

S = 0.155870 R-Sq = 79.8%

Analysis of Variance

Source             DF         SS         MS         F         P
Regression          1     1.7257     1.7257     71.03     0.000
Residual Error     18     0.4373     0.0243
Total              19     2.1631
a) Given that Σ(X_i-xbar)² = 72757.2 , where X = GMAT, compute N.
b) Compute M and interpret the result. In particular do we reject the underlying hypothesis (which hypothesis) or not?

Answer:
a. N is what we usually call the standard error of the slope, s_b₁. (This is the hardest part of the problem - figuring out what's missing in the Minitab output.) From the formula sheet, we know:
s_b₁ = S_XY/√SSX

We're given SSX, but we need to calculate S_XY from the formula:
S_XY = √(SSE/(n-2)).

We have SSE from the output: SSE = 0.4373. So,
S_XY = √(0.4373/18) = 0.156

Therefore,
s_b₁ = 0.156/√72757.2 = 0.156/269.7 = 0.00058

b. M is the t-score for the slope which is given by:
t = b₁/s_b₁
= 0.0048702/0.00058
= 8.4

The critical value for t for 18 degrees of freedom and α/2=0.005 is 2.878. Therefore, since our t-score is greater than the critical t-value, we would reject the null hypothesis, H₀: μ=0.

Final Exam Study Guide - Practice Questions

2008-03-10T12:44:00.021-05:00

Question 1: A population has a standard deviation of 16. If a sample of size 64 is selected from this population, what is the probability that the sample mean will be within ±2 of the population mean?
a. 0.6826
b. 0.3413
c. -0.6826
d. Since the mean is not given, there is no answer to this question.

Answer:
We need to calculate the z-score for the ±2 interval. In order to do that, we need the standard error of the mean, σ/√n = 16/sqrt(64) = 2.
So when we're asked for the probability that the sample mean is ±2 from the population mean, it's asking for the probability of the mean being within 1 standard error. Even without looking it up in the table, we know that the answer must be A - both from our experience that 68% of the data fall within 1 std dev, and because the other answers are unreasonable.

Question 2: The fact that the sampling distribution of sample means can be approximated by a normal probability distribution whenever the sample size is large is based on the
a. central limit theorem
b. fact that we have tables of areas for the normal distribution
c. assumption that the population has a normal distribution
d. None of these alternatives is correct.

Answer: There's not much to say here. The statement is essentially the definition of the Central Limit Theorem, see page 213. The sample size must be approximately 30 for this to hold for all distributions.

Question 3: A population has a mean of 53 and a standard deviation of 21. A sample of 49 observations will be taken. The probability that the sample mean will be greater than
57.95 is
a. 0
b. .0495
c. .4505
d. .9505

Answer: Find the z-score of this mean: (57.95-53)/(21/sqrt(49)) = 4.95/3 = 1.65. So the question becomes: What's the probability of an observation being more than 1.65 std devs from the mean. You know it can't be much. It's greater than 0. Answer B is the only logical one. Of course, when we go to the cumulative normal distribution table, we find that 1.65 has 0.9505 area, so the area to the right of 1.65 is 0.0495.

Question 4: Suppose a sample of n = 50 items is drawn from a population of manufactured products and the weight, X, of each item is recorded. Prior experience has shown that the weight has a probability distribution with mu = 6 ounces and sigma = 2.5 ounces. Which of the following is true about the sampling distribution of the sample mean if a sample of size 50 is selected?
a) The mean of the sampling distribution is 6 ounces.
b) The standard deviation of the sampling distribution is 2.5 ounces.
c) The shape of the sample distribution is approximately normal.
d) All of the above are correct.

Answer:
A is true. Although when you take a single sample, its mean is not necessarily equal to the population mean, nonetheless, the mean of the sampling distribution (of all samples) will tend toward the population mean as n increases.
B is also not necessarily true. The standard deviation of the sample is not necessarily equal to the population standard deviation. It is usually smaller by a factor of 1/&radicn.
C is not true. The central limit theorem tells us that when the sample size is ≥30, the distribution of the sample mean is approximately normal. However, the shape of the sample distribution itself is not necessarily normal.
D is clearly not true since B and C are not true.

Question 5: The owner of a fish market has an assistant who has determined that the weights of catfish are normally distributed, with mean of 3.2 pounds and standard deviation of 0.8 pound. If a sample of 25 fish yields a mean of 3.6 pounds, what is the Z-score for this observation?
a) 18.750
b) 2.500
c) 1.875
d) 0.750

Answer:
When evaluating the sample mean,
z = (xbar-μ)/(σ/√n) Note: This formula is not on the sheet.
= (3.6-3.2)/(0.8/√25)
= 0.4/0.16
= 2.5
So, answer B is correct.

Question 6: A 95% confidence interval for a population mean is determined to be 100 to 120. If the confidence coefficient is reduced to 0.90, the interval for mu
a. becomes narrower
b. becomes wider
c. does not change
d. becomes 0.1

Answer: No calculations are necessary here. It's completely conceptual. The general rule is: A higher level of confidence requires a wider confidence interval. Therefore, if we reduce the level of confidence to 90%, the confidence interval can be narrower. Answer A is the correct answer.

Exhibit 8-3
The manager of a grocery store has taken a random sample of 100 customers. The average length of time it took these 100 customers to check out was 3.0 minutes. It is known that the standard deviation of the population of checkout times is 1 minute.

Question 7: Refer to Exhibit 8-3. The standard error of the mean equals
a. 0.001
b. 0.010
c. 0.100
d. 1.000

Answer: The standard error of the mean is:
σ/√n = 1/√100 = 1/10 = 0.1
The correct answer is C.

Question 8: Refer to Exhibit 8-3. With a .95 probability, the sample mean will provide a margin of error of
a. 1.96
b. 0.10
c. 0.196
d. 1.64

Answer: The margin of error is the plus/minus term in the confidence interval. In this case, since we know the population standard deviation, the margin of error term is:
z_α/2(σ/√n)
From the z-table, we find that z_0.025 = 1.96
Therefore,
margin of error, E = 1.96(1/√100) = 0.196
Answer C is correct.

Question 12: When the following hypotheses are being tested at a level of significance of α
H₀: μ ≥ 100 H_a: μ < 100
the null hypothesis will be rejected if the p-value is
a. < α
b. > α
c. > α/2
d. < α/2

Answer: First, we notice that this is a one-tailed hypothesis test. The rejection region is entirely to one side of the mean.
Our general rule is If p is low, H₀ must go. So, if p is less than α, we reject the null hypothesis. Answer A is correct.

Question 13: In order to test the following hypotheses at an α level of significance
H₀: μ ≤ 100 H_a: μ > 100
the null hypothesis will be rejected if the test statistic Z is
a. > Z_α
b. < Z_α
c. < -Z_α
d. > Z_α/2

Answer: We've got a one-tailed hypothesis again. This time, the rejection region is in the right-hand tail. Therefore, we reject H₀ if the test statistic is more extreme (i.e. further to the right) than the Z_α. So answer A is correct.

Question 14: Your investment executive claims that the average yearly rate of return on the stocks she recommends is more than 10.0%. She takes a sample to prove her claim. The correct set of hypotheses is
a. H₀: μ = 10.0% H_a: μ ≠ 10.0%
b. H₀: μ ≤ 10.0% H_a: μ > 10.0%
c. H₀: μ ≥ 10.0% H_a: μ < 10.0%

Answer: I don't really like this question because it sounds like she's making a claim based on a status quo of the return rate being > 10%. Since the null hypothesis is about the status quo, I'm tempted to pick answer C. Unfortunately, that's not the right way to look at it in this case.

Rather, since her claim is that the return is greater than 10%, which does not contain an equal sign, that must be the alternative hypothesis, H_a. Therefore, the null hypothesis, H₀, is μ ≤ 10%. Answer B is correct.

Question 15: A soft drink filling machine, when in perfect adjustment, fills the bottles with 12 ounces of soft drink. Any over filling or under filling results in the shutdown and readjustment of the machine. To determine whether or not the machine is properly adjusted, the correct set of hypotheses is
a. H₀: μ > 12 H_a: μ ≤ 12
b. H₀: μ ≤ 12 H_a: μ > 12
c. H₀: μ = 12 H_a: μ ≠ 12

Answer: This one's a gimme. The null hypothesis H₀ is that the machine is continuing to work properly and μ = 12. The alternative hypothesis, H_a is that it is filling with some other mean volume and μ ≠ 12. Correct answer is C.

Question 16: A two-tailed test is performed at 95% confidence. The p-value is determined to be 0.11.
The null hypothesis
a. must be rejected
b. should not be rejected
c. could be rejected, depending on the sample size
d. has been designed incorrectly

Answer: Since the level of significance is 5%, the combined area of the two-tailed rejection region is 0.05. I.e., 0.025 in either tail. The p-value is 0.11. We remember our mantra: If p is low, H₀ must go! But p is not lower than 0.05. Therefore, we do not reject H₀ and answer B is correct.

Question 17: For a one-tailed hypothesis test (upper tail) the p-value is computed to be 0.034. If the test is being conducted at 95% confidence, the null hypothesis
a. could be rejected or not rejected depending on the sample size
b. could be rejected or not rejected depending on the value of the mean of the sample
c. is not rejected
d. is rejected

Answer: Level of significance is 5% = 0.05. p is 0.034. Repeat after me: If p is low, H₀ must go! In this case, yes, p is lower than the level of significance and therefore H₀ is rejected. Answer D is correct.

Note: If this had been a two-tailed test, then the 0.05 rejection region would have been split between the two tails, each having 0.025. In that case, it's not clear whether p = 0.034 is lower than 0.025 unless we know whether p was calculated on one side (as we did in class) or on both sides (as is done in the textbook). I asked Prof. Selcuk about this in an email and he replied that he would avoid such ambiguous cases on the final exam.

Exhibit 9-1
n = 36
xbar = 24.6
S = 12
H₀: μ ≤ 20
H_a: μ > 20

Question 18: Refer to Exhibit 9-1. The test statistic (t-score of xbar) is
a. 2.3
b. 0.38
c. -2.3
d. -0.38

Answer: The formula (on the formula sheet) for the t test statistic is:
t = (xbar - μ₀)/(s/√n)
= (24.6-20)/(12/√36)
= 4.6/2 = 2.3
A is the correct answer.

Question 19: Refer to Exhibit 9-1. If the test is done at 95% confidence, the null hypothesis should
a. not be rejected
b. be rejected
c. Not enough information is given to answer this question.
d. None of these alternatives is correct.

Answer: This question is tricky because we don't know if it's a one-tail or two-tail test. First, assume it's a one-tail test, i.e. the entire rejection region is in one tail. Refer to the t distribution table and look up the t value for 35 degrees of freedom and a 0.05 area in the tail. We find that t value to be approximately 1.69. Our t test statistic is 2.3 which is greater than 1.69, indicating that we should reject the null hypothesis, H₀.

Just to be sure, let's assume that's it's a two-tail test, so the rejection region is only 0.025 on each side. Referring to the t distribution table again, we find the t value for 35 degrees of freedom and a 0.025 area is approximately 2.03. Again, our t test statistic is more extreme than the critical t value. Therefore, reject the null hypothesis, H₀.

Answer B is correct.

Question 20: In regression analysis if the dependent variable is measured in dollars, the independent variable
a. must also be in dollars
b. must be in some units of currency
c. can be any units
d. can not be in dollars

Answer: This is entirely conceptual. The dependent and independent variables are entirely independent of each other. Think of the site.mtw example that we were using extensively in class. The dependent variable was store sales (measured in dollars) and the independent variable was the size of the store (measured in square feet). The correct answer is C - the independent variable can be in any units.

Question 21: In a regression analysis, if SST=4500 and SSE=1575, then the coefficient of determination (R²) is
a. 0.35
b. 0.65
c. 2.85
d. 0.45

Answer: Since SST=SSE+SSR, SSR=4500-1575=2925. And R²=SSR/SST=2925/4500=0.65. Therefore, answer B is correct.

Question 22: Regression analysis was applied between sales (Y in $1,000) and advertising (X in $100), and the following estimated regression equation was obtained.
Y-hat = 80 + 6.2 X
Based on the above estimated regression line, if advertising is $10,000, then the point estimate for sales (in dollars) is
a. $62,080
b. $142,000
c. $700
d. $700,000

Answer: When a question is this easy, you know there's some sort of trick. Watch your units!! Since X is in hundreds of dollars, plug in 100 in the regression equation. Y = 80 + 6.2(100) = 700. Y is in thousands of dollars. Therefore, the point estimate for sales in dollars is $700,000 - answer D.

Question 23: If the coefficient of correlation is a positive value, then
a. the intercept must also be positive
b. the coefficient of determination (R2) can be either negative or positive, depending on the value of the slope
c. the regression equation could have either a positive or a negative slope
d. the slope of the line must be positive

Answer: We learned about the coefficient of correlation way back in Chapter 3. It's a measure of the strength of the linear relationship between x and y. Its values range from -1 to 1. Values close to -1 or 1 indicate a strong linear relationship, either negative or positive.

Answer A is incorrect because the coefficient of correlation tells us nothing about the intercept.
Answer B is incorrect because the coefficient of determination (r²) can only be positive. r² = SSR/SST and both SSR and SST are positive (since they're both sums of squares), so r² must be positive.
Answer C is incorrect because a positive coefficient of correlation indicates a positive relationship which would be modeled with a positive slope.
Answer D is correct.

Exhibit 14-10
The following information regarding a dependent variable Y and an independent variable X is
provided.
∑ X = 16 ∑ (x-xbar)(y-ybar) = -8
∑ Y = 28 ∑ (x-xbar)² = 8
n = 4

Question 24: Refer to Exhibit 14-10. The slope of the regression function is
a. -1
b. 1.0
c. 11
d. 0.0

Answer: On the formula sheet we have the formula for the regression slope, b₁:
b₁ = ∑ (x-xbar)(y-ybar) / ∑ (x-xbar)² = -8/8 = -1.
So answer A is correct.

Question 25: Refer to Exhibit 14-10. The intercept of the regression line is
a. -1
b. 1.0
c. 11
d. 0.0

Answer: Again, the formula sheet gives us the computation for the intercept, b₀:
b₀ = ybar - b₁xbar = (28/4) - (-1)(16/4) = 7 + 4 = 11.
So answer C is correct.

More answers to sample problems to come. (I'm kinda jumping around for now.)

Final Exam Study Guide - Analysis of Prior Exam Questions

2008-03-10T10:23:00.005-05:00

Looking at last quarter's exam gives us some insight as to what to expect on our final. The most important piece is that it provides practice questions at the level we'll be expected to perform. It is extremely worthwhile to do these problems on your own and make sure you understand the answer.*

Another interesting insight that we gain from the sample exam is the distribution of questions. Here's what I came up for the number of questions per chapter and the number of points associated with those questions:


          Mult    Short   Total
Chapter   Choice  Answer  Points
7         6       0       12
8         5       1       20
9         8       1       26
12        6       3       42

We'll probably have a few questions from Chapter 6 thrown in, but those will probably be relatively easy compared to the more advanced material. These numbers tell me one thing for sure: Chapter 12 is really important!

*For the record: If you noticed that I didn't stick around for the in-class review off the sample final on Thursday, it's not because I think I know all this stuff! Just the opposite. Almost all of this material is new to me and I wanted to work through all the questions on my own without having heard the answer already solved by someone else.

Final Exam Study Guide - Outline

2008-03-10T09:05:00.011-05:00

Final Exam Study Guide
There's a lot of material to review for our final exam. In order to study for the final, I'm going through all the chapters that will be covered (6, 7, 8, 9 and 12) and pulling out the important points from each one. I've basically written them up as "learning objectives" for each chapter. Also since we didn't cover every section of every chapter, I've listed the sections that we did cover.

Here's my outline as it stands so far:

Chapter 6 - The Normal Distribution
6.1
Understand the concept of a continuous probability distribution and the difference between continuous and discrete probability distributions.

6.2
Understand the normal and standard normal distributions.
Calculate the z score for any given X.
Read the standard normal distribution table and answer questions of the form:
P(X<a)
P(X>a)
P(a<X<b)

6.3
Use the normal probability plot to evaluate normality of data.

Chapter 7 - Sampling Distributions

7.1
Understand the concept of a sampling distribution.

7.2
Calculate z-scores for xbar using the standard error of the mean: σ/√n
Understand the Central Limit Theorem.

Chapter 8 - Confidence Intervals

8.1
Construct a confidence interval for the population mean, given a sample mean, population standard deviation, sample size and level of confidence.
Know that a high level of confidence requires a wider confidence interval.

8.2
Construct a confidence interval for the population mean, given the sample mean, sample standard deviation, sample size and level of confidence.
Know that for the t statistic, the degrees of freedom is n-1.
Read the t-table to find the critical value for a given level of confidence and degrees of freedom.

8.4
Calculate the sample size required for a given margin of error and level of confidence.
Know that a smaller margin of error requires a larger sample size.
Know that a higher level of confidence requires a larger sample size.

Chapter 9 – Hypothesis Testing

9.1
Understand the concept of the null and alternative hypotheses.
Construct null and alternative hypotheses based on a description of the test.
Understand the concepts of rejection and non-rejection regions.
Understand the level of significance, alpha, of a hypothesis test.

9.2
Know the difference between a one-tailed and two-tailed hypothesis test.
Calculate critical values for the rejection and non-rejection regions for both one-tailed and two-tailed tests.
Calculate the z test statistic and compare to critical values to make a decision whether or not to reject the null hypothesis.
Calculate the p-value and compare to the level of significance to make a decision whether or not to reject the null hypothesis.

9.3
Create null and alternative hypotheses for one-tailed testing.

9.4
Use the t test statistic to conduct one and two-tailed hypothesis tests when σ is not known.

Chapter 12 - Simple Linear Regression

12.1
Understand the basic concepts of independent and dependent variables, intercept and slope.
Understand the concept of simple linear regression.
Regression: modeling a relationship between variables with a curve
Linear Regression: the curve in the relationship is a straight line (not some sort of arc)
Simple Linear Regression: only consider one independent variable as the predictor of the dependent variable
Understand the simple linear regression model formula: Y_i = β₀ + β₁X_i + ε_i

12.2
Understand the method of least squares.
Apply the computation formulas of the least squares method to compute the Y intercept b₀ and the slope b₁.
Know how to read and interpret partial computer output (Minitab) and develop the regression line based on it.

12.3
Understand the sum of squares terms SST, SSR and SSE for the measures of variation in regression.
Calculate any of the sum of squares terms, given the other two.
Understand the coefficient of determination, r².
Calculate r² given any two sum of square terms.
Know how to read and interpret partial computer output (Minitab) and calculate sum of squares terms and r² based on it.
Understand the standard error of the estimate and calculate it, given SSE or SST and SSR.

12.4
Know the four assumptions necessary to use the method of least squares in simple linear regression.

12.5
Know how to use residual analysis to validate the four assumptions.

12.6
Understand the Durbin-Watson statistic.
Know how to interpret the Durbin-Watson statistic to detect autocorrelation.

12.7
Calculate the standard error of the slope, S_b₁.
Calculate the t test statistic for the slope and determine whether there is a significant linear relationship.
Know that when comparing the t test statistic for the slope to the critical t value, you use n-2 degrees of freedom.
Construct a confidence interval for the slope.

12.8
Construct a prediction interval for an individual response Y.
Construct a confidence interval for the mean of Y.
(This is as far as I got so far. More to come!)

Lecture 8 - Confidence Interval for Ŷ

2008-03-04T16:46:00.004-06:00

Confidence Interval for Ŷ
Once we've calculated our regression coefficients, b₀ and b₁, we can estimate the value of Y at any given X with the formula:
Ŷ = b₀ + b₁X

This is known as a point estimate. It estimates Ŷ to a point. However, since it's just an estimate, it's logical to ask for a confidence interval around Ŷ.

In addition to constructing a confidence interval for an individual Ŷ, we can also construct a confidence interval for an average Ŷ at X.

The difference is best illustrated with an example. In site.mtw we have data on square footage of stores and their annual sales. In general, sales increase linearly with increasing square footage. We perform a regression analysis and determine the regression coefficients. Now we could ask two questions:
1. If I build a single new store with 4,000 square feet, what does the regression predict for its annual sales? The answer can be expressed as a confidence interval for an individual Ŷ, because we're making a prediction for an individual new store.
2. If I build 10 new stores, each with 4,000 square feet, what does the regression predict for the average annual sales of those stores? The answer to this question can be expressed as a confidence interval for an average Ŷ, since we're making a prediction about the average sales at many new stores.

Confidence Interval for Average Ŷ
The confidence interval for the average Ŷ (question #2 above) takes the common form:where S_Ŷ is the standard error of Ŷ. Note: n-2 is used in looking up the t value in the t table.

We are told thatwhere h_i is given by:
That's all there is to it! Well see in a minute that Minitab can calculate the standard error term for us, so it's constructing the interval is just a matter of looking up the value in the t table and then doing the arithmetic.

Confidence Interval for Individual Ŷ
If we're constructing the confidence interval for an individual Ŷ (question #1 above), the calculations are very similar except that we use a 1+h_i term in place of h_i. So that term becomes:Other than the standard error term, everything else is the same as calculating for an average Ŷ.

Using Minitab to calculate the confidence interval
Here's how to get the info we need out of Minitab:
1. Load up your data in a worksheet. (We use the site.mtw file as usual.)
2. Select Stat-Regression-Regression from the menubar.
3. Put the independent variable (square feet) in the Predictor box. Put the dependent variable (annual sales) in the Response box.
4. Click the Options button. Enter 4 in the Prediction interval for new observations box. This tells Minitab that we want a prediction of annual sales at 4000 square feet (the units of our data are thousands of feet).
5. Check the Confidence Limit checkbox if you want a confidence interval for an average Y. Check the Prediction Limit checkbox if you want a confidence interval for an individual Y.
6. Click OK in the Options and the main Regression windows. The results appear in the session window. Here's the relevant information:

Predicted Values for New Observations

New
Obs    Fit  SE Fit      95% CI          95% PI
  1  7.644   0.309  (6.971, 8.317)  (5.433, 9.854)

The prediction is 7.64 (the "fit"). The SE Fit term is the standard error for the average Y (the one with just h_i, not 1+h_i.

I suspect that we'll probably be expected to construct the confidence interval for the average Y, given the Fit and SE Fit output. Don't forget: You still need to look up the t value (at n-2!) and multiply the SE Fit value by it.

Lecture 8 - Inferences About the Regression Slope - Part 2

2008-03-04T15:06:00.005-06:00

Confidence Interval for b₁
The second question that we ask when evaluating the regression is: What is the confidence interval for b₁?

Like any confidence interval, this one will take the form:
b₁ ± t_{α/2, n-2} S_b₁

In the last blog post, we found that S_b₁ is:
S_b₁ = S_XY/SQRT(SSX)

Knowing S_b₁, we can look up t in the t-table and construct the confidence interval relatively easily.

Note that for the confidence interval for b₁ we use n-2 in looking up the t-score.

Using Minitab to evaluate the regression
We won't be expected to calculate S_XY and S_b₁ by hand for the final (or so we were told). But we will likely be asked to create a confidence interval for b₁ given a snippet of Minitab output. So it's worthwhile to take a look at it:

We used the site.mtw dataset and ran the standard regression analysis and got this:

Predictor      Coef  SE Coef      T      P
Constant     0.9645   0.5262   1.83  0.092
Square Feet  1.6699   0.1569  10.64  0.000
S = 0.966380   R-Sq = 90.4%   R-Sq(adj) = 89.6%

The S_b₁ value is calculated for us, but it's not obvious where it is. It's the SE Coef term that I've highlighted in red. b₁ itself is the Coef term, in blue. With those two numbers and a t-table, you can construct a confidence interval for b₁. Just remember to use n-2 in the t-table.

With these two values you can also determine the t statistic for hypothesis testing β₁=0 from the previous blog post by dividing b₁/S_b₁. But the truth is, you don't have to do that! The t-value for b₁ is right there in the Minitab output also. I've highlighted it in green. The number in the p column (highlighted purple) is the p-value for b₁. So if that number is less than α/2, then you can reject the hypothesis that β₀ is 0.

Lecture 8 - Inferences About the Regression Slope

2008-03-04T14:15:00.003-06:00

After we use the method of least-squares to calculate regression coefficients (b₀ and b₁) and we validate the LINE assumptions, we next turn to evaluating the regression, specifically the slope, b₁ and ask two questions:
1. Is it statistically significant?
2. What is the confidence interval for b₁?

The first question (we actually covered this after the second question in class), whether b₁ is statistically significant, is determined by asking: Is it any better than a flat horizontal line through the data?

We answer this question by making a hypothesis that the true relationship slope, β₁ is 0 and using our skills at hypothesis testing to determine whether we should reject that hypothesis.

H₀: β₁ = 0
H₁: β₁ ≠ 0

The t statistic that we use to test the hypothesis is:
t = (b₁-β₁)/S_b1
where S_b1 is the standard error of the slope.

In our case, β₁ is 0 according to our hypothesis, so t reduces to:
t = b₁/S_b1

The standard error of the slope, S_b1, is defined as:
S_b1 = S_XY/SQRT(SSX)
where S_XY is the standard error of the estimate.

The standard error of the estimate, S_XY, is defined as:
S_XY = SQRT(SSE/n-2)

So, if we have our calculations of SSX and SSE, we can do the math and find S_b1 and the t-score for b₁.

We finish our hypothesis testing by comparing the t-score for b₁ to t_{α/2, n-2}, where α is our level of significance.
If t is beyond t_{α/2, n-2} (either on the positive or negative end), we conclude that the hypothesis, H₀, must be rejected.
We could also make the conclusion based on the p-value of the t-score. If the p-value is less than α/2, then we reject H₀.

**Confidence interval for b₁ will be covered in the next blog post.**

Lecture 8 - Residual Analysis - Checking Independence of Errors

2008-03-04T09:59:00.006-06:00

Checking the Independence of Errors Assumption
The "I" in the LINE mnemonic stands for Independence of Errors. This means that the distribution of errors is random and not influenced by or correlated to the errors in prior observations. The opposite is independence is called autocorrelation.

Clearly, we can only check for independence/autocorrelation when we know the order in which the observations were made and the data points were collected.

We check for independence/autocorrelation in two ways. First, we can plot the residuals vs. the sequential number of the data point. If we notice a pattern, we say that there is an autocorrelation effect among the residuals and the independence assumption is not valid. The plot at right of residuals vs. observation week shows a clear up and down pattern of the residuals and indicates that the residuals are not independent.

The second test of independent/autocorrelation is a more quantitative measure. (All the methods that we've used up to this point for checking assumptions have been graphical/visual.) This test involves calculating the Durbin-Watson Statistic. The D-W statistic is defined as:
It's the sum of the squares of the differences between consecutive errors divided by the the sum of the squares of all errors.

Another way to look at the Durbin-Watson Statistic is:

D = 2(1-ρ)

where ρ (the Greek letter rho - lower case) = the correlation between consecutive errors.

Looking at it that way, there are 3 important values for D:
D=0: This means that ρ=1, indicating a positive correlation.
D=2: In this case, ρ=0, indicating no correlation.
D=4: ρ=-1, indicating a negative correlation

In order to assess whether there is independence, we check to see if D is close to 2 (in which case we say there is no correlation and errors are independent) or if it's closer to one of the other extreme values of 0 or 4 (in which case we say that the independence assumption is not valid). There is also some grey area between both 0 and 2 and between 2 and 4 in which case we say that the Durbin-Watson statistic does not give us enough information to make a determination, it is inconclusive.

To determine the boundaries for when the Durbin-Watson statistic is relevant and when it's inconclusive, we turn to table E.9, which provides us with lower and upper bounds, d_L and d_U.

Reading the Durbin-Watson Critical Values Table
The critical values are dependent on the sample size, n, the number of independent variables in the regression model, k, and the level of significance, α. In the case of simple linear regression, there's always only 1 independent variable. (That's the simple part.) The level of significance is usually 0.01 (99% confidence) or 0.05 (95% confidence).

So, to read the table:
1. Locate the large section of the table for your level of significance, α.
2. Find the two columns, d_L and d_U, for k=1 (assuming it's simple).
3. Go down the column to the row with your sample size, n.
4. Read the two values for d_L and d_U

Interpreting the Durbin-Watson Statistic
0 < D < d_L: There is positive autocorrelation
d_L < D < d_U: Inconclusive
d_U < D < 2+d_L: No autocorrelation
2+d_L < D < 2+d_U: Inconclusive
2+d_U < D < 4: There is negative autocorrelation

Graphically, it can be represented like this:

Note: Positive autocorrelation is somewhat common. Negative autocorrelation is very uncommon and our book does not deal with it.

Lecture 8 - Residual Analysis - Checking the Equal Variance Assumption

2008-03-02T15:40:00.004-06:00

Homoscadasticity (Not that there's anything wrong with that.)
We now turn to checking the assumption of equal variance of errors, the "E" in our LINE mnemonic. This assumptions states that not only is the error at each x-value distributed normally, but the variance in the error is equal at each point.

Equal variance of errors is known as homoscadasticity. Unequal variance of errors is called heteroscadasticity.

For this analysis we turn again to the plot of residuals vs. the independent variable (x) that we used in when we validated the linearity assumption. For linearity, we were just looking to see if the residuals were evenly distributed above and below the x-axis. To check for equal variance of errors, we check to see if there's any pattern in the distribution of the residuals around the x-axis.

Running the residual plot versus x in Minitab:
1. Load up your data.
2. Select Stat-Regression-Regression from the menu bar.
3. Put Annual Sales in the Response box and Square Feet in the Predictor box.
4. Click the Graphs button. Put Square Feet in the Residuals versus the variables box.
5. Click OK in both the Graphs and Regression dialogs. The residual plot appears.

Review the graph and ask yourself: Is there any pattern in the residuals? Do they get increasing larger or small as x changes? If so, then you have a case of heteroscadasticity. But if the residuals are distributed evenly and consistently around the x-axis, then you can conclude that the variances are consistent and the assumption of equality of variances is valid.

In our example, I'd be somewhat concerned with the fact that the residuals are closer to the x-axis for small values of x, but broaden out for larger values. The variance does seem to taper off as x gets very large, which is an indication that the variances are equal for x>2 or so. (Click the graph for a larger view of the plot.)

Lecture 8 - Residual Analysis - Checking the Normality Assumption

2008-03-02T14:20:00.004-06:00

The next assumption in the LINE mnemonic after Linearity is Independence of Errors. We skipped that one momentarily because it's a bit more complex than the others. So we saved it for last. In the meantime, we looked at the next assumption: Normality of Error.

Checking the Normality Assumption
This assumption states that the error in the observation is distributed normally at each x-value. A larger error is less likely than a smaller error and the distribution of errors at any x follows the normal distribution.

Although we typically only have one observation at each x, if we assume that the distribution of the errors is the same at each x, we can simply plot all the errors (residuals) and check if they follow the normal distribution. We do this by running a normal probability plot of the residuals. Fortunately for us, Minitab has a built-in normal probability plot function.

Checking Normality Using Minitab
1. Open up your data worksheet. As usual, we'll use the site.mtw file for our example.
2. Select Stat-Regression-Regression from the menu bar.
3. Put Annual Sales in the Response box since it's the dependent (response) variable and put Square Feet in the Predictors box since it's the independent (predictor) variable.
4. Click the Graph button and under Residual Plots, check the Normal plot of residuals checkbox.
5. Click OK in the Graphs and the Regression dialogs.

Minitab creates the normal probability plot of the residuals. The y-axis of this graph is adjusted so that if the data are distributed normally, they will fall on a straight line on the graph. Minitab even draws a line through the residuals for us (presumably using the method of least-squares).

Drawing a conclusion from the graph
Review this graph and ask yourself: Do the residual points fall more-or-less on a straight line in the normal probability plot? If they do, you can conclude that the errors are distributed normally and the normality of errors assumption is valid. In our example, the normality plot of the residuals are pretty much linear, but I would be concerned about the upward trend at the far right end of the graph. (Click the graph to see it in more detail.)

Lecture 8 - Residual Analysis - Checking Linearity

2008-02-29T14:42:00.006-06:00

Checking Linearity
Our method for checking the first assumption, linearity of the data, is not a precise, quantitative test. Rather, we'll use visual inspection to check for linearity.

One quick way to test the linearity of the data is to create an x-y scatter plot and observe whether the data generally follows a straight line (either with positive or negative slope). Plotting the regression line through the data may help visualize this as well.

Using Minitab for the linearity check:
1. Bring up your data in a worksheet. We used the site.mtw file in class.
2. Select Graph-Scatterplot from the menu bar. Select the "With Regression" option when prompted for the type of scatterplot.
3. Put Annual Sales (the dependent variable) in the Y Variables column and Square Feet (the independent variable) in the X Variables column. Remember that the independent variable is the variable that you can control and which you think will be a predictor of the dependent variable. In other words, the annual sales is dependent on the size of the store (in square feet). It's not the other way around. The size of the store doesn't grow or shrink depending on the number of sales!
4. Don't change the default options and click OK. You should get a plot of your data with a regression line through it. (If you don't get the regression line, in step 3 click Data View, Regression Tab and make sure Linear is selected.)

To interpret the linearity of this graph, "eyeball" the way the points fall above and below the regression line and ask yourself: Are the data points relatively linear or is it curved or skewed in some way? In our case, the data is relatively linear and not curved, so we conclude that the assumption of linearity is valid.

A better way to visually assess the linearity is to plot the residuals versus the independent variable and look to see if the errors are distributed evenly above and below 0 along the entire length of the sample.

Plotting Residuals versus the Independent Variable with Minitab
1. Select Stat-Regression from the menu bar.
2. Put Annual Sales in the Response box and Square Feet in the Predictors box. In our scenario, we think that the number of square feet will be a predictor of the annual sales of the store. Notice that the predictors box is large. There can be more than one predictor - perhaps advertising, employee training, etc. Many things can influence the response variable - the annual sales. We'll get to that during multiple linear regression. Right now, for simple linear regression, we're just looking at a single predictor.
3. Click the Graphs button. Put Square Feet in the Residuals versus the variables box.

To interpret this graph, ask yourself: Do the residual points fall equally above and below 0 along the entire length of the horizontal axis? In our case, the residuals do more or less fall equally above and below 0, so we conclude that the data is linear and the assumption of linearity is valid. Note: We also see that the residuals are closer to 0 for lower values of x (square feet). That may become important later when we talk about equal variance of errors.

Lecture 8 - Residual Analysis - Definition

2008-02-29T10:40:00.005-06:00

Residual Analysis
In Lecture 7 we discussed how to use the method of least-square to perform simple linear regression on a set of data. We also discussed the four assumptions we make about our data in order to use the method of least-squares for the regression:
1. Linearity
2. Independence of errors
3. Normality of error
4. Equal variance of errors

The error is also known as the residual and is the difference between the observed Y_i value, for any particular X_i, and the value for Y_i predicted by our regression model which is usually symbolized by Ŷ_i (read "y hat sub i"). The residual is symbolized by the greek letter epsilon (lower case) - ε_i.

ε_i = Y_i - Ŷ_i

We perform a four-part residual analysis on our data to evaluate whether each of the four assumptions hold and, based on the outcome, we can determine whether our linear regression model is the correct model.

It's called a residual analysis because 3 of the 4 assumptions (independence, normality and equality of variance) directly relate to the errors (the residuals) and the other assumption (linearity) is tested by assessing the residuals.

Using the DePaul Online Library for Research

2008-02-28T15:42:00.007-06:00

DePaul Online Research Library
One of the really nice perks that we have as students at DePaul is the online research library. The library subscribes to many databases of academic journals and magazines which are searchable. Many of these databases allow you to access and download full text versions of the journal articles, usually available in PDF format. To access the online research library, all you need is your Depaul student ID.

Another really nice feature of the DePaul online research library is that it can be integrated with the Google Scholar search engine. I'll show you how to do that in a future blog post.

Here's how to access the DePaul online research library:
1. Bring up the main DePaul web site by browsing to http://www.depaul.edu.
2. At the top of the page, click on the "Libraries" link.

3. This will bring you to the Library page. There are lots of links to follow here. Focus on the "Research" section. The way this works is that you need to identify the database in which you want to conduct your search. Once you do that, you can use the database's internal search function to find your article. So, how do you find a database? There are a couple ways:

Method 1:
Use this method if you're just starting out and don't know which database or journal you're going to search
4. Click the "Journals and newspaper articles" link. That will bring you to the subject page.
5. Since we're studying statistics, a good choice for subject would be Mathematical Sciences. Click that link.
6. You get the database list. For mathematical sciences, we subscribe to 7 databases. The database list gives you a short description of the database and the dates covered by the database. Some of the databases indicate whether we subscribe to full text of articles with a FT icon:

7. Choose your database and click on its link. You'll be prompted for your DePaul username and password. Enter those and click Login.

Method 2:
Use this method if you know the database you want to search
4. In the Research section of the Library page you can click on the A-Z Database List to see all the databases. If you already know the database that you want to search, you can skip by the "subject" steps 4-5 above in method 1 by just using the A-Z list.

Method 3:
Use this method if you know the name of the journal that you want to search
4. Click the "Journals and newspaper articles" link.
5. On the left hand margin, enter the name of the journal and click Search.
6. The results page will show you which databases contain that journal and for which years

Each database has its own interface and it would be impossible for me to cover all of them, but most of them are self explanatory and user-friendly. You can usually search by author, article title or keyword. Several databases also allow you to browse the issues of the journals in the database.

More to come...

Lecture 7 - Coefficient of Determination

2008-02-26T19:16:00.004-06:00

After calculating β₀ and β₁ to determine the best line to fit the data, we want to quantify how well the line fits the data. It may be the best line, but how good is it?

Sum of Squares
Looking at the graph of the data, we could say that without any modeling or regression at all, we would expect the y-value for any give x to be the mean y, ybar. Most of the observations, of course, would not be equal to the mean. We can measure how far the observations are from the mean by taking the difference between each y_i and ybar, squaring them, and taking the sum of the squares. We call this the total sum of squares or SST.

You probably remember that the variance that we discussed much earlier in the course is this sum of squares divided by n-1.

The total sum of squares is made up of two parts - the part that is explained by the regression (yhat-ybar) and the part that the observation differs from the regression (y_i-yhat). When we square each of these and sum them we compute the regression sum of squares, SSR, and the error sum of squares, SSE.

Coefficient of Determination
The 3 sum of squares terms, SST, SSR and SSE, don't tell us much by themselves. If we're dealing with observations which use large units, these terms may be relatively large even though the variance from a linear relationship is small. On the other hand, if the units of the measurements in our observations is small, the sum of square terms may be small even when the variance from linearity is great.

Therefore, the objective statistic that we use to assess how well the regression fits the data is the ratio of the regression sum of squares, SSR, to the total sum of squares, SST. We call this statistic the coefficient of determination, r².

r² = SSR / SST

Lecture 7 - Assumptions in the Method of Least Squares

2008-02-25T19:51:00.005-06:00

Photo courtesy of F. Espenak at MrEclipse.com

Assumptions
In order to use the Least Squares Method, we must make 4 fundamental assumptions about our data and the underlying relationship between the independent and dependent variables, x and y.

1. Linearity - that the variables are truly related to each other in a linear relationship.
2. Independence - that the errors in the observations are independent from one another.
3. Normality - that the errors in the observations are distributed normally at each x-value. A larger error is less likely than a smaller error and the distribution of errors at any x follows the normal distribution.
4. Equal variance - that the distribution of errors at each x (which is normal as in #3 above) has the identical variance. Errors are not more widely distributed at different x-values.

A useful mnemonic device for remembering these assumptions is the word LINE - Linearity, Independence, Normality, Equal variance.

Note that the first assumption, linearity, refers to the true relationship between the variables. The other three assumptions refer to the nature of the errors in the observed values for the dependent variable.

If these assumptions are not true, we need to use a different method to perform the linear regression.

The Least-Squares Method

2008-02-25T19:14:00.005-06:00

The Method of Least Squares
As described in the previous post, the least-squares method minimizes the sum of the squares of the error between the y-values estimated by the model and the observed y-values.

In mathematical terms, we need to minimize the following:
∑ (y_i - (β₀+β₁x_i))

All the y_i and x_i are known and constant, so this can be looked at as a function of β₀ and β₁. We need to find the β₀ and β₁ that minimize the total sum.

From calculus we remember that to minimize a function, we take the derivative of the function, set it to zero and solve. Since this is a function of two variables, we take two derivatives - the partial derivative with respect to β₀ and the partial derivative with respect to β₁.

Don't worry! We won't need to do any of this in practice - it's all been done years ago and the generalized solutions are well know.

To find b₀ and b₁:
1. Calculate xbar and ybar, the mean values for x and y.
2. Calculate the difference between each x and xbar. Call it xdiff.
3. Calculate the difference between each y and ybar. Call it ydiff.
4. b1 = [∑(xdiff)(ydiff)] / ∑(xdiff²)
5. b₀ = ybar - b₁xbar

Notice that we switched from using β to using b? That's because β is used for the regression coefficients of the actual linear relationship. b is used to represent our estimate of the coefficients determined by the least squares method. We may or may not be correctly estimating β with our b. We can only hope!

Lecture 7 - Simple Linear Regression

2008-02-25T18:28:00.006-06:00

Linear Regression essentially means creating a linear model that describes the relationship between two variables.

Our type of linear regression is often referred to as simple linear regression. The simple part of the linear regression refers to the fact that we don't consider other factors in the relationship - just the two variables. When we model how several variables may determine another variable, it's called multiple regression - the topic for a more advanced course (or chapter 13 in our text).

For example, we may think that the total sales at various stores is proportional to the number of square feet of space in the store. If we collect data from a number of stores and plot them in an XY scatter plot, we would probably find that the data points don't lie on a perfectly straight line. However, they may be "more or less" linear to the naked eye. Linear regression involves finding a single line that approximates the relationship. With this line, we can estimate the expected sales at a new store, given the number of square feet it will have.

In mathematical terms, linear regression means finding values for β₀ and β₁ in the equation

y = β₀ + β₁x

such that the resulting equation fits the data points as closely as possible.

The equation above may look more familiar to you in this form:

y = mx + b

That's the form we learned in linear algebra. m is the slope and b is the y-intercept. Similarly, in our statistical form, β₁ is the slope and β₀ is the y-intercept.

How Close is Close?
We said that we want to find a line that fits the data "as closely as possible". How close is that? Well, for any given β₀ and β₁, we can calculate how for off we are by looking at each x-value, calculating what the linear estimate would be according to our regression equation and comparing that to the actual observed y-value. The difference is error between our regression estimate and the observation. Clearly, we want to find the line that minimizes the total error.

Minimizing the total error is done in practice by minimizing the sum of the squares of the errors. If we used the actual error term, and not the square, positive and negative errors would cancel each other out. We don't use the absolute value of the error term because we will need to integrate it and the absolute value function is not integrable at 0.

Generating regression coefficients β₀ and β₁ for the linear model by minimizing the sum of the square of the errors is known as the least-squares method.

Homework

2008-02-25T12:15:00.005-06:00

In case you were wondering...
According to an email I got from Prof Selcuk, there is no homework due this week. Homework #5 will be assigned this Thursday Feb 28 and due next Thursday March 6. It will be the last homework of the quarter.

My comment: This will give us time to absorb the material on linear regression before working on homework exercises. Maybe even time to go to the zoo on Sunday instead of doing homework.

Lecture 7 - Using Minitab to Calculate Hypothesis Testing Statistics

2008-02-24T18:14:00.019-06:00

Using Minitab to Calculate Hypothesis Testing Statistics
Minitab can be used to perform some of the calculations that are required in steps 4 and 5 of the critical value approach and step 4 of the p-value approach to hypothesis testing (see previous 2 posts). You still need to do all the study design in steps 1-3 and use them as input to Minitab. You will also need to draw your own conclusions from the calculations that Minitab performs.

Here's how:
1. Load up your data in a Minitab worksheet. (In lecture 7, we used the data in the insurance.mtw worksheet from exercise 9.59.)
2. Select Stat - Basic Statistics from the menu bar. Since we're doing hypothesis testing of the mean, we have 2 choices from the menu. Either "1-sample z" or "1-sample t". Since we don't know the standard deviation of the population, we choose the "1-sample t" test.
3. In the dialog box, select the column that has your sample data and click the select button so it appears in the "Samples in columns" box. In the test mean box, enter the historical value for the mean, which in our case is 45.
4. Click the Options button and enter the confidence level ((1-&alpha)x100) and select a testing "alternative". The testing alternative is where you specify the testing condition of the alternative hypothesis.

If H₁ states that the mean is not equal to the historical value, select not equal. Minitab will make calculations for a two-tail test.
If H₁ states that the mean is strictly less than or strictly greater than the historical value, select less than or greater than. In this case, Minitab will calculate values for a one-tail test.

5. Click Ok in the Options dialog and Ok in the main dialog. Minitab displays the calculated values in the Session window. The results from our sample data looked like this:

One-Sample T: Time
Test of mu = 45 vs not = 45

Variable N Mean StDev SE Mean 95% CI T P
Time 27 43.8889 25.2835 4.8658 (33.8871, 53.8907) -0.23 0.821

Unfortunately, Minitab doesn't take the hypothesis testing all the way to drawing a conclusion about the null hypothesis. We need to do that ourselves in one of two ways: either the critical value or p-value approach.

For the critical value approach, we need to additionally look up the t-score for t_0.025,26 = ±2.056. 0.025 is α/2, which we use with this two-tail test. 26 is n-1, the degrees of freedom for this test. We compare t_0.025,26 to the t-score of the sample mean, which Minitab calculated for us as -0.23, and find that the t-score of the sample mean is between the critical values and therefore we do not reject H₀.

For the p-value approach, we compare the p-value that Minitab calculated as 0.821 and compare that to the level of significance, &alpha, which in our case is 0.10. Since the p-value is larger than α we do not reject H₀.

Hypothesis Testing - p-Value Approach - 5 Step Methodology

2008-02-24T16:38:00.008-06:00

The p-Value Approach
The p-value approach to hypothesis testing is very similar to the critical value approach (see previous post). Rather than deciding whether or not to reject the null hypothesis based on whether the test statistic falls in a rejection region or not, the p-value approach allows us to make the decision based on whether or not the p-value of the sample data is more or less than the level of confidence.

The p-value is the probability of getting a test statistic equal to or more extreme than the sample result. If the p-value is greater than the level of confidence then we can say that the probability of a more extreme test statistic is larger than the level of confidence and thus we do not reject H₀.

If, on the other hand, the p-value is less than the level of confidence, we conclude that the probability of a more extreme test statistic is smaller than the level of confidence and thus we reject H₀.

The five step methodology of the p-value approach to hypothesis testing is as follows:
(Note: The first three steps are identical to the critical value approach described in the previous post. However, step 4, the calculation of the critical value, is omitted in this method. Differences in the final two steps between the critical value approach and the p-value approach are emphasized.)

State the Hypotheses
1. State the null hypothesis, H₀, and the alternative hypothesis, H₁.
Design the Study
2. Choose the level of significance, α according to the importance of the risk or committing Type I errors. Determine the sample size, n, based on the resources available to collect the data.
3. Determine the test statistic and sampling distribution. When the hypotheses involve the population mean, μ, the test statistic is z when σ is known and t when σ is not known. These test statistics follow the normal distribution and the t-distribution respectively.
Conduct the Study
4. Collect the data and compute the test statistic and the p-value.
Draw Conclusions
5. Evaluate the p-value and determine whether or not to reject the null hypothesis. Summarize the results and state a managerial conclusion in the context of the problem.

Example (we'll look at the same example as the last post, also reviewed at the beginning of Lecture 7):
A phone industry manager thinks that customer monthly cell phone bills have increased and now average over $52 per month. The company asks you to test this claim. The population standard deviation, σ, is known to be equal to 10 from historical data.

The Hypotheses
1.H₀: μ ≤ 52
H₁: μ > 52
Study Design
2. After consulting with the manager and discussing error risk, we choose a level of significance, α, of 0.10. Our resources allow us to sample 64 sample cell phone bills.
3. Since our hypothesis involves the population mean and we know the population standard deviation, our test statistic is z and follows the normal distribution.
The Study
4. We conduct our study and find that the mean of the 64 sample cell phone bills is 53.1. We compute the test statstic, z = (xbar-μ)/(σ/√n) = (53.1-52)/(10/√64) = 0.88. Next, we look up the p-value of 0.88. The cumulative normal distribution table tells us that the area to the left of 0.88 is 0.8106. Therefore, the p-value of 0.88 = 1-0.8106 = 0.1894.
Conclusions
5. Since 0.1894 is greater than the level of significance, α, we do not reject the null hypothesis. We report to the company that, based on our testing, there is not evidence that the mean cell phone bill has increased from $52 per month.

Hypothesis Testing - Critical Value Approach - 6 Step Methodology

2008-02-24T15:44:00.004-06:00

The six-step methodology of the Critical Value Approach to hypothesis testing is as follows:
(Note: The methodology below works equally well for both one-tail and two-tail hypothesis testing.)

State the Hypotheses
1. State the null hypothesis, H₀, and the alternative hypothesis, H₁.
Design the Study
2. Choose the level of significance, α according to the importance of the risk or committing Type I errors. Determine the sample size, n, based on the resources available to collect the data.
3. Determine the test statistic and sampling distribution. When the hypotheses involve the population mean, μ, the test statistic is z when σ is known and t when σ is not known. These test statistics follow the normal distribution and the t-distribution respectively.
4. Determine the critical values that divide the rejection and non-rejection regions.
Note: For ethical reasons, the level of significance and critical values should be determined prior to conducting the test. The test should be designed so that the predetermined values do not influence the test results.
Conduct the Study
5. Collect the data and compute the test statistic.
Draw Conclusions
6. Evaluate the test statistic and determine whether or not to reject the null hypothesis. Summarize the results and state a managerial conclusion in the context of the problem.

Example (reviewed at the beginning of Lecture 7):
A phone industry manager thinks that customer monthly cell phone bills have increased and now average over $52 per month. The company asks you to test this claim. The population standard deviation, σ, is known to be equal to 10 from historical data.

The Hypotheses
1.H₀: μ ≤ 52
H₁: μ > 52
Study Design
2. After consulting with the manager and discussing error risk, we choose a level of significance, α, of 0.10. Our resources allow us to sample 64 sample cell phone bills.
3. Since our hypothesis involves the population mean and we know the population standard deviation, our test statistic is z and follows the normal distribution.
4. In determining the critical value, we first recognize this test as a one-tail test since the null hypothesis involves an inequality, ≤. Therefore the rejection region is entirely on the side of the distribution greater than the historic mean - right tail.
We want to determine a z-value for which the area to the right of that value is 0.10, our α. We can use the cumulative normal distribution table (which gives areas to the left of the z-value) and find z having value 0.90 = 1.285. This is our critical value.
The Study
5. We conduct our study and find that the mean of the 64 sample cell phone bills is 53.1. We compute the test statstic, z = (xbar-μ)/(σ/√n) = (53.1-52)/(10/√64) = 0.88.
Conclusions
6. Since 0.88 is less than the critical value of 1.285, we do not reject the null hypothesis. We report to the company that, based on our testing, there is not evidence that the mean cell phone bill has increased from $52 per month.