Conditions for inference
The conditions we need for inference, whether on a proportion or a mean, are:
- Random: The data needs to come from a random sample or randomized experiment.
- Independent: Individual observations need to be independent. If sampling without replacement, our sample size shouldn’t be more than 10% of the population.
- Normal: The sampling distribution of \(\hat{p}\) (for inference on a proportion) or \(\bar{x}\) (for inference on a mean) needs to be approximately normal.
- For inference on a proportion, it needs at least 10 expected successes and 10 expected failures.
- For inference on a mean, this is true if any of these 3 cases holds:
- Case 1: the parent population is normal
- Case 2: the sample is reasonably large (n≥30)
- Case 3: for a small sample (n<30), the sample data are roughly symmetric and don’t show outliers or strong skew
One sample significance tests
Confidence intervals
Confidence intervals about a proportion
- Define the confidence level, e.g. 95%, and get the corresponding critical value \(z^*\)
- Calculate the confidence interval:
\(CI = \hat{p} \pm z^* \sigma_{\hat{p}} \approx \hat{p} \pm z^* \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}\)
Confidence intervals about a mean
- Define the confidence level, e.g. 95%, and get the corresponding critical value \(t^*\)
- Calculate the confidence interval:
\(CI = \bar{x} \pm z^* \sigma_{\bar{x}} \approx \bar{x} \pm t^* \frac{s_x}{\sqrt{n}}\)
Significance tests
Significance tests about a proportion | Significance tests about a mean |
---|---|
\(\begin{align} H_0 &: p = p_0 \\ H_a &: p \neq p_0 \; & \text{(two-tailed)} \\ H_a &: p > p_0 \; & \text{(upper-tailed)} \\ H_a &: p < p_0 \; & \text{(lower-tailed)} \end{align}\) | \(\begin{align} H_0 &: \mu = \mu_0 \\ H_a &: \mu \neq \mu_0 \; & \text{(two-tailed)} \\ H_a &: \mu > \mu_0 \; & \text{(upper-tailed)} \\ H_a &: \mu < \mu_0 \; & \text{(lower-tailed)} \end{align}\) |
\(z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1 - p_0)}{n}}}\) | \(t = \frac{\bar{x} - \mu_0}{\frac{s_x}{\sqrt{n}}}\) |
When to use z or t statistics in significance tests?
For significance tests about a proportion,
\[z = \frac{\hat{p} - p_0}{\sigma_{\hat{p}}}\]with \(\sigma_{\hat{p}} = \sqrt{\frac{p_0(1 - p_0)}{n}}\)
We can get the z-statistic without any issues.
When we are making confidence intervals or doing significance tests about a mean,
\[z = \frac{\bar{x} - \mu_0}{\sigma_{\bar{x}}}\]we need to know the standard deviation of \(\bar{x}\):
\[\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}\]We usually don’t know the population standard deviation \(\sigma\), so we substitute the sample standard deviation \(s_x\) as an estimate for \(\sigma\), aka the standard error of \(\bar{x}\).
\[\sigma_{\bar{x}} \approx \frac{s_x}{\sqrt{n}}\]We will then use a t-statistic:
\[t = \frac{\bar{x} - \mu_0}{\frac{s_x}{\sqrt{n}}}\]Inference comparing two groups or populations (two sample inference)
To find out if there’s a difference between two populations, we can make a confidence interval to estimate the difference, or do a significance test to see if the difference is significant.
Confidence intervals
If the confidence interval overlaps with 0, then \(P \geq \alpha\), failed to reject \(H_0\);
If the confidence interval doesn’t overlap with 0, then \(P < \alpha\), reject \(H_0\).
Confidence intervals for the difference between two proportions
- Define the confidence level, e.g. 95%, and get the corresponding critical value \(z^*\)
- Calculate the confidence interval for \(p_1 - p_2\):
\(CI = (\hat{p}_1 - \hat{p}_2) \pm z^* \sigma_{\hat{p}_1 - \hat{p}_2} \approx (\hat{p}_1 - \hat{p}_2) \pm z^* \sqrt{\frac{\hat{p}_1(1 - \hat{p}_1)}{n_1} + \frac{\hat{p}_2(1 - \hat{p}_2)}{n_2}}\)
Confidence intervals for the difference between two means
- Define the confidence level, e.g. 95%, and get the corresponding critical value \(t^*\)
- Calculate the confidence interval for \(\mu_1 - \mu_2\):
\(CI = (\bar{x}_1 - \bar{x}_2) \pm z^* \sigma_{\bar{x}_1 - \bar{x}_2} \approx (\bar{x}_1 - \bar{x}_2) \pm t^* \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\)
Testing the difference
Testing the difference between two proportions | Testing the difference between two means |
---|---|
\(\begin{align} H_0 &: p_1 - p_2 = 0 \\ H_a &: p_1 - p_2 \neq 0 \; & \text{(two-tailed)} \\ H_a &: p_1 - p_2 > 0 \; & \text{(upper-tailed)} \\ H_a &: p_1 - p_2 < 0 \; & \text{(lower-tailed)} \end{align}\) | \(\begin{align} H_0 &: \mu_1 - \mu_2 = 0 \\ H_a &: \mu_1 - \mu_2 \neq 0 \; & \text{(two-tailed)} \\ H_a &: \mu_1 - \mu_2 > 0 \; & \text{(upper-tailed)} \\ H_a &: \mu_1 - \mu_2 < 0 \; & \text{(lower-tailed)} \end{align}\) |
\(\begin{align} z & = \frac{\hat{p}_1 - \hat{p}_1}{\sigma_{\hat{p}_1 - \hat{p}_2}} \\ & \approx \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}_{1 \cup 2}(1 - \hat{p}_{1 \cup 2}) (\frac{1}{n_1} + \frac{1}{n_2})}} \end{align}\) | \(\begin{align} t & = \frac{\bar{x}_1 - \bar{x}_2}{\sigma_{\bar{x}_1 - \bar{x}_2}} \\ & \approx \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}}} \end{align}\) |
P-value and significance
Once we got the z-statistic or t-statistic, we can then calculate the corresponding p-value.
\(P \text{-value}\) stands for probability value, is the probability of getting a statistic at least as extreme as the one we observed if we were to assume the null hypothesis is true.
z-statistic | t-statistic |
---|---|
\(\begin{align} P \text{-value} & = 2 \cdot Pr(Z \geq |z|) \; & \text{(two-tailed)} \\ P \text{-value} & = Pr(Z \geq |z|) \; & \text{(upper-tailed)} \\ P \text{-value} & = Pr(Z \leq -|z|) \; & \text{(lower-tailed)} \end{align}\) | \(\begin{align} P \text{-value} & = 2 \cdot Pr(T \geq |t|) \; & \text{(two-tailed)} \\ P \text{-value} & = Pr(T \geq |t|) \; & \text{(upper-tailed)} \\ P \text{-value} & = Pr(T \leq -|t|) \; & \text{(lower-tailed)} \end{align}\) |
A hypothesis test is significant if the \(P \text{-value}\) is less than the significance level \(\alpha\).