Statistical inference about means and proportions with two populations seems to be one of the most commonly used applications in the field of analytics β comparing campaign response rates between 2 groups of customers, pre and post campaign sales, membership renewal rates, etc.
Call it chance or whatever, but whenever these kind of tasks came up I hear people talking about the t-tests only. No issues as long as you want to compare means or when your target variable is a continuous value. But how or why do people talk about the t-test when they want to compare ratios or proportions? Whatever happened to the Chi-Square tests or the Z-test for difference in proportions?
I did a bit of research on the net, a bit of calculation using pen and paper [very good exercise for the brain in this age of calculators and spreadsheets π ], read a very good article by Gerard E. Dallal, and I found the answers.
Going back to our introductory class in statistics, letβs check out the formulae for the t-tests.
1. Assuming that the population variances are equal,
T = (X1 β X2)/sqrt (Sp2(1/n1 + 1/n2) β¦β¦β¦.Equation 1
where
X1, X2 = means of sample 1 and 2
n1, n2 = size of sample 1 and 2
Sp = pooled β¦
Statistical inference about means and proportions with two populations seems to be one of the most commonly used applications in the field of analytics β comparing campaign response rates between 2 groups of customers, pre and post campaign sales, membership renewal rates, etc.
Call it chance or whatever, but whenever these kind of tasks came up I hear people talking about the t-tests only. No issues as long as you want to compare means or when your target variable is a continuous value. But how or why do people talk about the t-test when they want to compare ratios or proportions? Whatever happened to the Chi-Square tests or the Z-test for difference in proportions?
I did a bit of research on the net, a bit of calculation using pen and paper [very good exercise for the brain in this age of calculators and spreadsheets π ], read a very good article by Gerard E. Dallal, and I found the answers.
Going back to our introductory class in statistics, letβs check out the formulae for the t-tests.
1. Assuming that the population variances are equal,
T = (X1 β X2)/sqrt (Sp2(1/n1 + 1/n2) β¦β¦β¦.Equation 1
where
X1, X2 = means of sample 1 and 2
n1, n2 = size of sample 1 and 2
Sp2 = pooled variance = [((n1-1)S12+(n2-1)S22)/(n1+n2-2)]
2. Assuming that the population variances are not equal,
T = (X1 β X2)/sqrt(S12/n1 + S22/n2) β¦β¦β¦.Equation 2
We have also been taught that the test statistic Z is used to determine the difference between two population proportions based on the difference between the two sample proportions (P1 β P2).
And the formula for the Z statistic is given by
Z = (P1 β P2)/ sqrt(P(1-P)(1/n1 + 1/n2)) β¦β¦β¦.Equation 3
where
P1, P2 = proportions of success (or target category) in samples 1 and 2
S1, S2 = variances for samples 1 and 2
n1, n2 = size of samples 1 and 2
P = pooled estimate of the sample proportion of successes =(X1 + X2) / (n1 +n2)
X1, X2 = number of successes (or target category) in samples 1 and 2
The test statistic Z (equation 3) is equivalent to the chi- square goodness-of-fit test, also called a test of homogeneity of proportions.
But how different is the proportions from means? The proportion having the desired outcome is the number of individuals/observations with the outcome divided by total number of individuals/observations. Suppose we create a variable that equals 1 if the subject has the outcome and 0 if not. The proportion of individuals/observations with the outcome is the mean of this variable because the sum of these 0s and 1s is the number of individuals/observations with the outcome.
Letβs suppose there are m 1s and (n-m) 0s among the n observations. Then, XMean (=P) =m/n and is equal to (1-m/n) for m observations and 0-m/n for (n-m) observations. When these results are combined, the final result is
β(Xi β XMean)2 = m(1-m/n)2 + (n β m) (0 β m/n)2
= m(1 β 2m/n + m2/n2) + (n β m) m2/n2
= m β 2(m2/n2) + (m3/n2) + (m2/n) β (m3/n2)
= m β (m2/n)
= m(1-m/n)
= nP(1-P)
So, variance = β(Xi β XMean)2/n = P(1-P)
Substituting this in the equation 3 (for Z statistic), we get
(P1 β P2)/ sqrt(Variance/n1 + Variance/n2)), which is not so different from equation 2 (the formula for the βequal variances not assumedβ version of t test).
As long as the sample size is relatively large, the distributional assumptions are met, and the response is binomial β the t test and the z test will give p-values that are very close to one another.
And in the case where we have only two categories, the z test and the chi-square test turn out to be exactly equivalent, though the chi-square is by nature a two-tailed test. The chi-square distribution for 1 df is just the square of the z distribution.
The various tests and their assumptions as listed in Wikipedia are given below:
1. Two-sample pooled t-test, equal variances
(Normal populations or n1 + n2 > 40) and independent observations and Ο1 = Ο2 and (Ο1 and Ο2 unknown)
2. Two-sample unpooled t-test, unequal variances
(Normal populations or n1 + n2 > 40) and independent observations and Ο1 β Ο2 and (Ο1 and Ο2 unknown)
3. Two-proportion z-test, equal variances
n1 p1 > 5 and n1(1 β p1) > 5 and n2 p2 > 5 and n2(1 β p2) > 5 and independent observations
4. Two-proportion z-test, unequal variances
n1 p1 > 5 and n1(1 β p1) > 5 and n2 p2 > 5 and n2(1 β p2) > 5 and independent observations