Means and Proportions with two populations
Call it chance or whatever, but whenever these kind of tasks came up I hear people talking about the t-tests only. No issues as long as you want to compare means or when your target variable is a continuous value. But how or why do people talk about the t-test when they want to compare ratios or proportions? Whatever happened to the Chi-Square tests or the Z-test for difference in proportions?
I did a bit of research on the net, a bit of calculation using pen and paper [very good exercise for the brain in this age of calculators and spreadsheets :-) ], read a very good article by Gerard E. Dallal, and I found the answers.
Going back to our introductory class in statistics, let’s check out the formulae for the t-tests.
1. Assuming that the population variances are equal,
T = (X1 – X2)/sqrt (Sp2(1/n1 + 1/n2) ..........Equation 1
where
X1, X2 = means of sample 1 and 2
n1, n2 = size of sample 1 and 2
Sp2 = pooled variance = [((n1-1)S12+(n2-1)S22)/(n1+n2-2)]
2. Assuming that the population variances are not equal,
T = (X1 – X2)/sqrt(S12/n1 + S22/n2) ..........Equation 2
We have also been taught that the test statistic Z is used to determine the difference between two population proportions based on the difference between the two sample proportions (P1 – P2).
And the formula for the Z statistic is given by
Z = (P1 – P2)/ sqrt(P(1-P)(1/n1 + 1/n2)) ..........Equation 3
where
P1, P2 = proportions of success (or target category) in samples 1 and 2
S1, S2 = variances for samples 1 and 2
n1, n2 = size of samples 1 and 2
P = pooled estimate of the sample proportion of successes =(X1 + X2) / (n1 +n2)
X1, X2 = number of successes (or target category) in samples 1 and 2
The test statistic Z (equation 3) is equivalent to the chi- square goodness-of-fit test, also called a test of homogeneity of proportions.
But how different is the proportions from means? The proportion having the desired outcome is the number of individuals/observations with the outcome divided by total number of individuals/observations. Suppose we create a variable that equals 1 if the subject has the outcome and 0 if not. The proportion of individuals/observations with the outcome is the mean of this variable because the sum of these 0s and 1s is the number of individuals/observations with the outcome.
Let's suppose there are m 1s and (n-m) 0s among the n observations. Then, XMean (=P) =m/n and is equal to (1-m/n) for m observations and 0-m/n for (n-m) observations. When these results are combined, the final result is
∑(Xi – XMean)2 = m(1-m/n)2 + (n – m) (0 – m/n)2
= m(1 – 2m/n + m2/n2) + (n – m) m2/n2
= m – 2(m2/n2) + (m3/n2) + (m2/n) – (m3/n2)
= m – (m2/n)
= m(1-m/n)
= nP(1-P)
So, variance = ∑(Xi – XMean)2/n = P(1-P)
Substituting this in the equation 3 (for Z statistic), we get
(P1 – P2)/ sqrt(Variance/n1 + Variance/n2)), which is not so different from equation 2 (the formula for the "equal variances not assumed" version of t test).
As long as the sample size is relatively large, the distributional assumptions are met, and the response is binomial – the t test and the z test will give p-values that are very close to one another.
And in the case where we have only two categories, the z test and the chi-square test turn out to be exactly equivalent, though the chi-square is by nature a two-tailed test. The chi-square distribution for 1 df is just the square of the z distribution.
The various tests and their assumptions as listed in Wikipedia are given below:
1. Two-sample pooled t-test, equal variances
(Normal populations or n1 + n2 > 40) and independent observations and σ1 = σ2 and (σ1 and σ2 unknown)
2. Two-sample unpooled t-test, unequal variances
(Normal populations or n1 + n2 > 40) and independent observations and σ1 ≠ σ2 and (σ1 and σ2 unknown)
3. Two-proportion z-test, equal variances
n1 p1 > 5 and n1(1 − p1) > 5 and n2 p2 > 5 and n2(1 − p2) > 5 and independent observations
4. Two-proportion z-test, unequal variances
n1 p1 > 5 and n1(1 − p1) > 5 and n2 p2 > 5 and n2(1 − p2) > 5 and independent observations
Other Posts by Romakanta Irungbam
The Keyword Tree - Spotfire, Data Visualization and Text Mining - February 23, 2011
What's behind your Tree? - December 17, 2010
Analytics: Reality and the Growing Interest - May 31, 2009
A Tale Of Two Banks and One Telecom Service Provider - May 18, 2009
Workforce Analytics - April 28, 2009
The moderated business community for business intelligence, predictive analyics, and data professionals.
--Sponsored--
From
By Steve Jones, Capgemini
Sea Change: Is your company prepared for the coming big-data wave?
By Paul Barsch and George Kong
Release the Flow: The Teradata Aster Analytic Pipeline Discovery sets the stage for uncovering new information.
By Mary Pat Simmons, Kevin J. Lewis and Dan Fritz
Smooth Road to System Upgrades: The Teradata Pre-Upgrade Assessment helps you avoid the bumps.
The Data Analytics of Super Bowl Commercials (527 views)
Facebook: Why is Nobody Listening? (289 views)
The Predictive Analytics in the Cloud Study is complete!
Register here to access the full results of this exclsuive study on Predictive Analytics and Cloud Technology including a whitepaper, 2 webinars, multiple podcasts and more!
Stephen Baker is the author of The Numerati & a journalist with 20 years of experience at BusinessWeek. More »
Paul Barsch directs professional services marketing programs for Teradata and has more than fifteen years of information... More »
Gary Cokins is an internationally recognized expert, speaker, and author. More »
Jill Dyché is an internationally recognized author, speaker, and business consultant. More »
Themos Kalafatis has worked as a consultant for Data Mining, Text Mining, Information Extraction and Data Quality for over a decade. More »
James Taylor is CEO and Principal Consultant at Decision Management Solutions and a leading expert in decision management. More »
- YOU
- Dean Abbott
- Teradata AusNZ
- Paul Barsch
- Meta S. Brown
- Jason Burke
- Ted Cuzzillo
- Barry Devlin
- Chris Dixon
- Jill Dyché
- Timo Elliott
- Teradata EMEA
- Teradata Experts
- Michael Fauscette
- Bob Gourley
- Julie Hunt
- Doug Lautzenheiser
- Jack Mason
- Darryl McDonald
- Alex Olesker
- David Smith
- James Taylor
- Daniel Tunkelang
Webinar: Making Sense of Service Organization Audits
When: Tue, 2012-02-14 02:00
Webinar Invite: Making Business Intelligence Faster & Easier
When: Tue, 2012-02-21 15:00
Banish Poor Application Performance: Eliminate Business Disruptions, Increase End User Productivity
When: Wed, 2012-02-22 11:00
O’Reilly Strata 2012
When: Tue, 2012-02-28 08:00
IFSUG Summit
When: Sun, 2012-03-04 08:00
Predictive Analytics World, March 4-10, 2012 San Francisco
When: Sun, 2012-03-04 09:00
Text Analytics World Topics & Case Studies – March 6-7, 2012 in San Francisco
When: Tue, 2012-03-06 09:00
Predictive Analytics World, April 25-26, 2012 in Toronto
When: Wed, 2012-04-25 09:00
Sentiment Analysis Symposium
When: Tue, 2012-05-08 08:30
Salford Analytics and Data Mining Conference
When: Thu, 2012-05-24 12:09

About Social Media Today






