Module 2
What is a correlation coefficient?
A correlation coefficient measures the degree and direction of the relationship between two variables. Psychologists commonly use the Pearson product moment correlation coefficient, or Pearson r.
Why?
If you have two variables, x and y, to determine if there is a relationship between the two of them, you would use a correlation coefficient. For example, supposing you want to know whether there is a relationship between the IQ of one identical twin and the other, you could use a correlation coefficient to do this. Here we would expect the direction of the correlation to be positive, i.e., if one twin in the pair had a high IQ, we would expect the other to have a high IQ, as well; if one twin in the pair had a low IQ we would expect the other twin in the pair to similarly have a low IQ.
INFORMATION
There is a famous saying in Psychology: Correlation does not imply causation. What this means is that just because there is a correlation between two variables, one cannot infer from this that one variable causes the other. In our IQ example above, obviously a high IQ in one twin doesn’t cause the other member of the twin to have a high IQ.
Let’s take another illustration. For many years, it was known that people who smoked had a higher risk of developing lung cancer, that there was a correlation between smoking and lung cancer. But the office of the Surgeon General couldn’t quite say yet that smoking causes lung cancer. Why not? The reason is there were no controlled experiments showing a cause and effect relationship between smoking and lung caner. To make such a cause and effect relationship, you must to have a controlled experiment And no such long term controlled experiments were available yet. (Now we do have such long–term controlled experiments.) So how else could we explain the correlation, the relationship, between smoking and lung cancer? Perhaps people who smoked engaged in other types of risky behavior and that’s what caused them to have the lung cancer, not the smoking per se.
When we talk about correlations, we talk about (1) the direction and (2) the strength of the correlation coefficient.
(1) DIRECTION of a correlation
The direction of a correlation coefficient can be positive, negative or zero.
Positive correlation
In a positive correlation, as one variable increases, so does the other. Similarly, as one variable decreases, so does the other. A positive correlation is indicated by a line with a positive slope and will rise uphill from left to right:
Example #1: Income and education have a positive correlation. People with higher incomes also tend to have more years of education. People with fewer years of education tend to have lower income.
Example #2: SAT scores and college achievement have a positive correlation. Among college students, those with higher SAT scores, generally have higher GPAs.
Example #3: Number of hours studied and Regents scores have a positive correlation. Students who study more generally get higher scores on Regents exams than those who study less.
When we make a scatter plot, we don’t connect the dots. Instead, we draw the best fitting straight line.
Negative Correlation
A negative correlation is a line with a negative slope and will fall downhill from left to right. It looks as follows:
In a negative correlation, as one variable increases the other decreases. Example: As the number of hours a student watches TV the night before a test increases, the score on the test usually decreases.
Zero Correlation
A zero correlation is a line with a zero slope, a horizontal line.
It could also look like
where the points are arranged in a seemingly random fashion indicating a zero relationship between the two variables.
A vertical line has undefined slope and also indicates that no relationship exists between two variables.
(2) STRENGTH of a correlation.
A correlation coefficient, r, ranges in value from 0 to +1.0 and 0 to -1.0. The closer r is to +1.0 or -1.0, the stronger the relationship between the two variables. For example, the correlation in IQ between identical twins reared together in the same household is 8. This is a very strong relationship. The correlation in IQ between a parent and a foster child is .2, a low correlation.
REQUIREMENTS FOR CALCULATING THE PEARSON r
The 2 main requirements for calculating the Pearson r are that
(1) the sample of paired data (x,y) be a random sample of collected data; and
(2) the underlying relationship between x and y is a linear one. This means that a visual examination of the scatter plot reveals that the points approximate a straight line.
FORMULA TO CALCULATE THE PEARSON r
If these requirements are satisfied, one can proceed to calculate the Pearson r. These days, when we want to calculate the correlation coefficient, we can do it with Excel using 2 arrays, one for X values and one for Y values. But what formula is being used? The formula is given below :
Where
n
represents the number of pairs of data present.
Σ
denotes the addition of the items indicated.
Σx
denotes the sum of all x-values.
Σx²
indicates that each x-value should be sqaured and then those
squares added.
(Σx)²
indicates that the x-values should be added and the total then
sqaured. It is extremely important to avoid confusing
Σx² & (Σx)² Σxy
indicates that each x-value should first be multiplied by its
corresponding y-value. After obtaining all such products, find
their sum.
r
represents the linear correlation
coefficient for a sample.
ρ
Greek letter rho used to represent the linear correlation
coefficient for a population.
Notice that what we have here is the formula for calculating r, the sample correlation coefficient, because it is based on only a sample of data (x,y). We haven’t sampled the entire population! Had we been able to sample the entire population (which is usually impossible to do) what we would have is the population correlation coefficient, ρ.
EXAMPLE CALCULATING r
Let’s say we randomly select a group of students in a class and give them two quizzes, a social studies and a math quiz. Let’s say there are four students and their scores on the quizzes are as follows:
We want to find out what is the correlation between performance on one quiz and performance on the other? Do people who do better on one quiz necessarily do better on the other, or is the opposite true? Or, perhaps there’s no relationship; between performance on one quiz and performance on the other?
Let’s see if we meet the criteria for doing a Pearson r. We meet requirement #1 because our sample was randomly selected. We meet requirement #2 because if we look at a scatter plot of our data (given below), we see the data approximate a straight line.
So, we can now proceed to calculate the Pearson r. We can either do this by entering our values into Excel or we can organize our data and perform the necessary calculations using a calculator as follows:
For our given sample of paired data, n = 4 because there are 4 students. We can now use the formula to evaluate r as follow:
What does r= -0.956 mean? It means there is a strong negative correlation between students' scores on one quiz and their scores on the other. Otherwise stated, students who do well on a social studies quiz in this class do poorly on a math quiz, and those who do well on the math quiz do poorly on the social studies quiz.
In cases where we have not set up a true experiment, the correlation coefficient can be used to answer interesting questions about real-world relationships. We consider now an example taken from the nuptials section of the New York Times.
Weddings Activity
Below are wedding announcements from the Sunday New York Times of March 18, 2007. These wedding announcements are probably a random sample of those from the upper middle to the upper class of the tri-state area.
Above we have 10 couples, so n=10. We could use these wedding announcements to answer the following questions:
1. In 2007, do highly educated men marry highly educated women? Or, otherwise stated, is there a correlation between the bride's and groom's highest educational level?
Try to answer the first question. You'll have to use some rating scale, such as the one below, in order to code level of education:
Educational Level | Score |
---|---|
High School | 1 |
Some College Credits | 2 |
College Degree | 3 |
In a Master's Program | 3.5 |
Master's Degree | 4 |
PhD, MD, or Law Degree | 5 |
Some terms and common assumptions you will need to know r:
*
A postdoctoral fellow is an individual who holds a PhD already
and would recieve a rating of 5 * A professor usually holds a PhD * A nephrologist is a kidney doctor. He would recieve a rating of 5 * An MBA is a master's degree in business and would recieve a
rating of 4
In your calculations, let "X" be the bride's highest educational level and let "Y" be the groom's highest educational level.
We would then calculate the relevant statistics by substituting these values into the formula for the Pearson r.
x y xy x² y² 3.5 4 14 12.25 16 3 3 9 9 9 4 4 16 16 16 3 1 3 9 1 4 5 20 16 25 3 3 9 9 9 4 2 8 16 4 3 5 15 9 25 5 4 20 25 16 5 5 25 25 25 __________________________________________ 37.5 36 139 146.25 146 ↑ ↑ ↑ ↑ ↑ Σx Σy Σxy Σx² Σy² r = n(Σxy)-(Σx)(Σy) √[n(Σx²)-(Σx)²] √[n(Σy²)-(Σy)²] = 10(139)-37.5(36) √[10(146.25)-(37.5)² √[10(146)-(36)²] = 40 = 40 = .42 (7.5)(12.8) 96.04
So, the answer to the first question is that there is a moderate correlation between the bride's and groom's educational level.
Exercise
Let's now do the calculations to answer these additional questions:
2. Do individuals marry someone close to themselves in age?
3. Do men with highly educated mothers tend to marry women who are highly educated?
4. Do individuals with highly educated parents tend to marry someone who also has highly educated parents?
5. Using Hollingshead's (Hollingshead, A.B. The four-factor index of social status. Unpublished manuscript, Yale University 1975) index of SES (socio- economic status), do our data indicate that individuals tend to marry other individuals of similar SES? (Hint: Use the information in the New York Times to assign to the bride and to the groom of each couple a score based on Hollinshead's scale.)
A cautionary note: In the Answer to Question #1, we found a moderate correlation of .42. But, we need to mention two caveats here:
(1) We used a small set of data since our n was only 10; (2) We did not have a sample representative of the entire population. If we had looked at a larger sample of people who got married, not just those who advertise in the New York Times, it may very well be that the correlation between bride and groom's educational level is higher than the .42 we found. Those individuals who place their wedding announcements in the New York Times are in the upper echelons of society. For them, it may be that other factors, such as social class, wealth, etc.,play a more significant role in the choice of a mate than for the population at large. The issue of having a sample not representative of the entire population in which we are interested is the issue of truncated or restricted range, which we consider in the next activity.
Question about this website?
Please email: Dr. Barbara Rumain - barbara.rumain@touro.edu
Copyright © 2007-2017, Touro College and University System.