The Pearson Correlation


  1. The Pearson Correlation Coefficient
  2. By far the most common measure of correlation is the Pearson product-moment correlation.

    Definition: The Pearson correlation measures the degree and direction of a linear relationship between two variables.

    Notation: The Pearson correlation is denoted by the letter r.

    1. Conceptual Formula

      Conceptually, the Pearson Correlation is the ratio of the variation shared by X and Y to the variation of X and Y separately. The conceptual formula is:

      Stated in statistical terminology:

      When there is a perfect linear relationship, every change in the X variable is accompanied by a corresponding change in the Y variable. In this case, all variation in X is shared with Y, so the ratio given above is r=1.00. At the other extreme, when there is no linear relationship between X and Y, then the numerator is zero, so r=0.00.

    2. Sum of Products of Deviations:

      To calculate the Pearson correlation, it is necessary to introduce one new concept: The sum of the products of corresponding deviation scores for two variables. We have already seen a similar concept, the sum of the squares of the deviation scores for a variable.

      The sum of squares, which is used to measure the amount of variability of a single variable, is defined as:

      The sum of products, which is used to measure the variability shared between two variables, is defined as:

      Note that the name is short for the sum of the products of corresponding deviation scores for two variables.

      To calculate the SP, you first determine the deviation scores for each X and for each Y, then you calculate the products of each pair of deviation scores, and then (last) you sum the products.

    3. The Algebraic Formula:

      As noted above, conceptually the Pearson correlation coefficient is the ratio of the joint covariability of X and Y, to the variability of X and Y separately. The formula uses the SP as the measure of covariability, and the square root of the product of the SS for X and the SS for Y as the measure of separate variability. That is:

    4. Z-Scores and Pearson Correlation:

      If we have scores that are expressed as standardized scores -- Z-Scores with a mean of zero and a variance of one -- then the formula for the Pearson Correlation becomes particularly simple. It is:

  3. Understanding and Interpreting the Pearson Correlation Coefficient
    1. Correlation is NOT Causation!

      One of the most common errors made in interpreting correlations is to assume that a correlation necessarily implies a cause-and-effect relationship between the two variables. Simply stated: Correlation is NOT Causation!

    2. Correlation and Restricted Range

      When a correlation is computed from scores with a restricted range the correlation coefficient is lower than it would be if it were computed from scores with an unrestricted range.

      This happens when we look at the correlation between SAT and GPA among students in this class, since we are only seeing those students who were admitted to UNC. Those with low SAT scores (who presumably would have had very low GPA scores) were not admitted. Thus, we have a restricted range of observed SAT scores, and a lower correlation.

      For example, consider the relationship between Automobile weight and Horsepower shown in the following scatterplot:

      If we lived in a country that restricted cars to have no more than 100 Hp, then the data would be cut off like this:

      What we would see as the relationship would be only based on the cars with less than 100 HP. We would see:

      Now the correlation is only .71, rather than .92.

    3. Outliers (Outriders?)

      Outliers (which G&W call, for some unknown reason, outriders) are an individual observation that has very large values of X and Y relative to all the other values of X and Y. For example, in this scatterplot of the Market Value of many companies plotted versus their Assets, the fact that IBM is so large compared to any other company completely obscures the relationship of the two variables to each other.

      The correlation for these variables is .68, which is spuriously high. In fact, the correlation is reduced to .48 when IBM is removed.

    4. Correlation and Strength of Relationship

      The Pearson correlation measures the degree of relationship between two variables. It is not, however, interpreted as a percentage. On the other hand:

      The Coefficient of Determination:
      The Coefficient of Determination, which is the squared correlation coefficient, measures the percentage of variation shared between the two variables.

  4. Hypothesis Testing with the Pearson Correlation Coefficient
  5. The Pearson Correlation coefficient is usually computed for sample data. Oftentime we wish to make inferences from the sample correlation to a value for the correlation in the population. We can use standard inference testing techniques to make this inference.

    The basic question answered by the hypothesis testing procedure for the Pearson correlation coefficient is whether it is significantly different from zero: i.e., whether or not a non-zero correlation exists in the population. Here are the four standard hypothesis testing steps, as augmented by a visualization step for the data:

    1. State the hypotheses

      The hypotheses concern whether or not there exists a non-zero correlation in the population. We have a 1-tailed hypothesis:

      There are also two possible 1-tailed hypothesis. Here's one of them:

    2. Set the decision criterion

      Choose an alpha level.
      The df=n-2, where n is the number of pairs.

    3. Gather the data

      Lets use the data gathered in class about SAT-M and GPA. We can also use the SAT-V and GPA correlation. We observe that

      • The MathSAT/GPA correlation is .32 -- 10% of the variance in GPA is explained by MathSAT.
      • The VerbalSAT/GPA correlation is .47 -- 22% of the variance in GPA is explained by VerbalSAT.
      • Remember that these correlations have been attenuated by restriction of range.
      • If we had a larger sample, and these correlation values stayed about the same, then they would become significant. However, significance isn't everything, as the size of the correlation tells us how strong the relationship is.

    4. Visualize the data

      Here are the two scatterplots:

    5. Evaluate the Hypothesis

      For df=39, alpha=.05, we can interpolate to find the the critical one-tailed r.

      • For df = 35 the critical r = .275.
      • For df = 40 the critical r = .257.
        Thus
      • For df = 39 the critical r
        = .275 - (.275 - .257)*(4/5)
        = .275 - .014
        = .261
      Note that we don't really need to interpolate since both observed correlations (.32 and .47) are beyond the larger critical value of .275.

      Therefore, for both relationships, we reject the null hypothesis that there is not a positive correlation in the population (in plain English, we say these correlations are significantly different from zero). The two SAT scales DO significantly predict your GPA's.

  6. Pearson Correlation and ViSta
  7. ViSta can compute and report Pearson (as well as Spearman, Point-Biserial or Phi correlations) but it does not do significance testing for the computed correlation coefficients.

    The ViSta Applet demonstrates that you can compute Pearson correlations in two ways: