# Correlation and agreement: overview and clarification of competing concepts and measures

^{1,}* Wan TANG,

^{2}Guanqin CHEN,

^{1}Yin LU,

^{3,}

^{4}Changyong FENG,

^{1}and Xin M. TU

^{1,}

^{*}

## Abstract

**Summary:** Agreement and correlation are widely-used concepts that assess the association between variables. Although similar and related, they represent completely different notions of association. Assessing agreement between variables assumes that the variables measure the same construct, while correlation of variables can be assessed for variables that measure completely different constructs. This conceptual difference requires the use of different statistical methods, and when assessing agreement or correlation, the statistical method may vary depending on the distribution of the data and the interest of the investigator. For example, the Pearson correlation, a popular measure of correlation between continuous variables, is only informative when applied to variables that have linear relationships; it may be non-informative or even misleading when applied to variables that are not linearly related. Likewise, the intraclass correlation, a popular measure of agreement between continuous variables, may not provide sufficient information for investigators if the nature of poor agreement is of interest. This report reviews the concepts of agreement and correlation and discusses differences in the application of several commonly used measures.

**Keywords:**concordance correlation, intraclass correlation, Kendall’s tau, non-linear association, Pearson’s correlation, Spearman’s rho

## 1. Introduction

Agreement and correlation are widely used concepts in the medical literature. Both are used to indicate the strength of association between variables of interest, but they are conceptually distinct and, thus, require the use of different statistics.

Correlation focuses on the association of changes in two outcomes, outcomes that often measure quite different constructs such as cancer and depression. The Pearson correlation is the most popular measure of the association between two continuous outcomes, but it is only useful when measuring linear relationships between variables. If the relationship is non-linear, the Pearson correlation generally does not provide a good indication of association between the variables. Another problem is that using the standard interpretation of Pearson correlation coefficients can, in some circumstances, lead to incorrect conclusions.

Agreement, also known as reproducibility, is a concept closely related to, but fundamentally different from, correlation. Like correlation, agreement also assesses the relationships between outcomes of interest, but, as the name indicates, the emphasis is on the degree of concordance in the opinions between two or more individuals or in the results between two or more assessments of the variable of interest. An example of agreement in mental health research is the consensus between multiple clinicians about the psychiatric diagnoses of a group of patients. In biomedical sciences agreement can also include measures of the reproducibility (i.e., reliability) of a laboratory test result when repeated in the same center or when conducted in multiple centers under the same conditions. It is not sensible to speak of agreement (reproducibility) between variables that measure different constructs; so when measuring the association between different variables – such as weight and height – one can assess correlation but not agreement. For continuous outcomes, the intraclass correlation (ICC) is a popular measure of agreement. Like the Pearson correlation, the ICC is an estimate of the magnitude of the relationship between variables (in this case, between multiple assessments of the same variable). However, the ICC also takes into account rater bias, the element that distinguishes agreement from correlation; that is, good agreement (reproducibility) not only requires good correlation, it also requires small rater bias.

In this report, we provide an overview of popular measures and statistical methods for assessing the two different notations of association between variables. We also clarify the key differences between the measures and between the methods used to assess the measures. We focus on continuous outcomes and assume all variables are continuous unless stated otherwise.

## 2. Correlation measures

### 2.1. Pearson correlation

Consider a sample of *n* subjects and a bivariate continuous outcome, (*u _{i}*,

*v*), from each subject within the sample (1≤

_{i}*i*≤

*n*). The Pearson correlation is the most popular statistic for measuring the association between the two variables

*u*and

_{i}*v*:

_{i}^{[1]}

where u. (v.) denotes the sample mean of *u _{i}* (

*v*) The Pearson correlation p⌢ ranges between -1 and 1, with 1(-1) indicating perfect positive (negative) correlation and 0 indicating no association between the variables.

_{i}As popular as it is, the Pearson correlation is only appropriate for measuring correlation between *u _{i}* and

*v*when the two variables follow a linear relationship. If the bivariate outcome (

_{i}*u*,

_{i}*v*) follows a non-linear relationship, p⌢ is not an informative measure and is difficult to interpret.

_{i}To see this, let *μ _{u}*(

*μ*) and

_{v}*σ*

_{u}^{2}(

*σ*

_{v}^{2}) denote the (population) mean and (population) variance of the variable

*u*(

_{i}*v*). The Pearson correlation is an estimate of the following product moment correlation:

_{i}

Unlike p⌢, which measures correlation between *u _{i}* and

*v*based on the sample, the product-moment correlation

_{i}*p*is the population-level correlation, which cannot be calculated but is estimated by p⌢. Thus, p⌢ may also be referred to as the ‘sample product-moment correlation’.

If *u _{i}* and

*v*have a linear relationship, then

_{i}*u*=

_{i}*av*+

_{i}*b*+

*ε*, where

_{i}*a*and

*b*are some constants, and

*ε*denotes random errors with mean 0 and variance

_{i}*σ*

_{ε}^{2}. By centering

*u*(

_{i}*v*) at its mean, we have:

_{i}*u*–

_{i}*μ*=

_{u}*a*(

*v-μ*)+

_{v}*ε*. It follows that

_{i}*σ*

_{u}^{2}=

*a*

^{2}

*σ*

_{v}^{2}+

*σ*

_{ε}^{2}. If

*u*and

_{i}*v*are perfectly correlated, that is,

_{i}*σ*

_{ε}^{2}=0, it follows from Equation (2) that

*p*=1 or (-1), depending on whether a is positive or negative. Also, if

*u*and

_{i}*v*are uncorrelated, or independent, that is,

_{i}*a*=0, then

*p*=0 and vice versa.

If *u _{i}* and

*v*have a non-linear relationship, the product moment correlation generally does not provide an informative measure of correlation. The example below shows that the Pearson correlation in this case can be quite misleading.

_{i}**Example 1.** Suppose that *u _{i}* and

*v*are perfectly correlated and follow the non-linear relationship,

_{i}*u*=

_{i}*v*

_{i}^{9}. Further, assume that

*v*follows a standard normal distribution

_{i}*N*(0, 1) with mean 0 and variance 1. Then, the product-moment correlation is:

The poor association between *u _{i}* and

*v*as indicated by the product-moment correlation contradicts the conceptual perfect correlation between the two variables. Thus, the product-moment and its sample counterpart, the Pearson correlation, generally do not apply to non-linear relationships.

_{i}### 2.2. Spearman’s Rho

Spearman’s rho is also a popular measure of association.

Unlike the Pearson correlation, it also applies to non-linear relationship, thereby addressing the aforementioned limitation associated with the Pearson correlation.

Let *q _{i}* (

*r*) denote the rankings of

_{i}*u*(

_{i}*v*), (1 ≤

_{i}*i*≤

*n*). Spearman’s rho is defined as:

By comparing (1) and (4), it is clear that ρ⌢ is really the Pearson correlation when applied to the rankings (*q _{i}* ,

*r*) of the original variables (

_{i}*u*,

_{i}*v*). Since the rankings only concern the ordering of the observations, relationships among the rankings are always linear, regardless of whether the original variables are linearly related. Thus, Spearman’s rho not only has the same interpretation as the Pearson correlation, but also applies to non-linear relationships.

_{i}The Spearman ρ⌢ ranges between -1 and 1, with 1 and -1 indicating perfect positive (negative) correlation; when ρ⌢ =0 there is no association between the variables *u _{i}* and

*v*. If ρ⌢ =1 then

_{i}*q*=

_{i}*r*, in which case,

_{i}

*u*

_{i}<

*u*

_{j},

*v*

_{i}<

*v*

_{j}

*o*

*r*

*u*

_{i}>

*u*

_{j},

*v*

_{i}>

*v*

_{j}

*for*

*all*

*1 ≤*

*i*<

*j*≤

*n*.

If ρ⌢ =-1, then *q _{i}*=

*n-r*+1, in which case,

_{i}

*u*

_{i}<

*u*

_{j},

*v*

_{i}>

*v*

_{j}

*o*

*r*

*u*

_{i}>

*u*

_{j},

*v*

_{i}<

*v*

_{j}

*for*

*all*

*1 ≤*

*i*<

*j*≤

*n*.

Any two pairs of bivariate outcomes (*u _{i}*,

*v*) and (

_{i}*u*,

_{j}*v*) that satisfy (5) or (6) are said to be concordant or discordant; that is,

_{j}*u*and

_{i}*v*are either both larger or both smaller than

_{i}*u*and

_{j}*v*. Thus, perfect positive (negative) correlation by Spearman’ rho corresponds to perfect concordance (discordance); that is, concordant (discordant) pairs (

_{j}*u*,

_{i}*v*) and (

_{i}*u*,

_{j}*v*) for all 1≤

_{j}*i*<

*j*≤

*n*.

**Example 2.** Table 1 shows 12 observations of the bivariate outcome (*u _{i}*,

*v*) as described in Example 1, and the ranks associated with these observations. Note that

_{i}*u*and

_{i}*v*are perfectly related, so their rankings are identical; that is,

_{i}*q*=

_{i}*r*.

_{i}### Table 1

**A sample of 12 bivariate outcomes (**

*u*,_{i}*v*) simulated with_{i}*u*=_{i}*v*_{i}^{9}and*v*from standard normal_{i}*N*(0,1).u_{i} |
0.26 | 1.49 | 1.39 | 0.65 | -0.49 | -1.38 | 1.168 | 0.87 | -0.96 | 2.15 | -0.03 | -1.08 |

v_{i} |
0 | 38.1 | 19.4 | 0.02 | -0.002 | -18.5 | 4.06 | 0.29 | -0.68 | 971.6 | 0 | -2.10 |

q(_{i}r)_{i} |
6 | 11 | 10 | 7 | 4 | 1 | 9 | 8 | 3 | 12 | 5 | 2 |

In this example the Pearson correlation p⌢=0.531, while Spearman’s ρ⌢ =1. Thus, only the Spearman rho captures the perfect non-linear relationship between *u _{i}* and

*v*.

_{i}Note that the Pearson correlation p⌢=0.531 has a higher upward bias than the product-moment correlation *p*=0.161; this occurs due to the small sample size, *n*=12. As sample size increases, p⌢ becomes closer to *p*, a property known as ‘consistency’ in statistics. For example, we also simulated (*u _{i}*,

*v*) with

_{i}*n*=1000 and obtained p⌢ =0.173, much closer to

*p*.

Like the Pearson correlation, the Spearman’s rho in (4) is a statistic based on a sample. This sample Spearman rho is an estimate of the following population Spearman rho:

*ρ*= 12

*E*[

*I*(

*u*

_{j}>

*u*

_{i})

*I*(

*v*

_{k}<

*v*

_{i})] − 3,

*for*

*all*

*1 ≤*

*i*<

*j*<

*k*≤

*n*.

In Equation (7), *E*[*I*(*u _{j}*<

*u*)

_{i}*I*(

*v*<

_{k}*v*)] stands for the mathematical expectation of

_{i}*I*(

*u*<

_{j}*u*)

_{i}*I*(

*v*<

_{k}*v*) and

_{i}*I*(

*u*<

_{j}*u*) (similarly

_{i}*I*(

*v*<

_{k}*v*)) denotes an indicator with

_{i}*I*(

*u*<

_{j}*u*)=1(0) if

_{i}*u*<

_{j}*u*. It can be shown that ρ⌢ =1(-1) if (

_{i}*u*,

_{i}*v*) are perfectly concordant (discordant) and vice versa.

_{i}Note that the sample Spearman’s rho in (4) is referred to as Spearman’s rho in the literature. Unlike the Pearson correlation, there is no formal name for the population Spearman’s rho in (7). In general, the lack of a formal name for the population version does not cause confusion, since it is usually clear which one is used within the context of a discussion. Like all statistics, the population version of a statistic is called a parameter in statistical lingo. The statistic and parameter serve different purposes. For example, only the parameter can be used in stating statistical hypotheses, such as the null hypothesis, H:*ρ*=0, for testing whether the population Spearman’s rho is 0. Reported values of Spearman’s rho by studies are always the sample Spearman rho.

### 2.3. Kendall’s Tau

Another alternative for non-linear association is Kendall’s tau.^{[2]} Like Spearman’s rho, Kendall’s tau also exploits the concept of concordance and discordance to derive a measure for bivariate outcomes. Unlike Spearman’s rho, it uses the notion of concordant and discordant pairs directly in the definition of this correlation measure.

Specifically, Kendall’s *τ* (sample version) is defined as:

In the above, *n _{t}* = ½

*n*(

*n*– 1) n 1 is the total number of concordant and discordant pairs in the sample. If

*n*=

_{c}*n*(

_{t}*n*=

_{d}*n*), then τ⌢ =1(-1) and vice versa. Also, if there is no association between

_{t}*u*and

_{i}*v*, then

_{i}*n*and

_{c}*n*should be close to each other and τ⌢ should be close to 0 (not exactly 0 due to sampling variability). Thus, like Spearman’s rho, τ⌢ =1(-1) corresponds to perfect concordance (discordance). A value of τ⌢ close to 0 indicates weaker or no association between the variables

_{d}*u*and

_{i}*v*.

_{i}Like the Pearson and Spearman correlation, the sample Kendall’s τ⌢ in (8) estimates the following population parameter:

*τ*= 2

*E*[

*I*(

*u*

_{i}<

*u*

_{j})

*I*(

*v*

_{i}<

*v*

_{k})] − 1,

*for*

*all*

*1 ≤*

*i*<

*j*≤

*n*.

Like its sample counterpart, *τ* also ranges between -1 and 1. If (5) holds true for all pairs (*u _{i}*,

*v*) and (

_{i}*u*,

_{j}*v*), then

_{j}*E*[

*I*(

*u*<

_{i}*u*)

_{j}*I*(

*v*<

_{i}*v*)]=1 and

_{j}*τ*=1. Likewise, if (6) holds true for all pairs, then

*E*[

*I*(

*u*<

_{i}*u*)

_{j}*I*(

*v*<

_{i}*v*)]=0 and τ⌢ =-1. Thus,

_{j}*τ*=1(-1) corresponds to perfect concordance (discordance). Finally, if

*u*and

_{i}*v*are independent, then

_{i}*E*[

*I*(

*u*

_{i}<

*u*

_{j})

*I*(

*v*

_{i}<

*v*

_{j})]=½ and

*τ*=0. Thus,

*τ*=0 indicates no association between

*u*and

_{i}*v*, and vice versa.

_{i}**Example 3.** Consider the data in Example 2. The sample Kendall’s tau τ⌢ =-1. Thus, like Spearman’s rho, Kendall’s tau also provides a sensible measure of association for non-linearly related variables.

## 3. Agreement and measures of agreement

Agreement, or reproducibility, is another widely used concept for assessing the relationship among outcomes. As indicated in the Introduction, unlike variables considered in correlation analysis, variables considered for agreement must measure the same construct. Conversely, measures of correlation considered in Section 2 generally do not apply to agreement.

**Example 4.** Consider two judges who rate each subject from a study of 5 subjects sampled from a population of interest using a scale from 1 to 10. Let *u _{i}* and

*v*denote the two judges’ ratings on the ith subject (1<

_{i}*i*<5). Suppose that the judges’ ratings from the subjects are as follows:

*u*

_{i},

*v*

_{i}):(1, 6), (2, 7), (3, 8), (4, 9), (5, 10).

Since *u _{i}* and

*v*are linearly related, the Pearson correlation can be applied, yielding p⌢=1, indicating perfect correlation. However, the data clearly do not indicate perfect agreement; in fact, the two judges hardly agree with one another.

_{i}The poor agreement in this hypothetical example is due to bias in judges’ ratings. The mean ratings for the two judges are 3 (for *u _{i}*) and 8 (for

*v*). Thus, despite the perfect correlation between the ratings, the two judges do not have good agreement because of bias in their ratings of the subjects; either

_{i}*u*has downward or

_{i}*v*has upward bias (or both).

_{i}The issue of bias does not apply to correlation because the variables considered for correlation generally measure different constructs and, thus, typically have different means. For the Pearson correlation, the sample means u. and v. are removed from the calculations of the correlation in (1), thus, the Pearson correlation is independent of differences between the (sample) means of the variables being correlated.

### 3.1. Intraclass correlation

Intraclass correlation (ICC) is a popular measure of agreement for continuous outcomes. Like the Pearson correlation, the ICC requires a linear relationship between the variables. However, it differs from the Pearson correlation in one key respect; the ICC also takes into account differences in the means of the measures being considered. In addition, the ICC can be applied to situations where there are three or more separate raters.

Consider a study with *n* subjects and assume each subject is rated by a different group of *K* judges. Let *y _{ik}* denote the rating of the

*i*

^{th}subject by the

*k*

^{th}judge (1 ≤

*i*≤

*n*, 1 ≤

*k*≤

*K*). The ICC is defined based on the following linear mixed-effects model:

^{[3, 4, 5]}

In the above model, the fixed effect *μ* is the (population) mean rating of the study population over all possible *K* judges from the population of judges; that is, the random effect or latent variable. *β _{i}* represents the difference between the mean rating of the

*i*

^{th}subject and the mean rating of the study population

*μ*. Thus, the sum

*u+β*represents the mean rating of the

_{i}*i*

^{th}subject. The intraclass correlation (ICC) is defined as the variance ratio, pICC=σ2βσ2β+σ2, of the variance

*σ*

_{β}^{2}of the mean rating of the subjects (

*u+β*) to the total variance consisting of

_{i}*σ*

_{β}^{2}plus the variance

*σ*

^{2}of the judges.

If there are only two judges (*K*=2), then under the linear mixed-effects model in (9) the productmoment correlation between *y*_{i1} and *y*_{i2} is the same as the ICC; that is, Corr(yi1,yi2)=σ2βσ2β+σ2. Moreover, *y*_{i1} and *y*_{i2} have the same mean (*μ*) and variance (*σ*^{2} ). Thus, in this special case, the ICC is the same as the product moment correlation (*p*_{ICC}= *p*). Note that this result is not a contradiction to the data in Example 4, since *u _{i}* and

*v*do not have the same mean and thus the linear mixed-effects model in (9) does not apply to the data and the ICC no longer serves its intended purpose in this case. However, since differences in means between judges’ ratings decrease the ICC, this agreement index may still be applied in this situation to indicate poorer agreement. Follow-up analyses are necessary to determine whether poor agreement is due to bias or large variability or both between the judges.

_{i}**Example 5.** Consider again Example 4 and let *y*_{i1}=*u _{i}* and

*y*

_{i1}=

*v*. By fitting the model in (9) to the data, we obtain estimates σ⌢2β = 0 and σ⌢2 =9.167. Thus, the (sample) ICC based on the data is p⌢ICC =0, which is quite different from the Pearson correlation. Although the judges’ ratings are perfectly correlated, agreement between the judges is extremely poor.

_{i}Note that p⌢ICC is not a valid measure of agreement between *y*_{i1} and *y*_{i2} for the data in Example 5, since the assumption of a common mean between *y*_{i1} and *y*_{i2} is not met by the data. However, it is precisely this assumption that makes p⌢ICC totally different from the Pearson correlation p⌢ =(1). We may revise the model in (9) to account for the bias in the judges’ ratings to consider:

where the added fixed-effect *μk* accounts for the difference between the two judges. By fitting the above model, we obtain estimates σ⌢2β =1.256, σ⌢2 =0, μ⌢1=3 and μ⌢2 =5. Once accounting for bias, the two judges have perfect agreement. The model in (10) also provides mean ratings μ⌢K for the judges. The positive estimate σ⌢2β describes the variability among the subjects. Although the correct model for the data, the ICC calculated from the model in (10) no longer has the interpretation as a measure of agreement. In fact, σ⌢2βσ⌢2β+σ⌢2=1, the same as the Pearson correlation p⌢ =1 as we have calculated in Example 4.

Note since *p*_{ICC}≥0 we can either reverse code some of the judges’ ratings or use a different index, such as the concordance correlation, discussed below.

### 3.2. Concordance correlation

The concordance correlation (CCC) is another measure of agreement which, unlike the ICC, does not assume a common mean for judges’ ratings at the outset, so it can be used to assess both the level of agreement and the level of disagreement. However, a major limitation of the CCC is that it only applies to two judges at a time.

Consider a study with *n* subjects and assume each subject is rated by a different group of two judges. Let *y _{ik}* again denote the rating of the

*i*

^{th}subject by the

*k*

^{th}judge (1≤

*i*≤

*n*, 1≤

*k*≤2). Let

*μ*=

_{k}*E*(

*y*) and

_{ik}*σ*

_{k}^{2}=

*Var*(

*y*), denoting the mean and variance of

_{ik}*y*, and

_{ik}*σ*

_{12}=

*Cov*(

*y*

_{i1},

*y*

_{i2}), denoting the covariance between

*y*

_{i1}and

*y*

_{i2}. The CCC is defined as:

^{[6]}

Unlike the ICC, no statistical model is assumed in the definition of *p*_{CCC}. Further, the two judges can come from two different populations of judges with different means and variances.

The CCC *p*_{CCC} has a nice decomposition, *p*_{CCC}=*pC _{b}*, where

*p*is the product-moment correlation in (2) and

*C*is called the bias correction factor given by:

_{b}

It can be shown that *p*_{CCC}=1(-1) if and only if *p*=1(-1), *μ*_{1}=*μ*_{2} and *σ*_{1}^{2}=*σ*_{2}^{2}.^{[6]} Thus, *p*_{CCC}=1(-1) if and only if *y*_{i1} = (10) *y*_{i2}(*y*_{i1}=-*y*_{i2}), that is, when there is perfect agreement (disagreement). The bias correction factor *C _{b}*(0≤

*C*≤1) in (12) assesses the level of bias, with smaller

_{b}*C*indicating larger bias. Thus, unlike the ICC, poor agreement can result from low correlation (small

_{b}*p*) or large bias (small

*C*).

_{b}**Example 6.** Consider again Example 5. The (sample) mean and variance of *y*_{i1}, and the (sample) correlation between *y*_{i1} and *y*_{i2} are given by : μ⌢1=3, μ⌢2=8, σ⌢21=2.5, σ⌢22=2.5 and σ⌢12=1. Thus, it follows from (11) that p⌢ccc=2σ⌢12σ⌢21+σ⌢22+(μ⌢1−μ⌢2)2= 0.053 . We can also obtain p⌢_{CCC} by using the decomposition result, which in our case yields p⌢=1, C⌢b=0.0533 and p⌢CCC=p⌢C⌢b = 0.0533.

Note that unlike correlation the issue of linear versus non-linear association does not arise when assessing agreement. This is because good agreement requires an approximate linear relationship between the outcomes. For example, in the case of two raters, good agreement requires that *y*_{i1} and *y*_{i2} are close to each other, such as *y*_{i1} = *y*_{i2} in the case of perfect agreement.

## 4. Discussion

We discussed the concepts of agreement and correlation and described various measures that can be used to assess the relationships among variables of interest. We focused on the measures and methods for continuous outcomes. For non-continuous outcomes, different methods must be applied. For example, for categorical outcomes a different version of Kendall’s tau, known as Kendall’s tau b can be used for assessing correlation and Kappa can be used for assessing agreement.^{[7]}

## Biography

Ms. Jinyuan Liu obtained her bachelor’s of science degree in statistics from Nanjing University of Posts and Telecommunications in 2015. She is currently a master’s student in the Department of Biostatistics and Computational Biology at the University of Rochester in New York, USA. Her research interests include categorical data analysis, machine learning, and social networks.

## Funding Statement

The work was supported in part by a grant (GM108337) from the National Institutes of Health and the National Science Foundation (Tang and Tu) and a pilot grant (UR-CTSI GR500208) from the Clinical and Translational Sciences Institute at the University of Rochester Medical Center (Feng and Tu).

## Footnotes

**Conflict of interest statement:** The authors report no conflict of interest.

**Authors’ contributions:** All authors worked together on this manuscript. In particular, JYL, WT and XMT made major contributions to the section on correlation, GQC, YL and CYF made major contributions to the section on agreement, and JYL and XMT drafted and finalized the manuscript. All authors read and approved the final manuscript.