Friday 16 December 2011

A Correlation for the 21st Century

By Terry Speed


Most scientists will be familiar with the use of Pearson's correlation coefficient r to measure the strength of association between a pair of variables: for example, between the height of a child and the average height of their parents (r ≈ 0.5; see the figure, panel A), or between wheat yield and annual rainfall (r ≈ 0.75, panel B). However, Pearson's r captures only linear association, and its usefulness is greatly reduced when associations are nonlinear. What has long been needed is a measure that quantifies associations between variables generally, one that reduces to Pearson's in the linear case, but that behaves as we'd like in the nonlinear case. Researchers introduce the maximal information coefficient, or MIC, that can be used to determine nonlinear correlations in data sets equitably.

The common correlation coefficient r was invented in 1888 by Charles Darwin's half-cousin Francis Galton. Galton's method for estimating r was very different from the one we use now, but was amenable to hand calculation for samples of up to 1000 individuals. Francis Ysidro Edgeworth and later Karl Pearson gave us the modern formula for estimating r, and it very definitely required a manual or electromechanical calculator to convert 1000 pairs of values into a correlation coefficient. In marked contrast, the MIC requires a modern digital computer for its calculation; there is no simple formula, and no one could compute it on any calculator. This is another instance of computer-intensive methods in statistics.

It is impossible to discuss measures of association without referring to the concept of independence. Events or measurements are termed probabilistically independent if information about some does not change the probabilities of the others. The outcomes of successive tosses of a coin are independent events: Knowledge of the outcomes of some tosses does not affect the probabilities for the outcomes of other tosses. By convention, any measure of association between two variables must be zero if the variables are independent. Such measures are also called measures of dependence. There are several other natural requirements of a good measure of dependence, including symmetry, and statisticians have struggled with the challenge of defining suitable measures since Galton introduced the correlation coefficient. Many novel measures of association have been invented, including rank correlation; maximal linear correlation after transforming both variables, which has been rediscovered many times since; the curve-based methods reviewed in; and, most recently, distance correlation.

To understand where the MIC comes from, we need to go back to Claude Shannon, the founder of information theory. Shannon defined the entropy of a single random variable, and laid the groundwork for what we now call the mutual information, MI, of a pair of random variables. This quantity turns out to be a new measure of dependence and was first proposed as such in 1957. Reshef et al.'s MIC is the culmination of more than 50 years of development of MI.

What took so long, and wherein lies the novelty of MIC? There were three difficulties holding back MI's acceptance as the right generalization of the correlation coefficient. One was computational. It turns out to be surprisingly tricky to estimate MI well from modest amounts of data, mainly because of the need to carry out two-dimensional smoothing and to calculate logarithms of proportions. Second, unlike the correlation coefficient, MI does not automatically come with a standard numerical range or a ready interpretation of its values. A value of r = 0.5 tells us something about the nature of a cloud of points, but a value of MI = 2.2 does not. The formula [1 − exp(−2MI)]1/2 in (10) satisfies all the requirements for a good measure of dependence, apart from ease of computation, and ranges from 0 to 1 as we go from independence to total dependence. But Reshef et al. wanted more, and this takes us to the heart of MIC. Although r was introduced to quantify the association between two variables evident in a scatter plot, it later came to play an important secondary role as a measure of how tightly or loosely the data are spread around the regression line(s). More generally, the coefficient of determination of a set of data relative to an estimated curve is the square of the correlation between the data points and their corresponding fitted values read from the curve. In this context, Reshef et al. want their measure of association to satisfy the criterion of equitability, that is, to assign similar values to “equally noisy relationships of different types.” MI alone will not satisfy this requirement, but the three-step algorithm leading to MIC does.

Is this the end of the Galton-Pearson correlation coefficient r? Not quite. A very important extension of the linear correlation rXY between a pair of variables X and Y is the partial (linear) correlation rXY.Z between X and Y while a third variable, Z, is held at some value. In the linear world, the magnitude of rXY.Z does not depend on the value at which Z is held; in the nonlinear world, it may, and that could be very interesting. Thus, we need extensions of MIC(X,Y) to MIC(X,Y|Z). We will want to know how much data are needed to get stable estimates of MIC, how susceptible it is to outliers, what three- or higher-dimensional relationships it will miss, and more. MIC is a great step forward, but there are many more steps to take.

See the original article in Science

No comments: