Friday 16 December 2011

A Correlation for the 21st Century

By Terry Speed


Most scientists will be familiar with the use of Pearson's correlation coefficient r to measure the strength of association between a pair of variables: for example, between the height of a child and the average height of their parents (r ≈ 0.5; see the figure, panel A), or between wheat yield and annual rainfall (r ≈ 0.75, panel B). However, Pearson's r captures only linear association, and its usefulness is greatly reduced when associations are nonlinear. What has long been needed is a measure that quantifies associations between variables generally, one that reduces to Pearson's in the linear case, but that behaves as we'd like in the nonlinear case. Researchers introduce the maximal information coefficient, or MIC, that can be used to determine nonlinear correlations in data sets equitably.

The common correlation coefficient r was invented in 1888 by Charles Darwin's half-cousin Francis Galton. Galton's method for estimating r was very different from the one we use now, but was amenable to hand calculation for samples of up to 1000 individuals. Francis Ysidro Edgeworth and later Karl Pearson gave us the modern formula for estimating r, and it very definitely required a manual or electromechanical calculator to convert 1000 pairs of values into a correlation coefficient. In marked contrast, the MIC requires a modern digital computer for its calculation; there is no simple formula, and no one could compute it on any calculator. This is another instance of computer-intensive methods in statistics.

It is impossible to discuss measures of association without referring to the concept of independence. Events or measurements are termed probabilistically independent if information about some does not change the probabilities of the others. The outcomes of successive tosses of a coin are independent events: Knowledge of the outcomes of some tosses does not affect the probabilities for the outcomes of other tosses. By convention, any measure of association between two variables must be zero if the variables are independent. Such measures are also called measures of dependence. There are several other natural requirements of a good measure of dependence, including symmetry, and statisticians have struggled with the challenge of defining suitable measures since Galton introduced the correlation coefficient. Many novel measures of association have been invented, including rank correlation; maximal linear correlation after transforming both variables, which has been rediscovered many times since; the curve-based methods reviewed in; and, most recently, distance correlation.

To understand where the MIC comes from, we need to go back to Claude Shannon, the founder of information theory. Shannon defined the entropy of a single random variable, and laid the groundwork for what we now call the mutual information, MI, of a pair of random variables. This quantity turns out to be a new measure of dependence and was first proposed as such in 1957. Reshef et al.'s MIC is the culmination of more than 50 years of development of MI.

What took so long, and wherein lies the novelty of MIC? There were three difficulties holding back MI's acceptance as the right generalization of the correlation coefficient. One was computational. It turns out to be surprisingly tricky to estimate MI well from modest amounts of data, mainly because of the need to carry out two-dimensional smoothing and to calculate logarithms of proportions. Second, unlike the correlation coefficient, MI does not automatically come with a standard numerical range or a ready interpretation of its values. A value of r = 0.5 tells us something about the nature of a cloud of points, but a value of MI = 2.2 does not. The formula [1 − exp(−2MI)]1/2 in (10) satisfies all the requirements for a good measure of dependence, apart from ease of computation, and ranges from 0 to 1 as we go from independence to total dependence. But Reshef et al. wanted more, and this takes us to the heart of MIC. Although r was introduced to quantify the association between two variables evident in a scatter plot, it later came to play an important secondary role as a measure of how tightly or loosely the data are spread around the regression line(s). More generally, the coefficient of determination of a set of data relative to an estimated curve is the square of the correlation between the data points and their corresponding fitted values read from the curve. In this context, Reshef et al. want their measure of association to satisfy the criterion of equitability, that is, to assign similar values to “equally noisy relationships of different types.” MI alone will not satisfy this requirement, but the three-step algorithm leading to MIC does.

Is this the end of the Galton-Pearson correlation coefficient r? Not quite. A very important extension of the linear correlation rXY between a pair of variables X and Y is the partial (linear) correlation rXY.Z between X and Y while a third variable, Z, is held at some value. In the linear world, the magnitude of rXY.Z does not depend on the value at which Z is held; in the nonlinear world, it may, and that could be very interesting. Thus, we need extensions of MIC(X,Y) to MIC(X,Y|Z). We will want to know how much data are needed to get stable estimates of MIC, how susceptible it is to outliers, what three- or higher-dimensional relationships it will miss, and more. MIC is a great step forward, but there are many more steps to take.

See the original article in Science

Wednesday 7 December 2011

Layer by layer

By Nicola McCarthy

The view that cancer is purely a genetic disease has taken a battering over recent years, perhaps most extensively from the recent discovery that between transcription and translation sits a whole host of regulatory RNAs, chiefly in the guise of microRNAs (miRNAs). Now, we can add yet another layer of regulation: the evidence from three papers that protein-coding and non-coding RNAs influence the interaction of miRNAs with their target RNAs.

Pier Paolo Pandolfi and colleagues had previously suggested that the miRNA response element (MRE) in the 3′ untranslated region (UTR) of RNAs could be used to decipher a network of RNAs that are bound by a common set of miRNAs. RNAs within this network would function as competing endogenous RNAs (ceRNAs) that can regulate one another by competing for specific miRNAs. Using an integrated computer analysis and an experimental validation process that they termed mutually targeted MRE enrichment (MuTaME), Tay et al., identified a set of PTEN ceRNAs in prostate cancer and glioblastoma samples. As predicted, some of these ceRNAs are regulated by the same set of miRNAs that regulate PTEN and have similar expression profiles to PTEN. For example, knockdown of the ceRNAs VAPA or CNOT6L using small interfering RNAs (siRNAs) resulted in reduced expression levels of PTEN and conversely, expression of the ceRNA 3′ UTRs to which the miRNAs bind resulted in an increase in expression of PTEN 3′ UTR–luciferase constructs. Importantly, the link between PTEN, VAPA and
CNOT6 was lost in cells that had defective miRNA processing, indicating that miRNAs are crucial for these effects.

Pavel Sumazin, Xuerui Yang, Hua-Sheng Chiu, Andrea Califano and colleagues investigated the mRNA and miRNA network in glioblastoma cells. They found a surprisingly large post-translational regulatory network, involving some 7,000 RNAs that can function as miRNA sponges and 148 genes that affect miRNA–RNA interactions through non-sponge effects. In tumours that have an intact or heterozygously deleted PTEN locus, expression levels of the protein vary substantially, indicating that other modulators of expression are at work. Analysis of 13 genes that are frequently deleted in patients with glioma and that encode miRNA sponges that compete with PTEN in the RNA network showed that a change in their mRNA expression had a significant effect on the level of PTEN mRNA. Specifically, siRNA-mediated silencing of ten of the 13 genes reduced PTEN levels and substantially increased proliferation of glioblastoma cells. Conversely, expression of the PTEN 3′ UTR increased the expression of these 13 miRNA sponges.

These results indicate that reduced expression of a specific set of mRNAs can affect the expression of other RNAs that form part of an miRNA–mRNA network. Moreover, they hint at the subtlety of changes that could be occurring during tumorigenesis, in which a small reduction in the expression level of a few mRNAs could have wide-ranging effects.