Transforming Data

Practical issues

Definition: transformation is a mathematical operation that changes the measurement scale of a variable.
Stabilizing variance: e.g. log, square root.
Normalizing: e.g. square root for Poisson data, log for odds.
Reducing the effect of outliers: e.g. reciprocal.
Making a measurement scale more meaningful: e.g. number needed to treat from absolute risk reduction.
Linearise a relationship: e.g. weight proportional to length cubed, bone strength proportional to length squared.
Negative values are a problem with log and square root transformations. In order to overcome this problem, an appropriate constant may be added to the original value before taking a log or square root; it is best to seek the advice of a statistician on the choice of constant.
Ranking data is a powerful normalizing technique as it pulls in both tails of a distribution but important information can be lost in doing so.
Trim points are an alternative to transformation with skewed data: e.g. use of mean ± 3 standard deviations or median ± 1.5 * inter-quartile range, instead of a transformation such as log/geometric mean.
The ladder of powers of transformations (1/x², 1/x, ln(x), sqr(x), x²) has increasing effect of pulling in the right hand tail of a distribution.
Selecting the best transformation can be a complex issue; a combination of exploratory techniques such as Box-Cox, Manley, skewness index and QQ plots may be required; it is best to involve a statistician with this.

Basics

Transformation is a mathematical operation that changes the measurement scale of a variable. This is usually done to make a set of useable with a particular statistical test or method.

Many statistical methods require data that follow a particular kind of distribution, usually a normal distribution. All of the observations must come from a population that follows a normal distribution. Groups of observations must come from populations that have the same variance or standard deviation. Transformations that normalize a distribution commonly make the variance more uniform and vice versa.

If a population with a normal distribution is sampled at random then the means of the samples will not be correlated with the standard deviations of the samples. This partly explains why normalizing transformations also make variances uniform. The Central Limit Theorem (the means of a large number of samples follow a normal distribution) is a key to understanding this situation.

Many biomedical observations will be a product of different influences, for example the resistance of blood vessels and output from the heart are two of the influences most closely related to blood pressure. In mathematical terms these influences usually multiply together to give an overall influence, so, if we take the logarithm of the overall influence then this is the sum of the individual influences [log(A * B) = log(A) + log(B) ]. The Central Limit Theorem thus dictates that the logarithm of the product of several influences follows a normal distribution.

Another general rule is that any relationship between mean and variance is usually simple; variance proportional to group mean, mean square, mean to power x etc.. A transformation is used to cancel out this relationship and thus make the mean independent of the variance. The most common situation is for the variance to be proportional to the square of the mean (i.e. the standard deviation is proportional to the mean), here log transformation is used (e.g. serum cholesterol). Square root transformation is used when the variance is proportional to the mean, for example with Poisson distributed data. Observations that are counted in time and/or space (e.g. cases of meningococcal meningitis in a city in a year) often follow a Poisson distribution; here the mean is equal to the variance. With highly variable quantities such as serum creatinine then the variance is often proportional to the square of the standard deviation (i.e. mean to the power of 4); here the reciprocal transformation (1/X) is used.

Transformations that cancel out the relationship between variance and mean, also usually normalize the distribution of the data. Common statistical methods can then be used on the transformed data. Only some of the results of such tests, however, can be converted back to the original measurement scale of the data, the rest must be expressed in terms of the transformed variable(s) (e.g. log(serum triglyceride) as a predictor in a regression model). An example of a back-transformed statistic is the geometric mean and its confidence interval; the antilog of the mean of log-transformed data is the geometric mean and its confidence interval is the antilog of the confidence interval for the mean of the log-transformed data.

If you are unsure about the use of a transformation then take the advice of a statistician. There are exploratory statistical techniques (Box-Cox, QQ plots etc.) that statisticians can use to help find an optimal transformation for your data. Proper application of such techniques requires specialist statistical knowledge and skills.