Logistic Regression

Menu location: Analysis_Regression and Correlation_Logistic.

This function fits and analyses logistic models for binary outcome/response data with one or more predictors.

Binomial distributions are used for handling the errors associated with regression models for binary/dichotomous responses (i.e. yes/no, dead/alive) in the same way that the standard normal distribution is used in general linear regression. Other, less commonly used binomial models include normit/probit and complimentary log-log. The logistic model is widely used and has many desirable properties (Hosmer and Lemeshow, 1989; Armitage and Berry, 1994; Altman 1991; McCullagh and Nelder, 1989; Cox and Snell, 1989; Pregibon, 1981).

Odds = π/(1-π)

[p = proportional response, i.e. r out of n responded so π = r/n]

Logit = log odds = log(π/(1-π))

When a logistic regression model has been fitted, estimates of π are marked with a hat symbol above the Greek letter pi to denote that the proportion is estimated from the fitted regression model. Fitted proportional responses are often referred to as event probabilities (i.e. p hat n events out of n trials).

The following information about the difference between two logits demonstrates one of the important uses of logistic regression models:

Logistic models provide important information about the relationship between response/outcome and exposure. It makes no difference to logistic models, whether outcomes have been sampled prospectively or retrospectively, this is not the case with other binomial models.

The general form of a logistic regression is:

- where p hat is the expected proportional response for the logistic model with regression coefficients b1 to k and intercept b0 when the values for the predictor variables are x1 to k.

Classifier predictors

If one of the predictors in a regression model classifies observations into more than two classes (e.g. blood group) then you should consider splitting it into separate dichotomous variables as described under dummy variables.

Data preparation

For individual responses that are dichotomous (e.g. yes/no), enter the total number as 1 and the response as 1 or 0 for each observation (usually 1 for yes and 0 for no).

For responses that are proportional, either enter the total number then the number responding or enter the total number as 1 and then a proportional response (r/n).

Rows with missing data are left out of the model. If missing data are encountered you are warned that missing data can cause bias.

Deviance and model analysis

Deviance is minus twice the log of the likelihood ratio for models fitted by maximum likelihood (Hosmer and Lemeshow, 1989; Cox and Snell, 1989; Pregibon, 1981). Log likelihood and deviance are given under the model analysis option of logistic regression in StatsDirect.

The value of adding parameter to a logistic model can be tested by subtracting the deviance of the model with the new parameter from the deviance of the model without the new parameter, this difference is then tested against a chi-square distribution with degrees of freedom equal to the difference between the degrees of freedom of the old and new models. The model analysis option tests the model you specify against a model with only one parameter, the intercept; this tests the combined value of the specified predictors/covariates in the model.

Some statistical packages offer stepwise logistic regression that performs systematic tests for different combinations of predictors/covariates. Automatic model building procedures such as these can be erroneous as they do not consider the real world importance of each predictor, for this reason StatsDirect does not include stepwise selection.

Three goodness of fit tests are given for the overall fit of a model: Pearson, deviance and Hosmer-Lemeshow (Hosmer and Lemeshow, 1989). Note that the Hosmer-Lemeshow (decile of risk) test is only applicable when the number of observations tied at any one covariate pattern is small in comparison with the total number of observations, and when all predictors are continuous variables.

Influential data and odds ratios

The following are provided under the fits and residuals option for the purpose of identifying influential data:

leverages (diagonal elements of the logistic "hat" matrix)
deviance residuals
Pearson residuals
standardized variances
deletion displacements
covariance

Approximate confidence intervals are given for the odds ratios derived from the covariates.

Bootstrap estimates

A bootstrap procedure may be used to cross-validate confidence intervals calculated for odds ratios derived from fitted logistic models (Efron and Tibshirani, 1997; Gong, 1986). The bootstrap confidence intervals used here are the 'bias-corrected' type.

The mechanism that StatsDirect uses is to draw a specified number of random samples (with replacement, i.e. some observations are drawn once only, others more than once and some not at all) from your data. These 're-samples' are fed back into the logistic regression and bootstrap' estimates of confidence intervals for the model parameters are made by examining the model parameters calculated at each cycle of the process. The bias statistic shows how much each mean model parameter from the bootstrap distribution deviates from observed model parameters.

Classification and ROC curve

The confidence interval given with the likelihood ratios in the classification option is constructed using the robust approximation given by Koopman (1984) for ratios of binomial proportions. The 'near' cut-off in the classification option is the rounding cut-off that gives the maximum sum of sensitivity and specificity. This value should be the shoulder at the top left of the ROC (receiver operating characteristic curve).

Prediction and adjusted means

The prediction option allows you to calculate values of the outcome (as response proportion) using your fitted logistic model coefficients with a specified set of values for the predictors (X1…p). A confidence interval is given for each prediction.

The default X values shown are those required to calculate the overall regression mean for the model, which is the mean of Y adjusted for all X. For continuous predictors the mean of X is used. For categorical predictors you should use X as 1/k, where k is the number of categories. StatsDirect attempts to identify categorical variables but you should check the values against these rules if you are using categorical predictors in this way.

For example, if a model of Y = logit(proportion of population who are hypertensive), X1 = sex, X2 = age was fitted, and you wanted to know the age and sex adjusted prevalence of hypertension in the population that you sampled, you could use the prediction function to give the regression mean as the answer, i.e. with X1 as 0.5 and X2 as mean age. If you wanted to know the age-adjusted prevalence of hypertension for males in your population then you would set X1 to 1 (if male sex is coded as 1 in your data).

Further methods

GLIM provides many generalised linear models with link functions including binomial (see non-linear models). SAS provides an extension of logistic regression to ordinal responses, this is known as ordered logistic regression. Generic modelling software such as R and S+ can also be used. Exploratory regression modelling should be attempted only under the expert guidance of a Statistician.

Technical validation

The logits of the response data are fitted using an iteratively re-weighted least squares method to find maximum likelihood estimates of the parameters in the logistic model (McCullagh and Nelder, 1989; Cox and Snell, 1989; Pregibon, 1981).

Residuals and case-wise diagnostic statistics are calculated as follows (Hosmer and Lemeshow, 1989):

Leverages are the diagonal elements of the logistic equivalent of the hat matrix in general linear regression (where leverages are proportional to the distances of the jth covariate pattern from the mean of the data). The jth diagonal element of the logistic equivalent of the hat matrix is calculated as:

- where m_j is the number of trials with the jth covariate pattern, p hat is the expected proportional response, x_j is the jth covariate pattern, X is the design matrix containing all covariates (first column as 1 if intercept calculated) and V is a matrix with the general element π hat(1 - π hat).

Deviance residuals are used to detect ill-fitting covariate patterns, and they are calculated as:

- where m_j is the number of trials with the jth covariate pattern, π hat is the expected proportional response and y_j is the number of successes with the jth covariate pattern.

Pearson residuals are used to detect ill-fitting covariate patterns, and they are calculated as:

- where m_j is the number of trials with the jth covariate pattern, π hat is the expected proportional response and y_j is the number of successes with the jth covariate pattern.

Standardized Pearson residuals are used to detect ill-fitting covariate patterns, and they are calculated as:

- where r_j is the Pearson residual for the jth covariate pattern and h_j is the leverage for the jth covariate pattern.

Deletion displacement (delta beta) measures the change caused by deleting all observations with the jth covariate pattern. The statistic is used to detect observations that have a strong influence upon the regression estimates. This change in regression coefficients is calculated as:

- where r_j is the Pearson residual for the jth covariate pattern and h_j is the leverage for the jth covariate pattern.

Standardized deletion displacement (std delta beta) measures the change caused by deleting all observations with the jth covariate pattern. The statistic is used to detect observations that have a strong influence upon the regression estimates. This change in regression coefficients is calculated as:

- where rs_j is the standardized Pearson residual for the jth covariate pattern and h_j is the leverage for the jth covariate pattern.

Deletion chi-square (delta chi-square) measures the change in the Pearson chi-square statistic (for the fit of the regression) caused by deleting all observations with the jth covariate pattern. The statistic is used to detect ill-fitting covariate patterns. This change in chi-square is calculated as:

- where r_j is the Pearson residual for the jth covariate pattern and h_j is the leverage for the jth covariate pattern.

Example

From Altman (1991).

Test workbook (Regression worksheet: Men, Hypertensive, Smoking, Obesity, Snoring).

Smoking, obesity and snoring were related to hypertension in 433 men aged 40 or over.

Men	Hypertensive	Smoking	Obesity	Snoring
60	5	0	0	0
17	2	1	0	0
8	1	0	1	0
2	0	1	1	0
187	35	0	0	1
85	13	1	0	1
51	15	0	1	1
23	8	1	1	1

To analyse these data using StatsDirect you must first enter them into five columns of a workbook. Alternatively, open the test workbook using the file open function of the file menu. Then select Logistic from the Regression and Correlation section of the analysis menu. Choose the option to enter grouped data when prompted. Select the column marked "Men" when asked for total number and select "Hypertensives" when asked for response. Then select "Smoking", "Obesity" and "Snoring" in one action when you are asked for predictors. Make sure the intercept option is checked and the weighted analysis option is unchecked.

For this example:

Logistic regression

Deviance goodness of fit chi-square = 1.618403	df = 4	P = 0.8055
Deviance (likelihood ratio) chi-square = 12.507498	df = 3	P = 0.0058

Parameter	Odds Ratio	95% Conf. Int.	Z Value	P(>\|Z\|)
Intercept	n/a		-6.253967	P < .0001
Smoking	0.934471	(0.541784 to 1.611779)	-0.243686	P = .8075
Obesity	2.00433	(1.146316 to 3.504564)	2.438954	P = .0147
Snoring	2.391544	(1.097143 to 5.213072)	2.193152	P = .0283

logit Hypertensive = -2.377661 -0.067775 Smoking +0.69531 Obesity +0.871939 Snoring

Thus with 95% confidence we can infer that the risk of hypertension in obese people is between 1.15 and 3.5 times greater than in non-obese people.

Logistic regression - model analysis

Accuracy = 1.00E-07

Log likelihood with all covariates = -199.4582

Deviance with all covariates = 1.618403, df = 4, rank = 4

Akaike = 9.618403

Schwartz = 12.01561

Deviance with no covariates = 14.1259

Deviance (likelihood ratio) chi-square = 12.507498, df = 3, P = 0.0058

Pearson chi-square goodness of fit = 1.364272, df = 4, P = 0.8504

Deviance goodness of fit = 1.618403, df = 4, P = 0.8055

Hosmer-Lemeshow test = 0.453725, df = 2, P = 0.797

Parameter	Coefficient	Standard Error
Constant	-2.377661	0.380185
Smoking	-0.067775	0.278124
Obesity	0.69531	0.285085
Snoring	0.871939	0.397574

We can infer that smoking has no association with hypertension from this evidence and drop it from our model. Remember that there may be important interactions between predictors. The fits and residuals option gives you the covariances. It would be prudent to seek statistical advice on the interpretation of covariance and influential data.

P values

confidence intervals