Poisson Regression
Menu location: Analysis_Regression and Correlation_Poisson
This function fits a Poisson regression model for multivariate analysis of numbers of uncommon events in cohort studies.
The multiplicative Poisson regression model is fitted as a log-linear regression (i.e. a log link and a Poisson error distribution), with an offset equal to the natural logarithm of person-time if person-time is specified (McCullagh and Nelder, 1989; Frome, 1983; Agresti, 2002). With the multiplicative Poisson model, the exponents of coefficients are equal to the incidence rate ratio (relative risk). These baseline relative risks give values relative to named covariates for the whole population. You can define relative risks for a sub-population by multiplying that sub-population's baseline relative risk with the relative risks due to other covariate groupings, for example the relative risk of dying from lung cancer if you are a smoker who has lived in a high radon area. StatsDirect offers sub-population relative risks for dichotomous covariates.
The outcome/response variable is assumed to come from a Poisson distribution. Note that a Poisson distribution is the distribution of the number of events in a fixed time interval, provided that the events occur at random, independently in time and at a constant rate. Poisson distributions are used for modelling events per unit space as well as time, for example number of particles per square centimetre.
Poisson regression can also be used for log-linear modelling of contingency table data, and for multinomial modelling. For contingency table counts you would create r + c indicator/dummy variables as the covariates, representing the r rows and c columns of the contingency table:
r1c1 | r1c2 | r1c3 |
r2c1 | r2c2 | r2c3 |
r3c1 | r3c2 | r3c3 |
Response | x_r1 | x_r2 | x_r3 | x_c1 | x_c2 | x_c3 |
r1c1 | 1 | 0 | 0 | 1 | 0 | 0 |
r1c2 | 1 | 0 | 0 | 0 | 1 | 0 |
r1c3 | 1 | 0 | 0 | 0 | 0 | 1 |
r2c1 | 0 | 1 | 0 | 1 | 0 | 0 |
r2c2 | 0 | 1 | 0 | 0 | 1 | 0 |
r2c3 | 0 | 1 | 0 | 0 | 0 | 1 |
r3c1 | 0 | 0 | 1 | 1 | 0 | 0 |
r3c2 | 0 | 0 | 1 | 0 | 1 | 0 |
r3c3 | 0 | 0 | 1 | 0 | 0 | 1 |
Adequacy of the model
In order to assess the adequacy of the Poisson regression model you should first look at the basic descriptive statistics for the event count data. If the count mean and variance are very different (equivalent in a Poisson distribution) then the model is likely to be over-dispersed.
The model analysis option gives a scale parameter (sp) as a measure of over-dispersion; this is equal to the Pearson chi-square statistic divided by the number of observations minus the number of parameters (covariates and intercept). The variances of the coefficients can be adjusted by multiplying by sp. The goodness of fit test statistics and residuals can be adjusted by dividing by sp. Using a quasi-likelihood approach sp could be integrated with the regression, but this would assume a known fixed value for sp, which is seldom the case. A better approach to over-dispersed Poisson models is to use a parametric alternative model, the negative binomial.
The deviance (likelihood ratio) test statistic, G², is the most useful summary of the adequacy of the fitted model. It represents the change in deviance between the fitted model and the model with a constant term and no covariates; therefore G² is not calculated if no constant is specified. If this test is significant then the covariates contribute significantly to the model.
The deviance goodness of fit test reflects the fit of the data to a Poisson distribution in the regression. If this test is significant then a red asterisk is shown by the P value, and you should consider other covariates and/or other error distributions such as negative binomial.
StatsDirect does not exclude/drop covariates from its Poisson regression if they are highly correlated with one another. Models that are not of full (rank = number of parameters) rank are fully estimated in most circumstances, but you should usually consider combining or excluding variables, or possibly excluding the constant term. You should seek expert statistical if you find yourself in this situation.
Technical validation
The deviance function is:
- where y is the number of events, n is the number of observations and μ is the fitted Poisson mean.
The log-likelihood function is:
The maximum likelihood regression proceeds by iteratively re-weighted least squares, using singular value decomposition to solve the linear system at each iteration, until the change in deviance is within the specified accuracy.
The Pearson chi-square residual is:
The Pearson goodness of fit test statistic is:
The deviance residual is (Cook and Weisberg, 1982):
-where D(observation, fit) is the deviance and sgn(x) is the sign of x.
The Freeman-Tukey, variance stabilized, residual is (Freeman and Tukey, 1950):
The standardized residual is:
- where h is the leverage (diagonal of the Hat matrix).
Example
Test workbook (Regression worksheet: Cancers, Subject-years, Veterans, Age group).
To analyse these data using StatsDirect you must first open the test workbook using the file open function of the file menu. Next generate a set of dummy variables to represent the levels of the "Age group" variable using the Dummy Variables function of the Data menu. Then select Poisson from the Regression and Correlation section of the Analysis menu. Click on the option "Counts of events and exposure (person-time), and select the response data type as "Individual". Select the column marked "Cancers" when asked for the response. Then select "Subject-years" when asked for person-time. Then select "Veterans", "Age group (25-29)" , "Age group (30-34)" etc. in one action when you are asked for predictors.
For this example:
Poisson regression
Deviance (likelihood ratio) chi-square = 2067.700372 df = 11 P < 0.0001
Intercept | b0 = -9.324832 | z = -45.596773 | P < 0.0001 |
Veterans | b1 = -0.003528 | z = -0.063587 | P = 0.9493 |
Age group (25-29) | b2 = 0.679314 | z = 2.921869 | P = 0.0035 |
Age group (30-34) | b3 = 1.371085 | z = 6.297824 | P < 0.0001 |
Age group (35-39) | b4 = 1.939619 | z = 9.14648 | P < 0.0001 |
Age group (40-44) | b5 = 2.034323 | z = 9.413835 | P < 0.0001 |
Age group (45-49) | b6 = 2.726551 | z = 12.269534 | P < 0.0001 |
Age group (50-54) | b7 = 3.202873 | z = 14.515926 | P < 0.0001 |
Age group (55-59) | b8 = 3.716187 | z = 17.064363 | P < 0.0001 |
Age group (60-64) | b9 = 4.092676 | z = 18.801188 | P < 0.0001 |
Age group (65-69) | b10 = 4.23621 | z = 18.892791 | P < 0.0001 |
Age group (70+) | b11 = 4.363717 | z = 19.19183 | P < 0.0001 |
log Cancers [offset log(Veterans)] = -9.324832 -0.003528 Veterans +0.679314 Age group (25-29) +1.371085 Age group (30-34) +1.939619 Age group (35-39) +2.034323 Age group (40-44) +2.726551 Age group (45-49) +3.202873 Age group (50-54) +3.716187 Age group (55-59) +4.092676 Age group (60-64) +4.23621 Age group (65-69) +4.363717 Age group (70+)
Poisson regression - incidence rate ratios
Inference population: whole study (baseline risk)
Parameter | Estimate | IRR | 95% CI |
Veterans | -0.003528 | 0.996479 | 0.89381 to 1.11094 |
Age group (25-29) | 0.679314 | 1.972524 | 1.250616 to 3.111147 |
Age group (30-34) | 1.371085 | 3.939622 | 2.571233 to 6.036256 |
Age group (35-39) | 1.939619 | 6.956098 | 4.590483 to 10.540786 |
Age group (40-44) | 2.034323 | 7.647073 | 5.006696 to 11.679905 |
Age group (45-49) | 2.726551 | 15.280093 | 9.884869 to 23.620062 |
Age group (50-54) | 3.202873 | 24.60311 | 15.96527 to 37.914362 |
Age group (55-59) | 3.716187 | 41.107367 | 26.825601 to 62.992647 |
Age group (60-64) | 4.092676 | 59.899957 | 39.096281 to 91.773558 |
Age group (65-69) | 4.23621 | 69.145275 | 44.555675 to 107.305502 |
Age group (70+) | 4.363717 | 78.54856 | 50.303407 to 122.653248 |
Poisson regression - model analysis
Accuracy = 1.00E-07
Log likelihood with all covariates = -66.006668
Deviance with all covariates = 5.217124, df = 10, rank = 12
Akaike information criterion = 29.217124
Schwartz information criterion = 45.400676
Deviance with no covariates = 2072.917496
Deviance (likelihood ratio, G²) = 2067.700372, df = 11, P < 0.0001
Pseudo (McFadden) R-square = 0.997483
Pseudo (likelihood ratio index) R-square = 0.939986
Pearson goodness of fit = 5.086063, df = 10, P = 0.8854
Deviance goodness of fit = 5.217124, df = 10, P = 0.8762
Over-dispersion scale parameter = 0.508606
Scaled G² = 4065.424363, df = 11, P < 0.0001
Scaled Pearson goodness of fit = 10, df = 10, P = 0.4405
Scaled Deviance goodness of fit = 10.257687, df = 10, P = 0.4182
Parameter | Coefficient | Standard Error |
Constant | -9.324832 | 0.204506 |
Veterans | -0.003528 | 0.055478 |
Age group (25-29) | 0.679314 | 0.232493 |
Age group (30-34) | 1.371085 | 0.217708 |
Age group (35-39) | 1.939619 | 0.212062 |
Age group (40-44) | 2.034323 | 0.216099 |
Age group (45-49) | 2.726551 | 0.222221 |
Age group (50-54) | 3.202873 | 0.220645 |
Age group (55-59) | 3.716187 | 0.217775 |
Age group (60-64) | 4.092676 | 0.217682 |
Age group (65-69) | 4.23621 | 0.224224 |
Age group (70+) | 4.363717 | 0.227374 |
Parameter | Scaled Standard Error | Scaled Wald z | |
Constant | 0.145847 | -63.935674 | P < 0.0001 |
Veterans | 0.039565 | -0.089162 | P = 0.929 |
Age group (25-29) | 0.165806 | 4.097037 | P < 0.0001 |
Age group (30-34) | 0.155262 | 8.830792 | P < 0.0001 |
Age group (35-39) | 0.151235 | 12.825169 | P < 0.0001 |
Age group (40-44) | 0.154115 | 13.200054 | P < 0.0001 |
Age group (45-49) | 0.158481 | 17.204308 | P < 0.0001 |
Age group (50-54) | 0.157357 | 20.354193 | P < 0.0001 |
Age group (55-59) | 0.15531 | 23.927605 | P < 0.0001 |
Age group (60-64) | 0.155243 | 26.362975 | P < 0.0001 |
Age group (65-69) | 0.159909 | 26.491421 | P < 0.0001 |
Age group (70+) | 0.162155 | 26.910733 | P < 0.0001 |
With 95% confidence you can infer that the risk of cancer in these veterans compared with non-veterans lies between 0.89 and 1.11, i.e. a statistically non-significant effect.