Poisson Regression

Menu location: Analysis_Regression and Correlation_Poisson

This function fits a Poisson regression model for multivariate analysis of numbers of uncommon events in cohort studies.

The multiplicative Poisson regression model is fitted as a log-linear regression (i.e. a log link and a Poisson error distribution), with an offset equal to the natural logarithm of person-time if person-time is specified (McCullagh and Nelder, 1989; Frome, 1983; Agresti, 2002). With the multiplicative Poisson model, the exponents of coefficients are equal to the incidence rate ratio (relative risk). These baseline relative risks give values relative to named covariates for the whole population. You can define relative risks for a sub-population by multiplying that sub-population's baseline relative risk with the relative risks due to other covariate groupings, for example the relative risk of dying from lung cancer if you are a smoker who has lived in a high radon area. StatsDirect offers sub-population relative risks for dichotomous covariates.

The outcome/response variable is assumed to come from a Poisson distribution. Note that a Poisson distribution is the distribution of the number of events in a fixed time interval, provided that the events occur at random, independently in time and at a constant rate. Poisson distributions are used for modelling events per unit space as well as time, for example number of particles per square centimetre.

Poisson regression can also be used for log-linear modelling of contingency table data, and for multinomial modelling. For contingency table counts you would create r + c indicator/dummy variables as the covariates, representing the r rows and c columns of the contingency table:

r1c1	r1c2	r1c3
r2c1	r2c2	r2c3
r3c1	r3c2	r3c3

Response	x_r1	x_r2	x_r3	x_c1	x_c2	x_c3
r1c1	1	0	0	1	0	0
r1c2	1	0	0	0	1	0
r1c3	1	0	0	0	0	1
r2c1	0	1	0	1	0	0
r2c2	0	1	0	0	1	0
r2c3	0	1	0	0	0	1
r3c1	0	0	1	1	0	0
r3c2	0	0	1	0	1	0
r3c3	0	0	1	0	0	1

Adequacy of the model

In order to assess the adequacy of the Poisson regression model you should first look at the basic descriptive statistics for the event count data. If the count mean and variance are very different (equivalent in a Poisson distribution) then the model is likely to be over-dispersed.

The model analysis option gives a scale parameter (sp) as a measure of over-dispersion; this is equal to the Pearson chi-square statistic divided by the number of observations minus the number of parameters (covariates and intercept). The variances of the coefficients can be adjusted by multiplying by sp. The goodness of fit test statistics and residuals can be adjusted by dividing by sp. Using a quasi-likelihood approach sp could be integrated with the regression, but this would assume a known fixed value for sp, which is seldom the case. A better approach to over-dispersed Poisson models is to use a parametric alternative model, the negative binomial.

The deviance (likelihood ratio) test statistic, G², is the most useful summary of the adequacy of the fitted model. It represents the change in deviance between the fitted model and the model with a constant term and no covariates; therefore G² is not calculated if no constant is specified. If this test is significant then the covariates contribute significantly to the model.

The deviance goodness of fit test reflects the fit of the data to a Poisson distribution in the regression. If this test is significant then a red asterisk is shown by the P value, and you should consider other covariates and/or other error distributions such as negative binomial.

StatsDirect does not exclude/drop covariates from its Poisson regression if they are highly correlated with one another. Models that are not of full (rank = number of parameters) rank are fully estimated in most circumstances, but you should usually consider combining or excluding variables, or possibly excluding the constant term. You should seek expert statistical if you find yourself in this situation.

Technical validation

The deviance function is:

- where y is the number of events, n is the number of observations and μ is the fitted Poisson mean.

The log-likelihood function is:

The maximum likelihood regression proceeds by iteratively re-weighted least squares, using singular value decomposition to solve the linear system at each iteration, until the change in deviance is within the specified accuracy.

The Pearson chi-square residual is:

The Pearson goodness of fit test statistic is:

The deviance residual is (Cook and Weisberg, 1982):

-where D(observation, fit) is the deviance and sgn(x) is the sign of x.

The Freeman-Tukey, variance stabilized, residual is (Freeman and Tukey, 1950):

The standardized residual is:

- where h is the leverage (diagonal of the Hat matrix).

Example

From Armitage et al. (2001):

Test workbook (Regression worksheet: Cancers, Subject-years, Veterans, Age group).

To analyse these data using StatsDirect you must first open the test workbook using the file open function of the file menu. Next generate a set of dummy variables to represent the levels of the "Age group" variable using the Dummy Variables function of the Data menu. Then select Poisson from the Regression and Correlation section of the Analysis menu. Click on the option "Counts of events and exposure (person-time), and select the response data type as "Individual". Select the column marked "Cancers" when asked for the response. Then select "Subject-years" when asked for person-time. Then select "Veterans", "Age group (25-29)" , "Age group (30-34)" etc. in one action when you are asked for predictors.

For this example:

Poisson regression

Deviance (likelihood ratio) chi-square = 2067.700372 df = 11 P < 0.0001

Intercept	b0 = -9.324832	z = -45.596773	P < 0.0001
Veterans	b1 = -0.003528	z = -0.063587	P = 0.9493
Age group (25-29)	b2 = 0.679314	z = 2.921869	P = 0.0035
Age group (30-34)	b3 = 1.371085	z = 6.297824	P < 0.0001
Age group (35-39)	b4 = 1.939619	z = 9.14648	P < 0.0001
Age group (40-44)	b5 = 2.034323	z = 9.413835	P < 0.0001
Age group (45-49)	b6 = 2.726551	z = 12.269534	P < 0.0001
Age group (50-54)	b7 = 3.202873	z = 14.515926	P < 0.0001
Age group (55-59)	b8 = 3.716187	z = 17.064363	P < 0.0001
Age group (60-64)	b9 = 4.092676	z = 18.801188	P < 0.0001
Age group (65-69)	b10 = 4.23621	z = 18.892791	P < 0.0001
Age group (70+)	b11 = 4.363717	z = 19.19183	P < 0.0001

log Cancers [offset log(Veterans)] = -9.324832 -0.003528 Veterans +0.679314 Age group (25-29) +1.371085 Age group (30-34) +1.939619 Age group (35-39) +2.034323 Age group (40-44) +2.726551 Age group (45-49) +3.202873 Age group (50-54) +3.716187 Age group (55-59) +4.092676 Age group (60-64) +4.23621 Age group (65-69) +4.363717 Age group (70+)

Poisson regression - incidence rate ratios

Inference population: whole study (baseline risk)

Parameter	Estimate	IRR	95% CI
Veterans	-0.003528	0.996479	0.89381 to 1.11094
Age group (25-29)	0.679314	1.972524	1.250616 to 3.111147
Age group (30-34)	1.371085	3.939622	2.571233 to 6.036256
Age group (35-39)	1.939619	6.956098	4.590483 to 10.540786
Age group (40-44)	2.034323	7.647073	5.006696 to 11.679905
Age group (45-49)	2.726551	15.280093	9.884869 to 23.620062
Age group (50-54)	3.202873	24.60311	15.96527 to 37.914362
Age group (55-59)	3.716187	41.107367	26.825601 to 62.992647
Age group (60-64)	4.092676	59.899957	39.096281 to 91.773558
Age group (65-69)	4.23621	69.145275	44.555675 to 107.305502
Age group (70+)	4.363717	78.54856	50.303407 to 122.653248

Poisson regression - model analysis

Accuracy = 1.00E-07

Log likelihood with all covariates = -66.006668

Deviance with all covariates = 5.217124, df = 10, rank = 12

Akaike information criterion = 29.217124

Schwartz information criterion = 45.400676

Deviance with no covariates = 2072.917496

Deviance (likelihood ratio, G²) = 2067.700372, df = 11, P < 0.0001

Pseudo (McFadden) R-square = 0.997483

Pseudo (likelihood ratio index) R-square = 0.939986

Pearson goodness of fit = 5.086063, df = 10, P = 0.8854

Deviance goodness of fit = 5.217124, df = 10, P = 0.8762

Over-dispersion scale parameter = 0.508606

Scaled G² = 4065.424363, df = 11, P < 0.0001

Scaled Pearson goodness of fit = 10, df = 10, P = 0.4405

Scaled Deviance goodness of fit = 10.257687, df = 10, P = 0.4182

Parameter	Coefficient	Standard Error
Constant	-9.324832	0.204506
Veterans	-0.003528	0.055478
Age group (25-29)	0.679314	0.232493
Age group (30-34)	1.371085	0.217708
Age group (35-39)	1.939619	0.212062
Age group (40-44)	2.034323	0.216099
Age group (45-49)	2.726551	0.222221
Age group (50-54)	3.202873	0.220645
Age group (55-59)	3.716187	0.217775
Age group (60-64)	4.092676	0.217682
Age group (65-69)	4.23621	0.224224
Age group (70+)	4.363717	0.227374

Parameter	Scaled Standard Error	Scaled Wald z
Constant	0.145847	-63.935674	P < 0.0001
Veterans	0.039565	-0.089162	P = 0.929
Age group (25-29)	0.165806	4.097037	P < 0.0001
Age group (30-34)	0.155262	8.830792	P < 0.0001
Age group (35-39)	0.151235	12.825169	P < 0.0001
Age group (40-44)	0.154115	13.200054	P < 0.0001
Age group (45-49)	0.158481	17.204308	P < 0.0001
Age group (50-54)	0.157357	20.354193	P < 0.0001
Age group (55-59)	0.15531	23.927605	P < 0.0001
Age group (60-64)	0.155243	26.362975	P < 0.0001
Age group (65-69)	0.159909	26.491421	P < 0.0001
Age group (70+)	0.162155	26.910733	P < 0.0001

With 95% confidence you can infer that the risk of cancer in these veterans compared with non-veterans lies between 0.89 and 1.11, i.e. a statistically non-significant effect.

P values

confidence intervals