Poisson regression is fundamental to the modeling of count data. It is the simplest count regression model; and similar to regular multiple linear regression except that the dependent variable is an observed count that follows the Poisson distribution. The Poisson distribution is specified with a single parameter μ – the mean incidence rate. The Poisson distribution has also the property that its mean and variance are equal.
Example
We will use the medpar data (The US national Medicare inpatient hospital database data for Arizona in 1991) as an example to run the Poisson regression using BESH Stat. You can download the source data in csv format here.
We will use following variables from the dataset:
- los – length of hospital stay (how many days was patient registered in a hospital for a specific disease)
- hmo – Patients belongs to a Health Maintenance Organization
- white – Patient identifies themselves as Caucasian
- type2 – Urgent admission
- type3 – Elective admission
and fit the model los = hmo + white + type2 + type3.
Note that that data do not have 0 counts (check the minimum value for the los variable in the descriptive statistics table), and also check the histogram indicating that data are skewed.
los | |
Valid data | 1495 |
Mean | 9.854181 |
Median | 8 |
SD | 8.832906 |
SEM | 0.228446 |
Variance | 78.02022 |
Coefficient of Variation | 0.896361 |
Skewness | 3.651526 |
Kurtosis | 29.20186 |
Q1 | 4 |
Q3 | 13 |
IQR | 9 |
Minimum | 1 |
Maximum | 116 |
Range | 115 |
To fit the Poisson model open the medpar.csv and select the BESH Stat main menu on the Excel Add-ins tab > Regression models > Poisson regression. On the Select Variables tab put los as a Response (Outcome) variable and hmo, white, type2, and type3 as a predictor variable(s). Then on the Model specification tab select all 4 predictor variables as an effects. You don’t need to change anything on the Settings tab (this procedure uses log link function by default and it cannot be changed; use the Generalized linear model userform (coming soon in BESH stat) if you need more settings.); and hit OK button to run the analysis.
A new excel workbook will be created with two sheets named – Poisson Regression containing the parameter estimates, standard errors, z statistic, p-value and 95% confidence intervals; and Model analysis table. And the Predictions and Residuals sheet.
Model Analysis | Poisson |
Link Function | Log |
Null deviance | 8901.134077 |
Residual deviance | 8142.666001 |
Full Log Likelihood | -6928.907786 |
Deviance G² (likelihood ratio) chisq | 758.4680756 |
df | 4 |
p-value | 7.5998E-163 |
Deviance goodness of fit chisq | 8142.666001 |
df | 1490 |
p-value | 0 |
Pseudo (McFadden) R-square | 0.085210274 |
Pearson goodness of fit chisq | 9327.983216 |
df | 1490 |
p-value | 0 |
AIC | 13867.81557 |
AICc | 13867.85587 |
BIC | 13894.36498 |
Over-dispersion scale parameter | 6.26039142 |
Number of Iterations | 4 |
Relative Log-Likelihood Change | 2.91038E-11 |
Converged? | TRUE |
Parameter estimates:
Variable | Coefficient | Std. Error | Z | P-value | 95% CI Lower | 95% CI Upper |
Intercept | 2.3329 | 0.0272 | 85.743 | 0 | 2.2796 | 2.3863 |
hmo | -0.0715 | 0.0239 | -2.988 | 0.0028 | -0.1185 | -0.0246 |
white | -0.1539 | 0.0274 | -5.613 | 1.98717E-08 | -0.2076 | -0.1001 |
type2 | 0.2217 | 0.0211 | 10.529 | 0 | 0.1804 | 0.2629 |
type3 | 0.7095 | 0.0261 | 27.146 | 0 | 0.6583 | 0.7607 |
Interpretation of parameter estimates (Coefficient column in the BESH Stat output) in Poisson regression is in general the change in the log-count (note the log is a link function in the Poisson regression) of the response for a one-unit change in the predictor. For a binary predictor (all predictors in our example are binary), it is the change in the log-count value of the response when the value of the predictor changes from 0 to 1. For example in case of type2 variable in our model, the Urgent admission increase the log-number of length of hospital stay by 0.2217 compared with a non-Urgent admission at its mean (or e0.2217 = 1.248 when expressed the length of hospital stay in number of days instead of log-days).
The model without the adjusted SEs had all p-values as statistically significant. BESH Stat results match the R glm results (see the R code and output at the bottom of this post). Recall that the data do not have 0 counts. The assumption of the Poisson model is that response variable is a count and zero counts must be allowed as a possibility even though there may not be any in a given modeling situation. However it is not the case in our example because the length of the hospital stay is always at least 1 day. This situation often leads to overdispersion. Note the high value of dispersion statistic (Over-dispersion scale parameter) at 6.26 (value 1 indicates dispersion as assumed by the model variance). (The Pearson dispersion parameter is calculated as the Pearson goodness of fit Chi2 statistic divided by degrees of freedom.)
Dispersion corrected estimates are automatically computed by BESH stat. It uses a scaled standard error calculated as square root of dispersion parameter multiplied by the original standard errors (The R glm function with family=quasipoisson use the scaling method described here. See the R output at the bottom of the post.). When scaling standard errors, the Pearson dispersion statistic better captures the excess variability in the data, adjusting model standard errors in such a manner as to reflect what the standard errors would be if the excess variability were not present in the data1. Scaled standard errors calculated for this model:
Variable | Coefficient | Std. Error | Z | P-value | 95% CI Lower | 95% CI Upper |
Intercept | 2.3329 | 0.0681 | 34.269 | 2.8105E-190 | 2.1994 | 2.4665 |
hmo | -0.0715 | 0.0599 | -1.194 | 0.2326 | -0.1891 | 0.0460 |
white | -0.1539 | 0.0686 | -2.243 | 0.0250 | -0.2884 | -0.0193 |
type2 | 0.2217 | 0.0527 | 4.208 | 2.72982E-05 | 0.1183 | 0.3250 |
type3 | 0.7095 | 0.0654 | 10.849 | 1.9021E-26 | 0.5812 | 0.8378 |
Note that the parameter estimates remain unaffected when standard errors are scaled. Scaling is an easy, quick-and-dirty method of adjusting standard errors for overdispersion1. It would be more appropriate to fit the negative binomial model, or zero-truncated model instead. The NB2 negative binomial model can be understand as an extension of Poisson model (i.e. Poisson-gamma mixture) with additional parameter (dispersion parameter) that can better handle over-dispersed data. Negative binomial NB2 regression is provided by BESH Stat starting version 0.21. See how to fit Negative Binomial model using BESH Stat in this post.
R code to fit the same model on medpar data:
library(msme)
data(medpar)
attach(medpar)
m1<- glm(los ~ hmo + white + factor(type), family=poisson, data=medpar)
summary(m1)
Call: glm(formula = los ~ hmo + white + factor(type), family = poisson, data = medpar) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.33293 0.02721 85.744 < 2e-16 *** hmo -0.07155 0.02394 -2.988 0.00281 ** white -0.15387 0.02741 -5.613 1.99e-08 *** factor(type)2 0.22165 0.02105 10.529 < 2e-16 *** factor(type)3 0.70948 0.02614 27.146 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 8901.1 on 1494 degrees of freedom Residual deviance: 8142.7 on 1490 degrees of freedom AIC: 13868 Number of Fisher Scoring iterations: 5m2<-glm(los ~ hmo + white + factor(type), family=quasipoisson)
summary(m2)
Call: glm(formula = los ~ hmo + white + factor(type), family = quasipoisson) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.33293 0.06808 34.269 < 2e-16 *** hmo -0.07155 0.05991 -1.194 0.233 white -0.15387 0.06859 -2.243 0.025 * factor(type)2 0.22165 0.05267 4.208 2.73e-05 *** factor(type)3 0.70948 0.06539 10.849 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for quasipoisson family taken to be 6.260405) Null deviance: 8901.1 on 1494 degrees of freedom Residual deviance: 8142.7 on 1490 degrees of freedom AIC: NA Number of Fisher Scoring iterations: 5
References
- Joseph M. Hilbe. Modeling Count Data. Cambridge University Press, 2014