Example Code for Section Chapter 8.5

Note: In the following examples we assume that you have some experience using R. The program is cross-platform, open-source, and free. You can download the program and also find a wealth of introductor and more advanced guides at the Comprehensive R Archive Network (http://cran.us.r-project.org/). We also use three extra packages, which you can install using the menus in R or using this command:

install.packages(c("ggplot2", "binom", "Barnard"))

The data set

Let us assume that we we sampled and tested for infection 20 individuals from each of eight ponds, labeled A–H. The ponds vary in distance (in meters) from roads, which may relate to their quality as habitat, and thus potentially the susceptibility of the animals within them, or perhaps ponds near roads are more likely to have ranavirus introduced.

We have the data in a comma-delimited spreadsheet, so we can simply read the file into R, assign it to the dataframe, “mydata”, and then display it.

mydata <- read.csv("mydata.csv")
mydata

##   Pond Distance Infected Tested
## 1    A       16       16     20
## 2    B      152       12     20
## 3    C      509        1     20
## 4    D        3       15     20
## 5    E      190       13     20
## 6    F      214        5     20
## 7    G       42       12     20
## 8    H       44       10     20

To get a better sense of our data, we should first graph it. Base graphics are fine, but for reasons that will become apparent, we will use the package, ggplot2 here.

library(ggplot2) # load the package
qplot(x=Distance, y=Infected/Tested, data=mydata, ylim=c(0,1)) # plot the prevalence against distance

plot of chunk plotdata

Confidence intervals on proportions

It does look like there is a trend towards decreasing prevalence with increasing distance from the road. But we have not accounted for sampling error. That is, we should look at the confidence intervals on these estimates of prevalence before drawing any conclusions. We can use the binom.confint function in the binom package to calculate the confidence interval on these proportions with the Wilson method. Note that many other methods are available in this package as well.

library(binom)
CIs <- binom.confint(x=mydata$Infected, n=mydata$Tested, methods="wilson")
CIs

##   method  x  n mean    lower  upper
## 1 wilson 16 20 0.80 0.583983 0.9193
## 2 wilson 12 20 0.60 0.386582 0.7812
## 3 wilson  1 20 0.05 0.008881 0.2361
## 4 wilson 15 20 0.75 0.531299 0.8881
## 5 wilson 13 20 0.65 0.432854 0.8188
## 6 wilson  5 20 0.25 0.111862 0.4687
## 7 wilson 12 20 0.60 0.386582 0.7812
## 8 wilson 10 20 0.50 0.299298 0.7007

# Now add the CIs to the data.frame
mydata <- cbind(mydata, CIs[,5:6])
mydata

##   Pond Distance Infected Tested    lower  upper
## 1    A       16       16     20 0.583983 0.9193
## 2    B      152       12     20 0.386582 0.7812
## 3    C      509        1     20 0.008881 0.2361
## 4    D        3       15     20 0.531299 0.8881
## 5    E      190       13     20 0.432854 0.8188
## 6    F      214        5     20 0.111862 0.4687
## 7    G       42       12     20 0.386582 0.7812
## 8    H       44       10     20 0.299298 0.7007

We can then add the confidence intervals to the plot, which is helpful. (Note: this is simple in ggplot2, but rather involved with base graphics in R).

qplot(x=Distance, y=Infected/Tested, ymin=lower, ymax=upper, data=mydata, ylim=c(0,1), geom = "pointrange") # the geom = "pointrange" adds a point at y and a linerange from ymin to ymax. Many other geom's are available

plot of chunk plotdataCIs

Accounting for the relatively small samples sizes we see confidence intervals that range by 20–30% or more. (Remember, these are not standard errors, but confidence intervals.) When we add them to the graph, the clear pattern we first observed now seems a little less certain. We will statistically test for a relationship between distance from roads and the probability or prevalence of infection shortly, but first let us see how to compare prevalence in particular populations.

Chi-square and Barnard’s exact test

Imagine that we are interested in ponds B, E, and F in particular. They are very similar, but pond F has an invasive species in it.

BEF <- subset(mydata, Pond %in% c("B","E","F")) # selecting ponds in the group, B, E, & F
qplot(x=Pond, y=Infected/Tested, ymin=lower, ymax=upper, data=BEF, # same as before
            ylim=c(0,1), geom = c("bar", "linerange"))  # but now using two geometries

plot of chunk compareBEF

If we wanted to compare the prevalence in these ponds statistically, we might use a Chi-square test.

BEF$Not <- BEF$Tested - BEF$Infected # create a column with the count of uninfected animals in each pond
test <- chisq.test(BEF[ , c("Infected", "Not")]) # read as: all rows of the matrix BEF, just columns labeled "Infected" and "Not"
# Note: chisq.test(BEF$Infected, BEF$Tested-BEF$Infected) will not work. 
# The chisq.test() function assumes that rows in a data.frame are individual observations, not counts of infected and uninfected animals in each pond. Instead, by treating the BEF data.frame as if it were a matrix, the function works just fine. 
test

## 
##  Pearson's Chi-squared test
## 
## data:  BEF[, c("Infected", "Not")]
## X-squared = 7.6, df = 2, p-value = 0.02237

Based on this analysis, the prevalence of infection is statistically different in these three ponds (\(\chi^2 =\) 7.6, \(P =\) 0.022). This does not tell you which ponds are different, just that there is significant deviation across ponds from what one would expect if they all had the same prevalence. It can be helpful to determine which ponds deviate from this expectation most. You can see the observed and expected values, along with the Pearson residuals (\((Obs - Exp)/\sqrt{Exp}\)) in the Chi square table by selecting the appropriate slots. (See help(chisq.test) for more information.)

test$observed

##   Infected Not
## 2       12   8
## 5       13   7
## 6        5  15

test$expected

##   Infected Not
## 2       10  10
## 5       10  10
## 6       10  10

test$residuals

##   Infected     Not
## 2   0.6325 -0.6325
## 5   0.9487 -0.9487
## 6  -1.5811  1.5811

The third row (labeled 6 for the 6th row of the original table) corresponding to pond F clearly has the greatest deviation from the expected prevalence of 0.5 (=10/20). But is this one pond responsible for the statistical significance? A useful feature of Chi-square tests is that you can (re)run the test on subsets of the the contingency table. But of course with multiple comparisons we need to adjust our P-values (or alpha levels) with, for instance, Bonferroni corrections, although other methods are usually more powerful.

(a <- chisq.test(BEF[c(1,2), c("Infected", "Not")]) ) # read as: just rows 1 & 2 of BEF, just columns labeled "Infected" and "Not"

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  BEF[c(1, 2), c("Infected", "Not")]
## X-squared = 0, df = 1, p-value = 1

(b <- chisq.test(BEF[c(1,3), c("Infected", "Not")]) )# read as: just rows 1 & 3 of BEF, just columns labeled "Infected" and "Not"

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  BEF[c(1, 3), c("Infected", "Not")]
## X-squared = 3.683, df = 1, p-value = 0.05497

(c <- chisq.test(BEF[c(2,3), c("Infected", "Not")]) )# read as: just rows 2 & 3 of BEF, just columns labeled "Infected" and "Not"

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  BEF[c(2, 3), c("Infected", "Not")]
## X-squared = 4.949, df = 1, p-value = 0.0261

p.adjust(p=c(a$p.value, b$p.value, c$p.value), method = "bonferroni")

## [1] 1.00000 0.16492 0.07829

So without the third pond (F), there is no evidence that the prevalence varies between ponds (and, after accounting for multiple comparisons, the difference between pond F and the others is no longer significant, either). As we can see in the bar graph, above, the difference we observed in the original Chi-square test was all due to pond F.

Before moving on to logistic regression, it is worth introducing Barnard’s unconditional test. It is very appropriate for prevalence data because it assumes that one of the margins is fixed (i.e., the sample size in each pond is fixed and animals then fall into the infected or uninfected column). It side-steps the issues with Chi-square tests that counts in cell of the contingency table are ≥ 5. It is also more powerful (and flexible) than Fisher’s exact test. Note that Barnard’s test only accomodates \(2 \times 2\) matrices and the way in which data are entered in the R-version (found in the Barnard package) is different.

library(Barnard)
b2 <- barnardw.test(12,8,5,15) # comparing the number infected/not in pond B vs pond F
b2

## $dp
## [1] 0.001
## 
## $contingency.matrix
##      [,1] [,2]
## [1,]   12    8
## [2,]    5   15
## 
## $alternative
## [1] "One Sided" "Two Sided"
## 
## $wald.statistic
## [1] -2.239
## 
## $nuisance.parameter
## [1] 0.477 0.500
## 
## $p.value
## [1] 0.01399 0.02508

p.adjust(p=b2$p.value, method = "bonferroni")

## [1] 0.02798 0.05016

Note that the Wald statistic is \(\chi^2\)-distributed and the second P-value listed corresponds to the two-sided test. So it appears that while a Chi-square test did not find a significant difference, the more powerful Barnard’s test (almost) did.

Logistic regression

We began our example by discussing the influence of distance from ponds. We now return to that question using logistic regression. As in normal linear regression, we are interested in whether the parameter estimate for the effect of distance is significanlty different from zero, only now our parameters relate to the log-odds of infection (\(=logit[P_{infection}] = ln[P_{infection}/(1-P_{infection})]\)). In R we can use the glm function for logistic models and other general linear models.

logist <- glm(cbind(Infected, Tested-Infected) ~ Distance, family=binomial, data = mydata)
summary(logist)

## 
## Call:
## glm(formula = cbind(Infected, Tested - Infected) ~ Distance, 
##     family = binomial, data = mydata)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.593  -0.874  -0.179   0.900   1.941  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.01676    0.24887    4.09  4.4e-05 ***
## Distance    -0.00675    0.00149   -4.52  6.1e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 40.999  on 7  degrees of freedom
## Residual deviance: 10.666  on 6  degrees of freedom
## AIC: 39.77
## 
## Number of Fisher Scoring iterations: 4

Note that for binary responses, the glm function either takes a single vector of zero and ones (one for each individual) or a matrix with two columns, the first representing the number that responded (here, individuals that were infected) in each unit (here, pond) and the second column representing the number that did not respond (here, those not infected). Also note that in the glm function we need to specify the family of data, which in our case is binomial. The family=binomial defaults to a logit-link (i.e., logistic regression), but you could specify family=binomial(link="probit") if you wanted use a probit regression.

Both the intercept and the coefficient for Distance (i.e., slope) are highly significant. The fact that the the slope is negative indicates that the log-odds of infection decrease with distance from the road. In fact, with every 100m distance, the log-odds decrease by 0.675 (=-0.00675 \(\times\) 100). It is not intuitive to think on a log-odds scale, however. We can exponentiate the coefficients to get a better sense of how the odds change with Distance.

exp(coef(logist))

## (Intercept)    Distance 
##      2.7642      0.9933

We can read this as the odds of being infected at 0m from the road are 2.76 to 1, and with with every meter further from the road, the odds decrease by ~1% (=1-0.9933). It is not true, however, that 100m from road the odds will have decreased by 100% (=1% \(\times\) 100). This is because the relationship with Distance (or any other parameters) is linear only on the log-odds scale. To back transform to the odds scale, we need to multiply the coefficient by 100m and then exponentiate the product:

exp(coef(logist)["Distance"]*100)

## Distance 
##   0.5094

Thus, by 100m the odds of being infected have decreased by ~50%. But decreased from what, exactly? From the odds of infection in a pond right at the edge of a road, which is given by the exponentiated intercept. We can caculate the odds of being infected at 100m by using the full regression equation.

exp(coef(logist)["(Intercept)"] + coef(logist)["Distance"]*100)

## (Intercept) 
##       1.408

So at 100m from the road, the odds are predicted to be 1.41 to 1 that any given animal is infected. (Notice that this is 2.7642 \(\times\) 0.5094.) We can convert odds into probabilities by remembering that \(odds = P_{infection}/(1-P_{infection})\), so \(P_{infection} = 1/(1+odds)\).

odds_100m <- exp(coef(logist)["(Intercept)"] + coef(logist)["Distance"]*100)
odds_100m/(1+odds_100m)

## (Intercept) 
##      0.5847

# Note that we could get the same result using the plogis() function
plogis(coef(logist)["(Intercept)"] + coef(logist)["Distance"]*100)

## (Intercept) 
##      0.5847

Thus, at 100m from the road, we would expect ~58% of animals to be infected, all else equal.

This process is rather laborious, but there are many useful helper function in R. The predict function will predict values on the scale of the linear predictors (i.e., log-odds) or the response (i.e., probability) for user-supplied values of the predictor variable(s).

new_dist <- data.frame(Distance = 0:max(mydata$Distance)) # create a dataframe with distances from zero to the furthest pond 
new_dist$probs <- predict(logist, newdata=new_dist, type="response") # Get the predictions on the response scale

# What is the probability predicted at 100m?
new_dist[new_dist$Distance == 100, ]

##     Distance  probs
## 101      100 0.5847

Note that if the model had multiple predictors in it, the new_dist dataframe would have to have columns for each of those preditors.

Lastly, it would be nice to plot the predicted relationship we found between the probability of infection and distance from roads, ideally in relation to the data. We first plot the predicted line in the new_dist dataframe and then add in the points and confidence intervals in the mydata dataframe.

qplot(x=Distance, y=probs, data=new_dist, ylim=c(0,1), geom = "line") + 
    geom_pointrange(data = mydata, aes(y=Infected/Tested, ymin=lower, ymax=upper))

plot of chunk plotdata_line

We can also go one step further and plot the confidence envelope of the logistic regression line, again using the predict function, but this time asking for the predictions (fit) and their standard error (se.fit) on the link (i.e., log-odds) scale.

preds_logit <- predict(logist, newdata=new_dist, type="link", se.fit=TRUE) 
preds_logit <- as.data.frame(preds_logit) # predict gives a list, but we want a dataframe

# Next we create standard confidence intervals around these predictions on the logit scale
preds_logit$lower <- preds_logit$fit - 1.96*preds_logit$se.fit
preds_logit$upper <- preds_logit$fit + 1.96*preds_logit$se.fit

# Then we back-transform these confidence intervals to the response (i.e., probability) scale. 
# We could do this by hand, as above, but for simplicity we simply use the plogis() function
new_dist$probs <- plogis(preds_logit$fit) # should be the same predictions we got before
new_dist$lower <- plogis(preds_logit$lower)
new_dist$upper <- plogis(preds_logit$upper)

# Lastly, we can plot the predictions and confidence envelope with a line and ribbon, respectively
# and then add the data points as before
qplot(x=Distance, y=probs, ymin=lower, ymax=upper, data=new_dist, ylim=c(0,1), geom=c("ribbon", "line"), alpha = I(0.5)) +  # the alpha setting makes the ribbon semitransparent
    geom_pointrange(data = mydata, aes(y=Infected/Tested)) + # adding in the actual data
    labs(x="Distance from road in meters", y="Probability of infection") # and then add some nice axis labels

plot of chunk plotdataCIs_line

Overall, a pretty nice fit. It’s too bad these data are made up!