# Confounding & collinearity

**MATHEMATICS IN MEDICINE**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Confounding and collinearity

## Introduction

In this blog article, I will discuss about the bias introduced in estimation of coefficient of a given explanatory variable due to the presence of confounding factors. After that, I will try to demonstrate about the effect of variable collinearity on estimation of coefficient.

## Underlying structure

Let us assume that we know about the underlying relationship between the explanatory variable (E), outcome variable (O), confounder ( C ), variable correlated with E (VE) and variable correlated with O (VO).

Let us assume that the real underlying relation between various variables are as given below:

O = 2.E + 2.C + 2.VO + 0.VE + error E, C and VE are correlated with each other with correlation coef of 0.7 each. All the random variables are normally distributed, with mean 0 and sd 5. error is distributed as standard normal distribution.

The directed acyclic graph depicts the relationship between the variables.

## Samples for simulation

library(MASS) library(dplyr) ## ## Attaching package: 'dplyr' ## ## The following objects are masked from 'package:Hmisc': ## ## src, summarize ## ## The following objects are masked from 'package:lubridate': ## ## intersect, setdiff, union ## ## The following object is masked from 'package:MASS': ## ## select ## ## The following objects are masked from 'package:stats': ## ## filter, lag ## ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union set.seed(100) sam <- mvrnorm(n = 10000, mu = c(0, 0, 0), Sigma = matrix(c(5, 4, 4, 4, 5, 4, 4, 4, 5), 3, 3, byrow = T)) df <- data.frame(E = sam[, 1], C = sam[, 2], VE = sam[, 3]) df <- mutate(df, VO = rnorm(10000, 0, 5), O = 2 * E + 2 * C + 2 * VO + rnorm(10000))

## Scenarios

### Scenario 1 (Only E as covariate, typical univariate analysis)

mod.E <- lm(O ~ E, data = df) summary(mod.E) ## ## Call: ## lm(formula = O ~ E, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -40.47 -7.05 -0.05 7.01 45.04 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.0069 0.1034 0.07 0.95 ## E 3.6420 0.0465 78.40 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 10.3 on 9998 degrees of freedom ## Multiple R-squared: 0.381, Adjusted R-squared: 0.381 ## F-statistic: 6.15e+03 on 1 and 9998 DF, p-value: <2e-16 confint(mod.E) ## 2.5 % 97.5 % ## (Intercept) -0.1959 0.2097 ## E 3.5509 3.7330

The model overestimates coefficient of E, its value is the sum of coef of E and C. This type of bias is known as **bias due to confounding** and occurs when confounders are not included as co-variates.

### Scenario 2 (E and C as covariates)

This model is incorrect as it does not include the other covariate, VO.

mod.EC <- lm(O ~ E + C, data = df) summary(mod.EC) ## ## Call: ## lm(formula = O ~ E + C, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -35.99 -6.75 -0.11 6.71 43.32 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.0194 0.0998 -0.19 0.85 ## E 2.0293 0.0741 27.40 <2e-16 *** ## C 2.0231 0.0740 27.35 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 9.98 on 9997 degrees of freedom ## Multiple R-squared: 0.424, Adjusted R-squared: 0.424 ## F-statistic: 3.68e+03 on 2 and 9997 DF, p-value: <2e-16 confint(mod.EC) ## 2.5 % 97.5 % ## (Intercept) -0.215 0.1762 ## E 1.884 2.1745 ## C 1.878 2.1681

Due to addition of C, the estimate of E is unbiased. Exclusion of VO has not influenced the coefficient of E.

### Scenario 3 (E and VO as covariates)

E and VO are not correlated. C has been excluded from the model.

mod.EVO <- lm(O ~ E + VO, data = df) summary(mod.EVO) ## ## Call: ## lm(formula = O ~ E + VO, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -9.638 -1.938 0.005 1.954 12.780 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.03893 0.02884 1.35 0.18 ## E 3.59175 0.01295 277.27 <2e-16 *** ## VO 2.00167 0.00581 344.39 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.88 on 9997 degrees of freedom ## Multiple R-squared: 0.952, Adjusted R-squared: 0.952 ## F-statistic: 9.88e+04 on 2 and 9997 DF, p-value: <2e-16 confint(mod.EVO) ## 2.5 % 97.5 % ## (Intercept) -0.01761 0.09546 ## E 3.56636 3.61714 ## VO 1.99028 2.01307

There is again overestimation of coefficient of E (as in scenario 1), due to exclusion of C. Estimate of VO is unbiased as it has no correlation with either E, VO or VE.

### Scenario 4 (E and VE as covariates)

E and VE are correlated, but VE is not associated with O.

mod.EVE <- lm(O ~ E + VE, data = df) summary(mod.EVE) ## ## Call: ## lm(formula = O ~ E + VE, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -41.55 -7.00 -0.02 6.93 43.46 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.0106 0.1026 0.1 0.92 ## E 2.8584 0.0761 37.6 <2e-16 *** ## VE 0.9857 0.0761 12.9 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 10.3 on 9997 degrees of freedom ## Multiple R-squared: 0.391, Adjusted R-squared: 0.391 ## F-statistic: 3.21e+03 on 2 and 9997 DF, p-value: <2e-16 confint(mod.EVE) ## 2.5 % 97.5 % ## (Intercept) -0.1905 0.2117 ## E 2.7093 3.0075 ## VE 0.8364 1.1350

Estimates of both E and VE are biased. But coefficient of E is less biased than scenario 1. Reason for biasness of coefficient of VE is unknown to me.

### Scenario 5 (E, VE, VO, C as covariates) (Complete model)

mod.comp <- lm(O ~ E + C + VO + VE, data = df) summary(mod.comp) ## ## Call: ## lm(formula = O ~ E + C + VO + VE, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.434 -0.677 -0.014 0.678 3.616 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.01297 0.01000 1.30 0.19 ## E 1.98794 0.00826 240.62 <2e-16 *** ## C 1.99993 0.00830 240.81 <2e-16 *** ## VO 2.00033 0.00202 992.51 <2e-16 *** ## VE 0.01214 0.00831 1.46 0.14 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1 on 9995 degrees of freedom ## Multiple R-squared: 0.994, Adjusted R-squared: 0.994 ## F-statistic: 4.29e+05 on 4 and 9995 DF, p-value: <2e-16 confint(mod.comp) ## 2.5 % 97.5 % ## (Intercept) -0.006635 0.03258 ## E 1.971745 2.00413 ## C 1.983651 2.01621 ## VO 1.996381 2.00428 ## VE -0.004159 0.02844

Unbiased coefficients of all the covariates.

### Scenario 6 (E, VE, C as covariates)

mod.EVEC <- lm(O ~ E + VE + C, data = df) summary(mod.EVEC) ## ## Call: ## lm(formula = O ~ E + VE + C, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -36.18 -6.74 -0.08 6.71 43.21 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.0185 0.0998 -0.19 0.85 ## E 1.9892 0.0824 24.13 <2e-16 *** ## VE 0.0920 0.0830 1.11 0.27 ## C 1.9817 0.0829 23.92 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 9.98 on 9996 degrees of freedom ## Multiple R-squared: 0.424, Adjusted R-squared: 0.424 ## F-statistic: 2.45e+03 on 3 and 9996 DF, p-value: <2e-16 confint(mod.EVEC) ## 2.5 % 97.5 % ## (Intercept) -0.21411 0.1771 ## E 1.82758 2.1507 ## VE -0.07061 0.2546 ## C 1.81930 2.1441

Addition of C has removed the biasedness from the coefficients of E and VE (see Scenario 4).

### Scenario 7 (E, VO and C as covariates)

mod.EVOC <- lm(O ~ E + VO + C, data = df) summary(mod.EVOC) ## ## Call: ## lm(formula = O ~ E + VO + C, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.427 -0.676 -0.014 0.680 3.641 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.01285 0.01000 1.29 0.2 ## E 1.99323 0.00742 268.50 <2e-16 *** ## VO 2.00036 0.00202 992.52 <2e-16 *** ## C 2.00539 0.00741 270.45 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1 on 9996 degrees of freedom ## Multiple R-squared: 0.994, Adjusted R-squared: 0.994 ## F-statistic: 5.72e+05 on 3 and 9996 DF, p-value: <2e-16 confint(mod.EVOC) ## 2.5 % 97.5 % ## (Intercept) -0.006751 0.03246 ## E 1.978682 2.00779 ## VO 1.996410 2.00431 ## C 1.990857 2.01993

Performance of this model is almost the same as Scenario 5

## Performance of different scenarios

Models | Complete (E, C, VO, VE) | E, C, VE | E, C, VO | E, C | E, VE | E, VO | E only |
---|---|---|---|---|---|---|---|

E (2) | 1.9879, 0.0083 | 1.9892, 0.0824 | 1.9932, 0.0074 | 2.0293, 0.0741 | 2.8584, 0.0761 | 3.5918, 0.013 | 3.642, 0.0465 |

C (2) | 1.9999, 0.0083 | 1.9817, 0.0829 | 2.0054, 0.0074 | 2.0231, 0.074 | - | - | - |

VE (0) | 0.0121, 0.0083 | 0.092, 0.083 | - | - | 0.9857, 0.0761 | - | - |

VO (2) | 2.0003, 0.002 | - | 2.0004, 0.002 | - | - | 2.0017, 0.0058 | - |

E (2): means covariate E and its actual value.

–,–: coefficient, standard error.

## Conclusions

Models with C, will have unbiased estimate of E. Models with correlated variables (E, C, VE) has larger SE due to near collinearity.

**leave a comment**for the author, please follow the link and comment on their blog:

**MATHEMATICS IN MEDICINE**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.