Several instructors have come before us and have made this work possible, those instructors include: Nat MacNell, Anne, among others.
All thanks to those above, but all errors are our own.
In this series of homeworks, we will analyze the effect of prenatal care (exposure) on preterm birth (outcome). To estimate this effect we will need to do the basics of any epidemiology project:
As an added bonus, these homeworks will also cover some basic spatial analyses and other “advanced techniques”.
The purpose of this homework is to introduce you to the under-the-hood machinery of R and provide some context/practice with data wrangling.
setwd()
.# Please note this is an example of code chunks you may have at the top of a .R but would differ for a Rmarkdown file (which wraps the markdown language)
#............................
# Births 2012 Analysis for EPID 799C: PNC decrease PT birth? ####
# Mike Dolan Fliss, August 2017
# Aim: Does early prenatal care (<month 5) decrease the risk of preterm (<week 37) birth?
# (This Descriptive header is HW1.Q1a)
#............................ (<- example of HW1.Q1b, comment bar)
# Project Notes:
# Birth Certificate Revisions: https://www.cdc.gov/nchs/nvss/vital_certificate_revisions.htm (HW1.Q1d, clickable link)
#............................
#............................
# Libraries, directories, load data (HW1.Q1f) ####
#............................
# NOTE: May need to install these packages in advance on your local machine.
# install.packages("tableone")
library(tidyverse) # for ggplot, dplyr in HW1-4 (<- example of a post-line code comment, HW1.Q1c)
library(lubridate) # for dates in HW2
devtools::install_github("kaz-yos/tableone") # used in HW2, need for github for kableone functipn from PR #32 (https://github.com/kaz-yos/tableone/pull/32)
library(tableone)
# Set directories.
data_dir = paste0(getwd(), "/data")
output_dir = paste0(getwd(), "/data")
# map_dir = paste0(data_dir, "/GIS") # used later
# Note you'll need to set different ones with setwd()! Since we're building this as a github repo the wd defaults to script dir.
# More on git here: http://r-bio.github.io/intro-git-rstudio/ . There's also a datacamp on it.
#............................
tidyverse::read_csv
but in this case use base::read.csv
)#............................
# Read data (HW1.Q2) ####
#............................
births = read.csv("data/births2012.csv", stringsAsFactors = F) # (HW1.Q2a)
names(births) = tolower(names(births)) #drop names to lowercase
head(births) # (HW1.Q2b)
## x deltype bedcode stocc coocc cityocc stres cores cityresf dob
## 1 1 1 K NC 49 46340 NC 49 30120 2012-01-01
## 2 2 1 I NC 47 73660 NC 47 0 2012-01-01
## 3 3 1 O NC 81 31400 NC 81 31400 2012-01-01
## 4 4 1 O NC 81 31400 NC 81 31400 2012-01-01
## 5 5 1 O NC 127 57500 NC 127 0 2012-01-01
## 6 6 1 N NC 179 43920 NC 179 0 2012-01-01
## dobmonth dobday dobyear tob sex plur sord liveb bwgrp mrace methnic
## 1 1 1 2012 621 1 1 99 99 6 1 N
## 2 1 1 2012 1659 1 1 99 99 6 1 N
## 3 1 1 2012 2211 1 1 99 99 6 2 N
## 4 1 1 2012 1144 2 1 99 99 7 2 N
## 5 1 1 2012 950 2 1 99 99 7 1 N
## 6 1 1 2012 1711 2 1 99 99 7 1 N
## meduc mage mnat marital attend mtran visits mdif kessner kotel zip wic
## 1 3 24 NC 1 1 2 15 2 1 4 28532 Y
## 2 2 25 FL 1 1 2 13 2 1 3 28472 N
## 3 2 29 IL 2 1 2 4 7 3 1 27262 Y
## 4 2 19 NC 2 1 2 6 6 2 1 27262 Y
## 5 1 25 NC 1 1 2 10 4 2 3 27822 Y
## 6 3 23 NC 1 1 2 15 4 2 4 28112 Y
## totpreg lbliving lbdead term dllb dllbmo dllbyear dtlterm dtlmo
## 1 2 1 0 0 92006 9 2006 888888 88
## 2 1 0 0 0 888888 88 8888 888888 88
## 3 6 3 0 2 42005 4 2005 62004 6
## 4 2 1 0 0 22010 2 2010 888888 88
## 5 4 3 0 0 82010 8 2010 888888 88
## 6 1 0 0 0 888888 88 8888 888888 88
## dtlyear lastdelv cigbef cigdur pay pdiab gdiab phype ghype ppb ppo inft
## 1 8888 1 N N 2 N N N N N N N
## 2 8888 9 Y N 2 N N N N N N N
## 3 2004 1 N N 1 N N Y N N N N
## 4 8888 1 N N 1 N N N N N N N
## 5 8888 1 Y Y 1 N N N N N N N
## 6 8888 9 N N 1 N N N N N N N
## psec npsec gon syph cham hepb hepc cerv toc ecvs ecvf prom pric prol
## 1 Y 1 N N N N N N N N N Y N N
## 2 N 0 N N N N N N N N N N N N
## 3 N 0 N N N N N N N N N N N N
## 4 N 0 N N N N N N N N N N N N
## 5 N 0 N N N N N N N N N N N N
## 6 N 0 N N N N N N N N N N N N
## indl augl nvpr ster antb chor mecs fint esan attf attv pres rout tlab
## 1 N N N N Y N N N N N N 1 4 N
## 2 N Y N N N N N N Y N N 1 1 X
## 3 Y N N N N N N Y Y N N 1 1 X
## 4 N N N N N N N N Y N N 1 1 X
## 5 N N N N N N N N Y N N 1 1 X
## 6 N N N N N N N N N N N 1 1 X
## mtr plac rut uhys aint uopr gest apgar5 apgar10 aven1 aven6 nicu surf
## 1 N N N N N N 36 8 88 N N N N
## 2 N N N N N N 39 8 88 N N N N
## 3 N N N N N N 40 7 88 N N N N
## 4 N N N N N N 39 9 88 N N N N
## 5 N N N N N N 39 9 88 N N N N
## 6 N N N N N N 40 9 88 N N N N
## anti seiz binj anen mnsb cchd cdh omph gast limb cl cp dowt cdit hypo
## 1 N N N N N N N N N N N N N N N
## 2 N N N N N N N N N N N N N N N
## 3 N N N N N N N N N N N N N N N
## 4 N N N N N N N N N N N N N N N
## 5 N N N N N N N N N N N N N N N
## 6 N N N N N N N N N N N N N N N
## itran bfed ehype inft_drg inft_art bmig wksgest
## 1 2 Y N X X 3 38
## 2 2 Y N X X 3 40
## 3 2 N N X X 3 39
## 4 2 Y N X X 3 37
## 5 2 Y N X X 4 40
## 6 2 Y N X X 4 40
# ......................................
# Exploring Data - (HW1Q3) ####
# ......................................
dim(births) # (HW1.Q3b)
## [1] 122513 117
summary(births[,c("wksgest", "mdif")]) # (HW1.Q3c)
## wksgest mdif
## Min. :18.00 Min. : 1.000
## 1st Qu.:38.00 1st Qu.: 2.000
## Median :39.00 Median : 3.000
## Mean :38.83 Mean : 6.183
## 3rd Qu.:40.00 3rd Qu.: 4.000
## Max. :99.00 Max. :99.000
str(births[,c("mage", "wksgest", "mdif")]) # (HW1.Q3d)
## 'data.frame': 122513 obs. of 3 variables:
## $ mage : int 24 25 29 19 25 23 20 22 36 32 ...
## $ wksgest: int 38 40 39 37 40 40 37 39 42 40 ...
## $ mdif : int 2 2 7 6 4 4 4 2 1 2 ...
class(births$mage)
## [1] "integer"
class(births$wksgest)
## [1] "integer"
class(births$mdif)
## [1] "integer"
str(births$mage)
## int [1:122513] 24 25 29 19 25 23 20 22 36 32 ...
str(births$wksgest)
## int [1:122513] 38 40 39 37 40 40 37 39 42 40 ...
str(births$mdif)
## int [1:122513] 2 2 7 6 4 4 4 2 1 2 ...
# ......................................
# Exploring Data - (HW1Q4) ####
# ......................................
## (HW1.Q4)
births_sample = births[1:1000, c("mage", "mdif", "visits", "wksgest", "mrace")] #A
plot(births_sample) #B
births_sample = births[sample(nrow(births), 1000), c("mage", "mdif", "visits", "wksgest", "mrace")] #C
# install.packages(c("car", "GGally"))
car::scatterplotMatrix(births_sample)
GGally::ggpairs(births_sample)
plot(density(x))
] or histogram.lubridate
package’s github page and skim it?lubridate
and vignette(“lubridate”)
to explore the packagelubridate::ymd()
to convert the dob string variable into a date-typed version called dob_d## (HW1.Q5)
### A. Prenatal Care (Exposure)
#### i
table(births$mdif, useNA = "always")
##
## 1 2 3 4 5 6 7 8 9 88 99 <NA>
## 9292 36987 41252 14783 7142 3984 2336 1556 971 2055 2155 0
#### ii
# See data dictionary, but 88 means no early prenatal care and 99 means missing
#### iii
births$mdif[births$mdif==99] <- NA
#### iv
boxplot(births$mdif[births$mdif != 88], main="Month Prenatal Care Began, \n Excluding Unkowns and No Prenatal Care")
#### v
births$pnc5 <- ifelse(births$mdif <= 5, 1, 0)
#### vi
births$pnc5_f <- factor(births$pnc5, levels=c(0,1), labels=c("No Early PNC", "Early PNC"))
#### vii
table(births$mdif, births$pnc5_f, useNA = "always")
##
## No Early PNC Early PNC <NA>
## 1 0 9292 0
## 2 0 36987 0
## 3 0 41252 0
## 4 0 14783 0
## 5 0 7142 0
## 6 3984 0 0
## 7 2336 0 0
## 8 1556 0 0
## 9 971 0 0
## 88 2055 0 0
## <NA> 0 0 2155
# (HW1.Q5 continued)
### B. Preterm Birth (Outcome)
#### i
hist(births$wksgest) #quick look
#### ii
births$preterm <- ifelse(births$wksgest >= 20 &births$wksgest < 37, 1,
ifelse(births$wksgest >= 37 & births$wksgest < 99, 0, NA)
)
#### iii
births$preterm_f <- factor(births$preterm, levels=c(0,1), labels=c("term", "preterm"))
#### iv
births$preterm_f <- cut(births$wksgest, breaks = c(0,19,36,99), labels = c("too young", "preterm", "term")) #should be no births too young at this point
# (HW1.Q5 continued)
### C. Plurality (covariate)
#### i
table(births$plur, useNA = "always") # looks fine, no 9 values for unknown
##
## 1 2 3 4 5 <NA>
## 118306 4080 118 4 5 0
# (HW1.Q5 continued)
### D. Maternal Age (covariate)
#### i
table(births$mage, useNA = "always")
##
## 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
## 1 1 5 23 102 355 846 1665 2877 4490 5539 6244 6564 6581 6533
## 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
## 6625 6716 7084 6953 7005 6934 6670 6026 5361 4681 4034 3266 2642 2030 1570
## 40 41 42 43 44 45 46 47 48 49 50 51 52 54 55
## 1152 806 510 293 156 64 36 26 8 6 5 2 2 1 1
## 99 <NA>
## 22 0
#### ii
# See data dictionary, but 99 means misssing
#### iii
births$mage[births$mage==99] <- NA
#### iv
par(mfrow=c(1,2))
boxplot(births$mage, main="Boxplot of Maternal Age Distribution")
plot(density(births$mage[!is.na(births$mage)]), main="Density Plot of Maternal Age Distribution")
dev.off() # turning back to default
## null device
## 1
#### v
births$mage_centered <- births$mage - mean(births$mage, na.rm = T)
hist(births$mage) # looks to be expected
summary(births$mage)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 10.00 23.00 27.00 27.56 32.00 55.00 22
# (HW1.Q5 continued)
### E. Cigarette Use (covariate)
#### i
table(births$cigdur, useNA = "always") # check to see what values are in cigdur
##
## N U Y <NA>
## 108952 610 12951 0
births$cigdur <- ifelse(births$cigdur == "Y", 1,
ifelse(births$cigdur == "N", 0, NA)) # NA for when case is U - i.e. unknown
#### ii
births$smoker_f <- factor(births$cigdur, levels=c(0,1), labels=c("Non-smoker", "Smoker"))
#### ii
table(births$cigdur, births$smoker_f, useNA = "always")
##
## Non-smoker Smoker <NA>
## 0 108952 0 0
## 1 0 12951 0
## <NA> 0 0 610
# (HW1.Q5 continued)
### F. Date of Birth (covariate)
#### i
# https://github.com/tidyverse/lubridate
#### ii
?lubridate
# vignette("lubridate")
#### iii
library(lubridate) # normally this would go at the top of the script under a packages chunk
#### iv
births$dob_d <- lubridate::ymd(births$dob)
# (HW1.Q5 continued)
### G. Sex (covariate)
#### i
table(births$sex, useNA = "always") # check what initial values are
##
## 1 2 9 <NA>
## 62523 59988 2 0
births$sex[births$sex==9] <- NA
births$sex_f <- factor(births$sex, levels=c(1,2), labels=c("Male", "Female"))
table(births$sex_f, births$sex, useNA="always")
##
## 1 2 <NA>
## Male 62523 0 0
## Female 0 59988 0
## <NA> 0 0 2
# (HW1.Q5 continued)
### G. Maternal Race/Ethnicity (covariate)
#### i
table(births$mrace, births$methnic, useNA = "always") # check what initial values are
##
## N U Y <NA>
## 1 68995 47 3026 0
## 2 28842 12 487 0
## 3 1645 0 81 0
## 4 4877 12 14489 0
## <NA> 0 0 0 0
## ii
mergedf <- data.frame(
mrace = c(sort(rep(seq(1,4),3))),
methnic = as.character(rep(c("Y", "N", "U"), 4)),
raceeth_f =factor(c("White Hispanic", "White non-Hispanic", "Other",
rep("African American", 3),
rep("American Indian or Alaska Native", 3),
rep("Other", 3)
)
),
stringsAsFactors = F)
births <- merge(x=births, y=mergedf, by=c("mrace", "methnic"))
#### iii
table(births$mrace, births$methnic, births$raceeth_f, useNA = "always") # looks good
## , , = African American
##
##
## N U Y <NA>
## 1 0 0 0 0
## 2 28842 12 487 0
## 3 0 0 0 0
## 4 0 0 0 0
## <NA> 0 0 0 0
##
## , , = American Indian or Alaska Native
##
##
## N U Y <NA>
## 1 0 0 0 0
## 2 0 0 0 0
## 3 1645 0 81 0
## 4 0 0 0 0
## <NA> 0 0 0 0
##
## , , = Other
##
##
## N U Y <NA>
## 1 0 47 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 4877 12 14489 0
## <NA> 0 0 0 0
##
## , , = White Hispanic
##
##
## N U Y <NA>
## 1 0 0 3026 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## <NA> 0 0 0 0
##
## , , = White non-Hispanic
##
##
## N U Y <NA>
## 1 68995 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## <NA> 0 0 0 0
##
## , , = NA
##
##
## N U Y <NA>
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## <NA> 0 0 0 0
# ......................................
# ......................................
# Optional data exploration (HW1.Q6) ####
# ......................................
# (HW1.Q6)
## A
#CreateTableOne(vars = c("preterm_f"), strata = c("mage", "pnc5_f", "smoker_f", "sex_f"), data = births)
## B
# mice::md.pattern(births) # the output here is a lot, so I have opted so comment it out for space concerns
## C
# GGally::ggpairs(data = births[1:1000, c("pnc5_f", "preterm_f")], title="Exposure versue Outcome for First 1,00 Obserations")
The purpose of this homework is to familiarize you with the apply
family, which are powerful tools that build on vectorization (which converts to speed in R).
# ......................................
# Eligibility criteria & core recoding (HW2 part 1)
# ......................................
# Create many inclusion variables, all of which have to be =1 to be included.
births$weeknum = week(births$dob_d) #thanks, lubridate! (HW2.1.Q1)
variable = 1
meaning it passes that inclusion test, and 0 meaning it does not pass. We’ll use apply()
functions where helpful to help reduce our code. The as.numeric() function may be helpful to cast Booleans to 1s and 0s, though T
and F
are interpreted intelligently as 1 and 0 as well – these instructions are based on 1s and 0s, but you may use T
and F
if that makes more sense to you.
# a. Has gestation data (weeks), and at least 20 weeks gestation pre-birth (weeks >= 20)
births$incl_hasgest = as.numeric(!is.na(births$wksgest)) #(HW2.1.Q2a)
births$incl_enoughgest = as.numeric(births$wksgest >= 20) #(HW2.1.Q2b)
# b. Less than 20 weeks gestation before Jan 1 of year (LMP > Aug 20), or weeks-weeknum>=19, w/ weeknum=1 for Jan 1-7
births$incl_lateenough = as.numeric(births$wksgest - births$weeknum <= 19) #(HW2.1.Q2c)
# c. Start date of gestation was 45 (max gest in'11) w prior to Jan 1 of next year, so all births are observed in year
births$incl_earlyenough = as.numeric(births$weeknum - births$wksgest <=7) #(HW2.1.Q2d)
# d. singletons only
births$incl_singleton = as.numeric(births$plur == 1) #(HW2.Q2e)
apply()
c("anen","mnsb","cchd", "cdh", "omph","gast", "limb", "cl","cp","dowt","cdit", "hypo")
apply()
with the all()
function on the congenital anomaly columns of births to return a 1 if all congenital anomaly variables are “N”, 0 otherwise and assign it to variable incl_noanomalies.# e. no congential anomalies - Let's use apply for efficiency
congen_anom = c("anen","mnsb","cchd", "cdh", "omph","gast", "limb", "cl","cp","dowt","cdit", "hypo") #(HW2.1.Q3a)
# below is just a useful aside to check that there is No "U" (i.e. none missing) before running the code to make the new variable
births$incl_hasanomdata = as.numeric(apply(births[,congen_anom]!="U", FUN=all, 1)) #All not missing = not any missing
births$incl_noanomalies = as.numeric(apply(births[,congen_anom]=="N", FUN=all, 1)) #All not present #(HW2.1.Q3b)
birth=old_births
to reset your work, since this dataset is relatively small (sometimes I’ll save these sorts of save points in a comment… ).grepl
function. To take advantage of our incl_
naming convention, we’ll use a text search function, grepl
. As an example, try (and note the misspelling of banana) : grepl("a", c("banana", "peach", "ornge"))
apply()
and sum over the column margins instead of rows. The grepl
function may be useful to subset births to the inclusion criteria of interest.grepl()
function to find just the variables with incl_ in their name, subset births to be just those that have a 1 in every incl_ variable. This is similar to what’s done in 3c, above, but looks at rows instead of columns.nrow()
or dim()
.old_births <- births #(HW2.1.Q4a)
# Below is the example grepl (HW2.1.Q4b)
grepl("a", c("banana", "peach", "ornge"))
[1] TRUE TRUE FALSE
eligibility_drops = nrow(births) - apply(births[,grepl("incl_", names(births))], FUN=sum, MARGIN = 2, na.rm=T) #(HW2.1.Q4c optional)
eligibility_drops #(HW2.1.Q4c optional)
incl_hasgest incl_enoughgest incl_lateenough incl_earlyenough
0 39 42317 15014
incl_singleton incl_hasanomdata incl_noanomalies 4207 761 1051
births$include_allpass = apply(births[, grepl("incl_", names(births))], FUN=all, MARGIN = 1) #(HW2.1.Q4d indicator for the inclusion criteria yes/no)
births = births[births$include_allpass, ] #(HW2.1.Q4e now apply the inclusion criteria to subset your births data set)
# to see what warnings are being thrown you can type warnings() -- below:
warnings() # Just letting me know I'm casting those 1s as TRUEs. That's ok.
dim(old_births) #(HW2.1.Q4f)
[1] 122513 134
dim(births) #(HW2.1.Q4f)
[1] 62370 135
#(HW2.1.Q5 Optional Challenge)
# Results of exclusion
cat("Leaving eligibility checks with n =", formatC(nrow(births), big.mark = ","),
"births of original", formatC(nrow(old_births), big.mark = ",")) #(HW2.1.Q4d)
Leaving eligibility checks with n = 62,370 births of original 122,513
# names(births)
vars_of_interest = c("pnc5_f", "preterm_f", "smoker_f", "wksgest", "sex", "meduc", "raceeth_f", "weeknum", "include_allpass")
tableone::kableone(tableone::CreateTableOne(data=births[, vars_of_interest])) #(HW2.1.Q5)
Overall | |
---|---|
n | 62370 |
pnc5_f = Early PNC (%) | 56087 ( 91.2) |
preterm_f (%) | |
too young | 0 ( 0.0) |
preterm | 6114 ( 9.8) |
term | 56256 ( 90.2) |
smoker_f = Smoker (%) | 6642 ( 10.7) |
wksgest (mean (sd)) | 38.91 (2.44) |
sex (mean (sd)) | 1.49 (0.50) |
meduc (mean (sd)) | 2.79 (1.23) |
raceeth_f (%) | |
African American | 14840 ( 23.8) |
American Indian or Alaska Native | 884 ( 1.4) |
Other | 10013 ( 16.1) |
White Hispanic | 1581 ( 2.5) |
White non-Hispanic | 35052 ( 56.2) |
weeknum (mean (sd)) | 33.03 (8.11) |
include_allpass = TRUE (%) | 62370 (100.0) |
# let's look at who was included and excluded in our database
# This is the same as question 4 but now I am running it on the birth dataset that has not been subsetted (i.e. we could have already saved this variable earlier on or we can easily recreate it)
old_births$include_allpass <- apply(old_births[, grepl("incl_", names(old_births))], FUN=all, MARGIN = 1)
tableone::kableone(tableone::CreateTableOne(data=old_births[, vars_of_interest], strata="include_allpass"))
FALSE | TRUE | p | test | |
---|---|---|---|---|
n | 60143 | 62370 | ||
pnc5_f = Early PNC (%) | 53369 (90.7) | 56087 ( 91.2) | 0.009 | |
preterm_f (%) | <0.001 | |||
too young | 39 ( 0.1) | 0 ( 0.0) | ||
preterm | 8041 (13.4) | 6114 ( 9.8) | ||
term | 52063 (86.6) | 56256 ( 90.2) | ||
smoker_f = Smoker (%) | 6309 (10.6) | 6642 ( 10.7) | 0.737 | |
wksgest (mean (sd)) | 38.76 (3.99) | 38.91 (2.44) | <0.001 | |
sex (mean (sd)) | 1.49 (0.50) | 1.49 (0.50) | 0.425 | |
meduc (mean (sd)) | 2.83 (1.33) | 2.79 (1.23) | <0.001 | |
raceeth_f (%) | 0.136 | |||
African American | 14501 (24.1) | 14840 ( 23.8) | ||
American Indian or Alaska Native | 842 ( 1.4) | 884 ( 1.4) | ||
Other | 9412 (15.6) | 10013 ( 16.1) | ||
White Hispanic | 1445 ( 2.4) | 1581 ( 2.5) | ||
White non-Hispanic | 33943 (56.4) | 35052 ( 56.2) | ||
weeknum (mean (sd)) | 20.68 (17.71) | 33.03 (8.11) | <0.001 | |
include_allpass = TRUE (%) | 0 ( 0.0) | 62370 (100.0) | <0.001 |
# ......................................
# NOTES: Finishes with 62,370 of 122,513
# ......................................
In this section of the homework we will start with the tidyverse
, and focus on the dplyr
and ggplot2
packages. As discussed in class, there are powerful packages that make cleaning, analyzing, and graphing data “easy” (or at least tidy!).
#(HW3.1.Q1A) Graph weeks
npop = formatC(sum(!is.na(births$wksgest)), big.mark=',') #(HW3.1.Q1a)
fig1a <- ggplot(data=births, aes(x=wksgest, y=..prop..))+geom_bar() + #Does what you need, as does geom_histogram()
labs(title="Figure 1: Proportional distribution of gestational age at birth",
subtitle=paste0("From ", npop, " births in NC in 2012 from the NC SCHS, \n posted at Odum Institute"),
x="Weeks of gestation", y="Proportion of population") +
theme_bw()
#(HW3.1.Q1a): Working with weeknum. - should double check this. Just used %V above.
# hist(births$weeknum, breaks = min(births$weeknum):max(births$weeknum)) # Looks weird
#(HW3.1.Q1B) : Graph weeknum
fig1b <- ggplot(births, aes(x=weeknum, y=..prop..)) + geom_bar() +
labs(title="Throwaway: Investigating Weeknum",
subtitle="Note to self : Weeknum needs some work, \n doesn't quite match SAS",
x="Week number of birth", y="Proportion of population") +
theme_bw()
#(HW3.1.Q1C) weeks vs. weeknum
fig1c <- ggplot(births, aes(x=wksgest, y=weeknum)) +
geom_jitter(width=1, height=1, pch=".", alpha=0.3)+
labs(title="Throwaway: week vs. weeknum",
subtitle="Note to self: Does this match what we expect given the inclusion criteria?",
x="weeks of gestation", y="week number of birth in year")+
theme_bw()
plot(gridExtra::arrangeGrob(fig1a, fig1b, fig1c, layout_matrix = rbind(c(1,2), c(3,3))))
4. Change a graph from questions a, b or c using ggplot’s “+” syntax to add a layer. You can do anything you like, but as a suggestion, try taking a saved plot c and adding geom_bin2d(binwidth=1, alpha=0.8).
#(HW3.1.Q1D)
# Note here that you can ggplot objects that are "appendable"
# which is to say you can add on new features by adding on LAYERS
plotObj <- ggplot(births, aes(x=wksgest, y=weeknum)) +
geom_jitter(width=1, height=1, pch=".", alpha=0.3)+
labs(title="Throwaway: week vs. weeknum",
subtitle="Note to self: Does this match what we expect given the inclusion criteria?",
x="weeks of gestation", y="week number of birth in year")+
theme_bw()
plot(plotObj)
## append new feature
plotObj + geom_bin2d(binwidth=1, alpha=0.8) + labs(caption = "Look we appended a layer here!")
5. Create your own! Create and submit any ggplot2 graphic you like built from the births dataset and using our variables of interest. Use at least two geometry layers, a title, a subtitle, and a caption that describes your interpretation of the punchline of your graph.
#install.packages("ggridges")
library("ggridges")
ggplot() +
geom_density_ridges_gradient(data=births, aes(x=mage, y=factor(cut(wksgest, 10)), fill=..x..)) +
scale_fill_gradientn(colors=c("#fc8d59", "#ffffbf", "#91cf60")) +
facet_grid(meduc~raceeth_f) +
ggtitle("Distribution of Weeks Gestation by Birth Outcome") +
labs(subtitle="This is the continuous variable we dichotomized") +
ylab("Weeks of Gestations factorized") + xlab("mage") +
theme_bw()
This section of the homework will focus a bit more on “piping” with magrittr
and using dplyr
.
Create a summary table for maternal age: Using dplyr, create a summary table for mage from the births dataset whose head looks like below:
Investigate why maternal age of 54 gives a NaN for the newly created pct_earlyPNC
variable (hint – what would the mean of all missing be?).
# ......................................
# Functional Form of maternal age w/ dplyr and ggplot2 #(HW3 part 2)
# ......................................
mage_df = births %>% group_by(mage) %>% #(HW3.2.2) setup
summarize(n=n(),
pct_earlyPNC = mean(pnc5, na.rm=T),
pct_preterm = mean(preterm, na.rm=T)) %>%
mutate_if(is.numeric, round, 2)
DT::datatable(mage_df) #(HW3.2.2) answer
#(HW3.2.3A)
fig2a <- ggplot(mage_df, aes(mage, pct_preterm))+
geom_point(aes(size=n))+
geom_smooth(aes(weight=n), color="blue", method="loess")+
labs(title="% Preterm vs. Maternal Age",
x="maternal age", y="% preterm",
subtitle="Investigating function form of maternal age",
caption="Note for future glms: Seems quadratic. Blue is loess, red is square linear.")
#(HW3.2.3B) Optional Challenge
fig2b <- ggplot(mage_df, aes(mage, pct_preterm)) +
geom_point(aes(size=n)) +
geom_smooth(aes(weight=n), color="blue") +
geom_smooth(aes(weight=n), method="lm", formula=y ~ poly(x, 2), color="red", lty=2) + # this is the square error term
labs(title="% Preterm vs. Maternal Age",
x="maternal age", y="% preterm",
subtitle="Investigating function form of maternal age",
caption="Note for future glms: Seems quadratic. Blue is loess, red is square linear.")
plot(gridExtra::arrangeGrob(fig2a, fig2b))