Air pollutant exposure estimation based on the home address

It is crucial to assess the indivudal exposure of environmental factors at different exposure periods (i.e, progestation and second trimester), so as to accurately assess how the air pollutant exposure work on the certain outcomes (i.e, birth defect and preterm labor). Compared with the traditional way which only uses the average concnetration among the whole monitoring sites, the EnvExpInd package taken the spatial heterogeneity into consideration and obtained a more accurate assessment of environmental exposure.

Dataset

Generally, three datasets including the concentration of environmental factors at each time point, individual information and the environmental monitoring sites should be prepared. Here, we take a simple example to illustrate how we can use the EnvExpInd packageto estimate the individual exposure of air pollutants.

monitoring site: data.frame, contains the names of monitoring sites and their corresponding address.
air pollution data: the concentration of environmental factors at each time point.
individual information: detailed information of address and the start time to be estimated.

Procedure

Obtain the longitude and latitude information for each individual as well as the monitoring sites according to their detailed address.
Calculate the matched monitoring site (nearest monitoring site used as the matched site) or the corresponding spatial position after the interpolation.
Based on the the time point of outcome and the assessment period, we can easily get the individual exposure in the targte period by re-running the second step.

Install

The package hasn’t been published in R-Project，whereas, you can install it from the github (Spatial-R):

devtools::install_github("Spatial-R/EnvExpInd")

Load the dataset

After you have installed the EnvExpInd package, you can load the “envid” dataset in the package using the following codes: data("envind). The envid dataset contains the information about the air pollutants, montoring site and the detailed information for each patient. In order to omit the potential error in the following procedure, you should make sure that the variables regarding date information in the air pollutants and patient dataset should be formated as “date/time”.

 library(sp);
 library(gstat);
 library(EnvExpInd)
 data("envind")
 individual_data_tem$date <- as.Date(individual_data_tem$date);
 pollutant_data$date <- as.Date(pollutant_data$date)
 pollutant_data_tem <- pollutant_data

The individual_data_tem contains at least three variables: id (unique ID for the each patient), date (date of outcome occured, formated as “2014-10-20”) and address (detailed address for each individual).

The pollutant_data also contains at least three variables: date (date of montoring, formated as “2014-10-20”), site_name (the name for the monitoring site) and concentration of certain air pollutants. (Notes: There were some missing values in the pullutant dataset, therefore, you should make some interpolations to fill with)

The site_data contains at least two variables: site_name (the name for the monitoring site) and address (detailed address for the montoring site).

Analysis

Longitude and latitude information

There are several ways to get the longitude and latitude information based on the detailed address: Baidu, Google, and Amap. For the user in China, you can use the API of Baidu and Amap to transform the address into the longitude and latitude. Here, we provide a simple function get_latlon_china to help the uses to realise the process.
(Note: Daily call limitation for each api_key was 6000)

site_data <- get_latlon_china(data = site_data, add_var = "address", api_key = "your key")
individual_data_tem <- get_latlon_china(data = individual_data_tem, add_var = "address", api_key = "your key")

Exposure estimation

Generally, there were two ways to calculate the environmental exposure on the individual level: simple or complicated. The simple one here is that we choose the air pollutant concentration in the nearest monitoring site as the reference for each individual; however, for the complicated one, we use the spatial interpolation method (i.e, inverse distance weighted, Ordinary kriging, and land use regression) to assess the air pollutant concentration. In the current version of EnvExpInd, there are only two spatial interpolation methods available: inverse distance weighted and kriging.

Simple one

Two improtant functions can be used to realised this aim: the function get_refrence_id_simple and expoure_estimate_simple. The function get_refrence_id_simple is used to match the nearest monitoring site for each individual. There were eight arugments (four arugments for the individual dataset and four arugments for the monitoring sites) that we should define firstly:

individual_data: dataframe with at least three variables: longitude, latitude and unique id.
individual_lat: column name of latitude for each individual.
individual_lon: column name of longitude for each individual.
individual_id: column name of unique id for each individual.
site_data: dataframe with at least three variables: longitude, latitude and unique id.
site_lat: column name of latitude for each monitoring site.
site_lon: column name of longitude for each monitoring site.
site_id: column name of unique id for each monitoring site.

The codes are as followed:

individual_data_tem$refrence_id <- get_refrence_id_simple(individual_data = individual_data_tem, 
                                                      individual_lat = "lat",
                                                      individual_lon ="lon",
                                                      individual_id = "id",
                                                      site_data = site_data,
                                                      site_lat = "lat",
                                                      site_lon = "lon",
                                                      site_id = "site")
kable(head(individual_data_tem[,-3]))

After that, we can use the function expoure_estimate_simple to get the exposure during the study period:

pollutant_data_tem$date <- as.Date(pollutant_data_tem$date)
individual_data_tem$date <- as.Date(individual_data_tem$date)
exposure.simple <- expoure_estimate_simple(individual_data = individual_data_tem,
                                           individual_id = "id",
                                           refrence_id = "refrence_id",
                                           exposure_date = "date",
                                           pollutant_data = pollutant_data_tem,
                                           pollutant_site = "site.name",
                                           pollutant_date = "date",
                                           pollutant_name = c("PM10","SO2"),
                                           estimate_interval = c(0:1))

kable(head(exposure.simple$PM10),digits= 1)   #### PM10 estimation

id	day.0	day.1
2564	202	206
2563	206	220
2562	206	220
2561	230	200
2560	212	146
2559	176	136

kable(head(exposure.simple$SO2),digits= 1)    #### SO2 estimation

id	day.0	day.1
2564	27	47
2563	47	60
2562	47	60
2561	82	52
2560	54	72
2559	44	52

Inverse distance weighted

The spatial interpolation method can be realised using the gstat package. In the EnvExpInd package, we can use exposure_estimate_idw function to estimate the individual level exposure based on the inverse distance weighted method.

pollutant_data_tem_idw <- merge(pollutant_data_tem,site_data[,c(1,3,4)],by.x = "site.name",by.y = "site", all.x = T)

exposure.idw <- exposure_estimate_idw(individual_data = individual_data_tem,
                                      individual_id = "id",
                                      exposure_date ="date",
                                      individual_lat ="lat",
                                      individual_lon ="lon",
                                      pollutant_data = pollutant_data_tem_idw,
                                      pollutant_date = "date",
                                      pollutant_site_lat = "lat",
                                      pollutant_site_lon = "lon",
                                      pollutant_name = c("PM10","SO2"),
                                      estimate_interval = c(0:1))  


kable(head(exposure.idw$PM10),digits = 1)   #### PM10 estimation

id	day.0	day.1
2564	196.1	198.6
2563	220.2	198.6
2562	227.0	194.7
2561	191.0	224.1
2560	141.6	186.6
2559	136.1	176.1

kable(head(exposure.idw$SO2),digits= 1)   #### SO2 estimation

id	day.0	day.1
2564	52.9	33.1
2563	72.1	53.0
2562	73.3	53.2
2561	50.8	81.0
2560	58.7	44.5
2559	51.9	44.0

Ordinary kriging

Ordinary kriging to estimate the environmental exposure can be realised by the function exposure_estimate_krige. Generally, you should define the semivariogram firstly.

example.date <- range(pollutant_data_tem$date)[2]
test.pollutant <- filter(pollutant_data_tem,date == example.date)[,c(2,5)]
test.pollutant <- merge(test.pollutant,site_data,by.x = "site.name",by.y = "site")
coordinates(test.pollutant) <- ~lat + lon
m <- fit.variogram(variogram(PM10~1, test.pollutant), vgm(1, "Sph", 200, 1))

estimate.krige <- exposure_estimate_krige(individual_data = individual_data_tem,
                                     individual_id = "id",
                                     exposure_date = "date",
                                     individual_lat = "lat",
                                     individual_lon = "lon",
                                     pollutant_data = pollutant_data_tem_idw,
                                     pollutant_date = "date",
                                     pollutant_site_lat = "lat",
                                     pollutant_site_lon = "lon",
                                     pollutant_name = c("PM10","SO2"),
                                     estimate_interval = c(0:1),
                                     krige_model = m,
                                     nmax = 7,
                                     krige_method = "med")


kable(head(estimate.krige$PM10),digits = 1)   #### PM10 estimation

id	day.0	day.1
2564	194	198
2563	220	198
2562	220	194
2561	192	220
2560	136	178
2559	136	178

kable(head(estimate.krige$SO2),digits = 1)   #### SO2 estimation

id	day.0	day.1
2564	58.0	35.0
2563	82.0	58.0
2562	80.0	58.0
2561	46.7	82.0
2560	52.0	46.7
2559	52.0	46.7

Air pollutant exposure estimation based on the home address

Bing Zhang

2020-10-15