Calculating Guidelines with EnviroDataR • EnviroDataR

Introduction

This tutorial will provide an overview of how EnviroDataR can be used to calculate water quality guidelines. Note that in order to use this method, your data should be stored in EnviroData, with the appropriate guidelines set up and linked with cofactors and analytes. For information on how to do this, check out the EnviroData Documentation.

In this example, we will assume we have a list of stations and analytes for which we need to calculate guidelines. We will also assume we have a list of reference stations that need to be used for cofactor data acquisition. The exact steps for accomplishing this task will vary depending on the specifics of your use-case. But broadly, the workflow can be broken down as follows:

Collect the dataset for relevant stations and analytes from EnviroData.
Prepare a reference dataframe consisting of cofactor data
Join the measurement dataframe with the cofactor reference dataframe
Ensure all analytes and cofactors have correctly formatted named and units
Calculate the guidelines!

Initial Setup and Collect Data

As usual, we’ll start by attaching the packages to the namespace. This step assumes you have the packages listed below installed. For instructions on how to install EnviroDataR and other Hatfield Packages, see the Install instructions on the homepage.

library(EnviroDataR)
library(dplyr)
library(stringr)
library(DT) # Attaching for tutorial. You generally don't need this.

We will set the log threshold to “ERROR” so that only errors are logged in our logfile. Otherwise we run the risk of having a logfile that’s tens of MB in size, which is not ideal.

The first step to collecting data is to define some constants. This includes our start date, end date, and a vector of station names.

TIP Use get_primary_station_from_alias_name() to ensure you have the correct primary names of the stations as seen in EnviroData. If the station name provided to EnviroDataR does not match the EnviroData station name, you will not get any data!

start_date <- "Jan-01-2019"
end_date <- "Dec-31-2021"

stations <- c(
  "WQ2",
  "WQ3",
  "WQ4",
  "WQ5a",
  "WQ7",
  "WQ11",
  "WQ13",
  "WQ20",
  "WQ21A",
  "WQ30",
  "WQ31",
  "WQ32",
  "WQ33",
  "WQ34",
  "WQ36",
  "WQ37",
  "WQ38",
  "WQ39"
)

stations <- EnviroDataR::get_primary_station_from_alias_name(alias_names = stations)

Next we’ll get the details for each station. This allows us to know which analytes are available for query. This step is not necessarily essential but is highly recommended, as having a good understanding of your dataset is key.

We’ll use purrr::map_dfr function from purrr to send each station name to the get_wq_locations_details() function and then bind those results into a single dataframe. Alternatively, you could use a list, but vectorized functions like purrr::map_dfr are R’s specialty. Finally, we’ll filter the station details to only include those involving “water” (i.e., not climate or sediment) and then set the unique set of analyte names to a vector.

stations_details <- do.call(rbind, lapply(stations, EnviroDataR::get_wq_locations_details))

## Warning in log_details(location_names, parsed_response, "Location"): The request for data from WQ37 returned zero rows of data.
##     This is a warning, not an error.

media_type <- "Water"
stations_with_water_details <- stations_details %>%
      filter(grepl(media_type, MediaType, ignore.case = T))
analytes <- unique(stations_with_water_details$Analyte)

Now we use get_wq_report_data to collect the entire dataset for the stations, dates, and analytes.

df <- EnviroDataR::get_wq_report_data(start_date = start_date,
                                          end_date = end_date,
                                          stations = stations,
                                          analytes = analytes,
                                          media = "Water",
                                          guidelines = NULL,
                                          includeQaQc = F,
                                          make_nice_names = F)

Now we have our dataset. Below I’ve highlighted two important columns: Analyte and DisplayName. EnviroData has two names for every analyte. “Analyte” is the name of the analyte as it is saved in the database. This name contains underscores and isn’t pretty to humans. For this reason, “DisplayName” column shows the analyte the name in a human readable format, and is how analytes are displayed in EnviroData.

The calculate_guidelines function in EnviroDataR needs to be provided the analyte names as they are saved in the database. For this reason, it’s crucial to keep your analytes in EnviroData organized. If you have several “pH” analytes, it will be a pain trying to consolidate them. Another advantage of using EnviroData is that we don’t need to do anything further with our analyte names before sending them for calculation. This is in contrast to using a dataset outside of EnviroData, where the analyte names could be anything, which would require reformatting them to the correct names before using them in the wqProcessing package to calculate guidelines.

df %>% select(Analyte, DisplayName) %>% distinct() %>% slice(1:10)

##                                                Analyte
## 1   Alkalinity (Total as CaCO3)_Concentration (liquid)
## 2      Alkalinity (PP as CaCO3)_Concentration (liquid)
## 3            Bicarbonate (HCO3)_Concentration (liquid)
## 4               Carbonate (CO3)_Concentration (liquid)
## 5                Hydroxide (OH)_Concentration (liquid)
## 6       Chloride (Cl)-Dissolved_Concentration (liquid)
## 7  Organic Carbon (C)-Dissolved_Concentration (liquid)
## 8                  Fluoride (F)_Concentration (liquid)
## 9        Hardness (CaCO3)-Total_Concentration (liquid)
## 10   Hardness (CaCO3)-Dissolved_Concentration (liquid)
##                     DisplayName
## 1   Alkalinity (Total as CaCO3)
## 2      Alkalinity (PP as CaCO3)
## 3            Bicarbonate (HCO3)
## 4               Carbonate (CO3)
## 5                Hydroxide (OH)
## 6       Chloride (Cl)-Dissolved
## 7  Organic Carbon (C)-Dissolved
## 8        Fluoride (F)-Dissolved
## 9        Hardness (CaCO3)-Total
## 10   Hardness (CaCO3)-Dissolved

Prepare Cofactor Reference Data

Our next step is to prepare the cofactor dataframe which is needed for calculation of guidelines. This can feel like a daunting process, but it’s actually straightforward when we break it down into steps:

Identify which cofactors are needed for the guidelines that you want.
Identify which analytes will be used for each of these cofactors.
Identify which reference station should be queried for cofactor data on each individual measurement of your dataset.
Collect the cofactor data
Turn it into wide format
Join the wide-format cofactor table with the original dataset.

Below we’ll start with step 1.

Cofactor Initialization

We know we need certain cofactor data. So we’ll create a vector named cofactor. These cofactor names must remain as shown below when using EnviroDataR to calculate guidelines.

cofactors <- c(
      "Hardness",
      "labpH",
      "Chloride",
      "fieldpH",
      "Temperature",
      "DOC",
      "Ca",
      "Mg",
      "K",
      "Na",
      "Alkalinity",
      "SO4",
      "Cu"
    )

Cofactor-Analyte Mapping

If you recall our list of analytes, you’ll remember that the “Hardness” analyte was not named “Hardness” but was named “Hardness (CaCO3)-Total_Concentration (liquid)” This means we’ll need to map the EnviroData analyte names to the correct format that EnviroDataR needs. We do this by making a vector of the same cofactors shown above, but using the Analyte names as they’re shown in EnviroData.

We’ll then combine these two vectors together into one dataframe.

envirodata_analyte_name <- c(
      "Hardness (CaCO3)-Total_Concentration (liquid)",
      "pH_PH",
      "Chloride (Cl)-Dissolved_Concentration (liquid)",
      "Field pH_PH",
      "Field Temperature_Temperature",
      "Dissolved Organic Carbon_Concentration (liquid)",
      "Calcium (Ca)-Total_Concentration (liquid)",
      "Magnesium (Mg)-Total_Concentration (liquid)",
      "Potassium (K)-Total_Concentration (liquid)",
      "Sodium (Na)-Total_Concentration (liquid)",
      "Alkalinity (Total as CaCO3)_Concentration (liquid)",
      "Sulphate (SO4)-Dissolved_Concentration (liquid)",
      "Copper (Cu)-Dissolved_Concentration (liquid)"
    )

cofactors_mapping_df <- data.frame(cofactor = cofactors, envirodata_analyte_name = envirodata_analyte_name )

DT::datatable(cofactors_mapping_df, rownames = FALSE, filter="top", 
              options = list(pageLength = 5, scrollX=T))

Note that we have a row for labpH and fieldpH. Eventually these will be consolidated into a single pH cofactor. If you only have data for one pH measurement, you should remove labpH and fieldpH and replace them with a single “pH” colName.

Identify Reference Stations

Before getting cofactor data, we need to know WHERE to get the cofactor data from. This is known as the reference station. If the reference stations remain unchanged throughout time, we can simply make a two column dataframe consisting of “Location” (where the original measurement is taken) and “reference.station” (where the cofactor needs to come from). Below you’ll see there’s a few locations that get their cofactor data from a different station.

reference_stations <-  data.frame(
  Location = c("WQ2", "WQ4", "WQ11", "WQ20", "WQ21A", "WQ36"),
  reference.station = c("WQ2", "WQ4", "WQ4", "WQ20", "WQ36", "WQ36"))

Every datapoint in our dataset needs a calculated guideline. This means we need a set of cofactor values for every datapoint in our dataset. So we’ll create a reference_metadata dataframe by first getting the date and location of every row in our dataset. Then we’ll do a left_join with reference_stations to add the Location of the reference station to this dataframe.

Finally, we’ll use get_primary_station_from_alias_name() to ensure our provided reference station names match the names in EnviroData.

unique_measurements <- df %>% distinct(Location, DateTime)

reference_metadata <- unique_measurements %>%
  left_join(reference_stations)

## Joining with `by = join_by(Location)`

reference_metadata$reference.station <- EnviroDataR::get_primary_station_from_alias_name(
  alias_names = reference_metadata$reference.station)

DT::datatable(reference_metadata, rownames = FALSE, filter="top", 
              options = list(pageLength = 5, scrollX=T))

Collect Cofactor Data

Now we know the date of every measurement, where it took place, what that measurement’s reference station is, and what cofactor analytes we need. So let’s collect the data from EnviroData. We’ll use our cofactors_mapping_df from above to supply the analyte names.

ref_stations_data <-  EnviroDataR::get_wq_report_data(
  stations = unique(reference_metadata$reference.station), 
  media = "Water", 
  start_date = start_date, 
  end_date = end_date, 
  analytes = cofactors_mapping_df$envirodata_analyte_name, 
  guidelines = NULL, 
  includeQaQc = F, 
  make_nice_names = F
  )

## Warning in log_details(analytes, parsed_response, "Analyte"): The request for data from Dissolved Organic Carbon_Concentration (liquid) returned zero rows of data.
##     This is a warning, not an error.

## Warning in log_details(analytes, parsed_response, "Analyte"): The request for data from Dissolved Organic Carbon_Concentration (liquid) returned zero rows of data.
##     This is a warning, not an error.

## Warning in log_details(analytes, parsed_response, "Analyte"): The request for data from Dissolved Organic Carbon_Concentration (liquid) returned zero rows of data.
##     This is a warning, not an error.

## Warning in log_details(analytes, parsed_response, "Analyte"): The request for data from Dissolved Organic Carbon_Concentration (liquid) returned zero rows of data.
##     This is a warning, not an error.

DT::datatable(ref_stations_data, rownames = FALSE, filter="top", 
              options = list(pageLength = 5, scrollX=T))

Spread Cofactor Data

We’re going to turn this dataframe into a wide format. This means we’ll have a column for each of our analytes, and a row for each individual measurement. Recall that the name of these columns needs to follow certain formatting rules, so we’ll use our cofactors_mapping_df$cofactor column to determine the names. We do this by joining it with our ref_stations_data dataframe.

ref_stations_data <- ref_stations_data %>%
  left_join(cofactors_mapping_df, by=c("Analyte" = "envirodata_analyte_name"))

DT::datatable(ref_stations_data, rownames = FALSE, filter="top",
              options = list(pageLength = 5, scrollX=T))

Sometimes multiple measurements of a cofactor analyte are taken at a given location and time. In those instances, we want to use their average. The following code uses the group_by and summarise functions from the dplyr package to calculate averages within a group. Note the use of as.numeric(). This is needed because EnviroDataR::get_wq_report_data returns lab results as character strings (in column “Value”)

# spread with cofactors as columns so that the df is in a format recognized by guideline calculation function
ref_stations_data <- ref_stations_data %>%
  select(Location, DateTime, cofactor, Value) %>%
  group_by(Location, DateTime, cofactor) %>%
  summarise(Value = mean(as.numeric(Value))) %>%
  distinct()

## `summarise()` has grouped output by 'Location', 'DateTime'. You can override
## using the `.groups` argument.

Now we will “spread” the ref_stations_data dataframe so that each cofactor becomes a column. We rename the “Location” column to “reference.station” to avoid confusion when we join with our original df. We’ll also copy “DateTime” into another column named “referenceDate”. This is so that we still know when the reference data was collected after we merge it into our original df.

ref_stations_data_wide <- ref_stations_data %>%
  tidyr::pivot_wider(names_from = cofactor, values_from = Value) %>%
  rename(reference.station = Location) %>%
  mutate(referenceDate = DateTime)

We now have a dataframe showing the correct cofactor data for each measurement in our dataset.

DT::datatable(ref_stations_data_wide, rownames = FALSE, filter="top", 
              options = list(pageLength = 5, scrollX=T))

Merge Dataset and Cofactor Reference Data

Now we will join our ref_stations_data_wide to our original df dataset, by matching the “reference.station” field in both dataframes (exactly), then further matching to the nearest date in field “DateTime”. This is called a rolling join, and it isn’t fully supported in dplyr, so we will instead carry it out using the data.table package. It’s a good idea to do a brief read-through of the getting started vignette in data.table if you are struggling to follow what’s going on below.

df_with_ref_stations <- df %>% dplyr::left_join(reference_stations)

## Joining with `by = join_by(Location)`

dt = data.table::data.table(df_with_ref_stations)
dt2 = data.table::data.table(ref_stations_data_wide)
dt3 <- dt2[dt, on = .(reference.station = reference.station,
                      DateTime = DateTime), roll = "nearest"]

df_with_ref_data <- as.data.frame(dt3)

df_with_ref_data <- df_with_ref_data %>% 
  relocate(Location, DateTime, Analyte, reference.station, referenceDate)

DT::datatable(head(df_with_ref_data), rownames = FALSE, filter="top", 
              options = list(pageLength = 5, scrollX=T))

You may need to add in continuous data from reference stations (where available) if you are missing lots of cofactor data from discrete measurements. This will follow a similar workflow as above, except using the appropriate EnviroDataR functions for aquarius stations.

The join function in the code chunk above simply matched our two dataframes on location, and by nearest date. Sometimes though, the nearest date is still months apart. If you think this may be the case, it’s a good idea to calculate the difference between DateTime and reference.date columns. If the difference is greater than a certain number (e.g., 7 days), then you set that cofactor data to NA. We won’t do that here since this is just an example.

Calculate Guidelines

We need to prep our dataframe to be sent into the calculate guidelines function. As mentioned above, we have two measurement methods for pH. We will use our lab pH and field pH values to create a pH column. If you only have one type of pH measurement, just use that.

df_with_ref_data <- df_with_ref_data %>%
  mutate(pH = labpH) %>%
  mutate(pH = ifelse((DisplayName %in% c("Ammonia (N)", "Copper (Cu)-Dissolved") & !is.na(fieldpH)),
  fieldpH,
  pH))

Since we used EnviroData to collect our dataset, we have lots of columns that we don’t need (or want). You can remove them. At the very least, we need to remove columns containing “guideline” since that’s the info we’re going to be receiving soon.

df_with_ref_data <- df_with_ref_data  %>% 
  select(-contains("Guideline", ignore.case = T))

The dataframe is now ready to be sent off for guideline calculation. The function requires three arguments:

our dataframe
the name (or names) of guidelines we want to calculate. These names correspond to the names shown in EnviroData.
EnviroData API URL (a holdover from previous versions)

The demonstration code below fetch guidelines for the first 500 rows of our data frame, then prints the first few rows having non-empty guidelines.

df_with_guidelines <-EnviroDataR::CalculateGuidelinesWithRefSt(df_with_ref_data = head(df_with_ref_data, 500), 
                                                guidelines = c("BC_AWQG_LTAVE"), 
                                                wrd_url = EDR_URL)

# print only the first few rows with non-empty Guideline
DT::datatable(filter(df_with_guidelines, nchar(Guideline) > 0),
              rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T))