calculate_guidelines_example.Rmd
This tutorial will provide an overview of how EnviroDataR can be used to calculate water quality guidelines. Note that in order to use this method, your data should be stored in EnviroData, with the appropriate guidelines set up and linked with cofactors and analytes. For information on how to do this, check out the EnviroData Documentation.
In this example, we will assume we have a list of stations and analytes for which we need to calculate guidelines. We will also assume we have a list of reference stations that need to be used for cofactor data acquisition. The exact steps for accomplishing this task will vary depending on the specifics of your use-case. But broadly, the workflow can be broken down as follows:
Collect the dataset for relevant stations and analytes from EnviroData.
Prepare a reference dataframe consisting of cofactor data
Join the measurement dataframe with the cofactor reference dataframe
Ensure all analytes and cofactors have correctly formatted named and units
Calculate the guidelines!
As usual, we’ll start by attaching the packages to the namespace. This step assumes you have the packages listed below installed. For instructions on how to install EnviroDataR and other Hatfield Packages, see the Install instructions on the homepage.
library(EnviroDataR)
library(dplyr)
library(stringr)
library(DT) # Attaching for tutorial. You generally don't need this.
We will set the log threshold to “ERROR” so that only errors are logged in our logfile. Otherwise we run the risk of having a logfile that’s tens of MB in size, which is not ideal.
The first step to collecting data is to define some constants. This includes our start date, end date, and a vector of station names.
TIP Use
get_primary_station_from_alias_name()
to ensure you have
the correct primary names of the stations as seen in EnviroData. If the
station name provided to EnviroDataR does not match the EnviroData
station name, you will not get any data!
start_date <- "Jan-01-2019"
end_date <- "Dec-31-2021"
stations <- c(
"WQ2",
"WQ3",
"WQ4",
"WQ5a",
"WQ7",
"WQ11",
"WQ13",
"WQ20",
"WQ21A",
"WQ30",
"WQ31",
"WQ32",
"WQ33",
"WQ34",
"WQ36",
"WQ37",
"WQ38",
"WQ39"
)
stations <- EnviroDataR::get_primary_station_from_alias_name(alias_names = stations)
Next we’ll get the details for each station. This allows us to know which analytes are available for query. This step is not necessarily essential but is highly recommended, as having a good understanding of your dataset is key.
We’ll use purrr::map_dfr
function from purrr to send
each station name to the get_wq_locations_details()
function and then bind those results into a single dataframe.
Alternatively, you could use a list, but vectorized functions like
purrr::map_dfr
are R’s specialty. Finally, we’ll filter the
station details to only include those involving “water” (i.e., not
climate or sediment) and then set the unique set of analyte names to a
vector.
stations_details <- do.call(rbind, lapply(stations, EnviroDataR::get_wq_locations_details))
## Warning in log_details(location_names, parsed_response, "Location"): The request for data from WQ37 returned zero rows of data.
## This is a warning, not an error.
media_type <- "Water"
stations_with_water_details <- stations_details %>%
filter(grepl(media_type, MediaType, ignore.case = T))
analytes <- unique(stations_with_water_details$Analyte)
Now we use get_wq_report_data to collect the entire dataset for the stations, dates, and analytes.
df <- EnviroDataR::get_wq_report_data(start_date = start_date,
end_date = end_date,
stations = stations,
analytes = analytes,
media = "Water",
guidelines = NULL,
includeQaQc = F,
make_nice_names = F)
Now we have our dataset. Below I’ve highlighted two important columns: Analyte and DisplayName. EnviroData has two names for every analyte. “Analyte” is the name of the analyte as it is saved in the database. This name contains underscores and isn’t pretty to humans. For this reason, “DisplayName” column shows the analyte the name in a human readable format, and is how analytes are displayed in EnviroData.
The calculate_guidelines function in EnviroDataR needs to be provided the analyte names as they are saved in the database. For this reason, it’s crucial to keep your analytes in EnviroData organized. If you have several “pH” analytes, it will be a pain trying to consolidate them. Another advantage of using EnviroData is that we don’t need to do anything further with our analyte names before sending them for calculation. This is in contrast to using a dataset outside of EnviroData, where the analyte names could be anything, which would require reformatting them to the correct names before using them in the wqProcessing package to calculate guidelines.
## Analyte
## 1 Alkalinity (Total as CaCO3)_Concentration (liquid)
## 2 Alkalinity (PP as CaCO3)_Concentration (liquid)
## 3 Bicarbonate (HCO3)_Concentration (liquid)
## 4 Carbonate (CO3)_Concentration (liquid)
## 5 Hydroxide (OH)_Concentration (liquid)
## 6 Chloride (Cl)-Dissolved_Concentration (liquid)
## 7 Organic Carbon (C)-Dissolved_Concentration (liquid)
## 8 Fluoride (F)_Concentration (liquid)
## 9 Hardness (CaCO3)-Total_Concentration (liquid)
## 10 Hardness (CaCO3)-Dissolved_Concentration (liquid)
## DisplayName
## 1 Alkalinity (Total as CaCO3)
## 2 Alkalinity (PP as CaCO3)
## 3 Bicarbonate (HCO3)
## 4 Carbonate (CO3)
## 5 Hydroxide (OH)
## 6 Chloride (Cl)-Dissolved
## 7 Organic Carbon (C)-Dissolved
## 8 Fluoride (F)-Dissolved
## 9 Hardness (CaCO3)-Total
## 10 Hardness (CaCO3)-Dissolved
Our next step is to prepare the cofactor dataframe which is needed for calculation of guidelines. This can feel like a daunting process, but it’s actually straightforward when we break it down into steps:
Below we’ll start with step 1.
We know we need certain cofactor data. So we’ll create a vector named cofactor. These cofactor names must remain as shown below when using EnviroDataR to calculate guidelines.
cofactors <- c(
"Hardness",
"labpH",
"Chloride",
"fieldpH",
"Temperature",
"DOC",
"Ca",
"Mg",
"K",
"Na",
"Alkalinity",
"SO4",
"Cu"
)
If you recall our list of analytes, you’ll remember that the “Hardness” analyte was not named “Hardness” but was named “Hardness (CaCO3)-Total_Concentration (liquid)” This means we’ll need to map the EnviroData analyte names to the correct format that EnviroDataR needs. We do this by making a vector of the same cofactors shown above, but using the Analyte names as they’re shown in EnviroData.
We’ll then combine these two vectors together into one dataframe.
envirodata_analyte_name <- c(
"Hardness (CaCO3)-Total_Concentration (liquid)",
"pH_PH",
"Chloride (Cl)-Dissolved_Concentration (liquid)",
"Field pH_PH",
"Field Temperature_Temperature",
"Dissolved Organic Carbon_Concentration (liquid)",
"Calcium (Ca)-Total_Concentration (liquid)",
"Magnesium (Mg)-Total_Concentration (liquid)",
"Potassium (K)-Total_Concentration (liquid)",
"Sodium (Na)-Total_Concentration (liquid)",
"Alkalinity (Total as CaCO3)_Concentration (liquid)",
"Sulphate (SO4)-Dissolved_Concentration (liquid)",
"Copper (Cu)-Dissolved_Concentration (liquid)"
)
cofactors_mapping_df <- data.frame(cofactor = cofactors, envirodata_analyte_name = envirodata_analyte_name )
DT::datatable(cofactors_mapping_df, rownames = FALSE, filter="top",
options = list(pageLength = 5, scrollX=T))
Note that we have a row for labpH and fieldpH. Eventually these will be consolidated into a single pH cofactor. If you only have data for one pH measurement, you should remove labpH and fieldpH and replace them with a single “pH” colName.
Before getting cofactor data, we need to know WHERE to get the cofactor data from. This is known as the reference station. If the reference stations remain unchanged throughout time, we can simply make a two column dataframe consisting of “Location” (where the original measurement is taken) and “reference.station” (where the cofactor needs to come from). Below you’ll see there’s a few locations that get their cofactor data from a different station.
reference_stations <- data.frame(
Location = c("WQ2", "WQ4", "WQ11", "WQ20", "WQ21A", "WQ36"),
reference.station = c("WQ2", "WQ4", "WQ4", "WQ20", "WQ36", "WQ36"))
Every datapoint in our dataset needs a calculated guideline. This means we need a set of cofactor values for every datapoint in our dataset. So we’ll create a reference_metadata dataframe by first getting the date and location of every row in our dataset. Then we’ll do a left_join with reference_stations to add the Location of the reference station to this dataframe.
Finally, we’ll use get_primary_station_from_alias_name()
to ensure our provided reference station names match the names in
EnviroData.
unique_measurements <- df %>% distinct(Location, DateTime)
reference_metadata <- unique_measurements %>%
left_join(reference_stations)
## Joining with `by = join_by(Location)`
reference_metadata$reference.station <- EnviroDataR::get_primary_station_from_alias_name(
alias_names = reference_metadata$reference.station)
DT::datatable(reference_metadata, rownames = FALSE, filter="top",
options = list(pageLength = 5, scrollX=T))
Now we know the date of every measurement, where it took place, what that measurement’s reference station is, and what cofactor analytes we need. So let’s collect the data from EnviroData. We’ll use our cofactors_mapping_df from above to supply the analyte names.
ref_stations_data <- EnviroDataR::get_wq_report_data(
stations = unique(reference_metadata$reference.station),
media = "Water",
start_date = start_date,
end_date = end_date,
analytes = cofactors_mapping_df$envirodata_analyte_name,
guidelines = NULL,
includeQaQc = F,
make_nice_names = F
)
## Warning in log_details(analytes, parsed_response, "Analyte"): The request for data from Dissolved Organic Carbon_Concentration (liquid) returned zero rows of data.
## This is a warning, not an error.
## Warning in log_details(analytes, parsed_response, "Analyte"): The request for data from Dissolved Organic Carbon_Concentration (liquid) returned zero rows of data.
## This is a warning, not an error.
## Warning in log_details(analytes, parsed_response, "Analyte"): The request for data from Dissolved Organic Carbon_Concentration (liquid) returned zero rows of data.
## This is a warning, not an error.
## Warning in log_details(analytes, parsed_response, "Analyte"): The request for data from Dissolved Organic Carbon_Concentration (liquid) returned zero rows of data.
## This is a warning, not an error.
We’re going to turn this dataframe into a wide format. This means we’ll have a column for each of our analytes, and a row for each individual measurement. Recall that the name of these columns needs to follow certain formatting rules, so we’ll use our cofactors_mapping_df$cofactor column to determine the names. We do this by joining it with our ref_stations_data dataframe.
ref_stations_data <- ref_stations_data %>%
left_join(cofactors_mapping_df, by=c("Analyte" = "envirodata_analyte_name"))
DT::datatable(ref_stations_data, rownames = FALSE, filter="top",
options = list(pageLength = 5, scrollX=T))
Sometimes multiple measurements of a cofactor analyte are taken at a
given location and time. In those instances, we want to use their
average. The following code uses the group_by
and
summarise
functions from the dplyr package to calculate
averages within a group. Note the use of as.numeric()
. This
is needed because EnviroDataR::get_wq_report_data
returns
lab results as character strings (in column “Value”)
# spread with cofactors as columns so that the df is in a format recognized by guideline calculation function
ref_stations_data <- ref_stations_data %>%
select(Location, DateTime, cofactor, Value) %>%
group_by(Location, DateTime, cofactor) %>%
summarise(Value = mean(as.numeric(Value))) %>%
distinct()
## `summarise()` has grouped output by 'Location', 'DateTime'. You can override
## using the `.groups` argument.
Now we will “spread” the ref_stations_data
dataframe so
that each cofactor becomes a column. We rename the “Location” column to
“reference.station” to avoid confusion when we join with our original
df
. We’ll also copy “DateTime” into another column named
“referenceDate”. This is so that we still know when the reference data
was collected after we merge it into our original df
.
ref_stations_data_wide <- ref_stations_data %>%
tidyr::pivot_wider(names_from = cofactor, values_from = Value) %>%
rename(reference.station = Location) %>%
mutate(referenceDate = DateTime)
We now have a dataframe showing the correct cofactor data for each measurement in our dataset.
Now we will join our ref_stations_data_wide
to our
original df
dataset, by matching the “reference.station”
field in both dataframes (exactly), then further matching to the
nearest date in field “DateTime”. This is called a rolling
join, and it isn’t fully supported in dplyr, so we will instead carry it
out using the data.table package. It’s a good idea to do a brief
read-through of the getting started vignette in data.table if you are
struggling to follow what’s going on below.
## Joining with `by = join_by(Location)`
dt = data.table::data.table(df_with_ref_stations)
dt2 = data.table::data.table(ref_stations_data_wide)
dt3 <- dt2[dt, on = .(reference.station = reference.station,
DateTime = DateTime), roll = "nearest"]
df_with_ref_data <- as.data.frame(dt3)
df_with_ref_data <- df_with_ref_data %>%
relocate(Location, DateTime, Analyte, reference.station, referenceDate)
DT::datatable(head(df_with_ref_data), rownames = FALSE, filter="top",
options = list(pageLength = 5, scrollX=T))
You may need to add in continuous data from reference stations (where available) if you are missing lots of cofactor data from discrete measurements. This will follow a similar workflow as above, except using the appropriate EnviroDataR functions for aquarius stations.
The join function in the code chunk above simply matched our two dataframes on location, and by nearest date. Sometimes though, the nearest date is still months apart. If you think this may be the case, it’s a good idea to calculate the difference between DateTime and reference.date columns. If the difference is greater than a certain number (e.g., 7 days), then you set that cofactor data to NA. We won’t do that here since this is just an example.
We need to prep our dataframe to be sent into the calculate guidelines function. As mentioned above, we have two measurement methods for pH. We will use our lab pH and field pH values to create a pH column. If you only have one type of pH measurement, just use that.
df_with_ref_data <- df_with_ref_data %>%
mutate(pH = labpH) %>%
mutate(pH = ifelse((DisplayName %in% c("Ammonia (N)", "Copper (Cu)-Dissolved") & !is.na(fieldpH)),
fieldpH,
pH))
Since we used EnviroData to collect our dataset, we have lots of columns that we don’t need (or want). You can remove them. At the very least, we need to remove columns containing “guideline” since that’s the info we’re going to be receiving soon.
The dataframe is now ready to be sent off for guideline calculation. The function requires three arguments:
The demonstration code below fetch guidelines for the first 500 rows of our data frame, then prints the first few rows having non-empty guidelines.
df_with_guidelines <-EnviroDataR::CalculateGuidelinesWithRefSt(df_with_ref_data = head(df_with_ref_data, 500),
guidelines = c("BC_AWQG_LTAVE"),
wrd_url = EDR_URL)
# print only the first few rows with non-empty Guideline
DT::datatable(filter(df_with_guidelines, nchar(Guideline) > 0),
rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T))