This tutorial will provide an overview of how EnviroDataR can be used to calculate water quality guidelines. Note that in order to use this method, your data should be stored in EnviroData, with the appropriate guidelines set up and linked with cofactors and analytes. For information on how to do this, check out the EnviroData Documentation.
In this example, we will assume we have a list of stations and analytes for which we need to calculate guidelines. We will also assume we have a list of reference stations that need to be used for cofactor data acquisition. The exact steps for accomplishing this task will vary depending on the specifics of your use-case. But broadly, the workflow can be broken down as follows:
Collect the dataset for relevant stations and analytes from EnviroData.
Prepare a reference dataframe consisting of cofactor data
Join the measurement dataframe with the cofactor reference dataframe
Ensure all analytes and cofactors have correctly formatted named and units
Calculate the guidelines!
As usual, we’ll start by attaching the packages to the namespace. This step assumes you have the packages listed below installed. For instructions on how to install EnviroDataR and other Hatfield Packages, see the Install instructions on the homepage.
We will set the log threshold to “ERROR” so that only errors are logged in our logfile. Otherwise we run the risk of having a logfile that’s tens of MB in size, which is not ideal.
The first step to collecting data is to define some constants. This includes our start date, end date, and a vector of station names.
get_primary_station_from_alias_name() to ensure you have
the correct primary names of the stations as seen in EnviroData. If the
station name provided to EnviroDataR does not match the EnviroData
station name, you will not get any data!
start_date = "Jan-01-2019" end_date = "Dec-31-2021" stations <- c( "WQ2", "WQ3", "WQ4", "WQ5a", "WQ7", "WQ11", "WQ13", "WQ20", "WQ21A", "WQ30", "WQ31", "WQ32", "WQ33", "WQ34", "WQ36", "WQ37", "WQ38", "WQ39" ) stations <- EnviroDataR::get_primary_station_from_alias_name(alias_names = stations)
Next we’ll get the details for each station. This allows us to know which analytes are available for query. This step is not necessarily essential but is highly recommended, as having a good understanding of your dataset is key.
purrr::map_dfr function from purrr to send
each station name to the
function and then bind those results into a single dataframe.
Alternatively, you could use a list, but vectorized functions like
purrr::map_dfr are R’s specialty. Finally, we’ll filter the
station details to only include those involving “water” (i.e., not
climate or sediment) and then set the unique set of analyte names to a
## Warning in log_details(location_names, parsed_response, "Location"): The request for data from WQ37 returned zero rows of data. ## This is a warning, not an error.
Now we use get_wq_report_data to collect the entire dataset for the stations, dates, and analytes.
Now we have our dataset. Below I’ve highlighted two important columns: Analyte and DisplayName. EnviroData has two names for every analyte. “Analyte” is the name of the analyte as it is saved in the database. This name contains underscores and isn’t pretty to humans. For this reason, “DisplayName” column shows the analyte the name in a human readable format, and is how analytes are displayed in EnviroData.
The calculate_guidelines function in EnviroDataR needs to be provided the analyte names as they are saved in the database. For this reason, it’s crucial to keep your analytes in EnviroData organized. If you have several “pH” analytes, it will be a pain trying to consolidate them. Another advantage of using EnviroData is that we don’t need to do anything further with our analyte names before sending them for calculation. This is in contrast to using a dataset outside of EnviroData, where the analyte names could be anything, which would require reformatting them to the correct names before using them in the wqProcessing package to calculate guidelines.
## Analyte ## 1 Alkalinity (Total as CaCO3)_Concentration (liquid) ## 2 Alkalinity (PP as CaCO3)_Concentration (liquid) ## 3 Bicarbonate (HCO3)_Concentration (liquid) ## 4 Carbonate (CO3)_Concentration (liquid) ## 5 Hydroxide (OH)_Concentration (liquid) ## 6 Chloride (Cl)-Dissolved_Concentration (liquid) ## 7 Organic Carbon (C)-Dissolved_Concentration (liquid) ## 8 Fluoride (F)_Concentration (liquid) ## 9 Hardness (CaCO3)-Total_Concentration (liquid) ## 10 Hardness (CaCO3)-Dissolved_Concentration (liquid) ## DisplayName ## 1 Alkalinity (Total as CaCO3) ## 2 Alkalinity (PP as CaCO3) ## 3 Bicarbonate (HCO3) ## 4 Carbonate (CO3) ## 5 Hydroxide (OH) ## 6 Chloride (Cl)-Dissolved ## 7 Organic Carbon (C)-Dissolved ## 8 Fluoride (F)-Dissolved ## 9 Hardness (CaCO3)-Total ## 10 Hardness (CaCO3)-Dissolved
Our next step is to prepare the cofactor dataframe which is needed for calculation of guidelines. This can feel like a daunting process, but it’s actually straightforward when we break it down into steps:
Below we’ll start with step 1.
We know we need certain cofactor data. So we’ll create a vector named cofactor. These cofactor names must remain as shown below when using EnviroDataR to calculate guidelines.
cofactors <- c( "Hardness", "labpH", "Chloride", "fieldpH", "Temperature", "DOC", "Ca", "Mg", "K", "Na", "Alkalinity", "SO4", "Cu" )
If you recall our list of analytes, you’ll remember that the “Hardness” analyte was not named “Hardness” but was named “Hardness (CaCO3)-Total_Concentration (liquid)” This means we’ll need to map the EnviroData analyte names to the correct format that EnviroDataR needs. We do this by making a vector of the same cofactors shown above, but using the Analyte names as they’re shown in EnviroData.
We’ll then combine these two vectors together into one dataframe.
envirodata_analyte_name <- c( "Hardness (CaCO3)-Total_Concentration (liquid)", "pH_PH", "Chloride (Cl)-Dissolved_Concentration (liquid)", "Field pH_PH", "Field Temperature_Temperature", "Dissolved Organic Carbon_Concentration (liquid)", "Calcium (Ca)-Total_Concentration (liquid)", "Magnesium (Mg)-Total_Concentration (liquid)", "Potassium (K)-Total_Concentration (liquid)", "Sodium (Na)-Total_Concentration (liquid)", "Alkalinity (Total as CaCO3)_Concentration (liquid)", "Sulphate (SO4)-Dissolved_Concentration (liquid)", "Copper (Cu)-Dissolved_Concentration (liquid)" ) cofactors_mapping_df <- data.frame(cofactor = cofactors, envirodata_analyte_name = envirodata_analyte_name ) DT::datatable(cofactors_mapping_df, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T))
Note that we have a column for labpH and fieldpH. Eventually these will be consolidated into a single pH cofactor. If you only have data for one pH measurement, you should remove labpH and fieldpH and replace them with a single “pH” colName.
Before getting cofactor data, we need to know WHERE to get the cofactor data from. This is known as the reference station. If the reference stations remain unchanged throughout time, we can simply make a two column dataframe consisting of “Location” (where the original measurement is taken) and “reference.station” (where the cofactor needs to come from). Below you’ll see there’s a few locations that get their cofactor data from a different station.
reference_stations <- data.frame( Location = c("WQ2", "WQ4", "WQ11", "WQ20", "WQ21A", "WQ36"), reference.station = c("WQ2", "WQ4", "WQ4", "WQ20", "WQ36", "WQ36"))
Every datapoint in our dataset needs a calculated guideline. This means we need a set of cofactor values for every datapoint in our dataset. So we’ll create a reference_metadata dataframe by first getting the date and location of every row in our dataset. Then we’ll do a left_join with reference_stations to add the Location of the reference station to this dataframe.
Finally, we’ll use
to ensure our provided reference station names match the names in
## Joining with `by = join_by(Location)`
Now we know the date of every measurement, where it took place, what that measurement’s reference station is, and what cofactor analytes we need. So let’s collect the data from EnviroData. We’ll use our cofactors_mapping_df from above to supply the analyte names.
ref_stations_data <- EnviroDataR::get_wq_report_data( stations = unique(reference_metadata$reference.station), media = "Water", start_date = start_date, end_date = end_date, analytes = cofactors_mapping_df$envirodata_analyte_name, guidelines = NULL, includeQaQc = F, make_nice_names = F )
## Warning in log_details(analytes, parsed_response, "Analyte"): The request for data from Dissolved Organic Carbon_Concentration (liquid) returned zero rows of data. ## This is a warning, not an error. ## Warning in log_details(analytes, parsed_response, "Analyte"): The request for data from Dissolved Organic Carbon_Concentration (liquid) returned zero rows of data. ## This is a warning, not an error. ## Warning in log_details(analytes, parsed_response, "Analyte"): The request for data from Dissolved Organic Carbon_Concentration (liquid) returned zero rows of data. ## This is a warning, not an error. ## Warning in log_details(analytes, parsed_response, "Analyte"): The request for data from Dissolved Organic Carbon_Concentration (liquid) returned zero rows of data. ## This is a warning, not an error.
We’re going to turn this dataframe into a wide format. This means we’ll have a column for each of our analytes, and a row for each individual measurement. Recall that the name of these columns needs to follow certain formatting rules, so we’ll use our cofactors_mapping_df$cofactor column to determine the names. We do this by joining it with our ref_stations_data dataframe.
Sometimes multiple measurements of a cofactor analyte are taken at a given location and time. In those instances, we want to use their average.
Then we will “spread” the reference_data dataframe so that each cofactor is a column.
Finally, we rename Location column to be reference_station so it’s not confusing when we join with our original df. We’ll also copy DateTime into another column named referenceDate. This is so that we still know when the reference data was collected after we merge it into our original df.
## `summarise()` has grouped output by 'Location', 'DateTime'. You can override ## using the `.groups` argument.
## `mutate_all()` ignored the following grouping variables: ## • Columns `reference.station`, `DateTime` ## ℹ Use `mutate_at(df, vars(-group_cols()), myoperation)` to silence the message.
ref_stations_data_wide$referenceDate <- ref_stations_data_wide$DateTime
We now have a dataframe showing the correct cofactor data for each measurement in our dataset.
Now we will join our ref_stations_data_wide to our dataset. This is done by matching reference.station in both dataframes and matching to the nearest DateTime. This is done using the data.table package. It’s a good idea to do a brief read-through of the getting started vignette to understand what’s going on below.
Once joined, we’ll rename reference.station column to “Site”, as this is what’s preferred by our calculate guidelines function.
## Joining with `by = join_by(Location)`
dt = data.table::data.table(df_with_ref_stations) dt2 = data.table::data.table(ref_stations_data_wide) dt3 <- dt2[dt, on = .(reference.station = reference.station, DateTime = DateTime), roll = "nearest"] df_with_ref_data <- as.data.frame(dt3) df_with_ref_data <- df_with_ref_data %>% relocate(Location, DateTime, Analyte, reference.station, referenceDate) DT::datatable(head(df_with_ref_data), rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T))
We may need to add in continuous data from reference stations (if available) if we’re missing lots of cofactor data from discrete measurements). This will follow a similar workflow as above, except using the appropriate EnviroDataR functions for aquarius stations.
The join function in the code chunk above simply matched our two dataframes on location, and by nearest date. Sometimes though, the nearest date is still months apart. If you think this may be the case, it’s a good idea to calculate the difference between DateTime and reference.date columns. If the difference is greater than a certain number (e.g., 7 days), then you set that cofactor data to NA. We won’t do that here since this is just an example.
We need to prep our dataframe to be sent into the calculate guidelines function. As mentioned above, we have two measurement methods for pH. We will use our lab pH and field pH values to create a pH column. If you only have one type of pH measurement, just use that.
Since we used EnviroData to collect our dataset, we have lots of columns that we don’t need (or want). You can remove them. At the very least, we need to remove columns containing “guideline” since that’s the info we’re going to be receiving soon.
The dataframe is now ready to be sent off for guideline calculation. The function requires three arguments:
df_with_guidelines <-EnviroDataR::CalculateGuidelinesWithRefSt(df_with_ref_data = df_with_ref_data, guidelines = c("BC_AWQG_LTAVE"), wrd_url = EDR_URL) head(df_with_guidelines)
## # A tibble: 6 × 38 ## # Rowwise: ## Location DateTime Analyte DisplayName Value Unit DL Guideline ## <chr> <dttm> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 WQ2 2019-01-21 10:37:00 Alkalini… Alkalinity… 157 mg/L 1 "" ## 2 WQ2 2019-02-17 16:15:00 Alkalini… Alkalinity… 135 mg/L 1 "" ## 3 WQ2 2019-03-17 10:20:00 Alkalini… Alkalinity… 130 mg/L 1 "" ## 4 WQ2 2019-04-14 10:20:00 Alkalini… Alkalinity… 133 mg/L 1 "" ## 5 WQ2 2019-05-20 14:55:00 Alkalini… Alkalinity… 67.5 mg/L 1 "" ## 6 WQ2 2019-05-20 14:55:00 Alkalini… Alkalinity… 67.5 mg/L 1 "" ## # ℹ 30 more variables: GuidelineName <chr>, `Guideline Value` <chr>, ## # `Guideline Unit` <chr>, ComparisonOperation <chr>, MediaType <chr>, ## # SampleType <chr>, LabSampleId <chr>, ClientSampleId <chr>, ## # `COC Code` <chr>, Flags <chr>, Datum <chr>, DatumDepthHeightValue <chr>, ## # reference.station <chr>, referenceDate <dttm>, PH <lgl>, Temperature <dbl>, ## # ColorRef <lgl>, Hardness <dbl>, labpH <dbl>, Chloride <dbl>, fieldpH <dbl>, ## # Ca <dbl>, Mg <dbl>, K <dbl>, Na <dbl>, Alkalinity <dbl>, SO4 <dbl>, …