Recommended Data Management Practices#
The following are a list of recommendations to follow when managing EnviroData datasets. The list is not exhaustive. Furthermore, these recommendations are not unique to EnviroData, but are useful for any form of digital data management.
Filenames (and also column header names and variable names)#
Think about your data or files. What is the common theme? E.g., “water quality”, “fish survey”. What pieces of information distinguish one file/dataset from another? E.g., date, field study number, version of file, lab conducting the analysis. -Take the common theme and the pieces of information that distinguish files and create a short, descriptive filename. E.g. “20220101_crab_survey_ael.xlsx”
Use ISO format when recording dates: YYYYMMDD
Never use spaces or special symbols
Use underscores or dashes to separate words
Use a padded 0 for values less than 10 when numbering files (e.g., 01, 02, etc).
Keep filenames short. Consider only using one or two pieces of information in the filename to distinguish files from one another.
Place the information that distinguishes one file from another at the start of the filename. This will make file ordering more useful. For example:
----20220101_crab_survey_ael.xlsx
----20220102_crab_survey_ael.xlsx
Split files that have different contents or formats into different folders. A good rule of thumb is if a computer cannot use the same exact instructions to read each file in a folder, the files should probably be in different folders.
Distinguishing variables can be used to create multiple folders where different data files will go. See the following example directory where survey data for crabs is split into two folders according to the lab conducting the analysis:
----data ----surveys ----ael_labs ----20220101_crab_survey_ael.xlsx ----20220102_crab_survey_ael.xlsx ----bv_labs ----20220101_crab_survey_bv.xlsx ----20220102_crab_survey_bv.xlsx
Avoid mixing different file types in the same folder. For example, do not keep pds and xlsx files in the same folder.
Be consistent with your naming format. Do not switch from snakecase to camelcase. Do not randomly capitalize some words while keeping others lowercase. Keep order of information consistent (e.g., do not place date at the start of some filenames and at the end of other filenames).
Include your naming conventions in a readme file.
Excel Spreadsheets#
Follow all of the rules in the Filenames section when naming column headers and files.
Store your data in long (tidy) format rather than wide format. Google ‘tidy format’ if you do nto know what this means.
If you have extra metadata or values that are not part of a table, always place them before the first row of the table. The final row of the data table should be the last row of the spreadsheet. It’s much easier to write code that finds the start of a table with column names than it is to find the end of a table with data that can potentially have different values.
When storing the same variable across separate excel files, use the same name for the same variable. For example, do not name your column with temperature data “Temp” in one file and “Temperature” in another. Give them both the same name.
Data#
If possible, be consistent with units of measurement. For example, when recording temperatures, do not switch from celcius to fahrenheit unless there’s a good reason (your thermometer broke).
Avoid mixing types when recording data. By type, we mean string vs numeric. For example, do not store a temperature datapoint as “around 10 degrees F”. Rather, split it into 3 different columns like so:
value |
units |
comment |
---|---|---|
10 |
degrees fahrenheit |
not exact. temperature was fluctuating alot |
The above rule includes when indicating if a value is below the detection limit. Do not store the value as “<10” rather, store it in two separate columns:
qualifier |
value |
---|---|
< |
10 |