File management • harsat

Overview

The R package harsat is designed to make it easy for you to work with your own files. harsat uses the following different kinds of file:

Data files. These are stored in a data directory anywhere on your file system. Your actual data (contaminant measurments and station dictionary) will be in these files. They will be TAB files if they were downloaded from the ICES webservice (i.e. in ICES format) or CSV files if you have put them together yourself (in the simpler external format).
Analysis files. These drive your actual analysis. Three files are particulary important: the determinand file, the species file (if doing a biota assessment) and the threshold file. You will normally keep these in an analysis directory somewhere in your file system. These will be CSV files, but they will be smaller and won’t change anywhere near as often as the data files.
Configuration files. There are several additional configuration files which provide information about, for example, chemical methods or pivot values for sediment normalisation. The harsat package provides default versions of these files which will be fine for most assessments. However, you may override the defaults by putting
modified copies of the files in the analysis directory described above. The configuration files are also CSV files.

The Datasets page provides zip files of:

data and assessment files for each vignette
analysis files for recent OSPAR, HELCOM and AMAP assessments
the default additional configuration files

If you need to assemble a reliable set of files for an assessment, you should have both a data directory and an analysis directory. To support full reproducibility, it would be good practice to also put a copy of the all the configuration files (whether modified or not) in the analysis directory. This is because updates to the R harsat package may change the contents of the default configuration files. Your own data files and analysis files, and any copied or modified configuration files you have put into your analysis directory, will not be affected.

File encodings

harsat currently expects all files to be encoded using UTF-8. Standard ASCII files are also fine, as UTF-8 is a superset of ASCII. If you are using accented characters in another encoding, such as the common Latin-1 (also known as iso-8859-1) then we would ask you to convert them to UTF-8 first. If you are using Microsoft Excel to prepare them, this simply means saving them as CSV files with UTF-8 encoding.

Note: the reason we do not allow people to choose a file encoding option for input files, is that harsat reads quite a few files, and it would be hard to specify encodings for all of them if they different file encodings. UTF-8 is very standard now, it is automatically used for ICES data anyway, so it harsat will instead check for UTF-8, and if your files aren’t encoded as expected, it will warn you.

If you are using other tools, please check the documentation for those tools to make sure that they aren’t converting files to Latin-1 behind your back.

To check the encoding of a file, use the file command, which is slightly different depending on your operating system.

On a Mac:

$ file -I file.csv
file.csv: text/csv; charset=utf-8

On Windows or Linux (on Windows, you may need File for Windows):

$ file file.csv
file.csv: text/csv; charset=utf-8

If the charset is not reported as utf-8 (or us-ascii), you will need to convert it. The easiest way to do this is using iconv, which is built in on Macs, and available as a download on Windows.

iconv -f <old charset> -t utf-8 file.csv > file-utf8.csv

You can then use the converted version file-utf8.csv in your workflows.

Typical workflow

Let’s imagine you want to run an analysis with harsat, and have already installed the R package (as described in the Getting started page).

Now you need some data. Typically you would get this from the ICES webservice, or put together your own data files using the simpler external format.
But for now let’s imagine you want to try out the OSPAR vignette. So you can navigate to our Datasets page, and look for an approprate zip file to download for the OSPAR vignette.

If you download and unzip this file (you can unzip it anywhere you like, but let’s pretend we’re using Windows and we unzip it at: C:\Users\stuart\OSPAR_vignette) you’ll see that your disk contains files as follows:

+ C:\Users\stuart\OSPAR_vignette
  |
  + data
  | |
  | + test_data.csv
  | + station_dictionary.csv
  | + quality_assurance.csv
  |
  + analysis 
    |
    + determinand.csv
    + species.csv
    + thresholds_biota.csv

This means that your directories are as follows:

Data directory: C:\Users\stuart\OSPAR_vignette\data
Analysis directory: C:\Users\stuart\OSPAR_vignette\analysis

Obviously, you can put these directories anywhere you like on your file system. You can even put them on a removable disk if you like, or a network shared drive. You can also call your directories something else. For example, you might call them data_vignette and analysis_vignette to distinguish them from other assessments.

So, now let’s see how you might use these to run an analysis.

Reading your data files

Virtually all the work you need to do involves the call to harsat’s read_data() function. Let’s suppose your R working directory (or R Project) is C:\Users\stuart\OSPAR_vignette\. Your call will then typically look like this:

biota_data <- read_data(
  compartment = "biota", 
  purpose = "OSPAR",                               
  contaminants = "test_data.csv", 
  stations = "station_dictionary.csv", 
  data_format = "ICES",
)

By default, the function looks for your data and analysis files in the directories called data and assessment that are nested inside your working directory. If you have called them something else, then you can use the data_dir and analysis_dir arguments. For example:

biota_data <- read_data(
  compartment = "biota", 
  purpose = "OSPAR",                               
  contaminants = "test_data.csv", 
  stations = "station_dictionary.csv",
  data_dir = "data_vignette",
  data_format = "ICES",
  analysis_dir = "analysis_vignette"
)

You can also specify absolute path names

biota_data <- read_data(
  compartment = "biota", 
  purpose = "OSPAR",                               
  contaminants = "test_data.csv", 
  stations = "station_dictionary.csv",
  data_dir = file.path("C:", "Users", "stuart", "OSPAR_vignette", "data"),
  data_format = "ICES",
  analysis_dir = file.path("C:", "Users", "stuart", "somewhere_else", "assessment"),
)

There are a few important things to see here:

Note the use of file.path() here to make portable pathnames. Of course, each user can use whatever filename pattern works best for them.
The info_path parameter can be a vector as well as a single string. harsat will actually search through every directory in this vector, looking for files like determinand.csv. If the file is found in your local analysis directory, it gets read and used. If not, we may try any other directories in this vector. If we get to the end and we’ve still not found a particular file (especially for common standard ones like matrix.csv which translates common codes) then harsat’s built-in directory of configuration files gets used as a last resort. If a file is really essential and we still can’t find it, then harsat will immediately throw an error so you can intervene.

When harsat does all this, it will log the file it actually used, and also log a “thumbprint” of the file contents – typically something like a string of hexadecimal digits. This will be the same wherever the file comes from, so long as the contents of the file are the same. If you move a file into a different directory but don’t edit it, the same thumbprint will show. So, the thumbprints are a vital tool in tracking reproducibility, as they change when the contents of the data changes.