Overview
The R package harsat
is designed to make it easy for you
to work with your own files. harsat
uses the following
different kinds of file:
Data files. These are stored in a data directory anywhere on your file system. Your actual data (contaminant measurments and station dictionary) will be in these files. They will be TAB files if they were downloaded from the ICES webservice (i.e. in ICES format) or CSV files if you have put them together yourself (in the simpler external format).
Analysis files. These drive your actual analysis. Three files are particulary important: the determinand file, the species file (if doing a biota assessment) and the threshold file. You will normally keep these in an analysis directory somewhere in your file system. These will be CSV files, but they will be smaller and won’t change anywhere near as often as the data files.
Configuration files. There are several additional configuration files which provide information about, for example, chemical methods or pivot values for sediment normalisation. The
harsat
package provides default versions of these files which will be fine for most assessments. However, you may override the defaults by putting
modified copies of the files in the analysis directory described above. The configuration files are also CSV files.
The Datasets page provides zip files of:
- data and assessment files for each vignette
- analysis files for recent OSPAR, HELCOM and AMAP assessments
- the default additional configuration files
If you need to assemble a reliable set of files for an assessment,
you should have both a data directory and an analysis directory. To
support full reproducibility, it would be good practice to also put a
copy of the all the configuration files (whether modified or not) in the
analysis directory. This is because updates to the R harsat
package may change the contents of the default configuration
files. Your own data files and analysis files, and any copied or
modified configuration files you have put into your analysis directory,
will not be affected.
File encodings
harsat
currently expects all files to be encoded using
UTF-8. Standard ASCII files are also fine, as UTF-8 is a superset of
ASCII. If you are using accented characters in another encoding, such as
the common Latin-1 (also known as iso-8859-1) then we would ask you to
convert them to UTF-8 first. If you are using Microsoft Excel to prepare
them, this simply means saving them as CSV files with UTF-8
encoding.
Note: the reason we do not allow people to choose a file encoding option for input files, is that
harsat
reads quite a few files, and it would be hard to specify encodings for all of them if they different file encodings. UTF-8 is very standard now, it is automatically used for ICES data anyway, so itharsat
will instead check for UTF-8, and if your files aren’t encoded as expected, it will warn you.
If you are using other tools, please check the documentation for those tools to make sure that they aren’t converting files to Latin-1 behind your back.
To check the encoding of a file, use the file
command,
which is slightly different depending on your operating system.
On a Mac:
$ file -I file.csv
file.csv: text/csv; charset=utf-8
On Windows or Linux (on Windows, you may need File for Windows):
$ file file.csv
file.csv: text/csv; charset=utf-8
If the charset
is not reported as utf-8
(or
us-ascii
), you will need to convert it. The easiest way to
do this is using iconv
, which is built in on Macs, and
available as a download
on Windows.
iconv -f <old charset> -t utf-8 file.csv > file-utf8.csv
You can then use the converted version file-utf8.csv
in
your workflows.
Typical workflow
Let’s imagine you want to run an analysis with harsat
,
and have already installed the R package (as described in the Getting started page).
Now you need some data. Typically you would get this from the ICES
webservice, or put together your own data files using the simpler
external format.
But for now let’s imagine you want to try out the OSPAR vignette. So you
can navigate to our Datasets page, and look
for an approprate zip file to download for the OSPAR vignette.
If you download and unzip this file (you can unzip it anywhere you
like, but let’s pretend we’re using Windows and we unzip it at:
C:\Users\stuart\OSPAR_vignette
) you’ll see that your disk
contains files as follows:
+ C:\Users\stuart\OSPAR_vignette
|
+ data
| |
| + test_data.csv
| + station_dictionary.csv
| + quality_assurance.csv
|
+ analysis
|
+ determinand.csv
+ species.csv
+ thresholds_biota.csv
This means that your directories are as follows:
-
Data directory:
C:\Users\stuart\OSPAR_vignette\data
-
Analysis directory:
C:\Users\stuart\OSPAR_vignette\analysis
Obviously, you can put these directories anywhere you like on your
file system. You can even put them on a removable disk if you like, or a
network shared drive. You can also call your directories something else.
For example, you might call them data_vignette
and
analysis_vignette
to distinguish them from other
assessments.
So, now let’s see how you might use these to run an analysis.
Reading your data files
Virtually all the work you need to do involves the call to
harsat
’s read_data()
function. Let’s suppose
your R working directory (or R Project) is
C:\Users\stuart\OSPAR_vignette\
. Your call will then
typically look like this:
biota_data <- read_data(
compartment = "biota",
purpose = "OSPAR",
contaminants = "test_data.csv",
stations = "station_dictionary.csv",
data_format = "ICES",
)
By default, the function looks for your data and analysis files in
the directories called data
and assessment
that are nested inside your working directory. If you have called them
something else, then you can use the data_dir
and
analysis_dir
arguments. For example:
biota_data <- read_data(
compartment = "biota",
purpose = "OSPAR",
contaminants = "test_data.csv",
stations = "station_dictionary.csv",
data_dir = "data_vignette",
data_format = "ICES",
analysis_dir = "analysis_vignette"
)
You can also specify absolute path names
biota_data <- read_data(
compartment = "biota",
purpose = "OSPAR",
contaminants = "test_data.csv",
stations = "station_dictionary.csv",
data_dir = file.path("C:", "Users", "stuart", "OSPAR_vignette", "data"),
data_format = "ICES",
analysis_dir = file.path("C:", "Users", "stuart", "somewhere_else", "assessment"),
)
There are a few important things to see here:
Note the use of
file.path()
here to make portable pathnames. Of course, each user can use whatever filename pattern works best for them.The
info_path
parameter can be a vector as well as a single string.harsat
will actually search through every directory in this vector, looking for files likedeterminand.csv
. If the file is found in your local analysis directory, it gets read and used. If not, we may try any other directories in this vector. If we get to the end and we’ve still not found a particular file (especially for common standard ones likematrix.csv
which translates common codes) thenharsat
’s built-in directory of configuration files gets used as a last resort. If a file is really essential and we still can’t find it, thenharsat
will immediately throw an error so you can intervene.
When harsat
does all this, it will log the file it
actually used, and also log a “thumbprint” of the file contents
– typically something like a string of hexadecimal digits. This will be
the same wherever the file comes from, so long as the contents of the
file are the same. If you move a file into a different directory but
don’t edit it, the same thumbprint will show. So, the thumbprints are a
vital tool in tracking reproducibility, as they change when the contents
of the data changes.