Reference table file formats
Source:vignettes/reference-file-formats.Rmd
reference-file-formats.Rmd
These are the column headers for CSV-formatted reference table files. Ideally, the files should be UTF-8 encoded.
harsat
uses three different reference tables:
- A species file
- A determinand file
- A threshold file
All these files should be in your information files directory.
Do not use any forward or backward slashes in the reference files.
Missing values should be supplied as blank cells, not as
NA
.
The order of the columns does not matter, as long as they are named consistently with the specification below.
Species file format
The species file will be species.csv
in your information
files directory.
column name | type | mandatory | NA allowed | comments |
---|---|---|---|---|
reference_species |
character | yes | no | Reference species name (usually its Latin name) |
submitted_species |
character | yes | no | Submitted species name: sometimes the same species can have several
Latin names, but these are all mapped to the
reference_species name |
common_name |
character | yes | no | Common name for the species: you can call them whatever you want |
species_group |
character | yes | no | Species group, should be one of: Bird ,
Bivalve , Crustacean , Echinoderm ,
Fish , Gastropod , Macrophyte ,
Mammal , Other
|
species_subgroup |
character | yes | no | This is a convenience column used for matching species to
thresholds. For example, OSPAR sometimes has different thresholds for
mussels and oysters, so these form convenient subgroups. You can call
things what you want, provided they match any reference to the
species_subgroup column in the threshold file. If you don’t need this
facility, use the species_group value |
assess |
boolean | yes | no | Whether or not to include this species in the assessment, should be
either TRUE or FALSE
|
MU_drywt |
numeric | no | yes | Muscle dry weight (%). A typical value for the species. This is only needed if you want to convert thresholds from one basis to another. Leave cells blank if you do not have a suitable value. |
LI_drywt |
numeric | no | yes | Liver dry weight (%). See above. |
SB_drywt |
numeric | no | yes | Soft body dry weight (%) |
EH_drywt |
numeric | no | yes | Egg homogenate dry weight (%) |
MU_lipidwt |
numeric | no | yes | Muscle lipid weight (%) |
LI_lipidwt |
numeric | no | yes | Liver lipid weight (%) |
SB_lipidwt |
numeric | no | yes | Soft body lipid weight (%) |
EH_lipidwt |
numeric | no | yes | Egg homogenate lipid weight (%) |
You can have as many dry and lipid weight columns as you have tissue
types. For example, if you have blood (BL) data, then you can have
columns BL_drywt
and BL_lipidwt
. (Remember
there is a difference between BL (blood) and ER (erythrocytes - red
blood cells in vertebrates). You can also omit all these columns if you
don’t need to convert thresholds from basis to another.
Determinand file format
The determinand file will be determinand.csv
in your
information files directory.
column name | type | mandatory | NA allowed | comments |
---|---|---|---|---|
determinand |
character | yes | no | Determinand code – this is a key to the thresholds file records and will usually be the ICES PARAM code |
common_name |
character | yes | yes | Common name for the determinand, e.g., Aluminium ,
Glyphosate . You can call them whatever you want. If
missing, the determinand code is used. |
pargroup |
character | yes | no | ICES parameter group reference code, e.g. I-MET ,
O-PAH , OC-CB , OC-DD , etc. (see:
https://vocab.ices.dk/?ref=78) |
biota_group |
character | yes* | yes | Grouping for a biota assessment (under development): currently one
of Metals , PAH_parent ,
PAH_alkylated , Chlorobiphenyls ,
PBDEs , Organobromines ,
Organotins , Organochlorines ,
Organofluorines , Pesticides ,
Dioxins , Effects , Auxiliary ,
Metabolites , or Imposex . Missing values are
only allowed for determinands that will not be used (in any way). |
sediment_group |
character | yes* | yes | Grouping for a sediment assessment. See biota_group for allowable
values and additional comments. Note that normalisers, such as AL or
CORG, should be classed as Auxiliary
|
water_group |
character | yes* | yes | Grouping for a water assessment. See biota_group for allowable values and additional comments. |
biota_assess |
boolean | yes* | no | Whether the determinand is to be assessed: should be either
TRUE or FALSE . For the time being,
Auxiliary determinands must be set to ‘FALSE’ |
sediment_assess |
boolean | yes* | no | Whether the determinand is to be assessed. See biota_assess for more details |
water_assess |
boolean | yes* | no | Whether the determinand is to be assessed. See biota_assess for more details |
biota_unit |
character | yes* | yes | The unit that all measurements will be converted to in a biota
assessment; e.g. ug/kg . Missing values are only allowed for
determinands that will not be used. |
sediment_unit |
character | yes* | yes | The unit that all measurements will be converted to in a sediment
assessment; e.g. mg/kg . Missing values are only allowed for
determinands that will not be used. |
water_unit |
character | yes* | yes | The unit that all measurements will be converted to in a water
assessment, e.g. ug/l . Missing values are only allowed for
determinands that will not be used. |
biota_auxiliary |
character | yes* | yes | Identifies all the auxiliary measurements that should be associated
with the determinand. These should be separated by a ~ . For
example, for chemical contaminants, this might be
‘DRYWT%LIPIDWT%LNMEA’. The list gets more interesting for
effects measurements. |
sediment_auxiliary |
character | yes* | yes | Identifies all the auxiliary measurements that should be associated
with the determinand. These should be separated by a ~ . For
example, for metal contaminants, this might be
‘ALLICORG’ |
water_auxiliary |
character | yes” | yes | Identifies all the auxiliary measurements that should be associated
with the determinand. These should be separated by a ~ . Not
currently used much for water assessments |
biota_sd_constant |
character | no | yes | If supplied, this allows the imputation of measurement uncertainty
for determinands in a biota assessment when they are missing from the
data file. sd_constant is the constant error with units
given by biota_unit
|
biota_sd_variable |
character | no | yes | If supplied, this allows the imputation of measurement uncertainty
for determinands in a biota assessment when they are missing from the
data file. sd_variable is the proportional error expressed
as a percentage (%) |
sediment_sd_constant |
character | no | yes | See biota_sd_constant
|
sediment_sd_variable |
character | no | yes | See biota_sd_variable
|
water_sd_constant |
character | no | yes | See biota_sd_constant
|
water_sd_variable |
character | no | yes | See biota_sd_variable
|
distribution |
character | yes | yes | A distribution type; for chemical contaminants, this should be set
to lognormal . Only required for determinands where at least
one of biota_assess , sediment_assess or
water_assess are TRUE
|
good_status |
character | yes | yes | Whether high or low values of the
determinand indicates a healthy environment. Only required for
determinands where at least one of biota_assess ,
sediment_assess or water_assess are
TRUE
|
A yes* in the mandatory column means that
e.g. biota_group
, biota_assess
etc. are
mandatory for a biota assessment, but can be omitted if there is only
going to be a sediment and water assessment.
Threshold file format
The threshold files are compartment-specific and will be called
threshold_biota.csv
, threshold_sediment.csv
and threshold_water.csv
. The format of each differs
somewhat, but the principle is the same: they provide the threshold
values for each determinand. For biota, the threshold values are linked
to a particular species_group
,
species_subgroup
or reference_species
(see the
species file format) and to other supporting variables such as
matrix
(tissue type). For sediment, the threshold values
can (optionally) be linked to one of the regional variables in the
stations file. For water, the threshold values ar linked to
filtration
. The water file is the simplest, so we describe
that first.
Water threshold files
The water threshold file will be thresholds_water.csv
in
your information files directory. For illustration, suppose there is
just one threshold, the EQS. The thresholds file will then have the
following four columns:
column name | type | mandatory | NA allowed | comments |
---|---|---|---|---|
determinand |
character | yes | no | The determinand code, which must key to the determinand file |
filtration |
character | yes | no | A string, either filtered or
unfiltered . |
EQS_basis |
character | yes | yes | The basis on which the EQS is expressed: for water this will always
be W . The basis must always be given if there is an EQS
value in the next column. |
EQS |
numeric | yes | yes | The value of the EQS. The units must key to the
water_units column in the determinand file |
If there is another threshold, for example, the BAC, then add two
columns called BAC_basis
and BAC
. You can have
as many thresholds as you want.
If, for a particular determinand, the same (set of) threshold
value(s) is to be applied to both filtered
and
unfiltered
time series, then set filtration
to
filtered~unfiltered
.
Sediment threshold files
The sediment threshold file will be
thresholds_sediment.csv
in your information files
directory. For illustration, suppose there are two thresholds, the BAC
and EAC. The threshold file will then have the following five
columns:
column name | type | mandatory | NA allowed | comments |
---|---|---|---|---|
determinand |
character | yes | no | The determinand code, which must key to the determinands file |
BAC_basis |
character | yes | yes | The basis on which the BAC is expressed: for sediment this will
always be D . This must always be provided if there is a BAC
value. |
EAC_basis |
character | yes | yes | The basis on which the EAC is expressed. See above. |
BAC |
numeric | yes | yes | The value of the BAC. The units much key to the
sediment_units column in the determinand file. The units
are also assumed to be normalised for grain size in the same way as the
data are normalised (in create_timeseries). |
EAC |
numeric | yes | yes | The value of the EAC. See above. |
Again, you can have as many thresholds as you want. For example, OSPAR assessments typically use the ERL, EQS and FEQG in addition to the BAC and EAC.
You can also have extra columns that match columns in the station
dictionary. Typically, these will be used to apply different threshold
values to different regions. For example, the OSPAR threshold file has a
column ospar_subregion
, which allows one set of threshold
values to be applied in the Iberian Coast and Gulf of Cadiz and another
set to be applied in the rest of the OSPAR area.
Biota threshold files
The biota threshold file will be thresholds_biota.csv
in
your information files directory. Again, suppose there are two
thresholds, the BAC and EAC. The threshold file will then have the
following eleven columns:
column name | type | mandatory | NA allowed | comments |
---|---|---|---|---|
determinand |
character | yes | no | The determinand code, which must key to the determinands file |
species_group |
character | yes | no | The species group, a key to the species.csv file |
species_subgroup |
character | yes | no | The species subgroup, a key to the species.csv
file |
species |
character | yes | yes | The species, a key to the species.csv file (and
specifically the reference_species column) |
matrix |
character | yes | no | The tissue type (ICES code); for example EH ,
MU , SB . If the threshold can be applied to
multiple tissues then provide all the relevant tissues separated by a
tilde. For example, LI~MU~SB . |
method_analysis |
character | yes | yes | The method of analysis (ICES code). Only provide values for PAH metabolites, otherwise leave blank. |
sex |
character | yes | yes | The sex (ICES code). Only provide for EROD, otherwise leave blank. |
BAC_basis |
character | yes | yes | The basis on which the BAC is expressed. For contaminants, this will
either be W , D or L and must be
provided if there is a BAC value. Leave blank for effects. |
EAC_basis |
character | yes | yes | The basis on which the EAC is expressed. See above. |
BAC |
numeric | yes | yes | The value of the BAC. The units must key to the
biota_units column in the determinand file. |
EAC |
numeric | yes | yes | The value of the EAC. See above. |
The columns species_group
,
species_subgroup
, and species
are used to
provide flexibility in matching thresholds to species. For some
determinands, a threshold is applied to all species in a species group,
in which case the species_group
column should be populated.
For other determinands, threshold values might differ between species
subgroups (e.g. between mussels and oysters), in which ase the
species_subgroup
column should be populated. Use the
species
column if the threshold value is species-specific.
At least one of species_group
,
species_subgroup
and species
must always be
provided.