External data file formats • harsat

These are the column headers for CSV-formatted external data files. The files should be UTF-8 encoded.

Missing values should be supplied as blank cells, not as NA or some other code.

Other columns can also be supplied, but will typically be ignored.

Contaminant data

The data file has one row for each measurement.

column name	type	mandatory	comments
`country`	character	yes	identifies the source of the data; for international assessments this is typically the country of origin, but for national assessments it could be a local monitoring authority must match `country` in station file no missing values
`station_code`	alphanumeric	yes	the station (code) where the sample was collected must match `station_code` in station file no missing values
`station_name`	alphanumeric	yes	the station (name) where the sample was collected; this is often more intuitive to a user than `station_code` must match `station_name` in station file no missing values
`sample_latitude`	numeric (decimal degrees)		need not match `station_latitude` in station file
`sample_longitude`	numeric (decimal degrees)		need not match `station_longitude` in station file
`year`	integer	yes	monitoring year doesn’t necessarily match `date` since a sampling season running from e.g. November 2021 to May 2022 might all be considered the 2022 monitoring year no missing values
`date`	date: use ISO 8601 standard e.g. 2023-06-28		sampling date
`depth`	numeric (m)		sediment: assumed to be a surface sediment sample with depth being the lower depth of the grab water: assumed to be a surface water sample with depth being the upper depth of the sample biota: not used, so can supply whatever is useful (or omit)
`species`	character	yes (biota)	latin name which must match a `submitted_species` in the species reference table
`sex`	character		see ICES reference codes for SEXCO required for EROD assessments desirable if sex is used to subdivide timeseries (see `subseries`)
`n_individual`	integer		number of pooled individuals in the sample required for imposex assessments
`subseries`	character		used to split up timeseries by e.g. sex or age for example: `juvenile`, `adult_male`, `adult_female` missing values indicate that all records in a timeseries will be considered together (no subdivision)
`sample`	alphanumeric	yes	links measurements made on the same individuals (biota), in the same sediment grab or in the same water sample no missing values don’t use the same value for samples collected in different years, at different stations or in different species
`determinand`	character	yes	must match values in determinand reference table most will be in ICES reference codes for PARAM but can provide own values no missing values
`matrix`	character	yes	see ICES reference codes for MATRX
`basis`	character	yes (biota & sediment)	`W`, `D` or `L` no missing values for chemical measurements in biota or sediment not mandatory for water where basis is always taken to be W
`unit`	character	yes	see ICES reference codes for MUNIT no missing values
`value`	numeric	yes	no missing values
`censoring`	character		typically `D`, `Q` or `<` indicating a value less than the limit of detection, less than the limit of quantification, or some other (non-specified) less than a missing value indicates that the measurement is not a less-than (i.e. is uncensored)
`limit_detection`	numeric		same unit as value
`limit_quantification`	numeric		same unit as value
`uncertainty`	numeric		analytical uncertainty in the measurement same unit as value
`unit_uncertainty`	character		`SD`, `U2` or `%` if `uncertainty` is present, `unit_uncertainty` must also be present
`method_pretreatment`	character		use ICES reference codes for METPT
`method_analysis`	character		use ICES reference codes for METOA required for bile metabolite measurements
`method_extraction`	character		use ICES reference codes for METCX required for sediment normalisation (typically for metals)

Station data

The station file has one row for each station.

current_name	Type	mandatory	Comments
`OSPAR_region`	character		the regional columns can be called anything (and are optional) for OSPAR assessments, use `OSPAR_region` and `OSPAR_subregion` for HELCOM assessments use `HELCOM_subbasin`, `HELCOM_L3` and `HELCOM_L4` for other assessments any regional columns must be explicitly identified when calling `read_data` using the `control` argument
`OSPAR_subregion`	character		see above
`country`	character	yes	no missing values
`station_code`	alphanumeric	yes	no missing values
`station_name`	character	yes	no missing values
`station_longname`	character		typically a more intuitive name for the station than `station_name`
`station_latitude`	numeric (decimal degrees)	yes	no missing values
`station_longitude`	numeric (decimal degrees)	yes	no missing values
`station_type`	character		see ICES reference codes for MSTAT typically `B` (baseline), `RH` (representative) or `IH` (impacted)
`waterbody_type`	character		see ICES reference codes for WLTYP typically a code indicating transitional (estuarine) waters, coastal waters or open sea