Skip to Content

15 November 2023

Ten problems with data near you

Through working on data management protocols, we have identified some of the most common issues that constrain data longevity to assist everyone involved in data collection and management in avoiding these.

  1. Metadata is often not recorded and data points are missing. Metadata is vital for others to interpret and use your data. Missing metadata and data can make datasets unusable.
  2. Recorded species names often aren’t recognized, accepted names. Most often, species names are recorded as canonical names and not accepted full names including authors. For plants, the accepted names from the South African National Plant Checklist should be used and it is updated annually at http://opus.sanbi.org/. For animals, national taxon-specific lists are not yet available, but the Catalogue of Life provides a global checklist.
  3. The date of data collection and data entry are often absent. Reams and reams of monitoring and other data are unusable without a collection date.
  4. Dates often aren’t in a standardised format. Three separate columns with day, month and year should be available for date data in a database. The required Darwin Core Archive date format is yyyy-mm-dd, and it can easily be obtained from the three columns.
  5. The names of who collected the data, identified the species and recorded the data often aren’t included. Data verification is important, especially the verification of species identifications.
  6. Coordinates often aren’t recorded in a standardised format and with an indication of resolution. Latitude (N-S) and Longitude (W-E) columns need to be available in decimal degrees. If degrees, minutes and seconds are recorded separately, then these values need to be recorded in three separate columns and decimal degrees can be calculated as degrees + (minutes/60) + (seconds/3600).
  7. Measurements often aren’t recorded in a standardised, identified unit. A column should be specified as being recorded in a particular unit (in the heading) and only values recorded. Numbers shouldn’t appear alongside units in the data, and units must be consistent: different units shouldn’t appear in one column.
  8. Data often remains in hard copy. Reams of data aren’t captured electronically, meaning that these data are not used in decisions and trends go unnoticed.
  9. Data often aren’t analysed and used. Much data are collected and not analysed or reflected on, making the resources spent on data collection meaningless. By making data more available, it is hoped that more data will be analysed and more strategic data collection encouraged.
  10. Data are often not widely shared on platforms such as GBIF. Large volumes of data are not collated and shared. Through the JRS project, SANParks are working towards getting data collected in parks by SANParks staff and external researchers available on GBIF. This enables broader engagement with researchers and managers both nationally and globally.

This article was originally published in the 2021/2022 Research Report.