Many Earth science observation datasets are inherently tabular in nature: rows and columns of numbers and text providing measurements of particular quantities at specified times and locations. Often these data are plain text files containing comma-separated values (CSV) or other separators. Such files are easy for humans to load into a spreadsheet or Pandas Dataframe, either interactively or using ad-hoc code that understands the structure of a particular file.
Unfortunately, tabular data files are heterogeneous. There are no mandatory standards or schema for important characteristics such as the presence of header rows, the naming and ordering of columns, the units used, and so forth. Even if there were a standard approach, a data archive facility may be obligated to accept data as submitted rather than converting to another format. The end result of this file variety is that human intervention is required to inspect and understand the contents of any new instance; automated data ingestion and verification are not easily done.
To solve this problem, a number of approaches have been proposed for machine-readable descriptors that provide metadata about the syntax and semantics of the rows of data. Examples include the World Wide Web Consortium (W3C) CSV on the Web (CSVW) technical recommendation (which uses JSON format), Table Schema (also in JSON), NOAA ERDDAP's NCCSV and British Atmospheric Data Center's BADC-CSV (both of which use CSV text), CSV YAML (CSVY), NASA Ames Format Specification (text), possibly NcML (XML not for this purpose but perhaps adaptable), and doubtless others. In each case the descriptor is either a separate sidecar file or comprises additional lines of metadata in the data file itself, prior to the actual CSV-style rows of data values.
This session will invite discussion of various approaches and their benefits or limitations including ease of creation, actual machine-readability, level of standardization, availability of tools, and breadth of community adoption.
Agenda:- Welcome and overview - Jeff de La Beaujardière/NCAR (15 min)
- W3C CSV on the Web (CSVW) at Italian Ministry of Transportation - Paolo Starace/SciamLab (15 min)
- ERDDAP's datasets.xml as a File Description System - Bob Simons/NOAA NMFS (15 min)
- CSV YAML (CSVY) at ICARUS - Tran Nguyen/UC Davis (15 min)
- Open discussion (30 min)
View Notes