Wednesday, April 22, 2015

CSVW for Tabular Clinical Trial Data and Metadata


W3C has developed a set of working drafts for tabular data and metadata called CSV on the Web (CSVW) and are now seeking comments and implementations.

The drafts describes:
  • Metadata vocabular for tabular data
    A JSON-based format for expressing metadata about tabular data to inform validation, conversion, display and data entry for tabular data
  • Model for tabular data and metadata
    An abstract model for tabular data, and how to locate metadata that enables users to better understand what the data holds; this specification also contains non-normative guidance on how to parse CSV files.
  • Procedures and rules to be applied when converting tabular data into JSON and RDF 
These are based on a series of use cases and recommendations including for example Publication of National Statistics and Analyzing Scientific Spreadsheets. I can see some interesting opportunities in this for tabular Clinical Trial Datasets.

A small example

Check out Ed Summers' (@edsu) very nice, small csvw example mentioning one of the authors of the drafts; Dan Brickley (@danbri, Developer Advocate at Google). Below the CSV example, related Metadata and the Annotated, linked data.

CSV
isbn,title,author
0470402377,"Bricklin on Technology","Dan Bricklin"

Metadata
{
  "@context": {
    "@vocab": "http://www.w3.org/ns/csvw#", 
    "dc": "http://purl.org/dc/terms/"
  }, 
  "@type": "Table", 
  "url": "example.csv",
  "dc:creator": "Dan Bricklin", 
  "dc:title": "My Spreadsheet", 
  "dc:modified": "2014-05-09T15:44:58Z", 
  "dc:publisher": "My Books", 
  "tableSchema": {
    "aboutUrl": "http://librarything.com/isbn/{isbn}",
    "primaryKey": "isbn",
    "columns": [
      {
        "name": "isbn",
        "titles": "ISBN-10",
        "datatype": "string",
        "unique": true,
        "propertyUrl": "http://purl.org/dc/terms/identifier"
      },
      {
        "name": "title", 
        "titles": "Book Title",
        "datatype": "string", 
        "propertyUrl": "http://purl.org/dc/terms/title"
      },
      {
        "name": "author",
        "titles": "Book Author",
        "datatype": "string",
        "propertyUrl": "http://purl.org/dc/terms/creator"
      }
    ]
  }
}


Annotated, linked data 
(RDF modeled serialized in JSON-LD)
  "@context": {
    "csvw": "http://www.w3.org/ns/csvw#",
    "dc": "http://purl.org/dc/terms/",
    "prov": "http://www.w3.org/ns/prov#",
    "xsd": "http://www.w3.org/2001/XMLSchema#"
  },
  "@graph": [
    {
      "@id": "_:g69960879269460",
      "@type": "prov:Usage",
      "prov:entity": {
        "@id": "example.csv-metadata.json"
      },
      "prov:hadRole": {
        "@id": "csvw:tabularMetadata"
      }
    },
    {
      "@id": "_:g69960879270660",
      "@type": "prov:Usage",
      "prov:entity": {
        "@id": "example.csv"
      },
      "prov:hadRole": {
        "@id": "csvw:csvEncodedTabularData"
      }
    },
    {
      "@id": "_:g69960879273280",
      "@type": "prov:Activity",
      "prov:endedAtTime": {
        "@value": "2015-04-22T20:21:11Z",
        "@type": "xsd:dateTime"
      },
      "prov:qualifiedUsage": [
        {
          "@id": "_:g69960879270660"
        },
        {
          "@id": "_:g69960879269460"
        }
      ],
      "prov:startedAtTime": {
        "@value": "2015-04-22T20:21:10Z",
        "@type": "xsd:dateTime"
      },
      "prov:wasAssociatedWith": {
        "@id": "http://rubygems.org/gems/rdf-tabular"
      }
    },
    {
      "@id": "_:g69960879277480",
      "@type": "csvw:Row",
      "csvw:describes": {
        "@id": "http://librarything.com/isbn/0470402377"
      },
      "csvw:rownum": {
        "@value": "1",
        "@type": "xsd:integer"
      },
      "csvw:url": {
        "@id": "#row=2"
      }
    },
    {
      "@id": "_:g69960879413940",
      "@type": "csvw:Table",
      "csvw:row": {
        "@id": "_:g69960879277480"
      },
      "csvw:url": {
        "@id": "example.csv"
      },
      "dc:creator": "Dan Bricklin",
      "dc:modified": "2014-05-09T15:44:58Z",
      "dc:publisher": "My Books",
      "dc:title": "My Spreadsheet"
    },
    {
      "@id": "_:g69960879425260",
      "@type": "csvw:TableGroup",
      "csvw:table": {
        "@id": "_:g69960879413940"
      },
      "prov:wasGeneratedBy": {
        "@id": "_:g69960879273280"
      }
    },
    {
      "@id": "http://librarything.com/isbn/0470402377",
      "dc:creator": "Dan Bricklin",
      "dc:identifier": "0470402377",
      "dc:title": "Bricklin on Technology"
    }
  ]
}


A clinical trial data example?

Tabular data has been the traditional way to organize how clinical trial data is captured, stored and submitted. So, I think that this would be very interesting to explore to be able to bind data to it's metadata in a similar way. That is to, make things like variable labels, date/time formats etc. explicit.
  • How could the metadata for a small, example of e.g. demographic data look like?
  • How would the annotated, linked data look for such a small example like?
I would love to see some early ideas on how this could be implemented in the two main language/environments we use today for clinical data: SAS and R. Similar to the early implementation of CSVW in Ruby described in a nice blog post from Greg Kellogg (@Gkelloggone of the authors of the drafts).

Such a first example I think would trigger an interesting ideas for best practices and potential extensions to the metadata vocabular and model, and also to the procedures and rules to create annotated JSON and RDF representations such as:
  • Templates for the URIs to be assigned to each captured and derived data point?
  • Representing implied formats in varchar fields such as dates and precision.
  • Making explicit the implied metadata from the actual data such as encoded labtest codes and units.
  • How to leverage the RDF schemas representing CDISC standards?
  • How to best use W3C's Provenance ontology to capture the life cycle of a data point in a clinical trial?
I think questions as these are important to address, especially in the context of transparency and reuse of clinical trial data, see also an earlier blog post: Clinical Trial Data Transparency and Linked Data.

So, I hope this blog post will spark some interesting responses from the SAS and R communities, and discussions in groups like CDISC and PhUSE Semantic Technology project.