Sunday, August 19, 2012

A Simplified Medical Document format based on static web pages semantic annotation

Summary: Static HTML web pages augmented with semantic markup are sufficient to use as medical documents, such as radiology reports, and can satisfy all the goals of HL7 CDA, but by using only free and open standards used by all industries and consumers; RDFa seems to have advantages over MicroData for the semantic markup and is sufficient.

(Very) Long Version.

For some time I have been railing against the evils of HL7 attempting to extort fees from "users" of its standards; putting aside whether or not they have any legal or ethical basis to stand on, there is clearly a need to find a truly open alternative to HL7 CDA as a medical document format.

Once one stops drinking the "CDA is the way" Kool-Aid and starts down the road of exploring alternatives that are consumer-industry-based, one comes across a plethora of alternative options, ranging from de novo XML creations, through existing XML-based document formats used in publishing (such as DocBook and DITA), through to page-layout-based formats like PDF (which though managed by Adobe, is surprisingly open, especially compared to HL7).

But the more one explores these questions, and especially as one searches the web for information with a browser, whether it be from one's desktop, laptop, tablet or phone, the more obvious it becomes that the choice of ubiquitous open document format is already clear - it is the HTML web page.

Most folks know that web pages are HTML files served up for rendering to browsers, and in contemporary web usage these typically contain a lot of dynamic (scripted) content or even require plugins for rendering of multi-media content. But they can be simplified down to their original basics, their bare essentials, files that contain a single page of static information without frames and making use of relatively limited rendering flexibility, and most importantly, without "hyperlinks" to external content; i.e., they can be self-contained and detached from their environment and remain fully meaningful. This is analogous to the use of HTML as a common email message format, for example, except without the cutesy pictures and embedded links to advertising in the spam that one receives. These are, by any other name "documents".

Formalizing the use of static HTML for medical document use would require placing constraints on what HTML base features (version) were required and what features (scripts, frames, etc.) were prohibited, but that is easily done. And for XML aficionados, there exist XHTML flavors of HTML that browsers happily render.

It turns out that folks have been suggesting to HL7 for many years that XHTML should be used in place of CDA's own flavor of narrative content specification, and that it is likely that CDA R3 will take this course (see Keith Boone's blog post on "XHTML for CDA Release 3").

What about how to handle the identification and management information that is needed to use a document, and how to identify as any coded or structured content therein? Can this be done without inventing a medically-specific "header" or "wrapper" format around the HTML content, and most importantly, can it be done in a pan-industry standard way?

For years the "semantic web" folks have been trying to convince everyone to use standards like RDF to say stuff about web resources, and this finally seems to be coming of age, if for no other reason than the "enlightened self-interest" of the major search engine vendors, particularly Google, Bing and Yahoo. The more formal RDFa effort has been competing against the MicroData effort, but it seems that both will be supported and arguably there are some advantages to RDFa (and particularly RDFaLite if it proves sufficient).

What I propose is that we use either RDFa or MicroData to "semantically annotate" narrative content in HTML form so that the resulting "document" meets certain minimal requirements for identification, management, coded and structure information, sufficient for use as a medical document. I am further proposing that there not be a "header" at all per se, at least not in the traditional sense, nor that such information be confined to the HTML header (contents of the HEAD element). Rather such information can be distributed throughout the document, in the place where it naturally occurs in the narrative, and just be tagged as having a specific meaning. For ugly but important staff (like unique identifiers that are not human readable and do not need to be rendered), there would need to be a means to hide such stuff from visibility.

It turns out, not surprisingly, that I am not the first to suggest HTML based narrative content; I gather it has come up on the HL7 Structured Documents mailing list, and again I point you to Keith Boone's excellent blog, where he too has been considering this sort of thing; see for example "A Wickedly Fresh look at CDA through Microdata and HTML5"; Grahame Grieve has also been considering XHTML for his FHIR effort (which, by the way, he has contributed to HL7, with the proviso that it remain open and free; he is one of the good guys). But both these guys (if I am understanding them right) are still thinking "wrapper" and "header", and encapsulating the (X)HTML, and not using the HTML as the entire and complete document in its own right; see for example Grahame's description of his DocumentHeader resource. There may well have been other prior work along similar lines; I haven't attempted to perform an exhaustive literature search.

Let's get concrete now, and see what this sort of thing might look like. I will start with a toy radiology report, and then move on to a real one, and then consider another simple document that contains something else like vital signs.

Suppose we have a very simple CT Chest report that contains only the following information:

We can create a very simple minimal static HTML document that would result in this content being rendered in an ordinary browser; i.e., this:


 would result in this being rendered in a typical browser:

Ugly, but no worse than the typical plain text report one often sees. Obviously one can get more fancy and add prettier formatting, fonts, institutional details and the like, but this suffices for the purpose of a basis to start to semantically annotate this report.

So, now lets add a little RDFa markup that uses the existing identification and management information. Here we will do two things, markup the information, and in some cases add alternative (hidden) values for data that needs to be in a "standard" format, in this case the dates and times:


This does not change the resulting rendered output in the browser in any way at all, but it does let us extract the information using an RDF-aware tool; here for example, I have pasted the above HTML source into the RDFa/Play tool, and the result is a visualization that describes the extracted semantic content:

Here is the so-called "Turtle" (Terse RDF Triple) form of RDF that is also extracted by this tool (in the Raw Data panel):

The point here is not to learn the specifics of RDF, but rather just to recognize that the information that one would normally find in the "header" of a CDA document (or PDF or DICOM document for that matter) is extractable, and most importantly using conventional, non-medically-specific tools. Also, though the markup is not terribly pretty, it is clear and straightfoward.

Without getting into too much detail, the way this works is to specify one or more "vocabularies" that define the "typeof" something (in this case I have defined the document as being a "typeof" "RadiologyReport"), and the vocabulary is specified by the URL "". That "entity" (thing) then has "properties" (which are a subset of "relationships"), several of which I have enumerated with text values, such as the "patientName" property. I am sure you get the basic idea.

What would need to be medically-specific would be the "vocabulary", and in this case I have defined a hypothetical one, in which I have started to use concepts from DICOM as properties of a hypothetical RadiologyReport document entity ("type").  What is more, there already exist many vocabularies that have been used for other semantic web activities (such as the Dublin Core for describing resources), as well as a more recent initiative in support of MicroData and RDFa (and reusable for either), specifically the effort. The latter, surprisingly enough, already includes some vocabulary intended for the markup of medical web pages, including a means of referencing existing medical codes through the MedicalCode entity. I don't know who was responsible for doing the medical vocabulary, but they have done a nice job so far.

Indeed, we can use some of the stuff as well as some of the more advanced RDFa features to markup our primitive report to include more than just the identification and management data in text form; we can code the procedure type, and we can code the diagnosis and impression. Here we are also going to use the "rel" attribute rather than the "property" attribute, since it allows us to point to a vocabulary term (resource) as a value, rather than plain text (this requires RDFa, not just RDFaLite, by the way). Once again, the rendered HTML in the browser looks just as boring, but the extracted semantic content is richer:

There is a little more fluff in this example, in order to introduce the mechanism for defining more than one vocabulary (needs the "prefix" attribute), since I have shown the use of RadLex and ICD-9-CM as sources of codes for the report content and the clinical history, respectively, and I have also used a RadLex Playbook ID for the CT procedure code. Note that in some cases I have defined the values as URL based resources (a la and in other cases I have used the traditional coded "triplet" of value, scheme and meaning with I have also illustrated two different ways of hiding the coded content (using the "content" attribute or a "visibility:hidden" style; the former is probably preferrable), and that the data can be typed (see the xsd:string type for one of the codes). This is what it looks like in RDF/Play, visually:

and as Turtle:


Note that the visual rendering does not show the nesting ("chaining" in RDFa-speak), whereas the extracted Turtle does.

Another good tool for extracting the RDF content from the web page is the Green Turtle plugin for Google's Chrome Browser, which provides an extra tab that shows the semantic content when it is detected, and also provides an interactive graphic rendering of the semantic information that one can use to highlight specific nodes and their tuples (and which does illustrate the chaining visually):

OK, I am probably getting too deep into the technical details, so I won't go any further, since I don't want to detract from the primary message, which is that static HTML web pages with semantic annotation are sufficient to encode medical documents. To illustrate the point I will finish with two final examples.

The first is a "real world" radiology report that was distributed in plain text form via fax, which I then redacted to remove all identifying information (except the source institution), then recreated as a simple HTML page using limited formatting features, and to which I added the semantic markup for the identification and management information, as well as a limited amount of coded content along the same lines as the basic example above. I deliberately did not attempt to structure and code the bulk of the narrative, though this would obviously be possible, since I do not want to detract from the message that this mechanism is useful for existing reports, not just for future hypothetical structured content authoring systems.

Here is the original redacted content:

Here is a screenshot of the recreated HTML as rendered in a browser, with an attempt to preserve as much of the original formatting and style as possible, and with the insertion of synthetic values for names and identifiers and dates and times:

Here is the corresponding HTML source code:

and here is the Turtle extract:

To belabor the obvious, there is nothing radiology-report-specific about this approach, and so here is the final example of a single "vital signs" reading of body temperature, using an example from Bob Dolins CDA R2 JAMIA article (his Figure 4); here is the browser rendering, extended a little from Bob's example to render the date and time:

and here is the HTML used to generate it, with the semantic markup:

and here is the RDFa/Play visualization of the RDF tuples extracted from it:

In this example, I have chosen to imply that the referenced vocabulary defines an entity that is a VitalSignBodyTemperature, since I wanted to show a model of date, value and units, but alternatively I could have just used a MedicalCode reference to a SNOMED code like the CDA example did (but since SNOMED is a closed standard too, let's not go there for now).

Obviously there remain many details to flesh out, such as what constraints on HTML to specify, what basic vocabulary elements are the minimum mandatory set (without getting too carried away trying to reinvent or entirely map DICOM or HL7 headers or XDS meta-data), and what style-related issues raise questions of what the attested rendered content is. But I think that these are probably all surmountable.

In conclusion, I believe that we can do away with CDA entirely and resort to a pure web-standard-based document format with medical specifics defined only in constraints in the form of schemas, templates and vocabularies. I am deliberately glossing over the effort or complexity of defining adequate vocabularies (and free ones at that), but we have a large playing field to choose from and decades of experience if we have to do it over.

Instances of such a document format could be distributed and exchanged by all of the normal "transport" mechanisms for documents, whether it be via HTTP, ordinary email, one of the NHIN DIRECT flavors of transport, IHE XDS or XDR, on physical media, etc. I.e., as with all "documents", the transport mechanism is independent of the "payload" that is the encoded document itself, and all of the usual additional (non-medical-specific) security and digital signature mechanisms could be applied as necessary.

I propose that we should call such a new standard that would define the necessary constraints the "Simplified Medical Document" format (SMD, or SMDF perhaps), and to resolve that it should be and should remain a truly open standard; if necessary, we can produce an entirely HL7-free solution for the whole world to use without fear of the evil empire and their lawyers.

PS. Before someone mentions it, and I saw this raised as a comment on one of Keith's blog entries, at first glance this might seem somewhat reminiscent of the controversial "tagged data element" approach mentioned in the much maligned PCAST report (which I too criticized previously for a host of reasons), at least until one digs into the details of what was being proposed. I still disagree with most of that report, especially its access control suggestions; and I am not suggesting that relevant meta-data not be pre-extracted and indexed in the normal manner. Rather, the semantic annotation approach here is largely just a repositioning and reformatting of conventional document "meta-data". Nor is there a need for a new "universal language" as suggested in that report, except to the extent (and perhaps one could interpret the report in that way) that specific vocabularies do need to be defined (or re-envisaged) to support the RDFa or MicroData approach. In particular, I am not suggesting, as the PCAST report does, that "each unit of data [be] accompanied by a mandatory metadata tag that describes the attributes, provenance, and required privacy protections of the data"; quite the contrary. The PCAST report essentially suggests a data-unit rather than document-centric approach, and that is most definitely NOT what I am proposing. That said, the semantic annotation approach would indeed allow for "tagged data elements" to be extracted from documents as necessary, and they would be meaningful as long as sufficient parent and sibling context was also extracted and remained related. To be fair, there is probably some point at which the opposite approaches converge into a similar solution, however.


dee said...


OK I buy your arguments.....Soooo, are you proposing that the structured content of a DICOM SR is expressed using the methodology described above? I am assuming that the mapping form a complex SR content tree is not as straightforward as the simple headings and finding identification described in the post. If this is meant to be a universal format for reporting, evidence documents created by automated systems should be trans-code-able and exchangeable.


David Clunie said...

Hi Dee

Certainly transcoding an "evidence document" style SR (like an echocardiography or radiation dose report) to HTML+RDFa is pretty straightforward, especially if one uses the MedicalCode approach to transcoding all the coded concepts. Nesting of hierarchical content is straightforward; relationships by reference are possible I believe, but I haven't tried them out yet (and don't know what the tooling does with them; like @rel rather than @property, that probably goes beyond RDFaLite and requires more RDFa features).

I will post more examples when I have done some more experiments along these lines.


Manu Sporny said...

Hi David,

I'm the current chair of the RDFa Working Group at the World Wide Web Consortium, editor of the HTML5+RDFa spec, and the chair of the JSON-LD CG at W3C.

Absolutely /fantastic/ blog post. It is so refreshing to see somebody applying RDFa to something that could help millions of people. You've also done an excellent job at understanding the spec and showing how it could be used to achieve a truly open medical record format.

Medical records were one of the use cases that we thought through when doing RDFa. In fact, I have been personally involved in writing software for a (now failed) electronic medical record company. We didn't have RDFa at the time, and it was one of the events that led me to work on the RDFa stuff in more depth.

If you like RDFa, you may also be interested in JSON-LD:

Data extracted from RDFa documents can be stored and transmitted (loss-lessly) in JSON-LD. In fact, the Drupal community, as well as the Wikidata (a Wikipedia project) is looking into JSON-LD as the data transport mechanism that they use. The reason I point this out is because there is a complete ecosystem around RDFa. You can not only mark up data, but you can extract it and store it directly into a database (like MongoDB) using JSON-LD. That allows you to query and process the data more easily.

If you have any questions, both the RDFa and JSON-LD communities would be very happy to help. Just drop us a line here:

or here:

or e-mail me directly: