Sunday, June 17, 2007

Where to get images for research and testing - Public collections, routine re-use, and the possibility of direct patient contributions

Summary: Large useful collections of publicly accessible medical images for testing and research are few in number; despite public initiatives to build such collections, progress is slow though improving; the additional possibility of having individual members of the public contribute their own images and data directly has been raised; logistic and legal concerns are significant but surmountable, and there would seem to be few privacy and human research regulation issues.

Long version:

I have long fantasized about the existence of a large collection of complete sets of images suitable for research and testing purposes, whether it be for testing image pixel data for different types of compression, display, analysis, or similar studies, or for more mundane tasks like checking for DICOM compliance or testing DICOM-capable tools like PACS and workstations against the installed base of equipment. Indeed, I first developed an interested in DICOM in the early nineties not for clinical interchange, but as a means of formatting and organizing my own teaching and research collections. Little did I know where that would lead !

Traditionally in academic research studies, one begins with a laborious exercise of collecting patient-related images prospectively or retrospectively; this often involves multi-site collaboration, approval by Institutional Review Boards (IRBs), etc.; this is very expensive, time consuming, and frankly, beyond the capabilities of many scientists, engineers, programmers and students who just want to test their ideas, algorithms and code. Further, the folks who need the images may not have the academic affiliations, credibility or stature to even get to first base as far as funding or approval is concerned.

Some of us are fortunate enough to be actively engaged in large scale multi-center clinical trials and industry testing collaborations and we can often find ways of re-purposing and reusing images gathered for other purposes, with the appropriate approvals and permissions. This avenue is not open to many folks who need images though. Some of the NIH folks are keen to remedy this problem by recruiting images from other studies and making them publicly accessible via such mechanisms as the National Cancer Image Archive (NCIA) and the Alzheimer's Disease Neuroimaging Initiative (ADNI) projects to name just a few of several. These projects emphasize the importance of gathering not just any images, but complete sets, in a relatively homogeneous manner with respect to acquisition protocol, at multiple time points in the course of diseases that need to be followed over time, and with additional related data, such as experts' assessment of lesion location and outcome and historical data where relevant. Such efforts still require significant resources and involve sometimes difficult negotiations with respect to funding and permission.

Another option that I have considered in the past is to somehow capture images and associated information as a "side effect" of routine clinical use. For example, many facilities are partially or totally digital already, with respect to images, diagnosis codes and reports if not the entire medical record. Further, many such sites already use "off-site storage" provided by third-parties either as their primary archive or to support disaster recovery. Would it be a difficult step to go a little further and automatically collect and de-identify all such image and related data and make it publicly available for research ?

From a legal perspective, possibly all it would take would be for the facility to add consent and authorization for such routine (as opposed to prospectively identified) re-use purposes; however, each IRB would undoubtedly weigh in with policy and risk-management related issues that might be difficult to get by. And frankly, many physicians might feel threatened by releasing what they otherwise consider their proprietary material, which potentially provides them with a competitive advantage with respect to grant applications and publishing papers. To put it another way, one would need to provide a facility with one hell of an incentive to get by the obstacles that naysayers might raise.

One such incentive might be to provide free or really cheap storage; how many CFOs or CIOs would drool over the possibility of reducing or eliminating bulk data storage costs if a third party (such as a non-profit organization established for the benefit of the public research community) were to underwrite these costs, on the proviso that their de-identified form be made available ? Such an incentive might serve to significantly undermine any opposition within an institution. It might be possible to leverage the capabilities of existing commercial providers of off-site archives, who could offer a reduced price for such data sets. Conversely however, less well intentioned folks might see this as a commercial opportunity and explore the possibility of selling the data instead of making it publicly available for free.

Some existing archive providers also provide the opportunity for patients to contribute and maintain their own images, allowing access to their health care providers as appropriate, myNDMA being an example (though I noticed as I was researching this post that myNDMA are "accepting no new registrations at this time"). The concept of patient empowerment and patient-centric control of one's own destiny is perhaps a concept whose time has come, though obviously only a subset of the population will be willing to or capable of taking on such responsibility. An example of extending this concept to one's entire record is the MedCommons project.

On a previous occasion, frustrated by the difficulty of getting images from a broad range of installed modalities to test DICOM software, I had considered setting up a publicly accessible archive that would also allow anybody from the public at large to contribute. My plan was to canvas the community of digital imaging and PACS users as well as ordinary people undergoing imaging to submit material that I would then de-identify and make available for testing. At the time my primary interest was in the "DICOM-ishness" of the data and not the research applications, though I was interested in complete sets rather than individual images. I did not pursue this, since about the same time NEMA was initiating an effort to gather images from modality vendors for similar sorts of testing (the NEMA DICOM Object Library). However I was sorely disappointed when, despite my strong protests, the NEMA vendors decided to keep this a closed and secret database not accessible to non-NEMA members or the public, which it remains to this date. Bet you didn't even know about it, did you ?

However, I was reminded of the possibility of direct patient contribution to image archives at a recent Cancer Research and Prevention Foundation Lung Cancer Workshop, during which the concept of approaching patients, people under going screening, and survivors for image contributions was raised. A lively conversation among the participants ensued led by Jim Mulshine, David Yankelevitz and Rick Avila. In essence, most of the attendees were quite excited by this concept, particularly since there is an opportunity to leverage the good will of the survivor-driven charitable organizations to organize and promote such an activity. KitWare has kindly volunteered to coordinate some of this work and you can follow along on their Wiki once it gets under way. Though this was discussed in the context of lung cancer, and particularly with respect to gathering images for CAD testing and validation, the concept is obviously generalizable.

For example, in lieu of there being a good publicly available collection of images for digital (as opposed to digitized) mammography image compression research, one might consider attempting to build such a collection with the assistance of contributions from individual women. One of the obvious problems with this is the relatively low prevalence of disease; i.e., one might receive far more normal contributions than abnormal, which makes performing research on disease-enriched data more difficult, or conversely, means storing and curating a large amount of data for a relatively low yield of useful information. However, unlike the unfortunate situation for lung cancer, a far higher proportion of women either have a negative biopsy or survive their disease, and potentially a high yield of images with positive findings could be obtained from this group.

Another problem is the matter of gathering additional outcome data; for many types of experiment it is necessary to have some knowledge of the truth beyond what can be ascertained from the images themselves. Contribution of pathology reports and/or follow-up images would be desirable. The former presents problems in that these reports are less often accessible to patients (or screening participants) in digital form, though perhaps they could be scanned or faxed The latter might be contributed on a separate occasion, but if de-identified, how are they to be linked to the same (anonymized) individual ?

In general, the problem of reliable de-identification and anonymization (or pseudonymization) on a large scale is hard. Sure, one can clean the DICOM header information well enough, especially if one can discard most of the string descriptive and private attributes without affecting reuse, though even that is non-trivial in the general case. The problem of burned in pixel data identification can at least be detected in a subset of images (by automated algorithms examining header patterns as well as OCR-like analysis of pixel data), which can then be sequestered for manual review. Anything that is not an image though, such as a scanned or faxed, or even PDF or HL7 plain text or DICOM structured report will likely require manual (and hence error prone) attention. The resource burden of manual de-identification (and QC process to check on it) is not to be underestimated.

One approach would be to have the contributor themselves actually perform the de-identification by providing them with the appropriate web-deployed tool to use to contribute, view and edit the content; that way they could both do the work and absolve the archiver from future responsibility in this respect. Indeed, if all the work were performed client-side, the central server would not ever need to have access to or knowledge of the actual Protected Health Information (PHI), which might considerably simplify the necessary security measures. Continuity across contributions would be more difficult but could be achieved with some sort of registration or identity hash based mechanism. It would be shame if this additional burden were to prove a disincentive to contribute, though.

Thorough de-identification in the general case remains non-trivial though, especially if one goes so far as to consider facial information possibly recognizable from a 3D rendering of images of the head; there are means to disrupt the data to prevent this, but that would make it useless for many (though not all) potential future uses. Though trials on the matter of recognizability are currently under way, there is no consensus on this yet, and perhaps it would be easiest just to have the contributor consent around this issue.

Indeed, on the matter of consent, this might be more challenging than all the procedural and technical and resource issues put together. One would have to be sure that the contribution agreement would stand legal scrutiny, cover all potential uses of the data, irrevocably, and allow for the archive maintainer to disclaim any liability. Liability might include not only privacy concerns, but also responsibility to feed back any findings with respect to the data to the contributor. For example, in the case of CAD testing, one would not want the contributor to have the (unrealistic) expectation that if a future CAD experiment found something undesirable that they would receive feedback that would impact their care. Such an agreement would somehow need to be "signed", presumably, to have any legal standing, and a mechanism to do this via the web at the time of contribution and to archive the signature would be necessary.

Note that I distinguish the matter of the individual contribution agreement with respect to permission and liability from the matter of permission from others. To my knowledge, at least in the US, there are no regulations that would govern the establishment of such a repository of images. Whilst the HIPAA Privacy and Security rules might provide helpful guidance, the repository would not in and of itself be a Covered Entity, and hence would not be subject to the rules. Further, since contributions would be directly from individuals rather than Covered Entities, no HIPAA provisions on the sending side would come into play.

Would some form of IRB approval be required, either to contribute, maintain or to use any of the data ? The US federal regulation on Protection of Human Subjects, which potentially applies to federally funded activities, specifically exempts "research involving the collection or study of existing data ... if these sources are publicly available or if the information is recorded ... in such a manner that subjects cannot be identified ..." (45 CFR 46.101(b)(4)),.

However, whilst there might be no formal need for an IRB approval, review of the policies and procedures and agreements by some form of central IRB might well be worthwhile to mitigate any concern that the rights of the contributors are not being abused. Perhaps the NCI's Central IRB (CIRB) Initiative might be willing to take on this responsibility. One could envisage drafting a set of standard "open source" pre-approved documents that would allow any number of willing organizations to implement and replicate this strategy.

This is of course a somewhat US-centric view of the privacy and human research situation biased by my own experience; since any such repository might be open to global contributions, a further analysis of the issues in other countries is desirable.

But the bottom line is that there would seem to be few if any restrictions to a person who has access to their own record in electronic form to use it in any manner they see fit, and hence to contribute it to such a research collection for the public good. Whilst one may debate about who actually "owns" the data, I hope few would be so crass as to attempt to restrict an individual's use of their own personal data in such a manner.

What remains now is for those of us who see merit in this approach to take action to make it happen, and in such a manner that the data becomes useful in advancing the state of the art.



Unknown said...

As a PACS test engineer I can vigorously affirm that good testing - and design for that matter - stems principally from the availability of good ‘field’ data. Without undue effort, a good test archive enables verification of a wide range of realistic clinical scenarios and hence reduces the chances of nasty surprises at deployment.
Currently, such data is like gold dust in my organization.
This proposed public collection is therefore a great objective - and your ideas for getting there are pragmatic and informed.
And yes, I would definitely contribute my own health record to such an archive (but perhaps I am biased).

Unknown said...

Dave, I think its a grate idea to have public collection of data.

I'm in Ultra Sound side of things now - miss the good old RadPharm MRI days :-). Anyway I find getting my self scanned to get some test data which I'll be willing to share; of course, given it ever gets started. I'll even be willing contribute some time and possibly moderate the site as well. Let me know.
I think you'd be surprised how many folks will willing to participate on this project.

Cheryl said...

Hi David! I'm a nurse in interventional radiology. I think a database such as you describe would be great! I work for a huge research and teaching hospital and feel that something like that would be an asset to facilities such as mine.