In our first two posts we looked at the principles of building standards in a rapidly innovating field like modern biological and biomedical imaging. We also reviewed the technical concepts we have used to build the OME Data Model and Bio-Formats, to explain OME’s efforts to deliver a common library for accessing scientific image data. In this next entry, we describe how we work with the community to build Bio-Formats.
Bio-Formats is a Java library, available under an open source license (GPL version 2). It plugs into many other software tools including ImageJ, Matlab, KNIME, and CellProfiler.
We distribute Bio-Formats under an open source license so that the technology is available for the widest possible set of use cases, and to allow us to engage, hopefully rapidly and productively, with the community. Scientists and software developers can use Bio-Formats for whatever purpose they wish and can make whatever changes they need to its source code. The architecture of Bio-Formats allows it to be used as a “plug-in”, enabling any tool that uses it to access a large number of proprietary file formats (PFFs). This flexibility has resulted in substantial uptake of Bio-Formats — it is installed in many thousands of sites worldwide and started more than 10,000 times each day.
The success of the Bio-Formats project is heartening, but what really differentiates it is the way its development process links in with the community.
As noted above, there are several hundred PFFs in biological and biomedical imaging and of course many more image formats in the wider scientific community, for example in astronomy, remote sensing, geology etc. We are not aware of a definitive measure of the number of scientific image file formats but it seems likely that it numbers in the several hundreds.
This panoply of PFFs raises the question of how a project like Bio-Formats could ever succeed. During the initial phases of Bio-Formats development, ca. 2002-2004, Kevin Eliceiri and Jason Swedlow approached several commercial imaging system vendors, requesting specifications for each vendor’s PFFs (at that time, several imaging companies each maintained several different imaging formats for different imaging modalities). Proposing to build software against these PFF specifications made sense. Simultaneously, the team at LOCI began reverse-engineering PFFs they had access to and including support for them in the early versions of Bio-Formats as well. It rapidly became apparent that specifications, in the few cases where they were available, were sometimes out of date, often incomplete and, critically, didn’t cover all the details and complexities that were revealed in real examples of PFFs received from various scientists. The most valuable source of information about PFFs was the files generated by scientists performing real experiments.
The effect of that realization was extremely profound. Over the years, simply by asking Bio-Formats users over and over (and over and over…) again to “send us the data”, the OME team has built up a substantial library of datasets submitted by users. Currently we hold over 30,000 datasets amounting to more than 2 TBs of image data. Each of the represented PFFs have been reverse engineered and supported in Bio-Formats. Since 2012, we have used essentially this whole library of images to test Bio-Formats on a daily basis (to view this testing live, see the OME CI system). By no means is this repository complete and, as noted above, the datasets we hold are at best barely up-to-date with the latest changes in file formats and the newest imaging technologies. It is however unquestionably a great reference for our work.
Rarely a day goes by where the OME Team is not contacted for some issue or problem with the use of Bio-Formats. In 2014, most problems involve a file format we attempt to support but currently mishandle. Sometimes this is a mistake on our part—a bug—but often we are not aware of a certain usage of metadata.
We have observed a trend towards coalescence of file formats by the commercial vendors. Previously, each vendor created and supported multiple file formats, one for each imaging system they manufactured. More recently the vendors have tried to coalesce these multiple file formats into single (corporate) PFFs that support all of the different modalities they manufacture. This is certainly an attractive goal and initially we believed it would make our work much easier.
However, the jury is still out on the impact of this trend. We now routinely see multiple versions of the “same” PFF. This amounts to different implementations of metadata structures being used by the different imaging systems manufactured by a single company. Thus while the file extensions have unified, the diversity of image file sets seems to be still increasing. For Bio-Formats, this trend has created a problem — a file that is supported based on its file name or extension may not be supported simply because we have never received an example of a particular version of a PFF. The only solution we have for improving this situation is working more closely with some of the commercial vendors to get a better specification of how their different imaging systems are supported in their file formats.
Over the years the OME Data Model has grown to support many different imaging modalities. In the near future, OME’s goal is to continue to expand the imaging modalities supported by Bio-Formats especially to target new 3D imaging technologies (e.g. LSFM, HREM, and OPT). We also note that there is an increasing scientific drive to develop and perform correlative imaging, where imaging of a single sample at two or more different resolution levels with two separate, complementary modalities is achieved. The metadata structures that relate to these different modalities must be supported. To this end, we are actively updating the OME Data Model and Bio-Formats to support units (pixel sizes, camera speed, temperature, etc.). This is part of an explicit effort to build the foundation for expanding Bio-Formats capabilities into new domains and to express the linkages between different domains.
One of the challenges of building and maintaining Bio-Formats is the scale of its user community. The number of requests that arrive on a daily basis for updates to Bio-Formats swamps our current development capacity. We are currently seeking funding to grow the numbers of developers on the Bio-Formats team and hopefully will expand the team in 2015.
Most of the time, a request for support for a new or updated PFF comes as a simple request for work. We accept these in the best spirit of open source development — that the user submitting the request needs this update to Bio-Formats to support their scientific progress. However, we now require any request to be supplemented with example files. Without these we simply cannot authoritatively perform any testing or updates. Of course we always welcome code submissions or guidance on the nature of the problem. Indeed several individuals have contributed to Bio-Formats over the years. These contributions are highly valued as they slowly spread the expertise of working with Bio-Formats. The next step in this progression will be to distribute the burden of supporting different PFFs to different individuals and entities, while maintaining the commitment to code review, QA and testing that we have developed over the last several years. Watch this space… or maybe get involved!
As noted above, the rapid pace of innovation and scale of usage means that the Bio-Formats team is always facing a very long queue of requests, bugs, etc. (in software management-speak, our “backlog”). We treat this queue somewhat flexibly. When we see very large numbers of requests around a single file format, we will prioritize that file format — more users means more impact for our software. Sometimes there is work to do that is a priority for us — adding support for units or a whole new imaging domain are examples. Inevitably, there are times when updates to PFFs take much longer than we would like. Anyone who makes a request to the Bio-Formats development team can be cc’d on the tickets that track our progress and is always welcome to contact the team to check on status. Repeated requests from a single individual do count, but we are strongly influenced by similar requests from multiple individuals.
As we work to provide a single common access mechanism for scientific image data, please do send us examples of the file formats you are using, get involved if you can — help us test, send us feedback, or even develop new solutions. Building a strong consortium of academic, industrial and commercial partners is the best way to make the project successful and useful for our whole community.
— November 14, 2014