Why Nightingale built a computing platform for researchers

At first glance, people are often curious why a research data repository like Nightingale would also build a computing platform that seemingly confines researchers to our hosted environment. Wouldn't it be easier for everyone if you could download data or plug in your own cloud account?

The answer is quite simple. Yes, that would be much easier. Unfortunately, there are precious few downloadable medical datasets.

It's easy to understand why not. While healthcare institutions are quite literally drowning in data, they are profoundly and justifiably incentivized to protect patient privacy. When a large public or private institution thinks about exposing medical research data to the public, the risks are quite imminent while the upsides to the institution are indirect at best. And so after considering the balance of risk and reward, despite goodwill that decision-makers might feel toward the research community, it's very difficult for them to justify making de-identified research data publicly available.

Instead, what actually happens is this: A prominent and well-funded professor at a prestigious university spends years to get access to research data. They spend hundreds of thousands of dollars. They employ lawyers and negotiate with software vendors and IT contractors. Occasionally, they even succeed. If they get access, it will be for themselves and a tiny research team. They might publish a result. If they do, you can see their paper and maybe some code, but not the data they used.

As we've written before, this situation hinders progress in the field of computational medicine. Without a different pathway for medical data to reach the research community, access will continue to be limited to a privileged few who have big endowments and the institutional pull necessary to get data.

The Nightingale platform exists to provide one such alternative pathway—one that reshapes the risk profile for healthcare decision-makers. In a sense, it wasn't built to satisfy researchers, although clearly that's the end goal. We actually built the platform to mitigate risks assumed by health systems so that they could, in turn, help satisfy the research community's increasing demand for more and larger datasets.

In our first year, Nightingale and partnering health systems published five new datasets totaling nearly 150 terabytes. That could easily double in the next year or two. It's clear that this model works for a growing list of contributing healthcare institutions, and we’re continuing to expand our data catalog.

We have user-researchers from more than 40 countries around the world, most of whom come from academic institutions other than the U.S. giants. And this reveals another important way the Nightingale platform benefits the research community.

server rack and wires — Photo by Massimo Botturi on Unsplash

To illustrate, consider our breast cancer pathology dataset, which at more than 140 terabytes is probably the largest such public dataset in the world. Few researchers have thought about what it takes to store and optimize I/O throughput for a dataset of this size. High performance storage at that scale is complex and expensive, and even best funded labs may struggle to buy and implement the necessary solutions. (If you’re curious about what an economical solution would entail, imagine 24 servers with two 3 terabyte disks each, plus a metadata server, plus rack, power supply, cooling, and networking equipment.) For a research project that might only last a few months, it probably isn’t worth it.

Yet 140 terabytes is small compared to what researchers really want in this domain. For this sort of dataset and the types of algorithms currently being investigated, we'll soon be talking about exabyte scale. And so even for rich universities, the Nightingale platform provides a commons that yields much needed efficiencies. For everyone else, it makes the impossible possible.

Storage is just one example. How would you onboard fifty or one hundred researchers from institutions around the world to collaborate in real time on a 150 terabyte dataset. (Say, in response to a global pandemic?) With Nightingale your team would be up and running in minutes.

So if you initially bristle at the idea of being confined to a remote research environment, that’s understandable. After all, you might be someone who has access to all the computing power you could ever want. But you probably don’t have hundreds of terabytes of excess storage capacity right now, and you probably have better ways to spend your research time than begging for data. So it’s a trade-off.

We exist to support the research community, from pioneers in the field to those who are just now taking their first steps and those who are instructing them. Please help make Nightingale OS better by telling us about your experience with the data and our computational environment. We are a tiny, nonprofit team, but we've come a long way already, and we want to do everything we can to help you use Nightingale data as productively as possible.