In talking with AI innovators who are tackling some of the most complex challenges in healthcare, three things are clear:
- Development of advanced solutions that can improve outcomes across today’s large and diverse patient population is slowed by insufficient access to real-world data;
- Validation of new AI-based healthcare solutions is hampered by the challenges associated with managing disparate data sets across multiple sites and systems; and
- Maintenance of models over the lifetime of an AI-based solution is not yet up to par – and will require not only access to large amounts of data on an ongoing basis but also infrastructure and tools for regulatory compliance as a model evolves.
At the heart of these challenges lie the issues of data sharing and patient privacy. Protecting personally identifiable information (PII) is obviously of the utmost importance. But this doesn’t mean that AI development must be slow. The future of healthcare requires the development of advanced solutions in a way that embraces data privacy.
Today, hospital researchers and AI developers typically collaborate in one of three ways:
- Direct Data Sharing: Many companies developing healthcare AI solutions will buy data from a small number of hospitals. This works quite well for developing prototypes of AI models. But it doesn’t scale. Striking these data agreements requires a large amount of time and resources from both the companies and the hospitals. This may have worked when there were only a handful of companies developing healthcare AI, but it won’t continue working now that there are hundreds and soon thousands of such companies.
- On-Premise Development: Today, there’s some promising AI innovation happening within specific hospitals (primarily large academic medical centers), with researchers utilizing the data they have readily available in their institutions. Translating this research into clinical-grade products, however, is not easy. In addition, these models trained on narrow data sets don’t perform well in clinical settings with diverse patient populations.
- Shared Data Lakes: Anonymizing data and moving it to a shared data lake works well in some instances. But there are some very real constraints to this approach – both from a technical and a business perspective. Much of the data most useful to creating AI-based diagnostics is imaging-centric – think CT scans, MRIs, and digital pathology scans. Moving petabytes of data in and out of the cloud is a heavy and expensive lift. It requires new business operations and governance processes. Even if the adoption of cloud accelerates, there is virtually no scenario in which all the world’s hospitals agree to pool their data into a single shared data lake. So we’ll inevitably be left with many different data silos.
Where We’re Headed
The healthcare data world is headed towards having a plethora of data silos on-premise and in the cloud. Some of these data silos will be from groups of hospitals with a shared data lake in a virtual private cloud, and some will be from single hospitals on-prem. Regardless, there will be many different silos of healthcare data around the world.
Making the data in these silos useful will require data federation. In order to develop advanced healthcare solutions that work well for diverse patient populations, we must be able to utilize data from these different data silos. And that must happen in a way that protects patient data privacy. We can think of these data silos as a worldwide distributed and disconnected healthcare database. The best way to perform calculations on such a database is with distributed computation.
This is why federated learning is the right solution for healthcare AI. It allows connecting data from different data silos while not requiring any movement of patient data. No huge files to transfer and store. And, most importantly, no risk to patient privacy.
Data remains with the hospital where it was created. Copies of the AI model are sent to each hospital, and training is performed at each hospital with its local data. Only aggregate information is shared back and used to create a global model. This makes it possible for healthcare AI developers to utilize data across hospitals and health systems without those care providers ever moving data, transferring ownership, or risking patient privacy.
It also makes it much easier for AI developers to access larger, more diverse data sets that better represent today’s real-world patient population. This will make models more robust for different patient populations, enabling AI-based healthcare solutions to scale globally at an unprecedented pace.
But training a model once on a diverse data set is not enough. The world is constantly changing, and AI models must be able to adapt and evolve with these changes. AI developers need access to a continuously updating data set, reflecting the changes to patient populations and treatment protocols. They also need a way to evolve their AI models over time to adapt to these changes and to deploy their updated models to production environments in a safe way, in line with regulations.
The FDA is actively evolving its guidance and requirements to address this problem. Hospitals and AI developers alike are also embracing the opportunity to do better – from the initial phases of developing an AI model, through training, validation, and continuous improvement.
But delivering on these good intentions will require a practical, effective solution to overcoming the very real challenges associated with accessing large volumes of distributed ever-evolving clinically relevant data from diverse groups of patients.
Now is the time to break down data silos and obstacles to innovation. Federated learning is the way forward in healthcare AI.
Photo: karsty, Getty Images