Federated Machine Learning

Healthcare is one of the most privacy-concerned domains where the management and sharing of patient health data is strictly regulated by regional, national and international legislation (e.g. GDPR, HIPAA). Hence, it is challenging to build machine learning models using datasets of different healthcare and health research organizations/institutes together. The classic approach is to collect datasets of different institutes into a central place and train machine learning models. However, sharing this data (i.e. moving out of the institutes’ boundaries) is subject to harsh privacy-related policies and regulations including the patient consent, and usually not possible.

In the context of the FAIR4Health project, we designed and implemented a federated machine learning architecture under the general concept of privacy-preserving distributed data mining (PPDDM). The source code is open and can be found on GitHub.

The general representation of the Federated Machine Learning Architecture. At each deployment site on the left, an Agent is responsible to train the local/weak models and the Manager orchestrates the collection of the local/weak models from different Agents and builds a boosted model. The boosted model can be used for predictions on the Manager.

I am the lead software architect and lead developer of the Federated Machine Learning (ML) system. With an international team of software engineers, machine learning experts and data scientist, we designed the agent-manager federated ML system based on the strict security and privacy requirements where we did not let the data out of agents, but still trained classification/regression models and association patterns in our system. Under my coordination, we developed the agent and manager as two separate modules where they communicate through RESTful services in an asynchronous manner. We developed with Scala and utilized Apache Spark as the data processing environment together with its MLlib. The source code is on GitHub and the full tech stack can be found on stackshare.

Description

FAIR4Health Privacy-Preserving Distributed Data Mining (PPDDM) Framework