Three Privacy Preserving Machine Learning Techniques Solving This Decade’s Most

On Apr 15, 2021

By Amogh Tarcar, Machine Learning and AI Researcher, Persistent Systems.

Data privacy, according to experts across a wide range of domains, will be the most important issue of this decade. This is particularly true for machine learning (ML) where algorithms are being fed reams of data.

Traditionally, ML modeling techniques have relied on centralizing data from multiple sources into a single data center. After all, ML models are at their most powerful when they have access to huge quantities of data. However, there are a host of privacy challenges that come with this technique. Aggregating diverse data from multiple sources is less feasible today due to regulatory concerns such as HIPAA, GDPR, and CCPA. Furthermore, centralizing data increases the scope and scale of data misuse and security threats in the form of data leaks.

To overcome these challenges, several pillars of privacy preserving machine learning (PPML) have been developed with specific techniques that reduce privacy risk and ensure that data remains reasonably secure. Here are a few of the most important:

Federated learning is an ML training technique that flips the data aggregation problem on its head. Instead of aggregating data to create a single ML model, federated learning aggregates ML models themselves. This ensures that data never leaves its source location, and it allows multiple parties to collaborate and build a common ML model without directly sharing sensitive data.

It works like this. You start with a base ML model that is then shared with each client node. These nodes then run local training on this model using their own data. Model updates are periodically shared with the coordinator node, which processes these updates and fuses them together to obtain a new global model. In this way, you get the insights from diverse datasets without having to share these datasets.

Source: Persistent Systems

In the context of healthcare, this is an incredibly powerful and privacy-aware tool to keep patient data safe while giving researchers the wisdom of the crowd. By not aggregating the data, federated learning creates one extra layer of security. However, the models and model updates themselves still present a security risk if left vulnerable.

2. Differential Privacy

ML models are often targets of membership inference attacks. Say that you were to share your healthcare data with a hospital in order to help develop a cancer vaccine. The hospital keeps your data secure, but uses federated learning to train a publicly available ML model. A few months later, hackers use a membership inference attack to determine whether your data was used in the model’s training or not. They then pass insights to an insurance company, which, based on your risk of cancer, could raise your premiums.

Differential privacy ensures adversary attacks on ML models will not be able to identify specific data points used while training, thus mitigating the risk of exposing sensitive training data in machine learning. This is done by applying “statistical noise” to perturb the data or the machine learning model parameters while training models, making it difficult to run attacks and determine whether a particular individual’s data was used to train the model.

For instance, Facebook recently released Opacus, a high-speed library for training PyTorch models using a differential privacy based machine learning training algorithm called Differentially Private Stochastic Gradient Descent (DP-SGD). The gif below highlights how it uses noise to mask data.

Source: Facebook’s Opacus Blog

This noise is governed by a parameter called Epsilon. If the Epsilon value is low, the model has perfect data privacy but poor utility and accuracy. Inversely, if you have a high Epsilon value, your data privacy will go down while your accuracy goes up. The trick is to strike a balance to optimize for both.

3. Homomorphic encryption

Standard encryption traditionally is incompatible with machine learning because…

Three Privacy Preserving Machine Learning Techniques Solving This Decade’s Most

2. Differential Privacy

3. Homomorphic encryption

Get more stuff like this in your inbox

Get more stuff like this
in your inbox