Privacy-Preserving Machine Learning
This report collects insights into the field of machine learning and the realm of privacy preservation. The escalating importance of the issue to preserve data privacy is the first insight, illustrating how and why it has become one of the primary concerns of the decade. Secondly, the lack of reliable anonymization techniques is a major gap identified in the field. Lastly, the ongoing development of differential privacy could be a solution for the lack of data anonymity frameworks.
Data Privacy has Become a Major Public Issue
- The importance of data privacy, security concerns, and information gathering has been a persistent trend that industry analysts have been covering for years.
- Per Pew Research, 6 in 10 Americans believe that data is collected about them on a daily basis by corporations or governments, signifying how widespread the issue of data harvesting and potential loss is.
- Health data collected by wearable devices or DNA samples sent to companies is not protected by health industry regulations such as the HIPAA, creating a gray area for who owns the harvested data and what can be done with it.
- With no data privacy, data is mined on individuals, including their likes and preferences. Those individuals are then given an industry-related score and their profiles are sold to companies. The lack of the ability to control this data is the major public complaint.
- The lack of privacy exhibited by social media platforms has led to a public #deleteFacebook campaign, which further illustrates the importance of privacy to consumers.
- The escalating data collection potential represented by 5G and quantum computing will cause the need for data privacy regulations all the more.
- As AI develops in its ability to process information, the amount of information collected will increase and the likelihood that it will be able to assign identities to sensitive data will also increase.
Anonymized Data can be Reversed to Identify Individuals
- With one or more anonymized data sets, enough information about an individual can be utilized to identify them. In other words, the industry is lacking a bulletproof method of anonymization.
- As an example of one of the multiple times this has happened, in 1997, the Massachusetts Group Insurance Commission released data to researchers that had been de-identified.
- The researchers were able to cross-reference publicly available databases with the anonymized data and align it with individuals, including the Governor at the time. Several other instances of anonymized data being re-identified with individuals have occurred.
- One method that is sometimes used to anonymize data is a one-way hash. While it is widely used, it is not a fully reliable method of removing identifiers. European data protection authorities recently released a guide that covers how one-way hashes can be undone.
Differential Privacy Could be a Solution for the Lack of Data Anonymity
- Differential privacy (DP) is a concept first formulated by 2006 that approaches data collected by machine learning that serves to remove the connection between personal data and personal identity.
- The goal of DP is "to ensure that anything that can be learned about an individual from the released information, can be learned without that individual’s data being included."
- DP uses a scrambling method as data is introduced and has background noise data that further prevents it from being connected with individuals.
- A workable model for DP has not yet been produced or implemented. However, Harvard Medical School released the first outlined methodology for practical use in 2018, representing the potential development of a solution for the lack of data anonymity (with further research needed).