Unembedding: reverse engineering PII from lists of numbers

Capture this in your threat model

TLDR; When embedding your data, treat the embedded records with the same privacy and threat considerations as the corresponding source records.

Exploitation scenario: A team of data scientists working for a large multinational organisation have recently developed an advanced predictive modelling algorithm that processes and stores data in a vector format. The algorithm is groundbreaking, with applications in numerous industries ranging from managing climate change data to predicting stock market trends. The scientists shared their work with their international colleagues to facilitate global work.

These data vectors, containing sensitive and proprietary information, get embedded into their AI systems and databases globally. However, the data is supposedly secured using the company's in-house encryption software.

One day, an independent research team published a paper and tool to accurately reconstruct source data from embedded data in a vector store. They experimented with multiple types of vector stores, and they could consistently recover the original data.

Unaware of this development, the multinational corporation allows source vector data of the proprietary AI system to be embedded and shared across its many branches.

After reading the recent research paper, a rogue employee at one of the branches decided to exploit this vulnerability. Using the research team's tooling, he successfully reconstructed the source data from the embedded vectors within the company's AI system. This way, he gains access to highly valuable and sensitive proprietary information.

This fictitious scenario shows how strings of numbers representing embedded data can be reverse-engineered to access confidential and valuable information.

Dec 10 2023