Methods to Mask PII data with FPE Using Azure Synapse

Artificial Intelligence

Methods to Mask PII data with FPE Using Azure Synapse

admin

May 23, 2023

Methods to Mask PII data with FPE Using Azure Synapse

Learn to do Format Preserving Encryption (FPE) at scale, securely move data from production to check environments

Masking data — image by Mika Baumeister on Unsplash

Lots of enterprises require representative data of their test environments. Typically, this data is copied from production to check environments. Nonetheless, Personally Identifiable Information (PII) data is usually a part of production environments and shall first be masked. Azure Synapse may be leveraged to mask data using format preserved encryption after which copy data to check environments. See also architecture below.

1. Format preserving Encryption at scale using Azure Synapse — image by creator

On this blog and repoazure-synapse_mask-data_format-preserved-encryption, it’s discussed how a scalable and secure masking solution may be created in Synapse. In the following chapter, the properties of the project are discussed. Then the project is deployed in chapter 3, tested in chapter 4 and a conclusion in chapter 5.

Properties of the PII masking appication in Synapse are as follows:

Extendable masking functionality: Extending on open source Python libraries like ff3, FPE may be achieved for IDs, names, phone numbers and emails. Examples of encryption are 06–23112312 => 48–78322271,
Kožušček123a => Sqxbblkd659p, bremersons@hotmail.com => h0zek2fbtw@fr5wsdh.com
Security: Synapse Analytics workspace that used has the next security in place: Private endpoints to hook up with Storage Account, Azure SQL (public access may be disabled) and 100 of other data sources (including on-premises); Managed Identity to authenticate to Storage account, Azure SQL and Azure Key Vault during which the secrets are stored which are utilized by ff3 for encryption; RBAC authorization to grant access to Azure Storage, Azure SQL and Azure Key Vault and Synapse data exfiltration protection to stop that data leaves the tenant by a malicious insider
Performance: Scalable solution during which Spark used. Solution may be scaled up by utilizing more vcores, scaling out by utilizing more executors (VMs) and/or using more Spark pools. In a basic test, 250MB of knowledge with 6 columns was encrypted and written to storage in 1m45 using a Medium sized Spark pool with 2 executors (VMs) and eight vcores (threads) per executor (16 vcores/threads in total)
Orchestration: Synapse pipelines can orchestrate the method end to finish. That’s, data may be fetched from cloud/on-premises databases using over 100 different connectors, staged to Azure Storage, masked after which sent back to lower environment for testing.

Within the architecture below, the safety properties are defined.

2. Security properties of masking application — image by creator

In the following chapter, the masking application will probably be deployed and configured including test data.

On this chapter, the project involves live and will probably be deployed in Azure. The next steps are executed:

3.1 Prerequisites
3.2 Deploy resources
3.3 Configure resources

3.1 Prerequisites

The next resources are required on this tutorial:

Finally, clone the git repo below to your local computer. In case you don’t have git installed, you’ll be able to just download a zipper file from the net page.

3.2 Deploy resources

The next resources have to be deployed:

3.2.1 Azure Synapse Analytics workspace: Deploy Synapse with data exfiltration protection enabled. Be sure that that a primary storage account is created. Make also sure that Synapse is deployed with 1) Managed VNET enabled, 2) has a non-public endpoint to the storage account and three) allow outbound traffic only to approved targets, see also screenshot below:

3.2. Azure Synapse with managed VNET and data exfiltration protection enabled — image by creator

3.3. Configure resources

The next resources have to be configured

3.3.1 Storage Account – File Systems : Within the storage account, create a latest Filesystem called bronze and gold. Then upload csv file in DataSalesLT.Customer.txt. In case you should do a bigger dataset, see this set of 250MB and 1M records
3.3.2 Azure Key Vault – Secrets: Create a secret called fpekey and fpetweak. Be sure that that hexadecimal values are added for each secrets. In case Azure Key vault was deployed with public access enabled (with a view to have the ability to create secrets via Azure Portal), it’s not needed anymore and public access may be disabled (since private link connection will probably be created between Synapse and Azure Key vault in 3.3.4)
3.3.3 Azure Key vault – access control: Be sure that that within the access policies of the Azure Key Vault the Synapse Managed Identity had get access to secret, see also image below.

3.3.3 Synapse Managed Identity having get access on secrets in Key Vault — image by creator

3.3.4 Azure Synapse Analytics – Private link to Azure Key Vault: Create a non-public endpoint from the Azure Synapse Workspace managed VNET and your key vault. The request is initiated from Synapse and wishes to be approved within the AKV networking. See also screenshot below during which private endpoint is approved, see also image below

3.3.4 Private Link Connection between Synapse and Key Vault — image by creator

3.3.5 Azure Synapse Analytics – Linked Service link to Azure Key Vault: Create a linked service from the Azure Synapse Workspace and your key vault, see also image below

3.3.5 Linked Service between Synapse and Key Vault to get secrets — image by creator

3.3.6 Azure Synapse Analytics – Spark Cluster: Create a Spark cluster that’s Medium size, has 3 to 10 nodes and may be scaled to 2 to three executors, see also image below.

3.3.6 Create Spark Cluster in Synapse — image by creator

3.3.7 Azure Synapse Analytics – Library upload: Notebook Synapse/mask_data_fpe_ff3.ipynb uses ff3 to encryption. Since Azure Synapse Analytics is created with data exfiltration protection enabled, it can’t be installed using by fetching from pypi.org, since that requires outbound connectivity outside the Azure AD tenant. Download the pycryptodome wheel here , ff3 wheel here and Unidecode library here (Unidecode library is leveraged to convert unicode to ascii first to stop that extensive alphabets shall be utilized in ff3 to encrypt data). Then upload the wheels to Workspace to make them trusted and eventually attach it to the Spark cluster, see image below.

3.3.7 Attached Python packages to Spark cluster from Synapse workspace — image by creator

3.3.8 Azure Synapse Analytics – Notebooks upload: Upload the notebooks Synapse/mask_data_fpe_prefixcipher.ipynb and Synapse/mask_data_fpe_ff3.ipynb to your Azure Synapse Analytics Workspace. Be sure that that within the notebooks, the worth of the storage account, filesystem, key vault name and keyvault linked services are substituted.
3.3.9 Azure Synapse Analytics – Notebooks – Spark session: Open Spark session of notebook Synapse/mask_data_fpe_prefixcipher.ipynb, ensure you select greater than 2 executor and run it using a Managed Identity, see also screenshot below.

3.3.9 Run Spark session as managed Identity — image by creator

In any case resources are deployed and configured, notebook may be run. Notebook Synapse/mask_data_fpe_prefixcipher.ipynb comprises functionality to mask numeric values, alpanumeric values, phone numbers and email addresses, see functionality below.

000001 => 359228
Bremer => 6paCYa
Bremer & Sons!, LTD. => OsH0*VlF(dsIGHXkZ4dK
06-23112312 => 48-78322271
bremersons@hotmail.com => h0zek2fbtw@fr5wsdh.com
Kožušček123a => Sqxbblkd659p

In case the 1M dataset is used and 6 columns are encrypted, processing takes around 2 minutes. This may easily be scaled by utilizing 1) scaling up by utilizing more vcores (from medium to large), scaling out by utilizing more executors or simply create a 2nd Spark pool. See also screenshot below.

4. Notebook successfully run — image by creator

In Synapse, notebooks may be easily embedded in pipelines. These pipelines may be used to orchestrate the activities by first uploading the info from production source to storage, run notebook to mask data after which copy masked data to check targed. An example pipeline may be present in Synapse/synapse_pipeline.json

Lots of enterprises must have representative sample data in test environment. Typically, this data is copied from a production environment to a test environment. On this blog and git repo-synapse_mask-data_format-preserved-encryption, a scalable and secure masking solution is discussed that leverages the facility of Spark, Python and open source library ff3, see also architecture below.