Learn to do Format Preserving Encryption (FPE) at scale, securely move data from production to check environments
Lots of enterprises require representative data of their test environments. Typically, this data is copied from production to check environments. Nonetheless, Personally Identifiable Information (PII) data is usually a part of production environments and shall first be masked. Azure Synapse may be leveraged to mask data using format preserved encryption after which copy data to check environments. See also architecture below.
On this blog and repoazure-synapse_mask-data_format-preserved-encryption
, it’s discussed how a scalable and secure masking solution may be created in Synapse. In the following chapter, the properties of the project are discussed. Then the project is deployed in chapter 3, tested in chapter 4 and a conclusion in chapter 5.
Properties of the PII masking appication in Synapse are as follows:
- Extendable masking functionality: Extending on open source Python libraries like ff3, FPE may be achieved for IDs, names, phone numbers and emails. Examples of encryption are 06–23112312 => 48–78322271,
Kožušček123a => Sqxbblkd659p, bremersons@hotmail.com => h0zek2fbtw@fr5wsdh.com - Security: Synapse Analytics workspace that used has the next security in place: Private endpoints to hook up with Storage Account, Azure SQL (public access may be disabled) and 100 of other data sources (including on-premises); Managed Identity to authenticate to Storage account, Azure SQL and Azure Key Vault during which the secrets are stored which are utilized by ff3 for encryption; RBAC authorization to grant access to Azure Storage, Azure SQL and Azure Key Vault and Synapse data exfiltration protection to stop that data leaves the tenant by a malicious insider
- Performance: Scalable solution during which Spark used. Solution may be scaled up by utilizing more vcores, scaling out by utilizing more executors (VMs) and/or using more Spark pools. In a basic test, 250MB of knowledge with 6 columns was encrypted and written to storage in 1m45 using a Medium sized Spark pool with 2 executors (VMs) and eight vcores (threads) per executor (16 vcores/threads in total)
- Orchestration: Synapse pipelines can orchestrate the method end to finish. That’s, data may be fetched from cloud/on-premises databases using over 100 different connectors, staged to Azure Storage, masked after which sent back to lower environment for testing.
Within the architecture below, the safety properties are defined.
In the following chapter, the masking application will probably be deployed and configured including test data.
On this chapter, the project involves live and will probably be deployed in Azure. The next steps are executed:
- 3.1 Prerequisites
- 3.2 Deploy resources
- 3.3 Configure resources
3.1 Prerequisites
The next resources are required on this tutorial:
Finally, clone the git repo below to your local computer. In case you don’t have git installed, you’ll be able to just download a zipper file from the net page.
3.2 Deploy resources
The next resources have to be deployed:
- 3.2.1 Azure Synapse Analytics workspace: Deploy Synapse with data exfiltration protection enabled. Be sure that that a primary storage account is created. Make also sure that Synapse is deployed with 1) Managed VNET enabled, 2) has a non-public endpoint to the storage account and three) allow outbound traffic only to approved targets, see also screenshot below:
3.3. Configure resources
The next resources have to be configured
- 3.3.1 Storage Account – File Systems : Within the storage account, create a latest Filesystem called
bronze
andgold
. Then upload csv file inDataSalesLT.Customer.txt
. In case you should do a bigger dataset, see this set of 250MB and 1M records - 3.3.2 Azure Key Vault – Secrets: Create a secret called
fpekey
andfpetweak
. Be sure that that hexadecimal values are added for each secrets. In case Azure Key vault was deployed with public access enabled (with a view to have the ability to create secrets via Azure Portal), it’s not needed anymore and public access may be disabled (since private link connection will probably be created between Synapse and Azure Key vault in 3.3.4) - 3.3.3 Azure Key vault – access control: Be sure that that within the access policies of the Azure Key Vault the Synapse Managed Identity had get access to secret, see also image below.
- 3.3.4 Azure Synapse Analytics – Private link to Azure Key Vault: Create a non-public endpoint from the Azure Synapse Workspace managed VNET and your key vault. The request is initiated from Synapse and wishes to be approved within the AKV networking. See also screenshot below during which private endpoint is approved, see also image below
- 3.3.5 Azure Synapse Analytics – Linked Service link to Azure Key Vault: Create a linked service from the Azure Synapse Workspace and your key vault, see also image below
- 3.3.6 Azure Synapse Analytics – Spark Cluster: Create a Spark cluster that’s Medium size, has 3 to 10 nodes and may be scaled to 2 to three executors, see also image below.
- 3.3.7 Azure Synapse Analytics – Library upload: Notebook
Synapse/mask_data_fpe_ff3.ipynb
uses ff3 to encryption. Since Azure Synapse Analytics is created with data exfiltration protection enabled, it can’t be installed using by fetching from pypi.org, since that requires outbound connectivity outside the Azure AD tenant. Download the pycryptodome wheel here , ff3 wheel here and Unidecode library here (Unidecode library is leveraged to convert unicode to ascii first to stop that extensive alphabets shall be utilized in ff3 to encrypt data). Then upload the wheels to Workspace to make them trusted and eventually attach it to the Spark cluster, see image below.
- 3.3.8 Azure Synapse Analytics – Notebooks upload: Upload the notebooks
Synapse/mask_data_fpe_prefixcipher.ipynb
andSynapse/mask_data_fpe_ff3.ipynb
to your Azure Synapse Analytics Workspace. Be sure that that within the notebooks, the worth of the storage account, filesystem, key vault name and keyvault linked services are substituted. - 3.3.9 Azure Synapse Analytics – Notebooks – Spark session: Open Spark session of notebook
Synapse/mask_data_fpe_prefixcipher.ipynb
, ensure you select greater than 2 executor and run it using a Managed Identity, see also screenshot below.
In any case resources are deployed and configured, notebook may be run. Notebook Synapse/mask_data_fpe_prefixcipher.ipynb
comprises functionality to mask numeric values, alpanumeric values, phone numbers and email addresses, see functionality below.
000001 => 359228
Bremer => 6paCYa
Bremer & Sons!, LTD. => OsH0*VlF(dsIGHXkZ4dK
06-23112312 => 48-78322271
bremersons@hotmail.com => h0zek2fbtw@fr5wsdh.com
Kožušček123a => Sqxbblkd659p
In case the 1M dataset is used and 6 columns are encrypted, processing takes around 2 minutes. This may easily be scaled by utilizing 1) scaling up by utilizing more vcores (from medium to large), scaling out by utilizing more executors or simply create a 2nd Spark pool. See also screenshot below.
In Synapse, notebooks may be easily embedded in pipelines. These pipelines may be used to orchestrate the activities by first uploading the info from production source to storage, run notebook to mask data after which copy masked data to check targed. An example pipeline may be present in Synapse/synapse_pipeline.json
Lots of enterprises must have representative sample data in test environment. Typically, this data is copied from a production environment to a test environment. On this blog and git repo-synapse_mask-data_format-preserved-encryption
, a scalable and secure masking solution is discussed that leverages the facility of Spark, Python and open source library ff3, see also architecture below.
lofi mix 2023
warm jazz music