Data De-identification and HIPAA

Companies are collecting and storing more data than ever before. As storage and processing costs decrease, companies are able to perform analytics on vast amounts of data. This provides insights which enable companies to serve their customers better and more efficiently, ultimately driving more value for the business. 

However, these companies are struggling to ensure the data they have collected is used appropriately and in accordance with both internal policy as well as external regulations. Companies are subject to a variety of regulations related to the privacy and security of customer data - HIPAA, GDPR and CPRA just to name a few.

In this post, we will focus on the HIPAA Privacy Rule. First, we will clarify exactly what HIPAA considers to be protected health information, next we will review how companies must handle such protected health information and then we will present a solution for doing so - automatically and at scale.

Protected Health Information (PHI)

According to the US Department of Health & Human Services (HHS), the HIPAA Privacy Rule protects most “individually identifiable health information held or transmitted by a covered entity or its business associate, in any form or medium, whether electronic, on paper, or oral. The Privacy Rule calls this information protected health information (PHI)”. The rule goes on to explain that essentially any company in the healthcare vertical is subject to HIPAA.

So this tells us that basically any company in the healthcare industry is likely collecting and storing PHI and, as a result, is subject to HIPAA. Now let’s take a look at what HIPAA advises these companies to do with the PHI they store.


As companies amass larger and larger data sets, their ability to perform meaningful studies increases. HIPAA recognizes this and encourages such analysis. However, it calls on the companies to ensure that the stored data cannot be attributed back to the data subject. The HIPAA Privacy Rule states: “The process of de-identification, by which identifiers are removed from the health information, mitigates privacy risks to individuals and thereby supports the secondary use of data for comparative effectiveness studies, policy assessment, life sciences research, and other endeavors.” The most common methods of de-identification are tokenization, encryption, and masking.

The HIPAA Privacy Rule calls out 18 forms of identifiable data that must be removed. These include name, phone number, email address, etc. - all of the ways to uniquely identify someone based on the data.

Fortunately, there are a number of companies that provide technology to implement de-identification. The specific methodology can vary, including encryption, tokenization, masking and different variants within each of these areas. Suffice it to say the ability to de-identify data exists… once you know what data to de-identify. But how do you first identify the data that needs to be de-identified? That leads us to data classification.

Data Classification

Companies store data in databases such as SQL Server, MySQL, PostgreSQL, Redshift, BigQuery, Snowflake, Databricks and others. These are known as structured data stores. Despite the large number of database providers, they all follow the same basic architecture which works its way down to a table, much like a typical spreadsheet. Each table is made up of columns. Each column has a header and then multiple values within the column. Data classification is the process of determining what type of data each column contains.

Bringing it All Together

Deidentification_and_daseraDasera makes de-identification continuous and automated.

Dasera provides a platform that will automatically discover the data stores in use within an organization and then look inside each data store to perform data classification, as described above. Once the data is classified, Dasera can be configured to pass the location details of all columns which contain PHI to the customer’s existing de-identification software so that the data can be de-identified.

Leveraging this capability, companies can avoid wholesale coarse encryption, which tends to be problematic for their business and instead perform fine-grain de-identification on only the fields which contain PHI. As we discussed above, the HIPAA Privacy Rule does not restrict the use or disclosure of de-identified health information, as it is no longer considered protected health information.

tokenization_processAn example of a de-identification process leveraging Dasera’s continuous and automated orchestration capabilities.

Check us out!

Dasera is built to help companies overcome the security and data challenges that the modern cloud presents. If you are ready for the next phase of your data security journey, our team is here to support you. Please reach out to us with any questions on how to get started!

Schedule a Demo  Dasera Overview


Lee Isenman