3 Essential Requirements of Advanced PII Data Classification

Guy Gertner
Apr 21, 2023
September 7, 2023
3 Essential Requirements of Advanced PII Data Classification

Data classification technology has more or less remained the same for well over a decade. Despite introducing some automation into the technology, it has largely been a process-driven exercise that has frustrated security, data, IT professionals, and other enterprise employees. If you, for example, stumble upon archived analyst research reports or dated articles on the topic from the mid-2010s, you’ll notice that the challenges bemoaned by professionals then are similar to the ones that haunt us today.

But a major shift is happening in the data security space, driven by cloud-native, agentless technology. We’ll get to that in a minute, but let’s cover the data classification that many professionals have experienced and discuss why it’s quickly becoming a legacy approach.

Yesterday’s Data Classification

What are the challenges of yesterday’s data classification?

Hard to implement

Yesterday’s data classification is hard to implement. Teams must first inventory their data, decide on the data store they want to classify, and then rope in the development team to help manually configure connections to that data store. 

Because the tool has to point to specific data stores, only known data stores can be classified. And because the process is so time-consuming, teams limit the scope of their classification initiative to a small subset of the environment. 

Provides limited information

Yesterday’s data classification provides two main outputs: sensitivity and semantic labels. 

Sensitivity or confidentiality indicates the level of risk to data when the data is compromised. Common labels of sensitivity include “very sensitive,” “sensitive,” “internal only,” and “public.” Both the naming convention for and number of sensitivity labels vary widely across different organizations. In absence of proper governance and alignment around sensitivity labels, that number can balloon. This results in confusing and inconsistent labeling of data. 

Semantic or simply “data classes” is a short description of the type of data. Many tools including data catalogs, DLP, and public clouds offer basic capabilities for data classification. Often, the output of the data classes from these tools are labels that essentially describe or reflect what is found in the column header name within a table. There’s no additional context to describe the data itself.

Requires constant, human intervention

Even with just two main outputs– sensitivity and semantic labels–the classification results are incomplete and inaccurate. It’s not uncommon to see a classification tool assign labels to only some of the data, but miss others. The predefined patterns that yesterday’s data classification leverage cannot keep up with the growing variety and format of data found in data stores

The lack of completeness and accuracy in the classification output means that someone has to manually review and validate the results. This prevents the classification initiative from ever scaling to keep up with the growth in data. 

What is Advanced PII Data Classification?

Data classification is the process of organizing data into relevant categories to make it simpler to retrieve, sort, use, store, and protect. Going further, Advanced PII Classification is a cloud-native and agentless solution that not only classifies data, but captures deep context about the data with high accuracy and speed. 

The three essential requirements of Advanced PII Classification are:

  • Speed and accuracy
  • Deep context
  • Dynamic identification

Speed and accuracy

Because data is constantly changing and moving, classification needs to be easy and fast. 

  • Within minutes, it connects to your cloud environments
  • Within hours, you’ll get a data store inventory, including the ones you weren’t aware of
  • Within day(s), you’ll get classifications along with deep data context surrounding your sensitive data 

The process does not require agents, overhead, and there is no performance degradation. It is highly automated, leveraging unsupervised machine learning to scan petabytes of data at incredible speeds. 

But that speed is not useful unless data classification is highly accurate. Classification forms the foundation for Data Loss Prevention (DLP) and Data Access Governance (DAG) policies. The data class tells us what controls are most appropriate for the level of risk that the data presents. It tells us what protections apply to the data, who should have access to the data, and how it should be obfuscated. Highly accurate classification makes DLP and DAG policies work more effectively to protect sensitive data. 

DLP policies work by detecting the sensitivity or classification of the data and taking pre-defined, protective actions on the data. For example, you can set a DLP policy to block data labeled as high risk from being copied or moved to an unapproved environment. If the sensitivity label is wrong– when high risk data is marked as public— then the DLP policy will fail to take action when the data is being copied or moved. 

Similar to DLP policies, DAG policies utilize sensitivity or classification labels as conditions for action. DAG policies determine who should have access to the data and how the data should be obfuscated, whether left in plaintext or encrypted. For example, you can set a DAG policy to encrypt highly sensitive data and restrict access only to the department that owns the data. When a sensitivity label is wrong, then the obfuscated method and access controls for the data will not be applied correctly. 

By ensuring that classification is easy to implement, executes quickly and results in highly accurate data classes, Cyera helps businesses keep up with the pace of change in the cloud.

Deep context

Context can be broken down into four categories: data, surface, controls, and risk. 

  • Data context – tells us the characteristics that define the data
  • Surface context – tells us about the environment where the data is stored
  • Controls context – tells us what protection is in place to ensure security and integrity
  • Risk context – tells us the frameworks that regulate the data 

Let’s explore how deep data context, broken into these categories, informs our security posture, using recent data breach examples. 

Example 1: News Corp 2022

Hackers targeted News Corp journalists who covered contentious geo-political topics. The hackers had infiltrated the News Corp network for two years, giving them ample opportunity to conduct reconnaissance and identify vulnerabilities. Dozens of employees had their PII compromised. 

Data Context:

  • Data subject role: Hackers specifically targeted employees.
  • Residency: The journalists in certain regions, say in U.S. or Taiwan, were likely targeted.
  • Identifiability: Some data classes such as first name, gender or age may be considered sensitive, but in isolation, does not link to a specific individual. Identifiability of data tells us if the compromised data can be linked to a specific individual. And if so, is more valuable to hackers. 
  • Uniqueness: Context reveals data classes unique to a business. For example, journalist’s “topic areas” as a data class is likely unique to News Corp and mass media companies.

Example 2: Bonobos, a Walmart subsidiary 2021

Hackers gained access to a backup database in an external cloud environment, stealing a 70GB SQL file containing customer addresses, partial credit card numbers, and password histories. 

Controls context:

  • Protection method: Thankfully, only the last four digits of the credit card numbers were stored and the passwords were hashed. Context tells us if the data was redacted, encrypted, transformed by another method, or exposed as plaintext.
  • Backups: Hackers infiltrated a backup. Context reveals the existence of backups and whether or not those backups contain sensitive data.

Risk context:

  • Regulatory risks: PCI compliance puts forth requirements around the secure storage of credit card information.

Example 3: Capital One 2019

A hacker scanned for misconfigured AWS accounts, gained access, and downloaded the data. She stole over 140,000 Social Security numbers and caused $250M worth of damages. 

Controls context:

  • Access: The hacker targeted misconfigured accounts that likely had overly permissive access, making them an easy target.

Data context:

  • Toxic combinations: Social Security numbers with linked bank account details were stolen. The combination of the two data classes increases the likelihood that the data could be used for fraudulent activities.

Surface context:

  • Cloud deployment: The hacker scanned AWS data stores.
  • Environment type and volume of data: The hacker likely conducted reconnaissance to determine the most high-value targets, which were likely production environments with large volumes of sensitive data.

Dynamic identification

If data is fluid, then our understanding of data and its risks must also be fluid. Yesterday’s data classification provides a static description of data: if that data was labeled as non-sensitive, then it remains non-sensitive despite the changes to data and its environment. 

Dynamic identification provides an extremely high degree of accuracy to our understanding of data because it registers changes to data by analyzing the relationship among data classes within a dataset. For example, a data class containing:

  • “first name” does not link to an individual
  • “first name” + “last name” + “age” combined links to an individual
  • “first name” + “last name” + “age” + “social security number” makes the data confidential or private

With that ability, dynamic identification of when proximity creates private, sensitive information helps us prioritize issues for defense and compliance assurance by telling us about the changing risk levels of data and pinpointing toxic data combinations that represent the highest potential for misuse.

Yesterday’s data classification fails to capture the nuances of data: is it customer PII or employee PII? Regular expression based solutions from DLP and other providers can easily misrepresent data, labeling individual data classes like email or name as PII because they do not have the context to decipher what is truly PII or not. For example, personal email is PII, but not corporate email. 

Cyera is the only data security platform to give you dynamic identification that results in advanced PII detection, providing you with a full picture of PII across your data landscape and more accurate visibility, risk management and compliance reporting. This empowers you to both address your data security posture and operationalize effective incident response.

Getting Past Yesterday

There are a lot of claims being made to what technology vendors can actually deliver Advanced PII Classification. 

Here are key questions to ask when looking for Advanced PII Classification: 

  • What are concrete examples of context that the technology can reveal?
  • How quickly can data be classified?
  • How do you detect and classify data that I don’t know about already? 
  • How accurate are the classification outputs?
  • And can you show me in under 5 minutes?

Or simply ask, “what would you say, you do here (for context to data)?”

Moving Forward with Advanced PII Classification

Cyera’s data security platform provides deep context on your data, applying correct, continuous controls to assure cyber-resilience and compliance.

Cyera takes a data-centric approach to security, assessing the exposure to your data at rest and in use and applying multiple layers of defense. Because Cyera applies deep data context holistically across your data landscape, we are the only solution that can empower security teams to know where their data is, what exposes it to risk, and take immediate action to remediate exposures and assure compliance without disrupting the business. 

See what data classes and context Cyera can reveal about your environment by scheduling a demo today.