4 Reasons Data Discovery and Classification in DLP Is Broken (And How to Fix It)

Aug 1, 2023

May 15, 2024

Jonathan Sharabi

4 Reasons Data Discovery and Classification in DLP Is Broken (And How to Fix It)

The built-in data foundation claimed by Data Loss Prevention (DLP) technology is broken. Although there are many types of enterprise DLP tools — such as email, endpoint, cloud, and network — they aren’t designed to intelligently understand your sensitive data landscape. The problem is that DLP tools generate too many false positive alerts due to the policies being tied to incomplete or inaccurate understanding of data.

Without a solid data foundation to manage and know what your data represents, it’s impossible to write effective DLP policies that actually prevent sensitive data from leaving secure environments. This is causing constraints on business activities and IT resources.

In this post, we’ll discuss how Cyera can help you gain the data transparency you need to operationalize DLP policies to more effectively protect sensitive data.

Why Discovery and Classification in DLP Tools Doesn't Work

1. Do-It-Yourself Data Discovery

The built-in data discovery functionality of DLP tools, at best, identifies the datastores you already know about. It does not uncover unknown datastores, as the functionality requires admins to point and maintain a connection with a known source, sometimes via agents. From there, you can initiate scanning. Most data discovery functionality do not automatically detect when new data or changes have been made to a datastore. Therefore to capture these updates, admins must either manually trigger the next scan or simply schedule a scan at some point in the future.

2. Rule-Based Data Classification

DLP tools have a limited understanding of where data is and its level of sensitivity. This is because the built-in data classification functionality is rule or regular expression based. It’s not uncommon for users to build a handful of rules to try to identify a single classifier, such as “employee ID,” but still only correctly identify “employee ID” in half or so of the tables analyzed. This incomplete or inaccurate data classification leads to ineffective protection, false positives, and other data security challenges.

The most common ways DLPs implement data classification are:

Data classes or short descriptions of the data, such as emails, names, addresses, and other types of data. These labels often lack additional context about the data itself.
Sensitivity labels, such as restricted, internal, public, and other descriptions. These labels can quickly become confusing or inconsistent if there’s not proper governance in place.
Static analysis, which refers to the fact that data is scanned and analyzed at one point in time but quickly becomes stale. Active datastores are constantly being updated and data itself is fluid. When data discovery or classification does not automatically detect changes, then you end up with an outdated view of your data.
Microsoft Information Protection (MIP) tags, which are sensitivity labels specific to Microsoft platforms. MIP tags are only effective if they’re consistently applied to data, but the process is typically very manual and requires constant human oversight and tuning.

The problem is that there’s often incomplete or inaccurate data classes and a lack of comprehensive sensitivity labeling. By relying on legacy methods of data classification, DLP requires humans to manually review and validate data labels on a regular basis. This means the data classification systems most DLPs utilize provide an outdated and incomplete picture of data due to the slow process of classifying data and the rapid pace of data generation and changes.

3. Lack of Context

DLPs also lack the context surrounding data, which means their policies end up treating data the same way without considering additional information about the data. This leads to a lot of false positives, where security teams receive alerts for data leak violations that aren't actually an issue. In some cases, a DLP without proper context can also block actions like data sharing that should actually be allowed. It also results in false negatives, where toxic combinations of data (when two pieces or more pieces of information reveal the identity of an individual whereas a single piece of information does not) aren’t recognized, leaving the business exposed to cyber resilience and compliance threats.

Deep context includes characteristics that define the data, information about the environment where it’s stored, the controls in place to ensure data security and integrity, and the framework that regulates the data. This deep context about data is required to implement DLP policies that are better aligned with real-world risks in different circumstances.

For example, there might not be any need to block synthetic data from being shared, but sharing real customer data could violate an internal policy. Sensitive data can also have a different risk profile depending on whether it’s stored as plaintext or masked. In these situations, the same data could have different DLP policies based on additional context.

4. Limited User Insights

DLPs don’t usually know who has access to certain data and why they need it. If security teams have limited user insights, it’s difficult to define all the different ways data could be accessed and then write DLP policies to secure them. This means DLPs might be overly restrictive and prevent users from doing their job, or not restrictive enough and allow sensitive information to leave the organization.

Without context about why users are accessing data, many DLP policies are either blockers to the business or ineffective at preventing data loss. When employees can’t transfer or access the data they need, they’ll constantly turn to security teams for permission or find ways to circumvent the DLP restrictions altogether.

How to Effectively Operationalize Your DLP

As you can see, DLP tools have limitations that prevent them from doing what they’re supposed to do: prevent sensitive data from leaving secure environments. Here’s how you can effectively operationalize your DLP tool with Cyera’s Data Security Platform.

Automatically generate a view of sensitive data:
Cyera can automatically discover your data sources across different datastores, including cloud storage buckets, databases, containers, virtual machines, and SaaS applications. This ensures you have a complete view of your data, even in fragmented and sprawling cloud environments.

Then Cyera will classify your data based on classes, sensitivity, and MIP with a deep contextual understanding. This creates a holistic picture of your data, laying the foundation for more optimal DLP policies.

Write DLP policies that address the intended data:
Once you have a deep understanding of your data, you can write policies that fit your specific data requirements in a way that reduces false positives. You’ll be able to create DLP policies that consider where the data is stored, the sensitivity of the data, the context in which it’s being accessed, and other factors.

Optimize DLP policies by monitoring changes to data:
Your data is constantly changing, but Cyera will continue to scan your data stores and adapt to reflect its current state. This helps overcome DLPs with data discovery and classification processes that require manual effort and are slow to adapt to changes slowly.

When your DLP policies are tuned to current data, this minimizes false positives and alert fatigue in the long run. If data becomes more sensitive over time, then you can implement more restrictive DLP policies. You’ll also be able to protect new data stores as soon as they’re created to stay ahead of potential security risks.

Improving Data Security with Cyera

The key “data piece” within DLP technology might be broken, but Cyera can help you fix it. By automatically building an up-to-date data foundation with Cyera, you’ll be able to write more effective DLP policies that reduce false positives and increase data security.

Besides enabling DLPs to better protect data, Cyera has additional data security capabilities that improve cyber-resilience and compliance. The platform takes a data-centric approach to security, assessing the exposure to your data at rest and in use and applying multiple layers of defense. Cyera’s data security posture management (DSPM) capabilities highlight data protection issues and prioritize appropriate actions to mitigate them.

Because Cyera applies deep data context holistically across your data landscape, we are the only solution that can empower security teams to know where their data is, what exposes it to risk, and take immediate action to remediate exposures and assure compliance without disrupting the business.

Want to learn more about using Cyera to build a better data foundation for DLP? Contact us or schedule a demo today.

Thought Leadership

4 Reasons Data Discovery and Classification in DLP Is Broken (And How to Fix It)

Why Discovery and Classification in DLP Tools Doesn't Work

1. Do-It-Yourself Data Discovery

2. Rule-Based Data Classification

3. Lack of Context

4. Limited User Insights

How to Effectively Operationalize Your DLP

Improving Data Security with Cyera

Minimizing The Blast Radius in The Uncharted World of AI Data Security

Five Data Security Challenges CISOs Are Facing Today

My Cyber Insurance Conversation with ChatGPT

4 Reasons Data Discovery and Classification in DLP Is Broken (And How to Fix It)

Why Discovery and Classification in DLP Tools Doesn't Work

1. Do-It-Yourself Data Discovery

2. Rule-Based Data Classification

3. Lack of Context

4. Limited User Insights

How to Effectively Operationalize Your DLP

Improving Data Security with Cyera

What to read next

Minimizing The Blast Radius in The Uncharted World of AI Data Security

Five Data Security Challenges CISOs Are Facing Today

My Cyber Insurance Conversation with ChatGPT

Related resources