The pixel


Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

An acronym for Cloud Service Provider. This is any company that sells a cloud computing service, be it PaaS, IaaS, or SaaS.

Learn More

An acronym of Controlled Unclassified Information.

It is information created or owned by the US government that requires safeguarding. While CUI is not classified information, it is considered sensitive. CUI is governed under a number of government policies and frameworks including the Department of Defense Instruction (DoDI) 5200.48 and Cybersecurity Maturity Model Certification. According to DoDi 5200.48, safeguarding CUI is a shared responsibility between Defense Industrial Base contractors and the Department of Defense.

Learn More
Cloud Native Database

A database service which is deployed and delivered through a cloud service provider (CSP) platform.

Learn More
Data Catalog

An organized inventory of data assets in the organization. Data catalogs use metadata to help organizations manage their data. They also help data professionals collect, organize, access, and enrich metadata to support data discovery and governance.

Learn More
Data Categorization

The process of dividing the data into groups of entities whose members are in some way similar to each other. Data privacy and security professionals can then categorize that data as high, medium, and low sensitivity data.

Learn More
Data Class

A definition that allows each type of data in a data store to be programmatically detected, typically using a test or algorithm. Data privacy and security professionals associate data classes with rules that define actions that should be taken when a given data class is detected. For example, sensitive information or PII should be tagged with a business term or classification, and further for some sensitive data classes a specific data quality constraint should be applied.

Learn More
Data Classification

Data classification is the process of organizing data into relevant categories to make it simpler to retrieve, sort, use, store, and protect.

A data classification policy, properly executed, makes the process of finding and retrieving critical data easier. This is important for risk management, legal discovery, and regulatory compliance. When creating written procedures and guidelines to govern data classification policies, it is critical to define the criteria and categories the organization will use to classify data.

Data classification can help make data more easily searchable and trackable. This is achieved by tagging the data. Data tagging allows organizations to label data clearly so that it is easy to find and identify. Tags also help you to manage data better and identify risks more readily. A data tag also enables it to be processed automatically and ensures timely and reliable access to data, as required by some state and federal regulations.

Most data classification projects help to eliminate duplication of data. By discovering and eliminating duplicate data, organizations can reduce storage and backup costs as well as reduce the risk of confidential data or sensitive data being exposed in case of a data breach.

Specifying data stewardship roles and responsibilities for employees inside the organization is part of data classification systems. Data stewardship is the tactical coordination and implementation of an organization's data assets, while data governance focuses on more high-level data policies and procedures.

The Purpose of Data Classification

Data classification increases data accessibility, enables organizations to meet regulatory compliance requirements more easily, and helps them to achieve business objectives. Often, organizations must ensure that data is searchable and retrievable within a specified timeframe. This requirement is impossible without robust classification processes for classifying data quickly and accurately.

To meet data security objectives, data classification is essential. Data classification facilitates appropriate security responses for data security based on the types of data being retrieved, copied, or transmitted. Without a data classification process, it is challenging to identify and appropriately protect sensitive data.

Data classification provides visibility into all data within an organization and enables it to use, analyze, and protect the vast quantities of data available through data collection. Effective data classification facilitates better protection for such data and promotes compliance with security policies.

Challenges with Legacy Data Classification Tools

Data classification tools are intended to provide data discovery capabilities; however, they often analyze data stores only for metadata or well-known identifiers. In complex environments, data discovery is ineffective if it can discover only dates but cannot identify whether they are a date of birth, a transaction date, or the dateline of an article. Without this additional information, these discovery tools cannot identify whether data is sensitive and therefore needs protection.

“The best DSPs will have semantic and contextual capabilities for data classification — judging what something really is, rather than relying on preconfigured identifiers.“ Gartner: 2023 Strategic Roadmap for Data Security Platform Adoption

Modern data security platforms must include semantic and contextual capabilities for data classification, to identify what a piece of data is rather than using preconfigured identifiers, which are less accurate and reliable. Because organizations are increasing the use of cloud computing services, more sensitive data is now in the cloud. However, a lot of the sensitive data is unstructured, which makes it harder to secure.

Data Classification Schemes

A data classification scheme enables you to identify security standards that specify appropriate handling practices for each data category. Storage standards that define the data's lifecycle requirements must be addressed as well. A data classification policy can help an organization achieve its data protection goals by applying data categories to external and internal data consistently.

Data Discovery

Data discovery and inventory tools help organizations identify resources that contain high-risk data and sensitive data on endpoints and corporate network assets. These tools help organizations identify the locations of both sensitive structured data and unstructured data by analyzing hosts, database columns and rows, web applications, file shares, and storage networks.

Types of Data Classification

Tagging or applying labels to data helps to classify data. This is an essential part of the data classification process. These tags and labels define the type of data, the degree of confidentiality, and the data integrity. The level of sensitivity is typically based on levels of importance or confidentiality, which aligns with the security measures applied to protect each classification level. Industry standards for data classification include three types:

  • Content-based classification, which relates to sensitive information (such as financial records and personally identifiable information).
  • Context-based classification, which analyzes data based on the location, application, creator, and so on, as indirect indicators of sensitive information.
  • User-based classification, which requires user knowledge and discretion to decide whether to flag sensitive documents during the creation, editing process, review cycles, or when the content is distributed.

While each approach has a place in data classification, user-based classification is a manual and time-consuming process, and extremely likely to be error-prone. It will not be effective at categorizing data at scale and may put protected data and restricted data at risk.

Data Sensitivity and Risk

It is important for data classification efforts to include the determination of the relative risk associated with diverse types of data, how to manage that data, and where and how to store and send that data. There are three broad levels of risk for data and systems:

  • Low risk: Public data that is easy to recover is a good example of low-risk data. Any information that can be used, reused, and redistributed freely without local, regional, national, or international restrictions on access or usage. Within an organization, this data includes job descriptions, publicly available marketing materials, and press releases or articles.
  • Moderate risk: If data is not public or is used internally only, but is not critical to operations or sensitive, it may be classified as moderate risk. Company documentation, non-sensitive presentations, and operating procedures may fall into this category.
  • High risk: If the data or system is sensitive or critical to operational security, it belongs in the high-risk category. In addition, any data that is difficult to recover is considered high risk. Any confidential data, sensitive data, internal-only data, and necessary data also fall into this category. Examples include social security numbers, driver's license numbers, bank and debit account information, and other highly sensitive data.

Automated Data Classification

Automated tools can perform classification that defines personal data and highly sensitive data based on defined data classification levels. A platform that includes a classification engine can identify data stores that contain sensitive data in any file, table, or column in an environment. It can also provide ongoing protection by continuously scanning the environment to detect changes in the data landscape. New solutions can identify sensitive data and where it resides, as well as apply the context-based classification needed to decide how to protect it.

Data classification examples

Classifying data as restricted, private, or public is an example of data classification. Like identifying risk levels, public data is the least-sensitive data and has the lowest security requirements. Restricted data receives the highest security classification, and it includes the most sensitive data, such as health data. A successful data classification process extends to include additional identification and tagging procedures to ensure data protection based on data sensitivity.

Why Data Classification Is Important

Security and risk leaders can only protect sensitive data and intellectual property if they know the data exists, where it is, why it is valuable, and who has access to use it. Data classification helps them to identify and protect corporate data, customer data, and personal data. Labeling data appropriately helps organizations to protect data and prevent unauthorized disclosure.

The General Data Protection Regulation (GDPR), among other data privacy and protection regulations, increases the importance of data classification for any organization that stores, transfers, or processes data. Classifying data helps ensure that anything covered by the GDPR is quickly identified so that appropriate security measures are in place. GDPR also increases protection for personal data related to racial or ethnic origin, political opinions, and religious or philosophical beliefs, and classifying these types of data can help to reduce the risk of compliance-related issues.

Organizations must meet the requirements of established frameworks, such as the GDPR, California Consumer Privacy Act (CCPA), Health Insurance Portability and Accountability Act (HIPAA), Payment Card Industry Data Security Standard (PCI DSS), Gramm-Leach-Bliley Act (GLBA), Health Information Technology for Economic and Clinical Health (HITECH), among others. To do so, they must evaluate sensitive structured and unstructured data posture across Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS) environments and contextualize risk as it relates to security, privacy, and other regulatory frameworks.

Learn More
Data Security Posture Management (DSPM)

Data is every business’s most crucial asset – the foundation of any security program. Data Security Posture Management (DSPM) is an emerging security trend named by Gartner in its 2022 Hype Cycle for Data Security. The aim of DSPM solutions is to enable security and compliance teams to answer three fundamental questions:

  • Where is our sensitive data?
  • What sensitive data is at risk?
  • How can we take action to remediate that risk?

The cloud has fundamentally changed how businesses function. Moving workloads and data assets is now simpler than ever, and is a boon for productivity, enabling businesses to quickly respond to customer demands and create new revenue opportunities. However, the pace and permissive nature of the cloud also dramatically expands a company’s threat surface and raises the likelihood of a data breach. Put simply, the distributed nature of the cloud seriously complicates data security.

Historically, a number of technologies have attempted to address challenges related to data security, including:

  • Data Discovery and Classification
  • Data Loss Prevention (DLP)
  • Data Access Governance (DAG)

DSPM solutions combine capabilities from all three of these areas and represent the next-generation approach in cloud data security.

DSPM represents a next-generation approach to data security

DSPM vendors are taking a cloud-first approach to make it easier to discover, classify, assess, prioritize, and remediate data security issues. They are solving cloud security concerns by automating data detection and protection activities in a dynamic environment and at a massive scale.

Gartner Research summarizes the DSPM space, saying, “Data security posture management provides visibility as to where sensitive data is, who has access to that data, how it has been used, and what the security posture of the data store or application is. In simple terms, DSPM vendors and products provide “data discovery+” — that is, in-depth data discovery plus varying combinations of data observability features. Such features may include real-time visibility into data flows, risk, and compliance with data security controls. The objective is to identify security gaps and undue exposure. DSPM accelerates assessments of how data security posture can be enforced through complementary data security controls.”  To summarize Gartner’s definition, DSPM provides visibility as to where sensitive data is, who has access to that data, how it has been used, and what the security posture of the data store or application is. 

The foundation of a DSPM offering is data discovery and classification. Reports like Forrester’s Now Tech: Data Discovery And Classification, Q4 2020 dive deep into data discovery and classification technologies, which in Forrester’s case aligns to five segments: data management, information governance, privacy, security, and specialist concerns. These segments align with three major buying centers: global risk and compliance, security, and business units/product owners. 

DSPM focuses on delivering automated, continuous, and highly accurate data discovery and classification for security teams. The following list provides clarity on how these approaches align with buying centers, all of which have data discovery and classification needs, but as you will see below, want to leverage it for different purposes: 

  • Global Risk and Compliance Teams including governance, IT, and privacy groups use:
  • Data management prepares data for use, and typically supports efforts like data governance, data quality, and accuracy, as well as data mapping and lineage analysis.
  • Information governance supports data lifecycle management and helps with ROT (redundant, obsolete, trivial) reduction, cloud migration, storage reduction and infrastructure optimization, and data lifecycle requirements like retention, deletion, and disposition. 
  • Privacy facilitates privacy processes and compliance and helps to enable the fulfillment of data subject access rights (DSARs) like data access or deletion requests, track cross-border data transfers, and manage privacy processes to support requirements like CCPA and GDPR. 
  • Security Teams aim to understand data in order to apply controls to develop a resilient posture, minimize their threat surface, and improve ransomware resilience and use:
  • Data Loss Prevention (DLP) enables teams to take actions to protect their data and enforce security policies.
  • Data Access Governance (DAG) focuses on the implementation of data security access policies for unstructured data.
  • Tokenization and Format-Preserving Encryption (FPE) solutions aim to protect sensitive data or create a deidentified copy of a dataset.
  • Specialists translate into business units or product owners. Products that appeal to this buying center can include an emphasis on user-driven classification labels, or identification of specific types of intellectual property like source code or sensitive data like non-disclosure agreements. 

Posture management solutions abound

Today there are three prevailing types of security tools that offer posture management solutions: cloud security posture management (CSPM), SaaS security posture management (SSPM), and data security posture management (DSPM).  The solutions can be disintermediated as follows:

  • CSPM focuses on the cloud infrastructure, seeking to provide cloud assets visibility and alerts on risky misconfigurations. 
  • SSPM identifies misconfigurations, unnecessary user accounts, excessive user permissions, compliance risks, and other cloud security issues.
  • DSPM focuses on the data itself and its application context by analyzing data both at rest and in motion, classifying the data for its sensitivity, such as PII, PHI, and financial information, and providing remediation guidance as well as workflows to automatically close security gaps. 

While DSPM solutions have focused on a cloud-first approach, data security is not limited only to cloud environments.  Therefore more mature DSPM solutions will also include on-prem use cases since most businesses maintain some form of on-prem data, and will for years to come.  In addition, as the DSPM space evolves, and solutions gain maturity, some will become more robust data security platforms, which will include the ability to: 

  • Discover and classify sensitive data
  • Reduce the attack surface
  • Detect and respond to data security issues
  • Automate risk remediation workflows
  • Maintain operational resilience and preparedness

DSPM solutions address key security use cases

Businesses thrive on collaboration. The current reality of highly distributed environments - many of which leverage cloud technologies - means that any file or data element can be easily shared at the click of a button. DSPM provides the missing piece to complete most security programs’ puzzles – a means of identifying, contextualizing, and protecting sensitive data.

DSPM solutions empower security teams to:

  • Understand the data an enterprise manages, and what’s at risk - agentless integration gives security teams immediate visibility into all of their data assets. DSPM solutions automatically classify and assess the security of an enterprise’s data, giving actionable insights to reduce risk.
  • Protect sensitive data from breaches and data leaks - proactive assessments of internet-facing exposure, and access permissions, coupled with detection and response capabilities, keep an enterprise’s most precious data assets safe from attack.
  • Anticipate threats and respond to attacks faster - intelligent machine learning algorithms eliminate cumbersome manual regular expression tuning, and learn the patterns of interaction between systems, users, and data, allowing detection of anomalous activity in real-time.
  • Empower distributed teams to leverage data, securely - user permission graphs highlight the sensitive data a given identity can access, which informs data access governance as well as facilitating access permission trimming, and enabling data to be shared safely.
  • Increase productivity by simplifying audits - continuously updated sensitive data inventories save time and effort when complying with subject access requests, as well as privacy and compliance audits by always knowing the data an enterprise has, where it is located, and who has access.

Learn More
Data Sprawl

Data sprawl refers to the significant quantities of data many organizations create daily. Data sprawl can be defined as the generation of data, or digital information, created by businesses. Data is a valuable resource because it enables business leaders to make data-driven decisions about how to best serve their client base, grow their business, and improve their processes. However, managing vast amounts of data and so many data sources can be a serious challenge.

Large businesses, particularly enterprises, are generating a staggering amount of data due to the wide variety of software products in use, as well as newly introduced data formats, multiple storage systems in the cloud and in on-premises environments, and huge quantities of log data generated by applications. There is an overwhelming amount of data being generated and stored in the modern world.

Where Does Data Come From?

As organizations scale and increasingly use data for analysis and investigation, that data is being stored in operating systems, servers, applications, networks, and other technologies. Many organizations generate massive quantities of new data all day, every day, including:

  • Financial data, including data types such as bank transactions, web data, geolocation data, credit card, and point of sale transaction data from vendors.
  • Sales data, which may include revenue by salesperson, conversion rate, the average length of a sales cycle, average deal size, number of calls made, age and status of sales leads, loss rates, and number of emails sent.
  • Transactional data, which may include customer data, purchase order information, employee hours worked, insurance costs, insurance claims, shipping status, bank deposits, and withdrawals.
  • Social media, email, and SMS communications, which may include social media metrics, demographics, times of day, hashtags, subject matter, and content types.
  • Event data describes actions performed by entities (essentially, behavior data); it includes the action, the timestamp, and the state (information about entities related to the event). Event data is critical for performing analytics.

 These files and records are dispersed across multiple locations, which makes inventorying, securing, and analyzing all that data extremely difficult.  

How Does Data Sprawl Happen?

Data sprawl is described as the ever-expanding amount of data produced by organizations every day. Amplified by the shift to the cloud, organizations can scale more rapidly, producing more and more data. New uses for big data continue to develop, requiring an increase in how much data is stored in operating systems, servers, networks, applications, and other technologies.

Further complicating matters, databases, analytics pipelines, and business workflows have been migrating rapidly to the cloud, moving across multiple cloud service providers (CSPs) and across structured and unstructured formats. This shift to the cloud is ongoing, and new data stores are created all the time. Security and risk management (SRM) leaders are struggling to identify and deploy data security controls consistently in this environment.

"...unstructured data sprawl (both on-premises and hybrid/multi-cloud) is difficult to detect and control when compared to structured data."

Gartner, Hype Cycle for Data Security, 2022

Organizations generate new data every hour of every day. The customer data in customer relationship management (CRM) systems may also include financial data, which is also in an accounting database or enterprise resource planning (ERP) system. Sales data and transactional data may be in those systems as well, and siloed by different departments, branches, and devices. To get the benefits promised by data analytics, data analysts need to cross reference multiple sources and therefore may have difficulty making accurate and informed decisions.

Ultimately, organizations need data to facilitate day-to-day workflows and generate analytical insights for smarter decision-making. The problem is that the amount of data organizations generate is spiraling out of control. According to a recent IDC study, the Global DataSphere is expected to more than double from 2022 to 2026. The worldwide DataSphere is a measure of how much new data is created, captured, replicated, and consumed each year, growing twice as fast in the Enterprise DataSphere compared to the Consumer DataSphere.

Challenges of Data Sprawl

As organizations generate data at a faster pace, it is becoming harder to manage this information. Organizations might have data stored in various locations, making it hard to access business-critical information and generate accurate insights. Team members must cross-reference data in multiple formats from multiple sources, making analytics difficult. Managing dispersed information across different silos wastes time and money. Data may become corrupted during transmission, storage, and processing. Data corruption compromises the value of data, and the likelihood of corruption may increase alongside increasing data sprawl.

In addition, the effort is wasted when data is duplicated by employees who were not able to find the data needed where expected, which can then also result in ghost data. This duplicate data is considered redundant. Other data may be obsolete (out of date) or trivial (not valuable for business insights). This excess data results in excessive resource utilization and increases cloud storage costs.

Employees may be handling data carelessly, not understanding how the way they share and handle data can introduce risk. Unauthorized users may also have access to sensitive information, particularly when the data produced and stored is not appropriately managed. Manually classifying data is time-consuming and error-prone and may increase the risk of sensitive data exposure, so finding automated solutions is essential for managing large stores of data.  

Data sprawl compromises data value and presents significant security risks. There are also security concerns because too much data can be difficult to control. This increases the chances of data breaches and other security risks. Furthermore, organizations that do not manage data sprawl may jeopardize the trust of customers and face strict penalties due to the General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA), or other data protection legislation for non-compliance. 

Managing Data Sprawl

Getting data sprawl under control requires a structured approach to data management. It is essential to have a solution in place to discover and classify data. Because data is spread across on-premises and cloud environments, it is critical to identify the environments where data is stored to ensure that all data is identified and managed. Tools that can discover and classify data in SaaS, IaaS, and PaaS environments are important, as are those that can find and classify structured and unstructured data. The goal of these tools is to create a unified view across the environment.

Identifying a central place to store data is one way to manage data sprawl. Cloud security standards continue to improve, making a centralized cloud repository an appealing option for many organizations. Cloud storage platforms are an excellent method of storing data in a way that creates a single source of truth that is more accessible to employees in many locations. At the same time, companies must establish data access governance (DAG) policies that outline how data should be collected, processed, and stored. These policies must also put policies in place to govern the data, including access controls, retention, risk management, compliance, and data disposition (how it is disposed of at the end of its lifecycle). DAG policies complement data loss prevention (DLP) programs. Data security posture management (DSPM) combines data discovery and classification, data loss prevention, and data access governance to create a next-generation approach to cloud data security.  

Data Sprawl Solutions

For organizations that want to manage data sprawl, it is imperative to know what data exists in the environment, where it is located, and who has access to it. Different tools exist to manage all the data that organizations store, but few can prevent data sprawl.

Automated data discovery and data classification solutions must be able to identify and classify sensitive data. Artificial intelligence (AI) and machine learning (ML) can more accurately classify difficult-to-identify data, such as intellectual property and sensitive corporate data.

Data sprawl solutions can also increase overall data security by helping to locate and identify duplicate and redundant data. Once sprawling data has been identified and classified, it becomes easier to dispose of stale data or extraneous data. This can save on storage costs as well as eliminate duplicate and irrelevant data.

Enterprises collect data daily and it is easy to create multiple copies. The first step for companies that wish to manage access to data and prevent data loss is to fully understand their data — both where it is now, whether IT or security teams are aware of the data stores or not, and any data stores that are created in the future. Identifying sensitive data and who has access to them can help prevent data breaches by ensuring that appropriate security controls are enforced.

Learn More
Data Store

A repository for storing, managing and distributing data sets on an enterprise level.

Learn More
Defense Industrial Base

Defense Industrial Base (DIB) contractors are companies that conduct business with the US military and are part of the military industry complex responsible for research, production, delivery, and service.

DIB contractors are responsible for meeting compliance requirements set by government policies and frameworks including the the Department of Defense Instruction (DoDI) 5200.48 and Cybersecurity Maturity Model Certification.

According to DoDi 5200.48, safeguarding Controlled Unclassified Information is a shared responsibility between DIB contractors and the Department of Defense.

Learn More

Electronic Lab Notebooks (Electronic Laboratory Notebook or ELN) is the digital form of a paper lab notebook. In the pharmaceutical industry, it is used by researchers, scientists, and technicians to document observations, progress, and results from their experiments performed in a laboratory.

While ELN enables information to be documented and shared electronically, it also exposes proprietary information to malicious insiders or external hackers. As a result, ELN should be subject to appropriate security controls to prevent misuse or loss.

Learn More
Encrypted Data

Encryption is the method of converting a plaintext into a cipher text so that only the authorized parties can decrypt the information and no third parties can tamper with the data. Unencrypted usually refers to data or information that is stored unprotected, without any encryption. Encryption is an important way for individuals and companies to protect sensitive information from hacking. For example, websites that transmit credit card and bank account numbers encrypt this information to prevent identity theft and fraud.

Learn More
Exact Matching

Where the a result of a query, algorithm or search only registers a match if there is a 100% match.

Learn More
File Clustering

An unsupervised learning method whereby a series of files is divided into multiple groups, so that the grouped files are more similar to the files in their own group and less similar to those in the other groups.

Learn More
Fuzzy Matching

Where scores of a result can fall from 0 - 100, based on the degree to which the search data and file data values match.

Learn More
Ghost Data

Ghost data is backups or snapshots of data stores where the original has been deleted. Ghost data is a type of shadow data, which includes unmanaged data store copies and snapshots or log data that are not included in an organization’s backup and recovery plans. Ghost data refers to data that still exists within a database or storage system but is no longer actively used or known to be accessible. For example, if a data store is created for a product and the product has been discontinued, the production data is usually removed as there is no longer a business justification to maintain it. However if copies of the data remain in staging or development environments they would be considered ghost data.


Ghost data occurs or is created due to a few reasons, such as when a user or program deletes a file or database entry, but the data is not permanently removed from the system. Ghost data also happens when data is migrated to a new system, but the old data is not completely erased from the original system.

Cloud adoption led to a proliferation of data. Much of that data is structured, secured, and monitored, but a considerable proportion of that data is unstructured, unsecured, and unmonitored. This presents real risks to organizations today. And while data collection and analysis can yield important business benefits, it can also increase the risk to the organization if not effectively managed. Ghost data presents significant risks to organizations because it cannot be effectively managed. 

Problems with Ghost Data

Ghost data can cause problems for organizations because it:

  • Takes up valuable storage space in cloud storage environments
  • Increases storage costs for data that is no longer useful or necessary
  • Creates potential security vulnerabilities if sensitive information is left behind in data stores that are not appropriately classified and protected


Ghost data may include sensitive data, including customer and employee personally identifiable information (PII).


“Over 30% of scanned cloud data stores are ghost data, and more than 58% of the ghost data stores contain sensitive or very sensitive data.” – Cyera Research

Where Ghost Data Originates

The problem with ghost data begins with how data is stored today. In the past, organizations had storage capability limited by hardware capacity. If an organization or team needed more storage space, the IT team purchased additional hardware, reviewed the data to determine what could be purged, or both.


The wide adoption of cloud storage and services changed that equation. Because the cloud is extensible, organizations can continually expand storage to accommodate the accumulation of data. Data is also being generated and stored at an unprecedented rate, creating an expansive data landscape. Further increasing the quantity of data, most organizations store multiple copies of data in different formats and different cloud providers. This makes it more difficult to identify duplicate and redundant data and much easier to have data stores that have become lost or are no longer tracked or protected.


Few companies delete older versions of data as it becomes obsolete. It is easy to store more data, but most organizations have no limits in place to trigger a review of the data across all environments, including multiple cloud environments. This results in data sprawl and creates challenges in data classification efforts.


“35% [of respondents] utilize at least two public cloud providers from a list that included Amazon Web Services, Google Cloud, Microsoft Azure, Alibaba, IBM, and Oracle; 17% of respondents rely on three or more.” – Cyera Research


Many organizations choose to keep older copies of data in case it is needed at some point. If there is no review or verification of the data — where it is, how much there is, what sensitive information exists in the data, or whether the data is securely stored — this ghost data both increases storage costs and poses a significant business risk.

Where Ghost Data Exists

Frequently, teams copy data to non-production environments. This not only creates an additional copy of the data but places it in a less secure environment. Non-production environments are not secured with the same rigor as production environments, therefore the sensitive data they contain is more susceptible to inadvertent disclosure or exfiltration. Frequently, ghost data is accessible to users that have no business justification for accessing that data, increasing data security risks.


These data copies also represent a potential EU General Data Protection Regulation (GDPR) violation. GDPR specifies that personal data be kept only as long as the data are required to achieve the business purpose it was collected for (except for scientific or historical research). After this period, the data must be disposed of appropriately, but when personal data exists in ghost data, it is likely to remain in the environment, increasing organizational risk. It can sometimes be difficult for IT teams to delete ghost data because they are unaware of it.


“60% of the data security posture issues present in cloud accounts stem from unsecured

sensitive data.” – Cyera Research 

Security Implications of Ghost Data

Sometimes, the database may be gone but snapshots are still there. Sometimes those snapshots are unencrypted, while other times the data stores exist in the wrong region. That exposes organizations to both increased costs and security risks. The additional data, unencrypted and in unknown locations, increases the attack surface for the organization.


Ghost data can increase the risk of ransomware because attackers do not care whether the data is up-to-date or accurate. They only care about what is easy to access and what is not being monitored. While the organization may not be aware of its ghost data, that lack of visibility does not protect it from attackers.


Stolen ghost data can be exfiltrated and used for malicious purposes. Cyber attackers can prove that they have access to the data and thereby execute a successful ransomware attack. Double extortion attacks are as successful with ghost data as with any other data because attackers have the same increased leverage. The attackers rely not only on encryption (which would not be of concern to an organization as it relates to ghost data). They can also publicly release the stolen data to encourage payment of the ransom. Robust backups cannot help with the issue of ghost data because the leverage to release data publicly remains the same.

Visibility into Cloud Data 

Unfortunately, cloud providers offer limited visibility into what data customers have. Cloud service providers (CSPs) do not identify how sensitive data is. CSPs also do not provide specific advice on how to improve the security and risk posture of data across their cloud estate. This results in increased risks to cyber resilience and compliance. An organization’s lack of visibility into its data across all cloud providers increases the risk of exposing sensitive data. Similarly, ghost data or any other data store that is not adequately identified and classified is likely to have overly permissive access.


Significant changes in how data is managed and stored in cloud and hybrid environments have also led to new questions, including:

  • Who oversees securing data?
  • What is the role of the security team in identifying and securing data?
  • What responsibilities do data creators have for managing data?

In modern corporate environments, it is important for all teams involved to understand their responsibilities when it comes to managing, security, and protecting data. It is a joint effort between builders and the security team. However, managing data on an ongoing basis remains a challenge without the technology to discover and classify sensitive data automatically.

Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning

Modern software solutions and products have had a significant impact in terms of creating data, increasing the data available for analytics, and growing complexity in corporate environments. AI/ML can help address the challenges created by these technological advances. In particular, AI/ML can help identify ghost data and increase data security by using continuous learning and automation to:

  • Identify data
  • Analyze anomalous data being stored
  • Identify anomalous access
  • Prioritize intelligently
  • Correlate different data in various places in the environment

Robust AI/ML data classification solutions can accurately classify data that previously was challenging to identify, including intellectual property and other sensitive corporate data. AI/ML can also help enable organizations to make smarter decisions and create better policies about what to do with data and how to protect sensitive data.

Data and Security

To begin with, it is important to think of data as being at an advanced layer of security. In the past, data was not considered a layer of security. This is because there was no straightforward way to deal with data in the past. Today, with AI/ML, it is far easier to access, understand, and know the data within an organization and across all its disparate environments.


As technology has changed, the focus of security has moved from infrastructure to data-related security. While CISOs remain in charge of the technical aspects of security, new challenges in business and cybersecurity require more collaboration across the business team, IT, security, and privacy office to move forward and meet data security and data privacy requirements.


Regulations and requirements are becoming more stringent globally, requiring organizations to take more responsibility for the data they are collecting. This includes all the data spread across environments, including ghost data. Managing that data requires robust data discovery and data classification.

Learn More
Managed Database

A database with storage, data, and compute services that is managed and maintained by a third-party provider instead of by an organization's IT staff.

Learn More

Data that describes other data. For databases, metadata describes properties of the data store itself, as well as the definition of the schema.

Learn More

The idea that organizations should only retain information as long as it is pertinent.

Learn More
Risk Assessment

In cybersecurity, a risk assessment is a comprehensive analysis of an organization to identify vulnerabilities and threats. The goal of a risk assessment is to identify an organization’s risks and make recommendations for mitigating those risks. Risk assessments may be requested after a specific trigger, to complete an assessment before moving forward as part of larger governance and risk processes, or to assess a portfolio periodically as part of meeting an enterprise risk management or compliance objective. 

Two popular risk assessment frameworks are the National Institute of Standards and Technology (NIST) Cybersecurity Framework and the International Organization for Standardization/ International Electrotechnical Commission (ISO)/IEC) 27001:2022 standard

Risk assessments may be based on different methodologies: qualitative, quantitative, or a hybrid of the two. A quantitative assessment provides concrete data that includes the probability and potential impact of a threat based on data collection and statistical analysis. A qualitative assessment provides a more subjective, generalized view and what would happen to operations and productivity for different internal teams if one of the risks occurred. 

A risk assessment should include an up-to-date inventory of the systems, vendors, and applications in scope for the assessment. This information helps security risk management leaders understand the risk associated with:

  • The technology in use
  • How business processes are dependent on those assets
  • What business value they provide

A single risk assessment provides a point in time snapshot of the current risks present and how to mitigate them. Ongoing or continuous risk assessments provide a more holistic view into the shifting risk landscape that exists in most organizations.

Risk assessments also help organizations assess and prioritize the risks to their information, including their data and their information systems. An assessment also helps security and technology leaders communicate risks in business terms to internal stakeholders, specifically the executive team and the board of directors. This information helps them make educated decisions about which areas of the cybersecurity program need to be prioritized and how to allocate resources in alignment with business goals. 

The growth of digital business and related assets require better management of complex technology environments, which today include:

  • The Internet of Things (IoT)
  • Artificial intelligence (AI)
  • Machine learning (ML)
  • Cloud delivery models
  • As-a-service offerings

It also creates a growing volume of data, types of data, and technology assets. A comprehensive risk assessment should include these assets, allowing an organization to gain visibility into all its data, provide insight into whether any of that data is exposed, and identify any serious security issues. A data risk assessment can help an organization secure and minimize exposure of sensitive data by providing a holistic view of sensitive data, identifying overly permissive access, and discovering stale and ghost data.

Learn More
Stale Data

Stale data is data collected that is no longer needed by an organization for daily operations. Sometimes the data collected was never needed at all. Most organizations store a significant amount of stale data, which may include:

  • Old employee lists
  • Multiple versions of presentation decks
  • Outdated personal data
  • Old usage data
  • Historical behavioral data
  • Outdated research data 

Simply creating an updated version of a file and sharing it but not deleting the obsolete versions increases the quantity of stale or inactive data. This type of activity happens many times a day in the typical organization. 

Increasingly, petabytes of data are stored in different public and private cloud platforms and are dispersed around the world. These file shares and document management systems, often poorly secured, present an appealing target for cyber attackers. If organizations store a significant amount of unstructured data, they are unlikely to have visibility into their data surface footprint, and even less likely to be protecting it adequately. Stale and unstructured data may be:

  • Easily accessible
  • Poorly secured
  • Unmonitored for data access 

Stale data is also not relevant to daily operations and therefore can impede a business’s ability to make good business decisions based on current market conditions. A study by Dimensional Research showed that “82 percent of companies are making decisions based on stale information” and “85 percent state this stale data is leading to incorrect decisions and lost revenue.” 

The shift to the cloud creates several challenges. Many organizations do not know what data it has, where it is located (on premises, in public or private cloud environments, or a mix of these), why it is being stored, and how the data is protected. 

Although big data and data analysis can provide actionable insights and improve automation capabilities, much of the data organizations collect, process, and store is unorganized and unstructured. Unfortunately, stale or inactive data can increase storage costs and security risks alike, without providing any business benefit at all. To reduce risks, organizations must identify stale data and then decide whether to move the data (storing it more securely), archive the data, or delete it. Organizations must also establish a consistent policy to identify and manage stale data on an ongoing basis. 

Learn More
Structured Data

Data in a standardized format, with a well-defined structure that is easily readable by humans and programs. Most structured data is typically stored in a database. Though structured data only comprises 20 percent of data stored worldwide, its ease of accessibility and accuracy of outcomes makes it the foundation of current big data research and applications.

Learn More
Tokenized Data

Tokenization entails the substitution of sensitive data with a non-sensitive equivalent, known as a token. This token then maps back to the original sensitive data through a tokenization system that makes tokens practically impossible to reverse without them. Many such systems leverage random numbers to produce secure tokens. Tokenization is often used to secure financial records, bank accounts, medical records and many other forms of personally identifiable information (PII).

Learn More
Unmanaged Data Stores

Unmanaged data stores are deployments that must be completely supported by development or infrastructure teams, without the assistance of the cloud service provider. This additonal logistical burden may be undertaken by teams aiming to comply with data sovereignty requirements, abide by private network or firewall requirements for security purposes, or resource requirements beyond the provider's (database as a service) DBaaS size or IOPS

Learn More
Unstructured Data

Data lacking a pre-defined model of organization or that does not follow one. Such data is often text-heavy, but can also include facts, figures and time and date information. The resulting irregularities and ambiguities make unstructured data much harder for programs to understand than data stored in databases with fields or documents with annotations. Many estimates claim unstructured data comprises the vast majority of global data, and that this category of data is growing rapidly.

Learn More