Ghost Data

Ghost Data

Ghost data is backups or snapshots of data stores where the original has been deleted. Ghost data is a type of shadow data, which includes unmanaged data store copies and snapshots or log data that are not included in an organization’s backup and recovery plans. Ghost data refers to data that still exists within a database or storage system but is no longer actively used or known to be accessible. For example, if a data store is created for a product and the product has been discontinued, the production data is usually removed as there is no longer a business justification to maintain it. However if copies of the data remain in staging or development environments they would be considered ghost data.


Ghost data occurs or is created due to a few reasons, such as when a user or program deletes a file or database entry, but the data is not permanently removed from the system. Ghost data also happens when data is migrated to a new system, but the old data is not completely erased from the original system.

Cloud adoption led to a proliferation of data. Much of that data is structured, secured, and monitored, but a considerable proportion of that data is unstructured, unsecured, and unmonitored. This presents real risks to organizations today. And while data collection and analysis can yield important business benefits, it can also increase the risk to the organization if not effectively managed. Ghost data presents significant risks to organizations because it cannot be effectively managed. 

Problems with Ghost Data

Ghost data can cause problems for organizations because it:

  • Takes up valuable storage space in cloud storage environments
  • Increases storage costs for data that is no longer useful or necessary
  • Creates potential security vulnerabilities if sensitive information is left behind in data stores that are not appropriately classified and protected


Ghost data may include sensitive data, including customer and employee personally identifiable information (PII).


“Over 30% of scanned cloud data stores are ghost data, and more than 58% of the ghost data stores contain sensitive or very sensitive data.” – Cyera Research

Where Ghost Data Originates

The problem with ghost data begins with how data is stored today. In the past, organizations had storage capability limited by hardware capacity. If an organization or team needed more storage space, the IT team purchased additional hardware, reviewed the data to determine what could be purged, or both.


The wide adoption of cloud storage and services changed that equation. Because the cloud is extensible, organizations can continually expand storage to accommodate the accumulation of data. Data is also being generated and stored at an unprecedented rate, creating an expansive data landscape. Further increasing the quantity of data, most organizations store multiple copies of data in different formats and different cloud providers. This makes it more difficult to identify duplicate and redundant data and much easier to have data stores that have become lost or are no longer tracked or protected.


Few companies delete older versions of data as it becomes obsolete. It is easy to store more data, but most organizations have no limits in place to trigger a review of the data across all environments, including multiple cloud environments. This results in data sprawl and creates challenges in data classification efforts.


“35% [of respondents] utilize at least two public cloud providers from a list that included Amazon Web Services, Google Cloud, Microsoft Azure, Alibaba, IBM, and Oracle; 17% of respondents rely on three or more.” – Cyera Research


Many organizations choose to keep older copies of data in case it is needed at some point. If there is no review or verification of the data — where it is, how much there is, what sensitive information exists in the data, or whether the data is securely stored — this ghost data both increases storage costs and poses a significant business risk.

Where Ghost Data Exists

Frequently, teams copy data to non-production environments. This not only creates an additional copy of the data but places it in a less secure environment. Non-production environments are not secured with the same rigor as production environments, therefore the sensitive data they contain is more susceptible to inadvertent disclosure or exfiltration. Frequently, ghost data is accessible to users that have no business justification for accessing that data, increasing data security risks.


These data copies also represent a potential EU General Data Protection Regulation (GDPR) violation. GDPR specifies that personal data be kept only as long as the data are required to achieve the business purpose it was collected for (except for scientific or historical research). After this period, the data must be disposed of appropriately, but when personal data exists in ghost data, it is likely to remain in the environment, increasing organizational risk. It can sometimes be difficult for IT teams to delete ghost data because they are unaware of it.


“60% of the data security posture issues present in cloud accounts stem from unsecured

sensitive data.” – Cyera Research 

Security Implications of Ghost Data

Sometimes, the database may be gone but snapshots are still there. Sometimes those snapshots are unencrypted, while other times the data stores exist in the wrong region. That exposes organizations to both increased costs and security risks. The additional data, unencrypted and in unknown locations, increases the attack surface for the organization.


Ghost data can increase the risk of ransomware because attackers do not care whether the data is up-to-date or accurate. They only care about what is easy to access and what is not being monitored. While the organization may not be aware of its ghost data, that lack of visibility does not protect it from attackers.


Stolen ghost data can be exfiltrated and used for malicious purposes. Cyber attackers can prove that they have access to the data and thereby execute a successful ransomware attack. Double extortion attacks are as successful with ghost data as with any other data because attackers have the same increased leverage. The attackers rely not only on encryption (which would not be of concern to an organization as it relates to ghost data). They can also publicly release the stolen data to encourage payment of the ransom. Robust backups cannot help with the issue of ghost data because the leverage to release data publicly remains the same.

Visibility into Cloud Data 

Unfortunately, cloud providers offer limited visibility into what data customers have. Cloud service providers (CSPs) do not identify how sensitive data is. CSPs also do not provide specific advice on how to improve the security and risk posture of data across their cloud estate. This results in increased risks to cyber resilience and compliance. An organization’s lack of visibility into its data across all cloud providers increases the risk of exposing sensitive data. Similarly, ghost data or any other data store that is not adequately identified and classified is likely to have overly permissive access.


Significant changes in how data is managed and stored in cloud and hybrid environments have also led to new questions, including:

  • Who oversees securing data?
  • What is the role of the security team in identifying and securing data?
  • What responsibilities do data creators have for managing data?

In modern corporate environments, it is important for all teams involved to understand their responsibilities when it comes to managing, security, and protecting data. It is a joint effort between builders and the security team. However, managing data on an ongoing basis remains a challenge without the technology to discover and classify sensitive data automatically.

Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning

Modern software solutions and products have had a significant impact in terms of creating data, increasing the data available for analytics, and growing complexity in corporate environments. AI/ML can help address the challenges created by these technological advances. In particular, AI/ML can help identify ghost data and increase data security by using continuous learning and automation to:

  • Identify data
  • Analyze anomalous data being stored
  • Identify anomalous access
  • Prioritize intelligently
  • Correlate different data in various places in the environment

Robust AI/ML data classification solutions can accurately classify data that previously was challenging to identify, including intellectual property and other sensitive corporate data. AI/ML can also help enable organizations to make smarter decisions and create better policies about what to do with data and how to protect sensitive data.

Data and Security

To begin with, it is important to think of data as being at an advanced layer of security. In the past, data was not considered a layer of security. This is because there was no straightforward way to deal with data in the past. Today, with AI/ML, it is far easier to access, understand, and know the data within an organization and across all its disparate environments.


As technology has changed, the focus of security has moved from infrastructure to data-related security. While CISOs remain in charge of the technical aspects of security, new challenges in business and cybersecurity require more collaboration across the business team, IT, security, and privacy office to move forward and meet data security and data privacy requirements.


Regulations and requirements are becoming more stringent globally, requiring organizations to take more responsibility for the data they are collecting. This includes all the data spread across environments, including ghost data. Managing that data requires robust data discovery and data classification.