Data Sprawl

Data Sprawl

Data sprawl refers to the significant quantities of data many organizations create daily. Data sprawl can be defined as the generation of data, or digital information, created by businesses. Data is a valuable resource because it enables business leaders to make data-driven decisions about how to best serve their client base, grow their business, and improve their processes. However, managing vast amounts of data and so many data sources can be a serious challenge.

Large businesses, particularly enterprises, are generating a staggering amount of data due to the wide variety of software products in use, as well as newly introduced data formats, multiple storage systems in the cloud and in on-premises environments, and huge quantities of log data generated by applications. There is an overwhelming amount of data being generated and stored in the modern world.

Where Does Data Come From?

As organizations scale and increasingly use data for analysis and investigation, that data is being stored in operating systems, servers, applications, networks, and other technologies. Many organizations generate massive quantities of new data all day, every day, including:

  • Financial data, including data types such as bank transactions, web data, geolocation data, credit card, and point of sale transaction data from vendors.
  • Sales data, which may include revenue by salesperson, conversion rate, the average length of a sales cycle, average deal size, number of calls made, age and status of sales leads, loss rates, and number of emails sent.
  • Transactional data, which may include customer data, purchase order information, employee hours worked, insurance costs, insurance claims, shipping status, bank deposits, and withdrawals.
  • Social media, email, and SMS communications, which may include social media metrics, demographics, times of day, hashtags, subject matter, and content types.
  • Event data describes actions performed by entities (essentially, behavior data); it includes the action, the timestamp, and the state (information about entities related to the event). Event data is critical for performing analytics.

 These files and records are dispersed across multiple locations, which makes inventorying, securing, and analyzing all that data extremely difficult.  

How Does Data Sprawl Happen?

Data sprawl is described as the ever-expanding amount of data produced by organizations every day. Amplified by the shift to the cloud, organizations can scale more rapidly, producing more and more data. New uses for big data continue to develop, requiring an increase in how much data is stored in operating systems, servers, networks, applications, and other technologies.

Further complicating matters, databases, analytics pipelines, and business workflows have been migrating rapidly to the cloud, moving across multiple cloud service providers (CSPs) and across structured and unstructured formats. This shift to the cloud is ongoing, and new data stores are created all the time. Security and risk management (SRM) leaders are struggling to identify and deploy data security controls consistently in this environment.

"...unstructured data sprawl (both on-premises and hybrid/multi-cloud) is difficult to detect and control when compared to structured data."

Gartner, Hype Cycle for Data Security, 2022

Organizations generate new data every hour of every day. The customer data in customer relationship management (CRM) systems may also include financial data, which is also in an accounting database or enterprise resource planning (ERP) system. Sales data and transactional data may be in those systems as well, and siloed by different departments, branches, and devices. To get the benefits promised by data analytics, data analysts need to cross reference multiple sources and therefore may have difficulty making accurate and informed decisions.

Ultimately, organizations need data to facilitate day-to-day workflows and generate analytical insights for smarter decision-making. The problem is that the amount of data organizations generate is spiraling out of control. According to a recent IDC study, the Global DataSphere is expected to more than double from 2022 to 2026. The worldwide DataSphere is a measure of how much new data is created, captured, replicated, and consumed each year, growing twice as fast in the Enterprise DataSphere compared to the Consumer DataSphere.

Challenges of Data Sprawl

As organizations generate data at a faster pace, it is becoming harder to manage this information. Organizations might have data stored in various locations, making it hard to access business-critical information and generate accurate insights. Team members must cross-reference data in multiple formats from multiple sources, making analytics difficult. Managing dispersed information across different silos wastes time and money. Data may become corrupted during transmission, storage, and processing. Data corruption compromises the value of data, and the likelihood of corruption may increase alongside increasing data sprawl.

In addition, the effort is wasted when data is duplicated by employees who were not able to find the data needed where expected, which can then also result in ghost data. This duplicate data is considered redundant. Other data may be obsolete (out of date) or trivial (not valuable for business insights). This excess data results in excessive resource utilization and increases cloud storage costs.

Employees may be handling data carelessly, not understanding how the way they share and handle data can introduce risk. Unauthorized users may also have access to sensitive information, particularly when the data produced and stored is not appropriately managed. Manually classifying data is time-consuming and error-prone and may increase the risk of sensitive data exposure, so finding automated solutions is essential for managing large stores of data.  

Data sprawl compromises data value and presents significant security risks. There are also security concerns because too much data can be difficult to control. This increases the chances of data breaches and other security risks. Furthermore, organizations that do not manage data sprawl may jeopardize the trust of customers and face strict penalties due to the General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA), or other data protection legislation for non-compliance. 

Managing Data Sprawl

Getting data sprawl under control requires a structured approach to data management. It is essential to have a solution in place to discover and classify data. Because data is spread across on-premises and cloud environments, it is critical to identify the environments where data is stored to ensure that all data is identified and managed. Tools that can discover and classify data in SaaS, IaaS, and PaaS environments are important, as are those that can find and classify structured and unstructured data. The goal of these tools is to create a unified view across the environment.

Identifying a central place to store data is one way to manage data sprawl. Cloud security standards continue to improve, making a centralized cloud repository an appealing option for many organizations. Cloud storage platforms are an excellent method of storing data in a way that creates a single source of truth that is more accessible to employees in many locations. At the same time, companies must establish data access governance (DAG) policies that outline how data should be collected, processed, and stored. These policies must also put policies in place to govern the data, including access controls, retention, risk management, compliance, and data disposition (how it is disposed of at the end of its lifecycle). DAG policies complement data loss prevention (DLP) programs. Data security posture management (DSPM) combines data discovery and classification, data loss prevention, and data access governance to create a next-generation approach to cloud data security.  

Data Sprawl Solutions

For organizations that want to manage data sprawl, it is imperative to know what data exists in the environment, where it is located, and who has access to it. Different tools exist to manage all the data that organizations store, but few can prevent data sprawl.

Automated data discovery and data classification solutions must be able to identify and classify sensitive data. Artificial intelligence (AI) and machine learning (ML) can more accurately classify difficult-to-identify data, such as intellectual property and sensitive corporate data.

Data sprawl solutions can also increase overall data security by helping to locate and identify duplicate and redundant data. Once sprawling data has been identified and classified, it becomes easier to dispose of stale data or extraneous data. This can save on storage costs as well as eliminate duplicate and irrelevant data.

Enterprises collect data daily and it is easy to create multiple copies. The first step for companies that wish to manage access to data and prevent data loss is to fully understand their data — both where it is now, whether IT or security teams are aware of the data stores or not, and any data stores that are created in the future. Identifying sensitive data and who has access to them can help prevent data breaches by ensuring that appropriate security controls are enforced.