The New Frontier: Data Science in the Decentralized Era

cloud
edge
decentralization
Published

February 7, 2024

Data science has transformed industries over recent decades, revolutionizing fields like healthcare, telecommunications, energy, and more. By analyzing massive amounts of data with machine learning and AI, data scientists have unlocked new insights that impact our daily lives. But as powerful as it is, the traditional, centralized approach to data science—processing everything in large, remote cloud data centers—brings substantial challenges that go beyond the technical. These challenges affect data privacy, sovereignty, and our environment, sparking a growing interest in decentralized alternatives that tackle these issues head-on.

The Evolution of Data Science: From Centralized to Decentralized

The initial promise of cloud computing was undeniable: massive storage and computational resources accessible from anywhere, centralizing control over data, and simplifying many processes. This approach worked well for early applications of data science, where data from various sources—IoT sensors, devices, and business applications—could be aggregated in the cloud for extensive processing.

However, centralization also introduced new challenges, particularly in handling the increasing volume of data from connected devices. By 2030, 500 billion devices are projected to be connected to the internet, and half of the world’s data will come from sensors alone, according to Cisco. As data generation shifts heavily toward these edge devices, the conventional cloud model struggles to keep up.

One such sector grappling with these issues is autonomous vehicles, which rely on a wealth of sensitive information—from location to driver behavior—that must remain secure. The real-time processing required for autonomous navigation also creates latency challenges, where delays can directly impact safety. These complexities make it clear that a centralized cloud model may not be suited for every data need, particularly for industries prioritizing privacy, security, and rapid processing.

In addition to autonomous vehicles, sectors like network monitoring and threat detection, autonomous cameras, and industrial robots face similar challenges. These applications generate and process large amounts of sensitive data in real-time, further emphasizing the need for decentralized models that prioritize local data processing.

The Rising Challenge of Data Privacy, Security, and Sovereignty

Data Privacy Risks in a Centralized Cloud Model

Transmitting vast amounts of data to centralized servers for processing has inherent risks. Cloud breaches are becoming increasingly common; in 2023 alone, 39% of companies reported a data breach within their cloud environments, up from 35% the year before [1]. The consequences of these breaches go beyond unauthorized access—they disrupt services and erode trust, with recovery often taking days or longer.

Personal data, handled with minimal oversight on who can access it, has also raised alarms. Scandals over misuse of personal data have pushed countries to enforce stricter data protection regulations, like the European Union’s General Data Protection Regulation (GDPR) [2], which mandates rigorous standards for handling user data. With privacy concerns escalating, a growing number of organizations are questioning if cloud centralization is truly in their best interests.

The Environmental Impact of Cloud Computing

The Alarming Carbon Footprint

As it turns out, the cost of cloud computing is more than just financial. The digital sector accounts for approximately 4% of global greenhouse gas emissions, and this number is set to rise. Data centers, the heart of cloud computing, are energy-intensive. In 2016, data centers consumed about 200 terawatt-hours (TWh); this figure is projected to rise to a staggering 2967 TWh by 2030 [3].

Moreover, data centers require significant amounts of water for cooling, which adds to their environmental impact. This demand for water resources has serious implications, particularly in regions where water scarcity is already a pressing issue.

Cloud computing’s model of horizontal scaling, which requires adding more servers to manage increasing data loads, leads to even higher energy and resource demands. This trend is especially concerning for machine learning and deep learning models, which require vast computational resources. As a result, cloud computing’s environmental footprint has grown to unsustainable levels, with data centers becoming one of the most significant energy and resource consumers globally.

Financial Costs of Cloud Dependency

Not only is cloud computing environmentally taxing, but it also comes with steep financial costs. Cloud service providers often charge for ingress (data entering the cloud), egress (data exiting the cloud), storage, and processing. These types of costs can make a company’s cloud computing bill unexpectedly high and opaque, as the actual expenses depend on usage patterns that are not always easy to predict or optimize. Hyperscale data centers, operated by large cloud providers, have driven a sharp rise in operational costs for companies relying on these services.

In some cases, servers remain underutilized, operating at only 10–15% of their capacity while still consuming considerable energy. Additionally, zombie servers, which are inactive yet still drawing power, make up about 30% of all data center servers. Combined, these inefficiencies highlight an unsustainable cycle. As companies continue to expand their cloud usage, energy demand grows, along with the environmental and economic impacts.

Conclusion: A Vision for the Future

The current cloud-centered approach to data science has delivered significant advancements, but it has also introduced serious challenges regarding privacy, sovereignty, and environmental impact. With 4% of global greenhouse gas emissions attributed to the digital sector and cloud data centers consuming massive amounts of energy and water resources, it’s clear that we need to rethink our approach.

At the heart of Manta’s vision is a shift from cloud dependency to a compute-to-data model, where data processing happens close to where data is generated. By developing middleware that supports edge computing and collaborative processing, Manta aims to help companies overcome the limitations of the cloud, empowering them to innovate with secure, efficient, and sustainable data solutions. In our next article, we’ll explore how Manta’s middleware brings computation to the data, reducing the environmental impact and enhancing security—paving the way for a smarter, decentralized future in data science.

References

[1]
T. Group, “2023 cloud security: Cyberattacks and data breaches,” Thales Group Newsroom, 2023, Available: https://cpl.thalesgroup.com/about-us/newsroom/2023-cloud-security-cyberattacks-data-breaches-press-release
[2]
“GDPR - general data protection regulation.” https://gdpr-info.eu/, 2024.
[3]
J. Doe and J. Smith, “The environmental impact of cloud computing,” Journal of Green Technology, vol. 10, no. 2, pp. 123–145, 2022, doi: 10.1007/s10586-022-03713-0.