The elephant in the data center

Blogs and Articles

What is dark data and why does it matter? A perspective from Mark Kidd, EVP& GM, Iron Mountain Data Centers & Asset Lifecycle Management.

October 18, 20237 mins

By Mark Kidd, EVP & GM, Iron Mountain data centers & Asset Lifecycle Management

‘Dark data’, a term coined by Gartner, is defined as the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes. Like dark matter, dark data takes up huge amounts of space in data centers and is virtually invisible. This doesn’t mean we can ignore it. I think it’s worth taking a moment to think about the nature of dark data, its impact, and what we might be able to do to improve things.

Personal footprint

Dark data is easiest to grasp and deal with at a personal level. For most of us it consists of unused photos and videos. In the old days, film was precious and development expensive, but now we can take 20 shots to get the one we want, and we can edit easily, creating more backup files in the process. In 2020, Google said it stored 4 trillion photos, with 28 billion new photos and videos uploaded each week. Google Photos is just one photo service, and those upload rates have no doubt grown in the last few years.

This personal dark data also creates a privacy issue. However secure our cloud service is, there is always the possibility that ID photos, personal chat screengrabs, and private files can be used by cybercriminals. The answer? Think before you shoot, tidy up caches and archives regularly, and be particularly careful not to leave sensitive files lying around.

Hidden losses

For companies, the challenge is on a larger scale and affects the bottom line. Dark data consists of things like near-identical images or documents, IoT data sets, log files, and applications. This data takes up server space, and powering these servers takes up energy and equipment, which not only costs money, but can also mean significant emissions if low-carbon or renewable power is not being used. Dark data is also unstructured and unexplored, which brings with it privacy and compliance risks.

No organization is unaffected. Estimated levels of commercial dark data vary by sector from 40 to 90 percent, so it’s extremely likely that the majority of your company’s data is dark. According to the World Economic Forum, companies generate 1.3. trillion gigabytes of dark data every day. Storing that data for a year using non-renewables generates as much CO2 as three million flights from London to New York. So, if we're interested in decarbonizing the data centre industry – and we should be - we should tackle this issue.

Technology lag

For many businesses the level of dark data is a reflection of a lack of data structuring processes. The ability of an organization to collect data can exceed the throughput at which it can analyse the data. In some cases the organization may not even be aware the data is being collected.

Organizations retain dark data for a multitude of reasons. Often it is stored for regulatory compliance, and record keeping, but equally often the complexity of compliance, privacy and data discovery is the reason that these data lakes are allowed to build up. Some organizations believe that dark data could be useful to them in the future once they have acquired better analytic and business intelligence technology to process the information.

New tools & standards

There is good news here. The scale of the task may appear daunting for CIO and CDOs, but AI and Machine Learning have now advanced to the point that they can help automate the data structuring process. Only a tiny percentage of dark data needs to be reviewed at the outset by humans to kickstart the process. This can then be followed up with a reinforcement learning model to assess the relevance of remaining data and prioritize it. From then on, a virtuous cycle of tagging and analysis makes the process easier to manage.

Measurement would also help to benchmark progress; considering the scale of the problem ,there may be a case for setting standards for effective data use. Perhaps there is a case for a Data Usage Effectiveness (DUE) metric to sit alongside CUE (Carbon) WUE (Water) AND PUE (Power), where 1 = 100% elimination of non-essential single-use data. This, or some similar metric, would be well worth working towards, and could also have value as a digital performance indicator. However, it may be too early to measure, while so much dark data remains invisible.

The role of the data center

While any data usage standards would need to be introduced by individual companies, I believe that, as an ecosystem aggregator with responsibility to customers, there is a role here for the data center provider.

Iron Mountain is better placed than most to help, as our group portfolio covers intelligent data management across the whole analog and digital data lifecycle. We have discovered that the key to successful storage is retaining and enhancing, or elevating, value. Back in the mid-twentieth century Iron Mountain stored physical assets in a secure but accessible way, and as soon as the technology was available we led the digitization process, creating the world’s most trusted digital archives. Now, using the global data center and cloud platforms where the bulk of data lives, not only can we digitize anything, we can distribute it anywhere, anytime. From taping and scanning to SaaS-driven AI-driven services like Iron Mountain InSight®, we have spent decades shedding light on dark data.

Let’s talk

Whatever dark data means to you or your business, it is an ‘elephant in the room’ for data centers, and the more we talk about it the likelier we are to come up with incremental improvements. For individual data users there are things we can do to reduce single-use data. For organizations it’s a bit more complicated, but approaches and tools are emerging. These should be discussed and shared.

As with energy efficiency, identifying and eliminating waste at source is the most obvious opportunity. According to IBM 60% of data loses its value within milliseconds of being acquired, and any scheme to use data more effectively must first address the issue of collecting useless data. A robust approach to data gathering is the key here; assessing how data can be used, or if it is usable.

The next step is structuring the data we keep. Structured data is not only more valuable, but easier to track and, if necessary, delete. By making data more visible, it should be possible to reduce the environmental and financial burden of storage at the same time as using our valuable data to empower our organizations and serve our customers better.