4 ways to deal with unstructured data

Blogs and Articles

The volume of data organizations generate is exploding. If you're not sure how to deal with unstructured data deluges, these emerging solutions can help.

Paul Gillin

September 24, 20207 mins

How to deal with unstructured data: It's a problem that's becoming more urgent every day. The amount of data that organizations generate is expected to more than quintuple over the next five years, according to an IDC report shared by Network World. Of that massive hunk of data, 80% will be unstructured — meaning it won't fit into neatly labeled categories or tidy rows and columns.

Capturing and storing all that information on one data storage tier in a data center or cloud is neither practical nor economical, particularly when on-the-spot decisions are required. Here are four promising trends that can help you deal with the oncoming data deluge.

1. Throw It Away

The reality is that much of the data organizations collect isn't very interesting or useful, but it still takes up a lot of storage space. Devices such as smart cameras and machine sensors create huge amounts of data, little of which is needed if everything is operating normally. Rather than storing all that data, a better solution is evaluating it before it hits the network and discarding what isn't needed. Edge computing, a type of distributed processing that makes decisions close to where data is gathered, is a promising way to do just that.

Edge AI is a special category of artificial intelligence that's specifically intended to make decisions requiring immediate attention, such as controlling the brakes of a car or determining the likelihood that a machine is about to fail, according to Forbes. Edge AI can also be used to scour data streams and quickly identify what:

Can be discarded.
Requires immediate attention.
Should be stored for analysis.

There are high hopes for edge computing; many researchers expect the market to grow more than 30% annually for the next several years, according to Allied Market Research.

2. Deduplicate It

Have you ever been on the distribution list of a mass email that included a 15-megabyte PowerPoint attachment? Organizations generate an enormous amount of duplicate data, so much so that a 2013 IDC study estimated that companies spend $44 billion annually to store copies of data they already have. This situation has likely only grown since then.

High-speed, in-line deduplication can flag duplicate records and either hold them for review or delete them automatically. While savings vary when using this method, it led to more than a 10 times reduction in storage space for two-thirds of the companies in an Enterprise Strategy Group survey shared by TechTarget.

3. Tier It

If your organization treats all data the same, it's flushing money down the drain. Only a tiny percentage of data is typically mission-critical enough that it needs to be instantly available on expensive storage media. Most data can be relegated to spinning disks or tape. Tiered storage automatically assigns data to the most appropriate storage medium based upon policies, which often results in significant savings.

Horison Information Strategies estimates that between 63% and 85% of a typical organization's data can be moved to secondary or long-term storage without impacting operations. If your organization isn't moving infrequently accessed data from disk to tape or from cloud to tape, you should take a fresh look at this low-cost archival option.

Tape is the most cost-effective storage medium and its retrieval times are approaching that of disk, thanks to recent advances in technology such as redundant arrays of independent tape. What's more, tape is stored offline, making it nearly impervious to malware attacks. Additionally, cloud services can archive data to tape automatically, which can help you save big while still retaining ready access to your data.

4. Structure It

Machine learning algorithms are great at finding patterns, including those in unstructured data sources such as text documents and images. By repeatedly scanning similar documents with some human oversight, machines can quickly figure out that a sequence of digits is more likely to be a Social Security number than a phone number or the identities of people in a photo or video. This semi-structured data can then be loaded into databases for analytical processing.

When deciding how to deal with unstructured data, look to cloud and tape for ideas. Deduplicating data helps minimize how much data is stored and AI helps with information processing and analysis, but cloud to tape storage ensures your data is protected in the most secure, accessible and cost-effective manner possible.